Example - BBC News live headlines

HTML output retrieves first article from http://news.bbc.co.uk/. Output is refreshing each 15 minutes from cron.

 

Output:

<p><h3><a href="http://www.bbc.com">Egypt vows forceful response after attack</a></h3></p>
<p>A devastating gun and bomb attack on a mosque in the Sinai peninsula killed 235 people.</p>
<p>Time: 2017-11-25T02:16:46.000Z</p>
<p>Category: Middle East</p>
<p><h3><a href="http://www.bbc.com">US &#x27;to stop arming anti-IS Syrian Kurds&#x27;</a></h3></p>
<p>The US confirms making &quot;adjustments&quot; to support for Syrian groups, but does not name the YPG militia.</p>
<p>Time: 2017-11-25T00:40:24.000Z</p>
<p>Category: Middle East</p>
<p><h3><a href="http://www.bbc.com">Mexico creates huge marine national park</a></h3></p>
<p>The Revillagigedo Archipelago has been named a marine reserve, protecting hundreds of ocean species.</p>
<p>Time: 2017-11-25T03:24:20.000Z</p>
<p>Category: Latin America &amp; Caribbean</p>
<p><h3><a href="http://www.bbc.com">Poles protest over &#x27;threats to judiciary&#x27;</a></h3></p>
<p>Thousands of people rally across Poland over proposed bills that they say threaten the rule of law.</p>
<p>Time: 2017-11-24T21:55:43.000Z</p>
<p>Category: Europe</p>
<p><h3><a href="http://www.bbc.com">&#x27;To all of you, we are sorry&#x27;</a></h3></p>
<p>Canadian PM Justin Trudeau issues an apology to aboriginal children removed from their families.</p>
<p>Time: 2017-11-24T22:54:55.000Z</p>
<p>Category: US &amp; Canada</p>

Source code of script:

# File: bbc_main.w
# Name: BBC News live headlines
# Description: HTML output retrieves first article from www.bbcnews.com
# Input: URL [http://news.bbc.co.uk]
# Output format: HTML file
# Output fields: Source URL, Link, Title, Description

#<Logger File>
#	Global
#	FileName bbc_log.log
#	Level debug
#</Logger>

<Section>
    Name bbc_main
	
    Define $output_file bbc_output.html

	# define variable $url and assign it value
    Define $url http://www.bbc.com/news
    
	
	
    # clean output file
    <Action Print>
        FileName {$output_file}
		FileMode Write  
    </Action>
    	
	
    
    # load content
    <Action ContentURL>
        URL {$url}
        RemoveNewLine
        TagsToStrip br,nobr,b
    </Action>

	# the script will iterate through all headlines
	<Section While>
		# search for headlines only in the top part of the website
		EndAt <div class="container">
		
		# match the beginning of headline
		<Pattern>
			RegExp <div class="gs-c-promo-body
		</Pattern>
	
		<Section>
			# stop searching for date before the beginning of next headline
			EndAt <div class="gel-layout__item
	
			# match url
			<Pattern>
				RegExp <a class="gs-c-promo-heading{:re([^"]*)}" href="{$url:re([^"]*)}">
				Trim
				Compact
			</Pattern>
	
			# match title
			<Pattern>
				RegExp <h3 class="gs-c-promo-heading__title{:re([^"]*)}">{$title}</h3></a>
				Trim
				Compact
			</Pattern>
	
			# match summary
			<Pattern>
				Optional
				RegExp <p class="gs-c-promo-summary{:re([^"]*)}">{$summary}</p>
				Trim
				Compact
			</Pattern>
	
			# match time
			<Pattern>
				Optional
				RegExp <time class="gs-o-bullet__text date qa-status-date" datetime="{$time:re([^"]*)}"
				Trim
				Compact
			</Pattern>
	
			# match category
			<Pattern>
				Optional
				RegExp <span aria-hidden="true">{$category}</span>
				Trim
				Compact
			</Pattern>
	
			# and print parsed data
			<Action Print>
				FileName {$output_file}
				Text <p><h3><a href="http://www.bbc.com{$link}">{$title}</a></h3></p>\n<p>{$summary}</p>\n<p>Time: {$time}</p>\n<p>Category: {$category}</p>\n
			</Action>
        </Section>
    </Section>
</Section>

Main bbc_main