Home | SourceForge | Forums | Contact

Example: The New York Times newspaper articles

In order to access full content of The New York Times online, it is first required to login using http processor. After that links to the articles are collected from the front page. The loop processor iterates over links, downloads single articles and creates resulting XML using XQuery.

<?xml version="1.0" encoding="UTF-8"?>
<config charset="ISO-8859-1">
    <!-- sends post request with needed login information -->
    <http method="post" url="http://www.nytimes.com/auth/login">
        <http-param name="is_continue">true</http-param>
        <http-param name="URI">http://</http-param>
        <http-param name="OQ"></http-param>
        <http-param name="OP"></http-param>
        <http-param name="USERID">web-harvest</http-param>
        <http-param name="PASSWORD">web-harvest</http-param>
    <var-def name="startUrl">http://www.nytimes.com/pages/todayspaper/index.html</var-def>
    <file action="write" path="nytimes/nytimes${sys.date()}.xml" charset="UTF-8">
            <![CDATA[ <newyourk_times date="${sys.datetime("dd.MM.yyyy")}"> ]]>
        <loop item="articleUrl" index="i">
            <!-- collects URLs of all articles from the front page -->
                <xpath expression="//div[@class='story clearfix']/h5/a[1]/@href">
                        <http url="${startUrl}"/>
                <xpath expression="//div[@class='story' or @class='story headline']/a[1]/@href">
                        <http url="${startUrl}"/>
            <!-- downloads each article and extract data from it -->
                    <xq-param name="doc">
                            <http url="${sys.fullUrl(startUrl, articleUrl)}?&amp;pagewanted=print"/>
                        declare variable $doc as node() external;
                        let $author := data($doc//div[@class="byline"])
                        let $title := data($doc//h1)
                        let $text := data($doc//div[@id="articleBody"])
        <![CDATA[ </newyourk_times> ]]>

The result of extraction is file nytimes<date>.xml. (see nytimes20061027.xml from October 27th, 2006).