Home | SourceForge | Forums | Contact

Example: Google images

Like in Example #2, user-defined function is used from included configuration file functions.xml. Function download-multipage-list collects all URLs of Google Images search for the specified keyword. Here, it downloads at most 5 result pages. After that loop processor is used to iterate over collected URLs and to download and save images locally.

<?xml version="1.0" encoding="UTF-8"?>
 
<!--
    Expects following initial variable: 
        search - search expression
-->
 
<!-- Updated on February, 9th, 2011 -->
<config charset="UTF-8">
 
    <include path="functions.xml"/>
 
    <!-- defines search keyword and start URL -->
    <var-def name="search" overwrite="false">banana</var-def>
 
    <var-def name="url"><template>http://images.google.com/images?q=${search}&amp;hl=en</template></var-def>
    
    <!-- collects all image URLs -->
    <var-def name="imgLinks">
        <call name="download-multipage-list">
            <call-param name="pageUrl"><var name="url"/></call-param>
            <call-param name="nextXPath">//a[@id="pnnext"]/@href</call-param>
            <call-param name="itemXPath">//img[contains(@src, 'images?q=tbn')]/@src</call-param>
            <call-param name="maxloops">5</call-param>
        </call>
    </var-def>
 
    <!-- download images and saves them to the files -->
    <loop item="link" index="i" filter="unique">
        <list>
            <var name="imgLinks"/>
        </list>
        <body>
            <file action="write" type="binary" path="google_images/${search}_${i}.gif">
                <http url="${sys.fullUrl(url, link)}"/>
            </file>
        </body>
    </loop>
 
</config>

The result of extraction is collection of 100 image files stored on the file system.