Home | SourceForge | Forums | Contact

Example: Google images

Like in Example #2, user-defined function is used from included configuration file functions.xml. Function download-multipage-list collects all URLs of Google Images search for the specified keyword. Here, it downloads at most 5 result pages. After that loop processor is used to iterate over collected URLs and to download and save images locally.

<?xml version="1.0" encoding="UTF-8"?>
    Expects following initial variable: 
        search - search expression
<!-- Updated on February, 9th, 2011 -->
<config charset="UTF-8">
    <include path="functions.xml"/>
    <!-- defines search keyword and start URL -->
    <var-def name="search" overwrite="false">banana</var-def>
    <var-def name="url"><template>http://images.google.com/images?q=${search}&amp;hl=en</template></var-def>
    <!-- collects all image URLs -->
    <var-def name="imgLinks">
        <call name="download-multipage-list">
            <call-param name="pageUrl"><var name="url"/></call-param>
            <call-param name="nextXPath">//a[@id="pnnext"]/@href</call-param>
            <call-param name="itemXPath">//img[contains(@src, 'images?q=tbn')]/@src</call-param>
            <call-param name="maxloops">5</call-param>
    <!-- download images and saves them to the files -->
    <loop item="link" index="i" filter="unique">
            <var name="imgLinks"/>
            <file action="write" type="binary" path="google_images/${search}_${i}.gif">
                <http url="${sys.fullUrl(url, link)}"/>

The result of extraction is collection of 100 image files stored on the file system.