Home | SourceForge | Forums | Contact

Example: Simple web site crawler

Although the main goal of Web-Harvest is not web crawling but data-extraction, it can be even used to collect and save pages of a web-site. This example demonsrates traversing all the .php pages of the official Web-Harvest web-site. Scripting processor is intensivilily used throughout the configuration, that way giving the programming-language (Java) power to the Web-Harvest. The main idea is in having two sets of URLs: one for all visited pages, and one for all remaining pages. The URLs are collected by gethering all links starting from home page.

<?xml version="1.0" encoding="UTF-8"?>
<config charset="UTF-8">
    <!-- set initial page -->
    <var-def name="home">http://web-harvest.sourceforge.net/index.php</var-def>
    <!-- define script functions and variables -->
        /* checks if specified URL is valid for download */
        boolean isValidUrl(String url) {
            String urlSmall = url.toLowerCase();
            return urlSmall.startsWith("http://web-harvest.sourceforge.net/") && urlSmall.endsWith(".php");
        /* create filename based on specified URL */
        String makeFilename(String url) {
            return url.replaceAll("http://|https://|file://", "");
        /* set of unvisited URLs */
        Set unvisited = new HashSet();
        /* pushes to web-harvest context initial set of unvisited pages */
        SetContextVar("unvisitedVar", unvisited);
        /* set of visited URLs */
        Set visited = new HashSet();
    <!-- loop while there are any unvisited links -->
    <while condition="${unvisitedVar.toList().size() != 0}">
        <loop item="currUrl">
            <list><var name="unvisitedVar"/></list>
                    <var-def name="content">
                            <http url="${currUrl}"/>
                        currentFullUrl = sys.fullUrl(home, currUrl);
                    <!--  saves downloaded page -->
                    <file action="write" path="spider/${makeFilename(currentFullUrl)}.html">
                        <var name="content"/>
                    <!-- adds current URL to the list of visited -->
                        visited.add(sys.fullUrl(home, currUrl));
                        Set newLinks = new HashSet();
                    <!-- loop through all collected links on the downloaded page -->
                    <loop item="currLink">
                            <xpath expression="//a/@href">
                                <var name="content"/>
                                String fullLink = sys.fullUrl(home, currLink);
                                if ( isValidUrl(fullLink.toString()) && !visited.contains(fullLink) && !unvisitedVar.toList().contains(fullLink) ) {
        <!-- unvisited link are now all the collected new links from downloaded pages  -->
             SetContextVar("unvisitedVar", newLinks);

The result of execution are download pages stored under <workingdir>/spider directory.