Home | SourceForge | Forums | Contact

Usage

Web-Harvest can be used in three modes: as standalone GUI application (typical for development and testing phase), as command line utility, and from the Java code.

The only requirement is Java 2 runtime environment, version 1.5 or higher.

Development IDE usage

Web-Harvest IDE eases creating and testing XML configurations. It represents multiple document interface (MDI) with auto-completion enabled XML editor, hierarchical view of configuration processors, property-viewer of each executing processor and log area. To start the IDE, simply double-click the webharvest_all_xx.jar or start if from command line as
    java -jar webharvest_all_xx.jar
without any additional parameters.

Bellow is an screenshot of the IDE:

Command line usage

Syntax for command line use is the following:

java -jar webharvest_all_XX.jar [-h] config=<path> [workdir=<path>] [debug=yes|no]
          [proxyhost=<proxy server> [proxyport=<proxy server port>]]
          [proxyuser=<proxy username> [proxypassword=<proxy password>]]
          [proxynthost=<NT host name>]
          [proxyntdomain=<NT domain name>]
          [loglevel=<level>]
          [logpropsfile=<path>]
          [plugins=<list of plugin classes>]
          [#var1=<value1> [#var2=<value2>...]]
    

where the parameters have the following meaning:

-h Shows the help
config Path or URL of configuration (URL must begin with "http://" or "https://")
workdir Path of the working directory (default is current directory)
debug Specifies if Web-Harvest generates debugging output (default is no)
proxyhost Specifies proxy server
proxyport Specifies port for proxy server
proxyuser Specifies proxy server username
proxypassword Specifies proxy server password
proxynthost NTLM authentication scheme - the host the request is originating from
proxyntdomain NTLM authentication scheme - the domain to authenticate within
loglevel Specifies level of logging for Log4J (trace,info,debug,warn,error,fatal)
logpropsfile File path to custom Log4J properties. If specified, loglevel is ignored.
plugins Comma-separated list of full plugins' class names.
#varN, valueN Specifies initial variables of the Web-Harvest context.
To be recognized, each variable name must have prefix #

Java code usage

First, it is required to include few Web-Harvest classes at the beginning of the Java file:

import org.webharvest.definition.ScraperConfiguration;
import org.webharvest.runtime.Scraper;

In the code create instance of ScraperConfiguration with path to the specified configuration file and create a Scraper instance with specified working folder:

ScraperConfiguration config = new ScraperConfiguration("c:/wh/configs/news.xml");
Scraper scraper = new Scraper(config, "c:/wh/work/");

Optionally add custom user object instances to the variable context:

scraper.addVariableToContext("myVarName1", myObj1);
scraper.addVariableToContext("myVarName2", myObj2);
...

Optionally set debugging (by default it isn't):

scraper.setDebug(true);

Optionally specify proxy server:

scraper.getHttpClientManager().setHttpProxy("proxy.wh", 3128);

Optionally specify proxy server details:

scraper.getHttpClientManager().setHttpProxyCredentials(myUsername, myPassword, myNTHost, myNTDomain);

where myNTHost and myNTDomain can be null and are used only for NTLM authentication scheme.

Start configuration execution:

scraper.execute();

Here is the full example of Web-Harvest usage from Java code:

import org.webharvest.definition.ScraperConfiguration;
import org.webharvest.runtime.Scraper;
import org.webharvest.runtime.variables.Variable;
import mypackage.MyXmlLibrary;

public class WebHarvestTest {

    public static void main(String[] args) {
        // register external plugins if there are any
        DefinitionResolver.registerPlugin("com.my.MyPlugin1");
        DefinitionResolver.registerPlugin("com.my.MyPlugin2");
        DefinitionResolver.registerPlugin("com.my.MyPlugin3");

        ScraperConfiguration config = 
            new ScraperConfiguration("c:/wh/configs/news.xml");
        Scraper scraper = new Scraper(config, "c:/wh/work/");
        
        scraper.addVariableToContext("username""web-harvest");
        scraper.addVariableToContext("password""web-harvest");
        scraper.addVariableToContext("myXmlLib"new MyXmlLibrary());
        
        scraper.setDebug(true);

        scraper.execute();
        
        // takes variable created during execution
        Variable articles = (Variablescraper.getContext().get("articles");
        
        // do something with articles...
    }

}

Check Web-Harvest API for more details about code usage.