Version 1.0 Home | SourceForge | Forums | Contact

User manual

This manual briefly describes structure and elements of the Web-Harvest configuration file.

Content

Predefined and user-defined variables and objects

Every Web-Harvest variable context initially contains the following help objects that can be used in any expression inside configurations:

Object sys which contains some general-purpose constants and methods:

sys.lf Line feed character (\n)
sys.cr Carriage return character (\r)
sys.tab Tab character (\t)
sys.space Space character ( )
sys.quot Double quote character (")
sys.apos Single quote character (')
sys.backspace Backspace character (\b)
sys.date() Returns current date in the yyyyMMdd format
sys.time() Returns current time in the HHmmss format
sys.datetime(format) Returns date/time in specified format (Java date and time formatting patterns must be used).
sys.escapeXml(text) Escapes characters &'"<> in the specified text according to XML standard.
sys.fullUrl(pageUrl, link) For the specified URL of the web page and specified link (which could be relative, absolute or full URL) returns full URL.
sys.defineVariable(varname, varvalue, [overwrite]) Defines new variable with specified name and value in the current Web-Harvest context. Parameter overwrite tells whether to overwrite existing variable with the same name. Its default value is true. It has the same meaning as var-def processor, however it could be useful for value exchange between scripts and Web-Harvest context.
sys.isVariableDefined(varname) Tells if variable with specified name is defined in the context.
xpath(xpathexpr, xml) Evaluates XPath expression on specified XML. Returns instance of org.webharvest.runtime.variables.Variable class.

For more details check Java API of this class.

Object http which provides access to Http client and gives information about HTTP responses:

http.client Returns instance of org.apache.commons.httpclient.HttpClient class which is used as primary HTTP client during the configuration execution. For more details, see Jakarta HttpClient documentation.
http.contentLength Lenght of the last HTTP response's content in bytes.
http.charset Encoding of the last HTTP response, if it contain textual content.
http.mimeType Mime type of the last HTTP response.
http.headers Map of last HTTP response's headers. To access individual header, use http.headers.get("headername").
http.statusCode Status code of the last HTTP response.
http.statusText Status message of the last HTTP response.
http.totalLength Total length in bytes of all responses returned to this HTTP client.
http.totalResponses Total number of responses returned to this HTTP client.

Note 1: See Usage page to check how user-defined objects can be put into variable context.

Valid XML configuration elements

config

This is the root element of every configuration file.

Syntax

<config charset="charset_value" scriptlang="default_script_lang">
    configuration body
</config>

Attributes

Name Required Default Description
charset no UTF-8 Defines default charset used throughout configuration. Every processor that needs charset information uses this value if other not explicitly set.
scriptlang no beanshell Defines default scripting engine used throughout configuration. Allowed values are beanshell, javascript and groovy. Default engine is used wherever not other is specified. script and template processors have ability to specify scripting engine within the same Web-Harvest configuration, that way giving possibility to even mix different scripting languages.

Example

<config charset="ISO-8859-1" scriptlang="groovy">
    <file action="write" path="squares.txt">
        <script return="${[1, 2, 3, 4, 5].collect(square)}">
            <![CDATA[ 
                square = { it * it }
            ]]>
        </script>
    </file>
</config>

Here, file processor uses default encoding ISO-8859-1 and script processor uses default language groovy.

empty

Wraps execution sequence and returns empty value. This element is used in situations when execution result is needless.

Syntax

<empty>
    wrapped body
</empty>

Example

<file action="write" path="test/amazon_home.html">
    <empty>
        <var-def name="amazonContent">
            <http url="http://www.amazon.com"/>
        </var-def>
    </empty>
    <template>
        <![CDATA[
            <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
        ]]>
        ${amazonContent}
    </template>
</file>

New variable is created but its value doesn't participate in the result, because it is inside empty element. Instead, its value is used in the following template processor.

text

Converts embedded value to the string representation.

Syntax

<text>
    wrapped body
</text>

Example

<var-def name="digits">
    <while condition="${i.toInt() != 10}" index="i">
        <template>${i}</template>
    </while>
</var-def>
 
<file action="write" path="/test/replaced23.txt">
    <regexp replace="true">
        <regexp-pattern>(.*)(2.*3)(.*)</regexp-pattern>
        <regexp-source>
            <text>
                <var name="digits"/>
            </text>
        </regexp-source>
        <regexp-result>
            <template>${_1}here were 2 and 3${_3}</template>
        </regexp-result>
    </regexp>
</file>

Variable named digits is defined using the while processor, producing sequence of 9 values. Next, the regexpr processor is invoked in order to do search-replace on variable value. text processor is used to concatenate all digit values into single one. Without text processor, regular expression search would be applied to each item in the list (every digit) and that way no replace would occurr, because there is no sequence "2*3".

var-def

Defines new or overrides existing variable with specified name and value.

Syntax

<var-def name="variable_name" overwrite="overwrite_existing">
    body as value of the variable
</var-def>

Attributes

Name Required Default Description
name yes The name of variable. Should be valid like in most programming languages.
overwrite no true Boolean value (true or false) telling if existing variable with the same name will be overwriten or not.

Example

<var-def name="digitList">
    <while condition="true" index="i" maxloops="9">
        <var-def name="digit${i}">
            <template>${i}</template>
        </var-def>
    </while>
</var-def>

This example defines the variable digitList which is the sequence of 9 values (digits from 1 to 9), and 10 simple variables digit1, digit2, ..., digit9 with values ranging from 1 to 9.

var

Returns value of defined variable. Throws an exception if variable is not defined.

Syntax

<var name="variable_name"/>

Attributes

Name Required Default Description
name yes Variable name.

Example

<var-def name="searchEngine">
    google
</var-def>
 
<var-def name="${searchEngine}Content">
    <http url="http://www.${searchEngine}.com"/>
</var-def>
 
<file action="write" path="data/${searchEngine}_content.html">
    <var name="${searchEngine}Content"/>
</file>

After execution, file named "google_content.html" contains page content of www.google.com.

file

Reads or writes content of the specified file.

Syntax

<file action="file_action" 
      path="file_path" 
      type="file_type" 
      charset="charset_of_text_file">
    body defining content of the file if action="write" or action="append"
</file>

Attributes

Name Required Default Description
action no read Defines file action. Valid values are read, append and write.
path yes File path, relative to the working directory.
type no text Type of file: text or binary.
charset no [Default charset for config] Charset for text files. Has no effect if type is binary.

Example

<file action="write" path="123.txt">
    <file action="read" path="1.txt"/>
    -----------------------------------
    <file action="read" path="2.txt"/>
    -----------------------------------
    <file action="read" path="3.txt"/>
</file>

Here, new file is created containing appended contents of three existing files, separeted with lines.

http

Sends HTTP request to the specified URL and gets HTTP response as a result. First the body of the processor is executed in order to define optional HTTP parameters and/or headers and then sends HTTP request.

Syntax

<http url="url" 
      method="method" 
      charset="charset"
      cookie-policy="cookie_policy"
      username="username"
      password="password">
    body that might contain http-param and/or http-header elements
</http>

Attributes

Name Required Default Description
url yes HTTP request URL.
method no get HTTP method: get or post
charset no [Default charset for config] Defines encoding of the HTTP response content. Has no effect if content type is binary.
cookie-policy no [Default cookie policy of the HTTP client] Specifies the way how HTTP client manages cookies. Allowed values are: browser, ignore, netscape, rfc_2109 and default.
username no Specifies username if URL requires authentication.
password no Specifies password if URL requires authentication.

Example

<xpath expression="data(//script)">
    <html-to-xml>
        <http url="http://www.yahoo.com/"/>
    </html-to-xml>
</xpath>

The content of www.yahoo.com is downloaded, transformed to XML and then all scripts inside the page are found.

http-param

Adds HTTP parameter for the first enclosing HTTP processor for both post and get requests. If used outside the HTTP processor an exception is thrown.

Syntax

<http-param name="param_name">
    body as parameter value
</http-param>

Attributes

Name Required Default Description
name yes The name of HTTP parameter.

Example

<var-def name="paramNames">
    USERID
    PASSWORD
</var-def>
 
<http method="post" url="http://www.nytimes.com/auth/login">
   <http-param name="is_continue">true</http-param>
   <http-param name="URI">http://</http-param>
   <http-param name="OQ"></http-param>
   <http-param name="OP"></http-param>
 
   <loop item="name">
       <list>
           <var name="paramNames"/>
       </list>
       <body>
           <http-param name="${name}">web-harvest</http-param>
       </body>
   </loop>
</http>

Sends needed parameters to www.nytimes.com/auth/login in order to log in.

http-header

Defines HTTP header for the first enclosing HTTP processor. If used outside the HTTP processor an exception is thrown.

Syntax

<http-header name="header_name">
    body as header value
</http-header>

Attributes

Name Required Default Description
name yes The name of HTTP header.

Example

<http url="www.google.com">
    <http-header name="User-Agent">
        Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1
    </http-header>
</http>

Identifies itself to www.google.com as Firefox browser.

html-to-xml

Cleans up the content of the body and transforms it to the valid XML. The body is usually HTML obtained as a result of http processor execution. Actual parsing and cleaning job is delegated to HtmlCleaner tool. Altough no special tuning is needed in most cases, cleaner may be configured with the several parameters defined with the processor's attributes.

Syntax

<html-to-xml outputtype="..." advancedxmlescape="..." usecdata="..." 
             specialentities="..." unicodechars="..." omitunknowntags="..."
             treatunknowntagsascontent="..." omitdeprtags="..."
             treatdeprtagsascontent="..." omitcomments="..."
             omithtmlenvelope="..." allowmultiwordattributes="..."
             allowhtmlinsideattributes="..." namespacesaware="..." 
             prunetags="...">
    body as html to be cleaned
</html-to-xml>

Attributes

Name Required Default Description
outputtype no simple Defines how the resulting XML will be serialized. Allowed values are simple, compact, browser-compact and pretty.
advancedxmlescape no true If this parameter is set to true, ampersand sign (&) that proceeds valid XML character sequences (&XXX;) will not be escaped with &amp;XXX;
usecdata no true If true, HtmlCleaner will treat SCRIPT and STYLE tag contents as CDATA sections, or otherwise it will be regarded as ordinary text (special characters will be escaped).
specialentities no true If true, special HTML entities (i.e. &ocirc;, &permil;, &times;) are replaced with unicode characters they represent (ô, ‰, ×). This doesn't include &, <, >, ", '.
unicodechars no true If true, HTML characters represented by their codes in form &#XXXX; are replaced with real unicode characters (i.e. &#1078; is replaces with ж).
omitunknowntags no false Tells whether to skip (ignore) unknown tags during cleanup.
treatunknowntagsascontent no false Tells whether to treat unknown tags as ordinary content, i.e. <something...> will be transformed to &lt;something...&gt;. This attribute is applicable only if omitUnknownTags is set to false.
omitdeprtags no false Tells whether to skip (ignore) deprecated HTML tags during cleanup.
treatdeprtagsascontent no false Tells whether to treat deprecated tags as ordinary content, i.e. <font...> will be transformed to &lt;font...&gt;. This attribute is applicable only if omitDeprecatedTags is set to false.
omitcomments no false Tells whether to skip HTML comments.
omithtmlenvelope no false Tells whether to remove HTML and BODY tags from the resulting XML, and use first tag in the BODY section instead. If BODY section doesn't contain any tags, then this attribute has no effect.
allowmultiwordattributes no true Tells parser wether to allow attribute values consisting of multiple words or not. If true, attribute att="a b c" will stay like it is, and if false parser will split this into att="a" b="b" c="c" (this is default browsers' behaviour).
allowhtmlinsideattributes no false Tells parser wether to allow html tags inside attribute values. For example, when this flag is set att="here is <a href='xxxx'>link</a>" will stay like it is, and if not, parser will end attribute value after "here is ".
This flag makes sense only if allowMultiWordAttributes is set as well.
namespacesaware no true If true, namespace prefixes found during parsing will be preserved and all neccessery xml namespace declarations will be added in the root element. If false, all namespace prefixes and all xmlns namespace declarations will be stripped.
prunetags no empty string Comma-separated list of tags that will be complitely removed (with all nested elements) from XML tree after parsing. For exampe if pruneTags is "script,style", resulting XML will not contain scripts and styles.

Example

<html-to-xml outputtype="pretty">
    <http url="http://www.motors.ebay.com"/>
</html-to-xml>

Downloads the www.motors.ebay.com page and cleans it up producing pretty-prented XML content.

regexp

Searches the body for the given regular expression and optionally replaces found occurrences with specified pattern. If body is a list of values then the regexp processor is applied to every item and final execution result is the list.

Syntax

<regexp replace="true_or_false" max="max_found_occurrences">
    <regexp-pattern>
        body as pattern value
    </regexp-pattern>
    <regexp-source>
        body as the text source 
    </regexp-source>
    [<regexp-result>
        body as the result
    </regexp-result>]
</regexp>
For each group inside the search pattern and for each found occurrence variables with names _<group_number> are created. See some Regular Expression tutorial for better explanation of groups.

Attributes

Name Required Default Description
replace no false Logical value telling if found occurrences of regular expression will be replaced. Valid values are: true/false or yes/no. If this value is true (yes), then the regexp-result needs to be specified with replacement value.
max no Limits the number of found pattern occurrences. There is no limit if it is not specified.

Example #1

<regexp>
    <regexp-pattern>([_\w\d]*)[\s]*=[\s]*([\w\d\s]*+)[\,\.\;]*</regexp-pattern>
    <regexp-source>
        var1= test1, var2 = bla bla; index=16; 
        city = Delhi,town=Kingston;
    </regexp-source>
    <regexp-result>
        <template>Value of variable "${_1}" is "${_2}"!</template>
    </regexp-result>
</regexp>

Here, regular expression is looking for specified pattern in two strings, producing as a result list of five values: Value of variable "var1" is "test1"!, Value of variable "var2" is "bla bla"! ...

Example #2

<regexp replace="true">
    <regexp-pattern>[\s]*[\,\.\;][\s]*</regexp-pattern>
    <regexp-source>
        var1= test1, var2 = bla bla; index=16; city = Delhi,town=Kingston;
    </regexp-source>
    <regexp-result>
        <template>|</template>
    </regexp-result>
</regexp>

Here, the regular expression replacement produces single value as the result: var1= test1|var2 = bla bla|index=16|city = Delhi|town=Kingston|.

xpath

Uses an XPath language expression to search an XML document.

Syntax

<xpath expression="xpath_expression">
    body as xml
</xpath>

Attributes

Name Required Default Description
expression yes XPath language expression.

Example

<xpath expression="//a/@href">
    <html-to-xml>
        <http url="http://www.nba.com/"/>
    </html-to-xml>
</xpath>

The result is sequence of links from the page retrieved from www.nba.com.

xquery

Uses an XQuery language expression to query an XML document.

Syntax

<xquery>
    [<xq-param name="xquery_param_name" [type="xquery_param_type"]>
        body as xquery parameter value
    </xq-param>] *
    <xq-expression>
        body as xquery language construct
    </xq-expression>
</xquery>

Attributes

Name Required Default Description
name yes Name of XQuery parameter
type no node() Type of XQuery parameter - one of the values: node(), integer, long, float, double, boolean, string, node()*, integer*, long*, float*, double*, boolean*, string*.

It is allowed to optionally specify multiple external parameters for the query. In most cases at least one, containing XML document is needed. For every specified xquery parameter the declaration inside the xq-expression in the form:

declare variable $<xquery_param_name> as <xquery_param_type> external;
is required in order to match the name and type of proceeded parameter. Valid parameter types supported by Web-Harvest are: node(), integer, long, float, double, boolean, string and analog sequence types: node()*, integer*, long*, float*, double*, boolean*, string*. If not specified, default XQuery parameter is node().

Example

<xquery>
    <xq-param name="doc">
        <html-to-xml>
            <http url="${sys.fullUrl(startUrl, articleUrl)}"/>
        </html-to-xml>
    </xq-param>
    <xq-expression><![CDATA[
        declare variable $doc as node() external;
        
        let $author := data($doc//div[@class="byline"])
        let $title := data($doc//h1)
        let $text := data($doc//div[@id="articleBody"])
            return
                <article>
                    <title>{$title}</title>
                    <author>{$author}</author>
                    <text>{$text}</text>
                </article>
    ]]></xq-expression>
</xquery>

The xquery is applied to the downloaded page resulting XML containing information about newspaper's articles.

xslt

Applies XSLT transformation to the XML document.

Syntax

<xslt>
    <xml>
        body as xml
    </xml>
    <stylesheet>
        body as xsl
    </stylesheet>
</xslt>

Example

<xslt>
    <xml>
        <html-to-xml>
            <http url="${url}"/>
        </html-to-xml>
    </xml>
    <stylesheet>
        <file path="stylesheets/tree.xsl"/>
    </stylesheet>
</xslt>

XSLT transformation, taken from the file is applied to the downloaded content.

script

Executes code written in specified scripting language. Web-Harvest supports BeanShell, Groovy and Javascript. All of them are powerfull, wide-spread and popular scripting languages.

Body of script processors is executed in specified language and optionally evaluated expression specified in return attribute is returned. All variables defined during configuration execution are also available in the script processor. However, it must be noted that variables used throughtout Web-Harvest are not simple types - they all are org.webharvest.runtime.variables.Variable objects (internal Web-Harvest class) that expose convinient methods:

  • String toString()
  • byte[] toBinary()
  • boolean toBoolean()
  • int toInt()
  • long toLong()
  • double toDouble()
  • double toDouble()
  • Object[] toArray()
  • java.util.List toList()
  • Object getWrappedObject()

The way to push value back to the Web-Harvest after script finishes is command sys.defineVariable(varName, varValue, [overwrite]) which creates appropriate wrapper around specified value : list variables for java.util.List and arrays and simple variables for other objects. The best way to illustrate this is simple example bellow.

Each script engine used in the single Web-Harvest configuration, once created, preserves its variable context throughout the configuration, meaning that all variables and objects are available in further script processors that use the same language.

Syntax

<script language="script_language" return="value_to_return">
    body as script
</script>

Attributes

Name Required Default Description
language no Default scripting language if defined in config element, or beanshell if nothing is defined. Defines which scripting engine is used in the processor. Valid values are beanshell, javascript and groovy.
return no Empty value Specifies what this processor should evaluate at the end and return as processing value.

Example

<?xml version="1.0" encoding="UTF-8"?>
 
<config>
    <var-def name="birthDate">
        11/4/1958
    </var-def>
    
    <var-def name="web_harvest_day_variable">
        <script return="namedDay.toUpperCase()"><![CDATA[
            tokenizer = new StringTokenizer(birthDate.toString(), "./-\\");
        
            day = Integer.parseInt(tokenizer.nextToken());
            month = Integer.parseInt(tokenizer.nextToken());
            year = Integer.parseInt(tokenizer.nextToken());
        
            Calendar cal = Calendar.getInstance();
            cal.set(Calendar.DAY_OF_MONTH, day);
            cal.set(Calendar.MONTH, month-1);
            cal.set(Calendar.YEAR, year);
        
            switch( cal.get(Calendar.DAY_OF_WEEK) ) {
                case 0 : namedDay = "Sunday"; break;
                case 1 : namedDay = "Monday"; break;
                case 2 : namedDay = "Tuesday"; break;
                case 3 : namedDay = "Wendsday"; break;
                case 4 : namedDay = "Thursday"; break;
                case 5 : namedDay = "Friday"; break;
                default: namedDay = "Saturday"; break;
            }
        ]]></script>
    </var-def>
    
    <template>
        The day when you were born was ${namedDay}.
    </template>
    
    <file action="write" path="day.txt">
        <var name="web_harvest_day_variable"/>
    </file>
</config>

This example also shows that script internal variables once defined, are available in all the following script and template processors (namedDay).

template

For the given text content, parts surrounded with ${ and } are evaluated using the specified scripting engine. If no scripting language is specified, default one is used (see config element).

Syntax

<template language="script_language">
    body as text for templating
</template>

Attributes

Name Required Default Description
language no Default config language Specifies script language that will be used for evaluation of parts surrounded with ${ and }. Valid values are beanshell, javascript and groovy.

Example

<var-def name="content">
    <file path="textdata/products.txt"/>
</var-def>
 
<var-def name="changedContent">
    <template>
        ${sys.datetime("yyyy-MM-dd, HH:mm:ss")} ${sys.lf}
        ---------------------------------------------------- ${sys.lf}
        ${my.process(content.toString())}
    </template>
</var-def>

Templater uses some built-in constants, functions and some user-defined objects from variable context in order to produce desired content.

case

Executes conditional statement. Sequentially checks if some of the specified conditions in inner if elements is satisfied and if found one returns its body as the result. If no true statement is found result of execution is body of else statement if specified, or empty value otherwise.

Syntax

<case>
    [<if condition="expression">
        if body
    </if>] *
    [<else>
        else body
    </else>]
</case>

Attributes

Name Required Default Description
condition yes If true (yes), body of if is evaluated.

Example

<var-def name="contact">
    <xpath expression="//a[contains(., 'contact')]/@href">
        <var name="pageContent"/>
    </xpath>
</var-def>
 
<var-def name="contactMail">
    <case>
        <if condition="${contact.toString() != ''}">
            <var name="contact"/>
        </if>
        <else>
            Contact is not defined!
        </else>
    </case>
</var-def>

Here, conditional processor is used to check if previous xpath search has found contact information on the page.

loop

Iterate through the specified list and executes specified body logic for each item. Result is the list of processed bodies.

Syntax

<loop item="item_var_name" 
      index="index_var_name" 
      maxloops="max_loops" 
      filter="list_filter">
    <list>
        body as list value
    </list>
    <body>
        body for each list item
    </body>
</loop>

Attributes

Name Required Default Description
item no Name of the variable that takes the value of current list item.
index no Name of the index variable, initial value for the first loop is 1.
maxloops no Limits number of iterations. There is no limit if it is not specified.
filter no Expression for filtering iteration list. It consists of arbitrary number of restrictions separated by comma. There are the following types of restrictions:
  • [n]-[m], for specifying index range, for example: 3-6, -5.
  • [n][:][m], for specifying sublist starting at index n and including items at indexes n+m, n+2*m, n+3*m, ..., for example 1:2 for all odd, 2:2 for all even.
  • unique, that removes duplicates from list comparing string values of list items.
Valid filter which is combination of allowed restrictions is for example: 1-20,1:2,unique.

Example

<loop item="link" index="i" filter="unique">
    <list>
        <xpath expression="//img/@src">
            <html-to-xml>
                <http url="http://www.yahoo.com"/>
            </html-to-xml>
        </xpath>
    </list>
    <body>
        <file action="write" type="binary" path="images/${i}.gif">
            <http url="${sys.fullUrl('http://www.yahoo.com', link)}"/>
        </file>
    </body>
</loop>

Loop iterates over the all unique image URLs from www.yahoo.com and for each URL downloads the image and stores it to the file system.

while

Loops while specified condition is satisfied. The result is list made of processed bodies in each iteration.

Syntax

<while condition="expression" index="index_var_name" maxloops="max_loops">
    body
</while>

Attributes

Name Required Default Description
condition yes Expression that is evaluated for every loop and if its value is true, the body is executed.
index no Name of the index variable, initial value for the first loop is 1.
maxloops no Limits number of iterations. There is no limit if it is not specified.

Example

See example from function processor.

function

Declares the user-defined function.

Syntax

<function name="function_name">
    function body
</function>

Attributes

Name Required Default Description
name yes The name of user-defined function

Example

<function name="download-multipage-list">
    <return>
        <while condition="${pageUrl.toString().trim() != ''}" maxloops="${maxloops}" index="i">
            <empty>
                <var-def name="content">
                    <html-to-xml>
                        <http url="${pageUrl}</