Home | SourceForge | Forums | Contact

User manual

This manual briefly describes structure and elements of the Web-Harvest configuration file.

Content

Predefined and user-defined variables and objects

Every Web-Harvest variable context initially contains the following help objects that can be used in any expression inside configurations:

Object sys which contains some general-purpose constants and methods:

sys.lf Line feed character (\n)
sys.cr Carriage return character (\r)
sys.tab Tab character (\t)
sys.space Space character ( )
sys.quot Double quote character (")
sys.apos Single quote character (')
sys.backspace Backspace character (\b)
sys.date() Returns current date in the yyyyMMdd format
sys.time() Returns current time in the HHmmss format
sys.datetime(format) Returns date/time in specified format (Java date and time formatting patterns must be used).
sys.escapeXml(text) Escapes characters &'"<> in the specified text according to XML standard.
sys.fullUrl(pageUrl, link) For the specified URL of the web page and specified link (which could be relative, absolute or full URL) returns full URL.
sys.defineVariable(varname, varvalue, [overwrite]) Defines new variable with specified name and value in the current Web-Harvest context. Parameter overwrite tells whether to overwrite existing variable with the same name. Its default value is true. It has the same meaning as var-def processor, however it could be useful for value exchange between scripts and Web-Harvest context.
sys.isVariableDefined(varname) Tells if variable with specified name is defined in the context.
xpath(xpathexpr, xml) Evaluates XPath expression on specified XML. Returns instance of org.webharvest.runtime.variables.Variable class.

For more details check Java API of this class.

Object http which provides access to Http client and gives information about HTTP responses:

http.client Returns instance of org.apache.commons.httpclient.HttpClient class which is used as primary HTTP client during the configuration execution. For more details, see Jakarta HttpClient documentation.
http.contentLength Lenght of the last HTTP response's content in bytes.
http.charset Encoding of the last HTTP response, if it contain textual content.
http.mimeType Mime type of the last HTTP response.
http.headers Changed in Web-Harvest 2
Array of HTTP response header pairs. To access individual header by index, use
http.headers.length,
http.headers[index].key,
http.headers[index].value.
To access first header value by name, use
http.getHeader("headername").
To get all headers with specified name use http.getHeaders("headername"):
http.getHeaders("Set-Cookie").length.
http.getHeaders("Set-Cookie")[index].
http.statusCode Status code of the last HTTP response.
http.statusText Status message of the last HTTP response.
http.totalLength Total length in bytes of all responses returned to this HTTP client.
http.totalResponses Total number of responses returned to this HTTP client.

Note 1: See Usage page to check how user-defined objects can be put into variable context.

Valid XML configuration elements

config

This is the root element of every configuration file.

Syntax

<config charset="charset_value" scriptlang="default_script_lang">
    configuration body
</config>

Attributes

Name Required Default Description
charset no UTF-8 Defines default charset used throughout configuration. Every processor that needs charset information uses this value if other not explicitly set.
scriptlang no beanshell Defines default scripting engine used throughout configuration. Allowed values are beanshell, javascript and groovy. Default engine is used wherever not other is specified. script and template processors have ability to specify scripting engine within the same Web-Harvest configuration, that way giving possibility to even mix different scripting languages.

Example

<config charset="ISO-8859-1" scriptlang="groovy">
    <file action="write" path="squares.txt">
        <script return="${[1, 2, 3, 4, 5].collect(square)}">
            <![CDATA[ 
                square = { it * it }
            ]]>
        </script>
    </file>
</config>

Here, file processor uses default encoding ISO-8859-1 and script processor uses default language groovy.

empty

Wraps execution sequence and returns empty value. This element is used in situations when execution result is needless.

Syntax

<empty>
    wrapped body
</empty>

Example

<file action="write" path="test/amazon_home.html">
    <empty>
        <var-def name="amazonContent">
            <http url="http://www.amazon.com"/>
        </var-def>
    </empty>
    <template>
        <![CDATA[
            <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
        ]]>
        ${amazonContent}
    </template>
</file>

New variable is created but its value doesn't participate in the result, because it is inside empty element. Instead, its value is used in the following template processor.

text

Converts embedded value to the string representation.

Syntax

<text charset="charset" delimiter="delimiter">
    wrapped body
</text>

Attributes

Name Required Default Description
charset no default configuration's charset Charset used if body is converted from binary to text value.
delimiter no new line character Delimiter string used to separate items when concatenating them into single string.

Example

<var-def name="digits">
    <while condition="${i.toInt() != 10}" index="i">
        <template>${i}</template>
    </while>
</var-def>
 
<file action="write" path="/test/replaced23.txt">
    <regexp replace="true">
        <regexp-pattern>(.*)(2.*3)(.*)</regexp-pattern>
        <regexp-source>
            <text>
                <var name="digits"/>
            </text>
        </regexp-source>
        <regexp-result>
            <template>${_1}here were 2 and 3${_3}</template>
        </regexp-result>
    </regexp>
</file>

Variable named digits is defined using the while processor, producing sequence of 9 values. Next, the regexpr processor is invoked in order to do search-replace on variable value. text processor is used to concatenate all digit values into single one. Without text processor, regular expression search would be applied to each item in the list (every digit) and that way no replace would occurr, because there is no sequence "2*3".

var-def

Defines new or overrides existing variable with specified name and value.

Syntax

<var-def name="variable_name" overwrite="overwrite_existing">
    body as value of the variable
</var-def>

Attributes

Name Required Default Description
name yes The name of variable. Should be valid like in most programming languages.
overwrite no true Boolean value (true or false) telling if existing variable with the same name will be overwriten or not.

Example

<var-def name="digitList">
    <while condition="true" index="i" maxloops="9">
        <var-def name="digit${i}">
            <template>${i}</template>
        </var-def>
    </while>
</var-def>

This example defines the variable digitList which is the sequence of 9 values (digits from 1 to 9), and 10 simple variables digit1, digit2, ..., digit9 with values ranging from 1 to 9.

var

Returns value of defined variable. Throws an exception if variable is not defined.

Syntax

<var name="variable_name"/>

Attributes

Name Required Default Description
name yes Variable name.

Example

<var-def name="searchEngine">
    google
</var-def>
 
<var-def name="${searchEngine}Content">
    <http url="http://www.${searchEngine}.com"/>
</var-def>
 
<file action="write" path="data/${searchEngine}_content.html">
    <var name="${searchEngine}Content"/>
</file>

After execution, file named "google_content.html" contains page content of www.google.com.

file

Reads and writes content of the file or search directory for specified files.

Syntax

<file action="file_action" 
      path="file_path" 
      type="file_type" 
      charset="charset_of_text_file"
      listdirs="listdirs"
      listfiles="listfiles"
      listrecursive="listrecursive"
      listfilter="listfilter">
    body defining content of the file if action="write" or action="append"
</file>

Attributes

Name Required Default Description
action no read Defines file action. Valid values are read, append, write and list.
path yes File path, relative to the working directory.
type no text Type of file: text or binary.
charset no [Default charset for config] Charset for text files. Has no effect if type is binary.
listdirs no yes Tells whether to list directories (action = list).
listfiles no yes Tells whether to list files (action = list).
listrecursive no no Tells whether to recursively search directories (action = list).
listfilter no Filename pattern to search for (* is replacement for any sequence, ? for any character). Works only for action = list.

Example 1

<file action="write" path="123.txt">
    <file action="read" path="1.txt"/>
    -----------------------------------
    <file action="read" path="2.txt"/>
    -----------------------------------
    <file action="read" path="3.txt"/>
</file>

Here, new file is created containing appended contents of three existing files, separeted with lines.

Example 2

<file action="write" path="c:/images/alljpegs.zip" type="binary">
    <zip>
        <loop item="filename">
            <list>
                <file path="c:/images/" action="list" listfilter="*.jpg" />
            </list>
            <body>
                <zip-entry name="${sys.getFilename(filename.toString())}">
                    <file type="binary" path="${filename}"/>
                </zip-entry>
            </body>
        </loop>
    </zip>
</file>

ZIP file consisting of all JPEG images taken from specified directory is created.

http

Sends HTTP request to the specified URL and gets HTTP response as a result. First the body of the processor is executed in order to define optional HTTP parameters and/or headers and then sends HTTP request.

Syntax

<http url="url" 
      method="method" 
      charset="charset"
      cookie-policy="cookie_policy"
      username="username"
      password="password"
      multipart="multipart">
    body that might contain http-param and/or http-header elements
</http>

Attributes

Name Required Default Description
url yes HTTP request URL.
method no get HTTP method: get or post
charset no [Default charset for config] Defines encoding of the HTTP response content. Has no effect if content type is binary.
cookie-policy no [Default cookie policy of the HTTP client] Specifies the way how HTTP client manages cookies. Allowed values are: browser, ignore, netscape, rfc_2109 and default.
username no Specifies username if URL requires authentication.
password no Specifies password if URL requires authentication.
multipart no no Tells if form is multipart encoded (enabling data upload).

Example

<xpath expression="data(//script)">
    <html-to-xml>
        <http url="http://www.yahoo.com/"/>
    </html-to-xml>
</xpath>

The content of www.yahoo.com is downloaded, transformed to XML and then all scripts inside the page are found.

http-param

Adds HTTP parameter for the first enclosing HTTP processor for both post and get requests. If used outside the HTTP processor an exception is thrown.

Syntax

<http-param name="param_name" 
            isfile="isfile"
            contenttype="contenttype" 
            filename="filename">
    body as parameter value
</http-param>

Attributes

Name Required Default Description
name yes The name of HTTP parameter.
isfile no no Tells if parameter is file for upload (applies only to multipart requests).
contenttype no MIME type of the upload file (effective for multipart forms where parameter is file).
filename no Name of uploaded file (effective for multipart forms where parameter is file).

Example

<var-def name="paramNames">
    USERID
    PASSWORD
</var-def>
 
<http method="post" url="http://www.nytimes.com/auth/login">
   <http-param name="is_continue">true</http-param>
   <http-param name="URI">http://</http-param>
   <http-param name="OQ"></http-param>
   <http-param name="OP"></http-param>
 
   <loop item="name">
       <list>
           <var name="paramNames"/>
       </list>
       <body>
           <http-param name="${name}">web-harvest</http-param>
       </body>
   </loop>
</http>

Sends needed parameters to www.nytimes.com/auth/login in order to log in.

http-header

Defines HTTP header for the first enclosing HTTP processor. If used outside the HTTP processor an exception is thrown.

Syntax

<http-header name="header_name">
    body as header value
</http-header>

Attributes

Name Required Default Description
name yes The name of HTTP header.

Example

<http url="www.google.com">
    <http-header name="User-Agent">
        Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1
    </http-header>
</http>

Identifies itself to www.google.com as Firefox browser.

html-to-xml

Cleans up the content of the body and transforms it to the valid XML. The body is usually HTML obtained as a result of http processor execution. Actual parsing and cleaning job is delegated to HtmlCleaner tool. Altough no special tuning is needed in most cases, cleaner may be configured with the several parameters defined with the processor's attributes.

Syntax

<html-to-xml outputtype="..." advancedxmlescape="..." usecdata="..." 
             specialentities="..." unicodechars="..." omitunknowntags="..."
             treatunknowntagsascontent="..." omitdeprtags="..."
             treatdeprtagsascontent="..." omitcomments="..."
             omithtmlenvelope="..." allowmultiwordattributes="..."
             allowhtmlinsideattributes="..." namespacesaware="..." 
             prunetags="..." omitxmldecl="...">
    body as html to be cleaned
</html-to-xml>

Attributes

Name Required Default Description
outputtype no simple Defines how the resulting XML will be serialized. Allowed values are simple, compact, browser-compact and pretty.
advancedxmlescape no true If this parameter is set to true, ampersand sign (&) that proceeds valid XML character sequences (&XXX;) will not be escaped with &amp;XXX;
usecdata no true If true, HtmlCleaner will treat SCRIPT and STYLE tag contents as CDATA sections, or otherwise it will be regarded as ordinary text (special characters will be escaped).
specialentities no true If true, special HTML entities (i.e. &ocirc;, &permil;, &times;) are replaced with unicode characters they represent (ô, ‰, ×). This doesn't include &, <, >, ", '.
unicodechars no true If true, HTML characters represented by their codes in form &#XXXX; are replaced with real unicode characters (i.e. &#1078; is replaces with ж).
omitunknowntags no false Tells whether to skip (ignore) unknown tags during cleanup.
treatunknowntagsascontent no false Tells whether to treat unknown tags as ordinary content, i.e. <something...> will be transformed to &lt;something...&gt;. This attribute is applicable only if omitUnknownTags is set to false.
omitdeprtags no false Tells whether to skip (ignore) deprecated HTML tags during cleanup.
treatdeprtagsascontent no false Tells whether to treat deprecated tags as ordinary content, i.e. <font...> will be transformed to &lt;font...&gt;. This attribute is applicable only if omitDeprecatedTags is set to false.
omitcomments no false Tells whether to skip HTML comments.
omithtmlenvelope no false Tells whether to remove HTML and BODY tags from the resulting XML, and use first tag in the BODY section instead. If BODY section doesn't contain any tags, then this attribute has no effect.
allowmultiwordattributes no true Tells parser wether to allow attribute values consisting of multiple words or not. If true, attribute att="a b c" will stay like it is, and if false parser will split this into att="a" b="b" c="c" (this is default browsers' behaviour).
allowhtmlinsideattributes no false Tells parser wether to allow html tags inside attribute values. For example, when this flag is set att="here is <a href='xxxx'>link</a>" will stay like it is, and if not, parser will end attribute value after "here is ".
This flag makes sense only if allowMultiWordAttributes is set as well.
namespacesaware no true If true, namespace prefixes found during parsing will be preserved and all neccessery xml namespace declarations will be added in the root element. If false, all namespace prefixes and all xmlns namespace declarations will be stripped.
prunetags no empty string Comma-separated list of tags that will be complitely removed (with all nested elements) from XML tree after parsing. For exampe if pruneTags is "script,style", resulting XML will not contain scripts and styles.
omitxmldecl no false Tells whether to skip XML declaration. The value must be true or false.

Example

<html-to-xml outputtype="pretty">
    <http url="http://www.motors.ebay.com"/>
</html-to-xml>

Downloads the www.motors.ebay.com page and cleans it up producing pretty-prented XML content.

regexp

Searches the body for the given regular expression and optionally replaces found occurrences with specified pattern. If body is a list of values then the regexp processor is applied to every item and final execution result is the list.

Syntax

<regexp replace="true_or_false" 
        max="max_found_occurrences"
        flag-canoneq="flag-canoneq"
        flag-caseinsensitive="flag-caseinsensitive"
        flag-dotall="flag-dotall"
        flag-multiline="flag-multiline"
        flag-unicodecase="flag-unicodecase">
    <regexp-pattern>
        body as pattern value
    </regexp-pattern>
    <regexp-source>
        body as the text source 
    </regexp-source>
    [<regexp-result>
        body as the result
    </regexp-result>]
</regexp>
For each group inside the search pattern and for each found occurrence variables with names _<group_number> are created. See some Regular Expression tutorial for better explanation of groups.

Attributes

Name Required Default Description
replace no false Logical value telling if found occurrences of regular expression will be replaced. Valid values are: true/false or yes/no. If this value is true (yes), then the regexp-result needs to be specified with replacement value.
max no Limits the number of found pattern occurrences. There is no limit if it is not specified.
flag-canoneq no no Enables canonical equivalence.
flag-caseinsensitive no no Enables case-insensitive matching.
flag-dotall no yes Enables dotall mode.
flag-multiline no no Enables multiline mode.
flag-unicodecase no yes Enables Unicode-aware case folding.

Example #1

<regexp>
    <regexp-pattern>([_\w\d]*)[\s]*=[\s]*([\w\d\s]*+)[\,\.\;]*</regexp-pattern>
    <regexp-source>
        var1= test1, var2 = bla bla; index=16; 
        city = Delhi,town=Kingston;
    </regexp-source>
    <regexp-result>
        <template>Value of variable "${_1}" is "${_2}"!</template>
    </regexp-result>
</regexp>

Here, regular expression is looking for specified pattern in two strings, producing as a result list of five values: Value of variable "var1" is "test1"!, Value of variable "var2" is "bla bla"! ...

Example #2

<regexp replace="true">
    <regexp-pattern>[\s]*[\,\.\;][\s]*</regexp-pattern>
    <regexp-source>
        var1= test1, var2 = bla bla; index=16; city = Delhi,town=Kingston;
    </regexp-source>
    <regexp-result>
        <template>|</template>
    </regexp-result>
</regexp>

Here, the regular expression replacement produces single value as the result: var1= test1|var2 = bla bla|index=16|city = Delhi|town=Kingston|.

xpath

Uses an XPath language expression to search an XML document.

Syntax

<xpath expression="xpath_expression">
    body as xml
</xpath>

Attributes

Name Required Default Description
expression yes XPath language expression.

Example

<xpath expression="//a/@href">
    <html-to-xml>
        <http url="http://www.nba.com/"/>
    </html-to-xml>
</xpath>

The result is sequence of links from the page retrieved from www.nba.com.

xquery

Uses an XQuery language expression to query an XML document.

Syntax

<xquery>
    [<xq-param name="xquery_param_name" [type="xquery_param_type"]>
        body as xquery parameter value
    </xq-param>] *
    <xq-expression>
        body as xquery language construct
    </xq-expression>
</xquery>

Attributes

Name Required Default Description
name yes Name of XQuery parameter
type no node() Type of XQuery parameter - one of the values: node(), integer, long, float, double, boolean, string, node()*, integer*, long*, float*, double*, boolean*, string*.

It is allowed to optionally specify multiple external parameters for the query. In most cases at least one, containing XML document is needed. For every specified xquery parameter the declaration inside the xq-expression in the form:

declare variable $<xquery_param_name> as <xquery_param_type> external;
is required in order to match the name and type of proceeded parameter. Valid parameter types supported by Web-Harvest are: node(), integer, long, float, double, boolean, string and analog sequence types: node()*, integer*, long*, float*, double*, boolean*, string*. If not specified, default XQuery parameter is node().

Example

<xquery>
    <xq-param name="doc">
        <html-to-xml>
            <http url="${sys.fullUrl(startUrl, articleUrl)}"/>
        </html-to-xml>
    </xq-param>
    <xq-expression><![CDATA[
        declare variable $doc as node() external;
        
        let $author := data($doc//div[@class="byline"])
        let $title := data($doc//h1)
        let $text := data($doc//div[@id="articleBody"])
            return
                <article>
                    <title>{$title}</title>
                    <author>{$author}</author>
                    <text>{$text}</text>
                </article>
    ]]></xq-expression>
</xquery>

The xquery is applied to the downloaded page resulting XML containing information about newspaper's articles.

xslt

Applies XSLT transformation to the XML document.

Syntax

<xslt>
    <xml>
        body as xml
    </xml>
    <stylesheet>
        body as xsl
    </stylesheet>
</xslt>

Example

<xslt>
    <xml>
        <html-to-xml>
            <http url="${url}"/>
        </html-to-xml>
    </xml>
    <stylesheet>
        <file path="stylesheets/tree.xsl"/>
    </stylesheet>
</xslt>

XSLT transformation, taken from the file is applied to the downloaded content.

script

Executes code written in specified scripting language. Web-Harvest supports BeanShell, Groovy and Javascript. All of them are powerfull, wide-spread and popular scripting languages.

Body of script processors is executed in specified language and optionally evaluated expression specified in return attribute is returned. All variables defined during configuration execution are also available in the script processor. However, it must be noted that variables used throughtout Web-Harvest are not simple types - they all are org.webharvest.runtime.variables.Variable objects (internal Web-Harvest class) that expose convinient methods:

  • String toString()
  • byte[] toBinary()
  • boolean toBoolean()
  • int toInt()
  • long toLong()
  • double toDouble()
  • double toDouble()
  • Object[] toArray()
  • java.util.List toList()
  • Object getWrappedObject()

The way to push value back to the Web-Harvest after script finishes is command sys.defineVariable(varName, varValue, [overwrite]) which creates appropriate wrapper around specified value : list variables for java.util.List and arrays and simple variables for other objects. The best way to illustrate this is simple example bellow.

Each script engine used in the single Web-Harvest configuration, once created, preserves its variable context throughout the configuration, meaning that all variables and objects are available in further script processors that use the same language.

Syntax

<script language="script_language" return="value_to_return">
    body as script
</script>

Attributes

Name Required Default Description
language no Default scripting language if defined in config element, or beanshell if nothing is defined. Defines which scripting engine is used in the processor. Valid values are beanshell, javascript and groovy.
return no Empty value Specifies what this processor should evaluate at the end and return as processing value.

Example

<?xml version="1.0" encoding="UTF-8"?>
 
<config>
    <var-def name="birthDate">
        11/4/1958
    </var-def>
    
    <var-def name="web_harvest_day_variable">
        <script return="namedDay.toUpperCase()"><![CDATA[
            tokenizer = new StringTokenizer(birthDate.toString(), "./-\\");
        
            day = Integer.parseInt(tokenizer.nextToken());
            month = Integer.parseInt(tokenizer.nextToken());
            year = Integer.parseInt(tokenizer.nextToken());
        
            Calendar cal = Calendar.getInstance();
            cal.set(Calendar.DAY_OF_MONTH, day);
            cal.set(Calendar.MONTH, month-1);
            cal.set(Calendar.YEAR, year);
        
            switch( cal.get(Calendar.DAY_OF_WEEK) ) {
                case 0 : namedDay = "Sunday"; break;
                case 1 : namedDay = "Monday"; break;
                case 2 : namedDay = "Tuesday"; break;
                case 3 : namedDay = "Wendsday"; break;
                case 4 : namedDay = "Thursday"; break;
                case 5 : namedDay = "Friday"; break;
                default: namedDay = "Saturday"; break;
            }
        ]]></script>
    </var-def>
    
    <template>
        The day when you were born was ${namedDay}.
    </template>
    
    <file action="write" path="day.txt">
        <var name="web_harvest_day_variable"/>
    </file>
</config>

This example also shows that script internal variables once defined, are available in all the following script and template processors (namedDay).

template

For the given text content, parts surrounded with ${ and } are evaluated using the specified scripting engine. If no scripting language is specified, default one is used (see config element).

Syntax

<template language="script_language">
    body as text for templating
</template>

Attributes

Name Required Default Description
language no Default config language Specifies script language that will be used for evaluation of parts surrounded with ${ and }. Valid values are beanshell, javascript and groovy.

Example

<var-def name="content">
    <file path="textdata/products.txt"/>
</var-def>
 
<var-def name="changedContent">
    <template>
        ${sys.datetime("yyyy-MM-dd, HH:mm:ss")} ${sys.lf}
        ---------------------------------------------------- ${sys.lf}
        ${my.process(content.toString())}
    </template>
</var-def>

Templater uses some built-in constants, functions and some user-defined objects from variable context in order to produce desired content.

case

Executes conditional statement. Sequentially checks if some of the specified conditions in inner if elements is satisfied and if found one returns its body as the result. If no true statement is found result of execution is body of else statement if specified, or empty value otherwise.

Syntax

<case>
    [<if condition="expression">
        if body
    </if>] *
    [<else>
        else body
    </else>]
</case>

Attributes

Name Required Default Description
condition yes If true (yes), body of if is evaluated.

Example

<var-def name="contact">
    <xpath expression="//a[contains(., 'contact')]/@href">
        <var name="pageContent"/>
    </xpath>
</var-def>
 
<var-def name="contactMail">
    <case>
        <if condition="${contact.toString() != ''}">
            <var name="contact"/>
        </if>
        <else>
            Contact is not defined!
        </else>
    </case>
</var-def>

Here, conditional processor is used to check if previous xpath search has found contact information on the page.

loop

Iterate through the specified list and executes specified body logic for each item. Result is the list of processed bodies.

Syntax

<loop item="item_var_name" 
      index="index_var_name" 
      maxloops="max_loops" 
      filter="list_filter"
      empty="empty">
    <list>
        body as list value
    </list>
    <body>
        body for each list item
    </body>
</loop>

Attributes

Name Required Default Description
item no Name of the variable that takes the value of current list item.
index no Name of the index variable, initial value for the first loop is 1.
maxloops no Limits number of iterations. There is no limit if it is not specified.
filter no Expression for filtering iteration list. It consists of arbitrary number of restrictions separated by comma. There are the following types of restrictions:
  • [n]-[m], for specifying index range, for example: 3-6, -5.
  • [n][:][m], for specifying sublist starting at index n and including items at indexes n+m, n+2*m, n+3*m, ..., for example 1:2 for all odd, 2:2 for all even.
  • unique, that removes duplicates from list comparing string values of list items.
Valid filter which is combination of allowed restrictions is for example: 1-20,1:2,unique.
empty no no Equal to surrounding body by empty element, producing empty result of iteration.

Example

<loop item="link" index="i" filter="unique">
    <list>
        <xpath expression="//img/@src">
            <html-to-xml>
                <http url="http://www.yahoo.com"/>
            </html-to-xml>
        </xpath>
    </list>
    <body>
        <file action="write" type="binary" path="images/${i}.gif">
            <http url="${sys.fullUrl('http://www.yahoo.com', link)}"/>
        </file>
    </body>
</loop>

Loop iterates over the all unique image URLs from www.yahoo.com and for each URL downloads the image and stores it to the file system.

while

Loops while specified condition is satisfied. The result is list made of processed bodies in each iteration.

Syntax

<while condition="expression" 
       index="index_var_name" 
       maxloops="max_loops"
       empty="empty">
    body
</while>

Attributes

Name Required Default Description
condition yes Expression that is evaluated for every loop and if its value is true, the body is executed.
index no Name of the index variable, initial value for the first loop is 1.
maxloops no Limits number of iterations. There is no limit if it is not specified.
empty no no Equal to surrounding body by empty element, producing empty result of iteration.

Example

See example from function processor.

function

Declares the user-defined function.

Syntax

<function name="function_name">
    function body
</function>

Attributes

Name Required Default Description
name yes The name of user-defined function

Example

<function name="download-multipage-list">
    <return>
        <while condition="${pageUrl.toString().trim() != ''}" maxloops="${maxloops}" index="i">
            <empty>
                <var-def name="content">
                    <html-to-xml>
                        <http url="${pageUrl}"/>
                    </html-to-xml>
                </var-def>
 
                <var-def name="nextLinkUrl">
                    <xpath expression="${nextXpath}">
                        <var name="content"/>
                    </xpath>
                </var-def>
 
                <var-def name="pageUrl">
                    <template>${sys.fullUrl(pageUrl, nextLinkUrl)}</template>
                </var-def>
            </empty>
 
            <xpath expression="${itemXPath}">
                <var name="content"/>
            </xpath>
        </while>
    </return>
</function>
 
<var-def name="imgLinks">
    <call name="download-multipage-list">
        <call-param name="pageUrl">
            http://images.google.com/images?q=harvest&hl=en&btnG=Search+Images&nojs=1
        </call-param>
        <call-param name="nextXPath">
            //a[@shape='rect' and .='Next']/@href
        </call-param>
        <call-param name="itemXPath">
            //img[contains(@src, 'images?q=tbn')]/@src
        </call-param>
        <call-param name="maxloops">
            5
        </call-param>
    </call>
</var-def>

Here the function named download-multipage-list is defined in order to serve multiple extractions. It collects link URLs from series of pages where XPath expression parameter is used to determine URL of next page with links if it exists. This situation is typical for list of products, or list of search results spanning multiple web pages. After that, the function is called with specified parameters in order to collect image links from Google images search limiting number of resulting pages to 5.

return

Returns value from the user-defined function.

Syntax

<return>
    body as return value
</return>

Example

See example from function processor.

call

Calls the user-defined function.

Syntax

<call name="function_name">
    [<call-param name="function_name">
        body as actual parameter value
    </call-param>] *
</call>

Attributes

Name Required Default Description
name yes The name of user-defined function

Example

See example from function processor.

include

Includes other configuration file and executes its logic. This is useful for keeping libraries of common functions or for splitting large extraction process into multiple files.

Syntax

<include path="file_path"/>

Attributes

Name Required Default Description
path yes Path of he configuration file to be included. Path is relative the directory where including configuration file is.

Example

<include path="lib.xml"/>

try

Wraps execution and for any recoverable exception returns default value without crashing the whole process.

Syntax

<try>
    <body>
        try body
    </body>
    <catch>
        catch body
    </catch>
</try>

Example

<var-def name="reportText">
    <try>
        <body>
            <file path="data/report.txt"/>
        </body>
        <catch>
            No report file!
        </catch>
    </try>
</var-def>

File read exception is caught if occurred and default value is stored in the variable.

exit

Conditionally breaks the configuration execution.

Syntax

<exit condition="condition" message="message" />

Attributes

Name Required Default Description
condition no true Condition that determines if execution will stop. Must be boolean value (true, yes, false, no).
message no Optional message to the user if configuration is exiting. Will be part of logging information, or dialog warning will popup if Web-Harvest is used in GUI mode.

Example

<exit condition='${!sys.isVariableDefined("username")}' message="No username provided!" />

Configuration is stopping execution if variable username is not defined.

database2.0

Execute query against database.

JDBC driver library file(s) should be provided on the classpath if used programatically, or on the same path with Web-Harvest executable if used standalone. In case of SELECT sql statement, it returns list of row objects. They can be accessed with special accessor methods:

<mydbrow>.getColumnCount() - returns number of columns returned.
<mydbrow>.getColumnName(index) - returns name for column number.
<mydbrow>.get(column_index) - returns field value for column number.
<mydbrow>.get(column_name) - returns field value for column name.

The whole list of returned db rows can be accessed by index to get individual row:

<mydbvar>.get(rowindex)

For example:

mydb.get(0).get("image")

Syntax

<database connection="jdbc connection string" 
          jdbcclass="full named jdbc class" 
          username="username" 
          password="password" 
          autocommit="autocommit" 
          max="max rows returned">
    select, insert or delete SQL query
</database>

Attributes

Name Required Default Description
connection yes Properly formatted JDBC string for the database. It depends on database/driver vendor.
jdbcclass yes Fully qualified class name of the JDBC driver.
username no Username to access database.
password no Password to access database.
autocommit no true Whether commit is performed automatically after query execution.
max no no limit Maximum number of returned rows from the SELECT statement.

Example 1

<var-def name="employees">
    <database connection="jdbc:microsoft:Sqlserver://myserver:1433;databaseName=mycompany;user=sa;password=hehehe" 
              jdbcclass="com.microsoft.jdbc.sqlserver.SQLServerDriver">
        select name, salary from employee
    </database>
</var-def>
 
<loop item="emp">
    <list>
        <var name="employees"/>
    </list>
    <body>
        <template>Salary of ${emp.get("name")} is ${emp.get("salary")}</template>
    </body>
</loop>

Example 2

<database connection="jdbc:microsoft:Sqlserver://myserver:1433;databaseName=mycompany;user=sa;password=hehehe" 
          jdbcclass="com.microsoft.jdbc.sqlserver.SQLServerDriver">
    <template>
        insert into news (id, url, text, source)
        values (${myId}, '${myUrl}', '${myText}', '${mySource}')
    </template>
</database>

db-param2.0

Specifies database parameter inside database element. Can be used for storing BLOBs (Binary Large OBjects).

Syntax

<db-param type="param_type">
    parameter value
</db-param>

Attributes

Name Required Default Description
type no binary if it's value is recognized as binary, text otherwise. Type of the parameter. Valid values are: int, long, double, text and binary.

Example

<database connection="jdbc:mysql://myserver/mydb" 
          jdbcclass="com.mysql.jdbc.Driver" 
          username="myuser" 
          password="mypass">
    insert into logos (id, img)
    values ( 1, <db-param><http url='${myImageUrl}'/></db-param> )
</database>

mail2.0

Sends an email.

Syntax

<mail smtp-host="smtp server" 
      smtp-port="smtp server port" 
      type="content type"
      from="sender"
      reply-to="reply-to header"
      to="to"
      cc="cc"
      bcc="bcc"
      subject="subject"
      charset="charset"
      username="smtp username"
      password="smtp password"
      security="smtp security type">
    mail content with optional attachments (mail-attach elements)
</mail>

Attributes

Name Required Default Description
smtp-host yes SMTP server host.
smtp-port no 25 SMTP server port.
type no text Content type of the mail body: text or html.
from yes The senders email address.
reply-to no The email address where replies should be sent to.
to yes Comma-separated list of recipient email addresses.
cc no Comma-separated list of cc email addresses.
subject no Subject of the email.
charset no default configuration's charset Charset of the email.
username no SMTP server username.
password no SMTP server password.
security no none SMTP server security type: none, ssl or tls

Example

<mail from="reminder@my.com" smtp-host="smtp.gmail.com" to="myaccount@gmail.com" 
      username="myusername" password="mypassword" security="tsl" 
      subject='Reminder for ${sys.datetime("dd.MM.yyyy.")}'>
      Here is what you need today:
      <file path="today.txt">
</mail>

mail-attach2.0

Adds an email attachment. Can be used only as part of mail processor of html type.

Syntax

<mail-attach name="name" mimetype="mimetype" inline="inline">
    body of the attachment
</mail-attach>

Attributes

Name Required Default Description
name no Attachment N Name of the attachment.
mimetype no image/jpeg for inline attachments,
application/octet-stream otherwise
Mime type of the attachment.
inline no no Tells whether attachment is embeded in the mail body.

Example

<mail from="my@my.com" smtp-host="smtp.gmail.com" to="myaccount@gmail.com" type="html"
      username="myusername" password="mypassword" security="tsl" subject='Photos from the ...'>
    Here is me with ...
    <![CDATA[ <img src="]]>
        <mail-attach inline="true"><file path="myphoto1.jpg" type="binary"/></mail-attach>
    <![CDATA[ "> ]]>
    And this is ...
    <![CDATA[ <img src="]]>
        <mail-attach inline="true"><file path="myphoto2.jpg" type="binary"/></mail-attach>
    <![CDATA[ "> ]]>
</mail>

zip2.0

Creates a ZIP archive by compressing inner content defined by zip-entry elements.

Syntax

<zip>
    ...
    [<zip-entry name="name" charset="charset">
        entry content
    </zip-entry>]*
    ...
</zip>

Attributes

Name Required Default Description
name yes Name of the file inside ZIP archive.
charset no default configuration's charset Charset of text file inside zip archive.

Example

<zip>
    <loop item="filename" index="i">
       <list><var name="myfilenames"/></list>
       <body>
           <zip-entry name="file${i}.xls">
               <file path="${filename}" type="binary">
           </zip-entry>
       <body>
    </loop>
</zip>

This example creates an archive that includes list of specified files. This ZIP archive can further be sent via email, stored to database or file system, so that zip element can be inside mail, database, file or any other valid processor.

ftp2.0

Creates FTP connection and executes some of valid ftp-based operations against the server: ftp-list, ftp-get, ftp-put, ftp-del, ftp-mkdir, ftp-rmdir.

Syntax

<ftp server="server" port="port" username="username" password="password" 
     account="account" remotedir="remotedir">
    [<ftp-list path="path" listfiles="listfiles" listdirs="listdirs" 
               listlinks="listlinks" listfilter="listfilter"/>]*
    [<ftp-get path="path"/>]*
    [<ftp-put path="path" charset="charset">
        content to save 
    </ftp-put>]*
    [<ftp-del path="path"/>]*
    [<ftp-mkdir path="path"/>]*
    [<ftp-rmdir path="path"/>]*
</ftp>

Attributes

Name Required Default Description
server yes FTP server address.
port no 21 FTP server port.
username yes FTP server username.
password yes FTP server password.
account no FTP server account name.
remotedir no Working remote directory on FTP server.
path yes Path of the file/directory to be accessed/added/removed.
listfiles no yes Tells whether to include files in the list.
listdirs no yes Tells whether to include directories in the list.
listlinks no yes Tells whether to include links in the list.
listfilter no Filter used for listing files. May include * and ?, i.e. my*.ex?

tokenize2.0

Splits given text to elements (tokens).

Syntax

<tokenize delimiters="delimiters" 
          trimtokens="trimtokens" 
          allowemptytokens="allowemptytokens">
    content to tokenize
</tokenize>

Attributes

Name Required Default Description
delimiters no new line character Tells which characters are used as token delimiters.
trimtokens no yes Tells whether to trim resulting tokens.
allowemptytokens no no Tells whether to include empty tokens in the resulting list (consisting only of whitespaces).

json-to-xml2.0

Converts given JSON content to XML.

Syntax

<json-to-xml>
    JSON content
</json-to-xml>

xml-to-json2.0

Converts given XML content to JSON.

Syntax

<xml-to-json>
    XML content
</xml-to-json>

Debugging

Writing Web-Harvest configuration could be tricky, especially when it includes multiple features like regular expressions, xquery, variables, various templates. In order to find out the problem more easily, Web-Harvest internally uses Log4J for logging each processor's execution. In programmatic use, Log4J could be configured depending on user's wish.

Furthermore, there is a way to save temporal execution values to the file system. Debugging option must be turned on (see Usage) and for each processor whose execution is monitored it is possible to define special attribute id that tells Web-Harvest to save its content to the file named _debug/<id>_<num>.debug under the working path. For example in the following configuration pipeline:

<xpath expression="//a/@href" id="yahoo_links">
    <html-to-xml id="yahoo_xml">
        <http url="http://www.yahoo.com" id="yahoo_html"/>
    </html-to-xml>
</xpath>

all three processors are told to save their results. Thus, the following files are created:

  • _debug/yahoo_html_<num>.debug,
  • _debug/yahoo_xml_<num>.debug,
  • _debug/yahoo_links_<num>.debug.