| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Home | SourceForge | Forums | Contact | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
User manualThis manual briefly describes structure and elements of the Web-Harvest configuration file. Content
Predefined and user-defined variables and objects
Every Web-Harvest variable context initially contains the following help objects that can be used
in any expression inside configurations:
Object sys which contains some general-purpose constants and methods:
For more details check Java API of this class. Object http which provides access to Http client and gives information about HTTP responses:
Note 1: See Usage page to check how user-defined objects can be put into variable context. Valid XML configuration elements
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Name | Required | Default | Description |
|---|---|---|---|
| charset | no | UTF-8 | Defines default charset used throughout configuration. Every processor that needs charset information uses this value if other not explicitly set. |
| scriptlang | no | beanshell | Defines default scripting engine used throughout configuration. Allowed values are beanshell, javascript and groovy. Default engine is used wherever not other is specified. script and template processors have ability to specify scripting engine within the same Web-Harvest configuration, that way giving possibility to even mix different scripting languages. |
<config charset="ISO-8859-1" scriptlang="groovy"> <file action="write" path="squares.txt"> <script return="${[1, 2, 3, 4, 5].collect(square)}"> <![CDATA[ square = { it * it } ]]> </script> </file> </config>
Here, file processor uses default encoding ISO-8859-1 and script processor uses default language groovy.
emptyWraps execution sequence and returns empty value. This element is used in situations when execution result is needless.
<empty> wrapped body </empty>
<file action="write" path="test/amazon_home.html"> <empty> <var-def name="amazonContent"> <http url="http://www.amazon.com"/> </var-def> </empty> <template> <![CDATA[ <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> ]]> ${amazonContent} </template> </file>
New variable is created but its value doesn't participate in the result, because it is inside empty element. Instead, its value is used in the following template processor.
textConverts embedded value to the string representation.
<text charset="charset" delimiter="delimiter"> wrapped body </text>
| Name | Required | Default | Description |
|---|---|---|---|
| charset | no | default configuration's charset | Charset used if body is converted from binary to text value. |
| delimiter | no | new line character | Delimiter string used to separate items when concatenating them into single string. |
<var-def name="digits"> <while condition="${i.toInt() != 10}" index="i"> <template>${i}</template> </while> </var-def> <file action="write" path="/test/replaced23.txt"> <regexp replace="true"> <regexp-pattern>(.*)(2.*3)(.*)</regexp-pattern> <regexp-source> <text> <var name="digits"/> </text> </regexp-source> <regexp-result> <template>${_1}here were 2 and 3${_3}</template> </regexp-result> </regexp> </file>
Variable named digits is defined using the while processor, producing sequence of 9 values. Next, the regexpr processor is invoked in order to do search-replace on variable value. text processor is used to concatenate all digit values into single one. Without text processor, regular expression search would be applied to each item in the list (every digit) and that way no replace would occurr, because there is no sequence "2*3".
var-defDefines new or overrides existing variable with specified name and value.
<var-def name="variable_name" overwrite="overwrite_existing"> body as value of the variable </var-def>
| Name | Required | Default | Description |
|---|---|---|---|
| name | yes | The name of variable. Should be valid like in most programming languages. | |
| overwrite | no | true | Boolean value (true or false) telling if existing variable with the same name will be overwriten or not. |
<var-def name="digitList"> <while condition="true" index="i" maxloops="9"> <var-def name="digit${i}"> <template>${i}</template> </var-def> </while> </var-def>
This example defines the variable digitList which is the sequence of 9 values (digits from 1 to 9), and 10 simple variables digit1, digit2, ..., digit9 with values ranging from 1 to 9.
varReturns value of defined variable. Throws an exception if variable is not defined.
<var name="variable_name"/>
| Name | Required | Default | Description |
|---|---|---|---|
| name | yes | Variable name. |
<var-def name="searchEngine"> google </var-def> <var-def name="${searchEngine}Content"> <http url="http://www.${searchEngine}.com"/> </var-def> <file action="write" path="data/${searchEngine}_content.html"> <var name="${searchEngine}Content"/> </file>
After execution, file named "google_content.html" contains page content of www.google.com.
fileReads and writes content of the file or search directory for specified files.
<file action="file_action" path="file_path" type="file_type" charset="charset_of_text_file" listdirs="listdirs" listfiles="listfiles" listrecursive="listrecursive" listfilter="listfilter"> body defining content of the file if action="write" or action="append" </file>
| Name | Required | Default | Description |
|---|---|---|---|
| action | no | read | Defines file action. Valid values are read, append, write and list. |
| path | yes | File path, relative to the working directory. | |
| type | no | text | Type of file: text or binary. |
| charset | no | [Default charset for config] | Charset for text files. Has no effect if type is binary. |
| listdirs | no | yes |
Tells whether to list directories (action = list).
|
| listfiles | no | yes |
Tells whether to list files (action = list).
|
| listrecursive | no | no |
Tells whether to recursively search directories (action = list).
|
| listfilter | no |
Filename pattern to search for (* is replacement for any sequence, ? for any character).
Works only for action = list.
|
<file action="write" path="123.txt"> <file action="read" path="1.txt"/> ----------------------------------- <file action="read" path="2.txt"/> ----------------------------------- <file action="read" path="3.txt"/> </file>
Here, new file is created containing appended contents of three existing files, separeted with lines.
<file action="write" path="c:/images/alljpegs.zip" type="binary"> <zip> <loop item="filename"> <list> <file path="c:/images/" action="list" listfilter="*.jpg" /> </list> <body> <zip-entry name="${sys.getFilename(filename.toString())}"> <file type="binary" path="${filename}"/> </zip-entry> </body> </loop> </zip> </file>
ZIP file consisting of all JPEG images taken from specified directory is created.
httpSends HTTP request to the specified URL and gets HTTP response as a result. First the body of the processor is executed in order to define optional HTTP parameters and/or headers and then sends HTTP request.
<http url="url" method="method" charset="charset" cookie-policy="cookie_policy" username="username" password="password" multipart="multipart"> body that might contain http-param and/or http-header elements </http>
| Name | Required | Default | Description |
|---|---|---|---|
| url | yes | HTTP request URL. | |
| method | no | get | HTTP method: get or post |
| charset | no | [Default charset for config] | Defines encoding of the HTTP response content. Has no effect if content type is binary. |
| cookie-policy | no | [Default cookie policy of the HTTP client] | Specifies the way how HTTP client manages cookies. Allowed values are: browser, ignore, netscape, rfc_2109 and default. |
| username | no | Specifies username if URL requires authentication. | |
| password | no | Specifies password if URL requires authentication. | |
| multipart | no | no | Tells if form is multipart encoded (enabling data upload). |
<xpath expression="data(//script)"> <html-to-xml> <http url="http://www.yahoo.com/"/> </html-to-xml> </xpath>
The content of www.yahoo.com is downloaded, transformed to XML and then all scripts inside the page are found.
http-paramAdds HTTP parameter for the first enclosing HTTP processor for both post and get requests. If used outside the HTTP processor an exception is thrown.
<http-param name="param_name" isfile="isfile" contenttype="contenttype" filename="filename"> body as parameter value </http-param>
| Name | Required | Default | Description |
|---|---|---|---|
| name | yes | The name of HTTP parameter. | |
| isfile | no | no | Tells if parameter is file for upload (applies only to multipart requests). |
| contenttype | no | MIME type of the upload file (effective for multipart forms where parameter is file). | |
| filename | no | Name of uploaded file (effective for multipart forms where parameter is file). |
<var-def name="paramNames"> USERID PASSWORD </var-def> <http method="post" url="http://www.nytimes.com/auth/login"> <http-param name="is_continue">true</http-param> <http-param name="URI">http://</http-param> <http-param name="OQ"></http-param> <http-param name="OP"></http-param> <loop item="name"> <list> <var name="paramNames"/> </list> <body> <http-param name="${name}">web-harvest</http-param> </body> </loop> </http>
Sends needed parameters to www.nytimes.com/auth/login in order to log in.
http-headerDefines HTTP header for the first enclosing HTTP processor. If used outside the HTTP processor an exception is thrown.
<http-header name="header_name"> body as header value </http-header>
| Name | Required | Default | Description |
|---|---|---|---|
| name | yes | The name of HTTP header. |
<http url="www.google.com"> <http-header name="User-Agent"> Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1 </http-header> </http>
Identifies itself to www.google.com as Firefox browser.
html-to-xmlCleans up the content of the body and transforms it to the valid XML. The body is usually HTML obtained as a result of http processor execution. Actual parsing and cleaning job is delegated to HtmlCleaner tool. Altough no special tuning is needed in most cases, cleaner may be configured with the several parameters defined with the processor's attributes.
<html-to-xml outputtype="..." advancedxmlescape="..." usecdata="..." specialentities="..." unicodechars="..." omitunknowntags="..." treatunknowntagsascontent="..." omitdeprtags="..." treatdeprtagsascontent="..." omitcomments="..." omithtmlenvelope="..." allowmultiwordattributes="..." allowhtmlinsideattributes="..." namespacesaware="..." prunetags="..." omitxmldecl="..."> body as html to be cleaned </html-to-xml>
| Name | Required | Default | Description |
|---|---|---|---|
| outputtype | no | simple |
Defines how the resulting XML will be serialized. Allowed values
are simple, compact, browser-compact
and pretty.
|
| advancedxmlescape | no | true | If this parameter is set to true, ampersand sign (&) that proceeds valid XML character sequences (&XXX;) will not be escaped with &XXX; |
| usecdata | no | true | If true, HtmlCleaner will treat SCRIPT and STYLE tag contents as CDATA sections, or otherwise it will be regarded as ordinary text (special characters will be escaped). |
| specialentities | no | true | If true, special HTML entities (i.e. ô, ‰, ×) are replaced with unicode characters they represent (ô, ‰, ×). This doesn't include &, <, >, ", '. |
| unicodechars | no | true | If true, HTML characters represented by their codes in form &#XXXX; are replaced with real unicode characters (i.e. ж is replaces with ж). |
| omitunknowntags | no | false | Tells whether to skip (ignore) unknown tags during cleanup. |
| treatunknowntagsascontent | no | false |
Tells whether to treat unknown tags as ordinary content, i.e.
<something...> will be transformed to
<something...>. This attribute is
applicable only if omitUnknownTags is set to false.
|
| omitdeprtags | no | false | Tells whether to skip (ignore) deprecated HTML tags during cleanup. |
| treatdeprtagsascontent | no | false |
Tells whether to treat deprecated tags as ordinary content, i.e.
<font...> will be transformed to
<font...>. This attribute is
applicable only if omitDeprecatedTags is set to false.
|
| omitcomments | no | false | Tells whether to skip HTML comments. |
| omithtmlenvelope | no | false | Tells whether to remove HTML and BODY tags from the resulting XML, and use first tag in the BODY section instead. If BODY section doesn't contain any tags, then this attribute has no effect. |
| allowmultiwordattributes | no | true |
Tells parser wether to allow attribute values consisting of multiple words or not. If true, attribute
att="a b c" will stay like it is, and if false parser will split this
into att="a" b="b" c="c" (this is default browsers' behaviour).
|
| allowhtmlinsideattributes | no | false |
Tells parser wether to allow html tags inside attribute values. For example, when this flag is set
att="here is <a href='xxxx'>link</a>" will stay like it is, and if not, parser will
end attribute value after "here is ". This flag makes sense only if allowMultiWordAttributes is set as well.
|
| namespacesaware | no | true | If true, namespace prefixes found during parsing will be preserved and all neccessery xml namespace declarations will be added in the root element. If false, all namespace prefixes and all xmlns namespace declarations will be stripped. |
| prunetags | no | empty string |
Comma-separated list of tags that will be complitely removed (with all nested elements)
from XML tree after parsing. For exampe if pruneTags is "script,style",
resulting XML will not contain scripts and styles.
|
| omitxmldecl | no | false | Tells whether to skip XML declaration. The value must be true or false. |
<html-to-xml outputtype="pretty"> <http url="http://www.motors.ebay.com"/> </html-to-xml>
Downloads the www.motors.ebay.com page and cleans it up producing pretty-prented XML content.
regexpSearches the body for the given regular expression and optionally replaces found occurrences with specified pattern. If body is a list of values then the regexp processor is applied to every item and final execution result is the list.
<regexp replace="true_or_false" max="max_found_occurrences" flag-canoneq="flag-canoneq" flag-caseinsensitive="flag-caseinsensitive" flag-dotall="flag-dotall" flag-multiline="flag-multiline" flag-unicodecase="flag-unicodecase"> <regexp-pattern> body as pattern value </regexp-pattern> <regexp-source> body as the text source </regexp-source> [<regexp-result> body as the result </regexp-result>] </regexp>
_<group_number> are created. See some Regular Expression
tutorial for better explanation of groups.
| Name | Required | Default | Description |
|---|---|---|---|
| replace | no | false |
Logical value telling if found occurrences of regular expression will be replaced.
Valid values are: true/false or yes/no. If this value is true (yes),
then the regexp-result needs to be specified with replacement value.
|
| max | no | Limits the number of found pattern occurrences. There is no limit if it is not specified. | |
| flag-canoneq | no | no | Enables canonical equivalence. |
| flag-caseinsensitive | no | no | Enables case-insensitive matching. |
| flag-dotall | no | yes | Enables dotall mode. |
| flag-multiline | no | no | Enables multiline mode. |
| flag-unicodecase | no | yes | Enables Unicode-aware case folding. |
<regexp> <regexp-pattern>([_\w\d]*)[\s]*=[\s]*([\w\d\s]*+)[\,\.\;]*</regexp-pattern> <regexp-source> var1= test1, var2 = bla bla; index=16; city = Delhi,town=Kingston; </regexp-source> <regexp-result> <template>Value of variable "${_1}" is "${_2}"!</template> </regexp-result> </regexp>
Here, regular expression is looking for specified pattern in two strings, producing as a result list of five values: Value of variable "var1" is "test1"!, Value of variable "var2" is "bla bla"! ...
<regexp replace="true"> <regexp-pattern>[\s]*[\,\.\;][\s]*</regexp-pattern> <regexp-source> var1= test1, var2 = bla bla; index=16; city = Delhi,town=Kingston; </regexp-source> <regexp-result> <template>|</template> </regexp-result> </regexp>
Here, the regular expression replacement produces single value as the result: var1= test1|var2 = bla bla|index=16|city = Delhi|town=Kingston|.
xpathUses an XPath language expression to search an XML document.
<xpath expression="xpath_expression"> body as xml </xpath>
| Name | Required | Default | Description |
|---|---|---|---|
| expression | yes | XPath language expression. |
<xpath expression="//a/@href"> <html-to-xml> <http url="http://www.nba.com/"/> </html-to-xml> </xpath>
The result is sequence of links from the page retrieved from www.nba.com.
xqueryUses an XQuery language expression to query an XML document.
<xquery> [<xq-param name="xquery_param_name" [type="xquery_param_type"]> body as xquery parameter value </xq-param>] * <xq-expression> body as xquery language construct </xq-expression> </xquery>
| Name | Required | Default | Description |
|---|---|---|---|
| name | yes | Name of XQuery parameter | |
| type | no | node() |
Type of XQuery parameter - one of the values:
node(), integer, long,
float, double, boolean,
string,
node()*, integer*, long*,
float*, double*, boolean*, string*.
|
It is allowed to optionally specify multiple external parameters for the query.
In most cases at least one, containing XML document is needed. For every specified
xquery parameter the declaration inside the xq-expression in the form:
declare variable $<xquery_param_name> as <xquery_param_type> external;
node(), integer, long,
float, double, boolean, string
and analog sequence types:
node()*, integer*, long*,
float*, double*, boolean*, string*.
If not specified, default XQuery parameter is node().
<xquery> <xq-param name="doc"> <html-to-xml> <http url="${sys.fullUrl(startUrl, articleUrl)}"/> </html-to-xml> </xq-param> <xq-expression><![CDATA[ declare variable $doc as node() external; let $author := data($doc//div[@class="byline"]) let $title := data($doc//h1) let $text := data($doc//div[@id="articleBody"]) return <article> <title>{$title}</title> <author>{$author}</author> <text>{$text}</text> </article> ]]></xq-expression> </xquery>
The xquery is applied to the downloaded page resulting XML containing information about newspaper's articles.
xsltApplies XSLT transformation to the XML document.
<xslt> <xml> body as xml </xml> <stylesheet> body as xsl </stylesheet> </xslt>
<xslt> <xml> <html-to-xml> <http url="${url}"/> </html-to-xml> </xml> <stylesheet> <file path="stylesheets/tree.xsl"/> </stylesheet> </xslt>
XSLT transformation, taken from the file is applied to the downloaded content.
scriptExecutes code written in specified scripting language. Web-Harvest supports BeanShell, Groovy and Javascript. All of them are powerfull, wide-spread and popular scripting languages.
Body of script processors is executed in specified language and optionally
evaluated expression specified in return attribute is returned.
All variables defined during configuration execution are also available in the script
processor. However, it must be noted that variables used throughtout Web-Harvest are not
simple types - they all are org.webharvest.runtime.variables.Variable objects (internal
Web-Harvest class) that expose convinient methods:
String toString()byte[] toBinary()boolean toBoolean()int toInt()long toLong()double toDouble()double toDouble()Object[] toArray()java.util.List toList()Object getWrappedObject()
The way to push value back to the Web-Harvest after
script finishes is command sys.defineVariable(varName, varValue, [overwrite]) which creates
appropriate wrapper around specified value : list variables for java.util.List
and arrays and simple variables for other objects. The best way to illustrate this is simple
example bellow.
Each script engine used in the single Web-Harvest configuration, once created, preserves its variable context throughout the configuration, meaning that all variables and objects are available in further script processors that use the same language.
<script language="script_language" return="value_to_return"> body as script </script>
| Name | Required | Default | Description |
|---|---|---|---|
| language | no |
Default scripting language if defined in config element,
or beanshell if nothing is defined.
|
Defines which scripting engine is used in the processor. Valid values are beanshell, javascript and groovy. |
| return | no | Empty value | Specifies what this processor should evaluate at the end and return as processing value. |
<?xml version="1.0" encoding="UTF-8"?> <config> <var-def name="birthDate"> 11/4/1958 </var-def> <var-def name="web_harvest_day_variable"> <script return="namedDay.toUpperCase()"><![CDATA[ tokenizer = new StringTokenizer(birthDate.toString(), "./-\\"); day = Integer.parseInt(tokenizer.nextToken()); month = Integer.parseInt(tokenizer.nextToken()); year = Integer.parseInt(tokenizer.nextToken()); Calendar cal = Calendar.getInstance(); cal.set(Calendar.DAY_OF_MONTH, day); cal.set(Calendar.MONTH, month-1); cal.set(Calendar.YEAR, year); switch( cal.get(Calendar.DAY_OF_WEEK) ) { case 0 : namedDay = "Sunday"; break; case 1 : namedDay = "Monday"; break; case 2 : namedDay = "Tuesday"; break; case 3 : namedDay = "Wendsday"; break; case 4 : namedDay = "Thursday"; break; case 5 : namedDay = "Friday"; break; default: namedDay = "Saturday"; break; } ]]></script> </var-def> <template> The day when you were born was ${namedDay}. </template> <file action="write" path="day.txt"> <var name="web_harvest_day_variable"/> </file> </config>
This example also shows that script internal variables once defined, are
available in all the following script and template
processors (namedDay).
templateFor the given text content, parts surrounded with ${ and } are evaluated using the specified scripting engine. If no scripting language is specified, default one is used (see config element).
<template language="script_language"> body as text for templating </template>
| Name | Required | Default | Description |
|---|---|---|---|
| language | no | Default config language | Specifies script language that will be used for evaluation of parts surrounded with ${ and }. Valid values are beanshell, javascript and groovy. |
<var-def name="content"> <file path="textdata/products.txt"/> </var-def> <var-def name="changedContent"> <template> ${sys.datetime("yyyy-MM-dd, HH:mm:ss")} ${sys.lf} ---------------------------------------------------- ${sys.lf} ${my.process(content.toString())} </template> </var-def>
Templater uses some built-in constants, functions and some user-defined objects from variable context in order to produce desired content.
caseExecutes conditional statement. Sequentially checks if some of the specified conditions in inner if elements is satisfied and if found one returns its body as the result. If no true statement is found result of execution is body of else statement if specified, or empty value otherwise.
<case> [<if condition="expression"> if body </if>] * [<else> else body </else>] </case>
| Name | Required | Default | Description |
|---|---|---|---|
| condition | yes |
If true (yes), body of if is evaluated.
|
<var-def name="contact"> <xpath expression="//a[contains(., 'contact')]/@href"> <var name="pageContent"/> </xpath> </var-def> <var-def name="contactMail"> <case> <if condition="${contact.toString() != ''}"> <var name="contact"/> </if> <else> Contact is not defined! </else> </case> </var-def>
Here, conditional processor is used to check if previous xpath search has found contact information on the page.
loopIterate through the specified list and executes specified body logic for each item. Result is the list of processed bodies.
<loop item="item_var_name" index="index_var_name" maxloops="max_loops" filter="list_filter" empty="empty"> <list> body as list value </list> <body> body for each list item </body> </loop>
| Name | Required | Default | Description |
|---|---|---|---|
| item | no | Name of the variable that takes the value of current list item. | |
| index | no | Name of the index variable, initial value for the first loop is 1. | |
| maxloops | no | Limits number of iterations. There is no limit if it is not specified. | |
| filter | no |
Expression for filtering iteration list. It consists of arbitrary number of
restrictions separated by comma. There are the following types of restrictions:
1-20,1:2,unique.
|
|
| empty | no | no |
Equal to surrounding body by empty element, producing empty result
of iteration.
|
<loop item="link" index="i" filter="unique"> <list> <xpath expression="//img/@src"> <html-to-xml> <http url="http://www.yahoo.com"/> </html-to-xml> </xpath> </list> <body> <file action="write" type="binary" path="images/${i}.gif"> <http url="${sys.fullUrl('http://www.yahoo.com', link)}"/> </file> </body> </loop>
Loop iterates over the all unique image URLs from www.yahoo.com and for each URL downloads the image and stores it to the file system.
whileLoops while specified condition is satisfied. The result is list made of processed bodies in each iteration.
<while condition="expression" index="index_var_name" maxloops="max_loops" empty="empty"> body </while>
| Name | Required | Default | Description |
|---|---|---|---|
| condition | yes | Expression that is evaluated for every loop and if its value is true, the body is executed. | |
| index | no | Name of the index variable, initial value for the first loop is 1. | |
| maxloops | no | Limits number of iterations. There is no limit if it is not specified. | |
| empty | no | no |
Equal to surrounding body by empty element, producing empty result
of iteration.
|
function processor.
functionDeclares the user-defined function.
<function name="function_name"> function body </function>
| Name | Required | Default | Description |
|---|---|---|---|
| name | yes | The name of user-defined function |
<function name="download-multipage-list"> <return> <while condition="${pageUrl.toString().trim() != ''}" maxloops="${maxloops}" index="i"> <empty> <var-def name="content"> <html-to-xml> <http url="${pageUrl}"/> </html-to-xml> </var-def> <var-def name="nextLinkUrl"> <xpath expression="${nextXpath}"> <var name="content"/> </xpath> </var-def> <var-def name="pageUrl"> <template>${sys.fullUrl(pageUrl, nextLinkUrl)}</template> </var-def> </empty> <xpath expression="${itemXPath}"> <var name="content"/> </xpath> </while> </return> </function> <var-def name="imgLinks"> <call name="download-multipage-list"> <call-param name="pageUrl"> http://images.google.com/images?q=harvest&hl=en&btnG=Search+Images&nojs=1 </call-param> <call-param name="nextXPath"> //a[@shape='rect' and .='Next']/@href </call-param> <call-param name="itemXPath"> //img[contains(@src, 'images?q=tbn')]/@src </call-param> <call-param name="maxloops"> 5 </call-param> </call> </var-def>
Here the function named download-multipage-list is defined in order to serve multiple extractions. It collects link URLs from series of pages where XPath expression parameter is used to determine URL of next page with links if it exists. This situation is typical for list of products, or list of search results spanning multiple web pages. After that, the function is called with specified parameters in order to collect image links from Google images search limiting number of resulting pages to 5.
returnReturns value from the user-defined function.
<return> body as return value </return>
function processor.
callCalls the user-defined function.
<call name="function_name"> [<call-param name="function_name"> body as actual parameter value </call-param>] * </call>
| Name | Required | Default | Description |
|---|---|---|---|
| name | yes | The name of user-defined function |
function processor.
includeIncludes other configuration file and executes its logic. This is useful for keeping libraries of common functions or for splitting large extraction process into multiple files.
<include path="file_path"/>
| Name | Required | Default | Description |
|---|---|---|---|
| path | yes | Path of he configuration file to be included. Path is relative the directory where including configuration file is. |
<include path="lib.xml"/>
tryWraps execution and for any recoverable exception returns default value without crashing the whole process.
<try> <body> try body </body> <catch> catch body </catch> </try>
<var-def name="reportText"> <try> <body> <file path="data/report.txt"/> </body> <catch> No report file! </catch> </try> </var-def>
File read exception is caught if occurred and default value is stored in the variable.
exitConditionally breaks the configuration execution.
<exit condition="condition" message="message" />
| Name | Required | Default | Description |
|---|---|---|---|
| condition | no | true | Condition that determines if execution will stop. Must be boolean value (true, yes, false, no). |
| message | no | Optional message to the user if configuration is exiting. Will be part of logging information, or dialog warning will popup if Web-Harvest is used in GUI mode. |
<exit condition='${!sys.isVariableDefined("username")}' message="No username provided!" />
Configuration is stopping execution if variable username
is not defined.
database2.0
Execute query against database.
JDBC driver library file(s) should be provided on the classpath
if used programatically, or on the same path with Web-Harvest executable if used standalone.
In case of SELECT sql statement, it returns list of row objects. They can be accessed with
special accessor methods:
<mydbrow>.getColumnCount() - returns number of columns returned.<mydbrow>.getColumnName(index) - returns name for column number.<mydbrow>.get(column_index) - returns field value for column number.<mydbrow>.get(column_name) - returns field value for column name.The whole list of returned db rows can be accessed by index to get individual row:
<mydbvar>.get(rowindex)
For example:
mydb.get(0).get("image")
<database connection="jdbc connection string" jdbcclass="full named jdbc class" username="username" password="password" autocommit="autocommit" max="max rows returned"> select, insert or delete SQL query </database>
| Name | Required | Default | Description |
|---|---|---|---|
| connection | yes | Properly formatted JDBC string for the database. It depends on database/driver vendor. | |
| jdbcclass | yes | Fully qualified class name of the JDBC driver. | |
| username | no | Username to access database. | |
| password | no | Password to access database. | |
| autocommit | no | true | Whether commit is performed automatically after query execution. |
| max | no | no limit | Maximum number of returned rows from the SELECT statement. |
<var-def name="employees"> <database connection="jdbc:microsoft:Sqlserver://myserver:1433;databaseName=mycompany;user=sa;password=hehehe" jdbcclass="com.microsoft.jdbc.sqlserver.SQLServerDriver"> select name, salary from employee </database> </var-def> <loop item="emp"> <list> <var name="employees"/> </list> <body> <template>Salary of ${emp.get("name")} is ${emp.get("salary")}</template> </body> </loop>
<database connection="jdbc:microsoft:Sqlserver://myserver:1433;databaseName=mycompany;user=sa;password=hehehe" jdbcclass="com.microsoft.jdbc.sqlserver.SQLServerDriver"> <template> insert into news (id, url, text, source) values (${myId}, '${myUrl}', '${myText}', '${mySource}') </template> </database>
db-param2.0
Specifies database parameter inside database element. Can be used for storing BLOBs
(Binary Large OBjects).
<db-param type="param_type"> parameter value </db-param>
| Name | Required | Default | Description |
|---|---|---|---|
| type | no | binary if it's value is recognized as binary, text otherwise. |
Type of the parameter. Valid values are: int, long, double, text and binary. |
<database connection="jdbc:mysql://myserver/mydb" jdbcclass="com.mysql.jdbc.Driver" username="myuser" password="mypass"> insert into logos (id, img) values ( 1, <db-param><http url='${myImageUrl}'/></db-param> ) </database>
mail2.0Sends an email.
<mail smtp-host="smtp server" smtp-port="smtp server port" type="content type" from="sender" reply-to="reply-to header" to="to" cc="cc" bcc="bcc" subject="subject" charset="charset" username="smtp username" password="smtp password" security="smtp security type"> mail content with optional attachments (mail-attach elements) </mail>
| Name | Required | Default | Description |
|---|---|---|---|
| smtp-host | yes | SMTP server host. | |
| smtp-port | no | 25 | SMTP server port. |
| type | no | text |
Content type of the mail body: text or html.
|
| from | yes | The senders email address. | |
| reply-to | no | The email address where replies should be sent to. | |
| to | yes | Comma-separated list of recipient email addresses. | |
| cc | no | Comma-separated list of cc email addresses. | |
| subject | no | Subject of the email. | |
| charset | no | default configuration's charset | Charset of the email. |
| username | no | SMTP server username. | |
| password | no | SMTP server password. | |
| security | no | none |
SMTP server security type: none, ssl or tls
|
<mail from="reminder@my.com" smtp-host="smtp.gmail.com" to="myaccount@gmail.com" username="myusername" password="mypassword" security="tsl" subject='Reminder for ${sys.datetime("dd.MM.yyyy.")}'> Here is what you need today: <file path="today.txt"> </mail>
mail-attach2.0
Adds an email attachment. Can be used only as part of mail processor of html type.
<mail-attach name="name" mimetype="mimetype" inline="inline"> body of the attachment </mail-attach>
| Name | Required | Default | Description |
|---|---|---|---|
| name | no | Attachment N | Name of the attachment. |
| mimetype | no | image/jpeg for inline attachments, application/octet-stream otherwise |
Mime type of the attachment. |
| inline | no | no | Tells whether attachment is embeded in the mail body. |
<mail from="my@my.com" smtp-host="smtp.gmail.com" to="myaccount@gmail.com" type="html" username="myusername" password="mypassword" security="tsl" subject='Photos from the ...'> Here is me with ... <![CDATA[ <img src="]]> <mail-attach inline="true"><file path="myphoto1.jpg" type="binary"/></mail-attach> <![CDATA[ "> ]]> And this is ... <![CDATA[ <img src="]]> <mail-attach inline="true"><file path="myphoto2.jpg" type="binary"/></mail-attach> <![CDATA[ "> ]]> </mail>
zip2.0
Creates a ZIP archive by compressing inner content defined by zip-entry elements.
<zip> ... [<zip-entry name="name" charset="charset"> entry content </zip-entry>]* ... </zip>
| Name | Required | Default | Description |
|---|---|---|---|
| name | yes | Name of the file inside ZIP archive. | |
| charset | no | default configuration's charset | Charset of text file inside zip archive. |
<zip> <loop item="filename" index="i"> <list><var name="myfilenames"/></list> <body> <zip-entry name="file${i}.xls"> <file path="${filename}" type="binary"> </zip-entry> <body> </loop> </zip>
This example creates an archive that includes list of specified files. This ZIP
archive can further be sent via email, stored to database or file system, so that
zip element can be inside mail, database,
file or any other valid processor.
ftp2.0
Creates FTP connection and executes some of valid ftp-based operations against the server:
ftp-list, ftp-get, ftp-put, ftp-del,
ftp-mkdir, ftp-rmdir.
<ftp server="server" port="port" username="username" password="password" account="account" remotedir="remotedir"> [<ftp-list path="path" listfiles="listfiles" listdirs="listdirs" listlinks="listlinks" listfilter="listfilter"/>]* [<ftp-get path="path"/>]* [<ftp-put path="path" charset="charset"> content to save </ftp-put>]* [<ftp-del path="path"/>]* [<ftp-mkdir path="path"/>]* [<ftp-rmdir path="path"/>]* </ftp>
| Name | Required | Default | Description |
|---|---|---|---|
| server | yes | FTP server address. | |
| port | no | 21 | FTP server port. |
| username | yes | FTP server username. | |
| password | yes | FTP server password. | |
| account | no | FTP server account name. | |
| remotedir | no | Working remote directory on FTP server. | |
| path | yes | Path of the file/directory to be accessed/added/removed. | |
| listfiles | no | yes | Tells whether to include files in the list. |
| listdirs | no | yes | Tells whether to include directories in the list. |
| listlinks | no | yes | Tells whether to include links in the list. |
| listfilter | no |
Filter used for listing files. May include * and ?, i.e. my*.ex?
|
tokenize2.0Splits given text to elements (tokens).
<tokenize delimiters="delimiters" trimtokens="trimtokens" allowemptytokens="allowemptytokens"> content to tokenize </tokenize>
| Name | Required | Default | Description |
|---|---|---|---|
| delimiters | no | new line character | Tells which characters are used as token delimiters. |
| trimtokens | no | yes | Tells whether to trim resulting tokens. |
| allowemptytokens | no | no | Tells whether to include empty tokens in the resulting list (consisting only of whitespaces). |
Writing Web-Harvest configuration could be tricky, especially when it includes multiple features like regular expressions, xquery, variables, various templates. In order to find out the problem more easily, Web-Harvest internally uses Log4J for logging each processor's execution. In programmatic use, Log4J could be configured depending on user's wish.
Furthermore, there is a way to save temporal execution values to the file system.
Debugging option must be turned on (see Usage) and for
each processor whose execution is monitored it is possible to define special
attribute id that tells Web-Harvest to save its content to the
file named _debug/<id>_<num>.debug under the working path.
For example in the following configuration pipeline:
<xpath expression="//a/@href" id="yahoo_links"> <html-to-xml id="yahoo_xml"> <http url="http://www.yahoo.com" id="yahoo_html"/> </html-to-xml> </xpath>
all three processors are told to save their results. Thus, the following files are created:
