| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Version 1.0 | Home | SourceForge | Forums | Contact | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
User manualThis manual briefly describes structure and elements of the Web-Harvest configuration file. Content
Predefined and user-defined variables and objects
Every Web-Harvest variable context initially contains the following help objects that can be used
in any expression inside configurations:
Object sys which contains some general-purpose constants and methods:
For more details check Java API of this class. Object http which provides access to Http client and gives information about HTTP responses:
Note 1: See Usage page to check how user-defined objects can be put into variable context. Valid XML configuration elements
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Name | Required | Default | Description |
|---|---|---|---|
| charset | no | UTF-8 | Defines default charset used throughout configuration. Every processor that needs charset information uses this value if other not explicitly set. |
| scriptlang | no | beanshell | Defines default scripting engine used throughout configuration. Allowed values are beanshell, javascript and groovy. Default engine is used wherever not other is specified. script and template processors have ability to specify scripting engine within the same Web-Harvest configuration, that way giving possibility to even mix different scripting languages. |
<config charset="ISO-8859-1" scriptlang="groovy"> <file action="write" path="squares.txt"> <script return="${[1, 2, 3, 4, 5].collect(square)}"> <![CDATA[ square = { it * it } ]]> </script> </file> </config>
Here, file processor uses default encoding ISO-8859-1 and script processor uses default language groovy.
emptyWraps execution sequence and returns empty value. This element is used in situations when execution result is needless.
<empty> wrapped body </empty>
<file action="write" path="test/amazon_home.html"> <empty> <var-def name="amazonContent"> <http url="http://www.amazon.com"/> </var-def> </empty> <template> <![CDATA[ <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> ]]> ${amazonContent} </template> </file>
New variable is created but its value doesn't participate in the result, because it is inside empty element. Instead, its value is used in the following template processor.
textConverts embedded value to the string representation.
<text> wrapped body </text>
<var-def name="digits"> <while condition="${i.toInt() != 10}" index="i"> <template>${i}</template> </while> </var-def> <file action="write" path="/test/replaced23.txt"> <regexp replace="true"> <regexp-pattern>(.*)(2.*3)(.*)</regexp-pattern> <regexp-source> <text> <var name="digits"/> </text> </regexp-source> <regexp-result> <template>${_1}here were 2 and 3${_3}</template> </regexp-result> </regexp> </file>
Variable named digits is defined using the while processor, producing sequence of 9 values. Next, the regexpr processor is invoked in order to do search-replace on variable value. text processor is used to concatenate all digit values into single one. Without text processor, regular expression search would be applied to each item in the list (every digit) and that way no replace would occurr, because there is no sequence "2*3".
var-defDefines new or overrides existing variable with specified name and value.
<var-def name="variable_name" overwrite="overwrite_existing"> body as value of the variable </var-def>
| Name | Required | Default | Description |
|---|---|---|---|
| name | yes | The name of variable. Should be valid like in most programming languages. | |
| overwrite | no | true | Boolean value (true or false) telling if existing variable with the same name will be overwriten or not. |
<var-def name="digitList"> <while condition="true" index="i" maxloops="9"> <var-def name="digit${i}"> <template>${i}</template> </var-def> </while> </var-def>
This example defines the variable digitList which is the sequence of 9 values (digits from 1 to 9), and 10 simple variables digit1, digit2, ..., digit9 with values ranging from 1 to 9.
varReturns value of defined variable. Throws an exception if variable is not defined.
<var name="variable_name"/>
| Name | Required | Default | Description |
|---|---|---|---|
| name | yes | Variable name. |
<var-def name="searchEngine"> google </var-def> <var-def name="${searchEngine}Content"> <http url="http://www.${searchEngine}.com"/> </var-def> <file action="write" path="data/${searchEngine}_content.html"> <var name="${searchEngine}Content"/> </file>
After execution, file named "google_content.html" contains page content of www.google.com.
fileReads or writes content of the specified file.
<file action="file_action" path="file_path" type="file_type" charset="charset_of_text_file"> body defining content of the file if action="write" or action="append" </file>
| Name | Required | Default | Description |
|---|---|---|---|
| action | no | read | Defines file action. Valid values are read, append and write. |
| path | yes | File path, relative to the working directory. | |
| type | no | text | Type of file: text or binary. |
| charset | no | [Default charset for config] | Charset for text files. Has no effect if type is binary. |
<file action="write" path="123.txt"> <file action="read" path="1.txt"/> ----------------------------------- <file action="read" path="2.txt"/> ----------------------------------- <file action="read" path="3.txt"/> </file>
Here, new file is created containing appended contents of three existing files, separeted with lines.
httpSends HTTP request to the specified URL and gets HTTP response as a result. First the body of the processor is executed in order to define optional HTTP parameters and/or headers and then sends HTTP request.
<http url="url" method="method" charset="charset" cookie-policy="cookie_policy" username="username" password="password"> body that might contain http-param and/or http-header elements </http>
| Name | Required | Default | Description |
|---|---|---|---|
| url | yes | HTTP request URL. | |
| method | no | get | HTTP method: get or post |
| charset | no | [Default charset for config] | Defines encoding of the HTTP response content. Has no effect if content type is binary. |
| cookie-policy | no | [Default cookie policy of the HTTP client] | Specifies the way how HTTP client manages cookies. Allowed values are: browser, ignore, netscape, rfc_2109 and default. |
| username | no | Specifies username if URL requires authentication. | |
| password | no | Specifies password if URL requires authentication. |
<xpath expression="data(//script)"> <html-to-xml> <http url="http://www.yahoo.com/"/> </html-to-xml> </xpath>
The content of www.yahoo.com is downloaded, transformed to XML and then all scripts inside the page are found.
http-paramAdds HTTP parameter for the first enclosing HTTP processor for both post and get requests. If used outside the HTTP processor an exception is thrown.
<http-param name="param_name"> body as parameter value </http-param>
| Name | Required | Default | Description |
|---|---|---|---|
| name | yes | The name of HTTP parameter. |
<var-def name="paramNames"> USERID PASSWORD </var-def> <http method="post" url="http://www.nytimes.com/auth/login"> <http-param name="is_continue">true</http-param> <http-param name="URI">http://</http-param> <http-param name="OQ"></http-param> <http-param name="OP"></http-param> <loop item="name"> <list> <var name="paramNames"/> </list> <body> <http-param name="${name}">web-harvest</http-param> </body> </loop> </http>
Sends needed parameters to www.nytimes.com/auth/login in order to log in.
http-headerDefines HTTP header for the first enclosing HTTP processor. If used outside the HTTP processor an exception is thrown.
<http-header name="header_name"> body as header value </http-header>
| Name | Required | Default | Description |
|---|---|---|---|
| name | yes | The name of HTTP header. |
<http url="www.google.com"> <http-header name="User-Agent"> Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1 </http-header> </http>
Identifies itself to www.google.com as Firefox browser.
html-to-xmlCleans up the content of the body and transforms it to the valid XML. The body is usually HTML obtained as a result of http processor execution. Actual parsing and cleaning job is delegated to HtmlCleaner tool. Altough no special tuning is needed in most cases, cleaner may be configured with the several parameters defined with the processor's attributes.
<html-to-xml outputtype="..." advancedxmlescape="..." usecdata="..." specialentities="..." unicodechars="..." omitunknowntags="..." treatunknowntagsascontent="..." omitdeprtags="..." treatdeprtagsascontent="..." omitcomments="..." omithtmlenvelope="..." allowmultiwordattributes="..." allowhtmlinsideattributes="..." namespacesaware="..." prunetags="..."> body as html to be cleaned </html-to-xml>
| Name | Required | Default | Description |
|---|---|---|---|
| outputtype | no | simple |
Defines how the resulting XML will be serialized. Allowed values
are simple, compact, browser-compact
and pretty.
|
| advancedxmlescape | no | true | If this parameter is set to true, ampersand sign (&) that proceeds valid XML character sequences (&XXX;) will not be escaped with &XXX; |
| usecdata | no | true | If true, HtmlCleaner will treat SCRIPT and STYLE tag contents as CDATA sections, or otherwise it will be regarded as ordinary text (special characters will be escaped). |
| specialentities | no | true | If true, special HTML entities (i.e. ô, ‰, ×) are replaced with unicode characters they represent (ô, ‰, ×). This doesn't include &, <, >, ", '. |
| unicodechars | no | true | If true, HTML characters represented by their codes in form &#XXXX; are replaced with real unicode characters (i.e. ж is replaces with ж). |
| omitunknowntags | no | false | Tells whether to skip (ignore) unknown tags during cleanup. |
| treatunknowntagsascontent | no | false |
Tells whether to treat unknown tags as ordinary content, i.e.
<something...> will be transformed to
<something...>. This attribute is
applicable only if omitUnknownTags is set to false.
|
| omitdeprtags | no | false | Tells whether to skip (ignore) deprecated HTML tags during cleanup. |
| treatdeprtagsascontent | no | false |
Tells whether to treat deprecated tags as ordinary content, i.e.
<font...> will be transformed to
<font...>. This attribute is
applicable only if omitDeprecatedTags is set to false.
|
| omitcomments | no | false | Tells whether to skip HTML comments. |
| omithtmlenvelope | no | false | Tells whether to remove HTML and BODY tags from the resulting XML, and use first tag in the BODY section instead. If BODY section doesn't contain any tags, then this attribute has no effect. |
| allowmultiwordattributes | no | true |
Tells parser wether to allow attribute values consisting of multiple words or not. If true, attribute
att="a b c" will stay like it is, and if false parser will split this
into att="a" b="b" c="c" (this is default browsers' behaviour).
|
| allowhtmlinsideattributes | no | false |
Tells parser wether to allow html tags inside attribute values. For example, when this flag is set
att="here is <a href='xxxx'>link</a>" will stay like it is, and if not, parser will
end attribute value after "here is ". This flag makes sense only if allowMultiWordAttributes is set as well.
|
| namespacesaware | no | true | If true, namespace prefixes found during parsing will be preserved and all neccessery xml namespace declarations will be added in the root element. If false, all namespace prefixes and all xmlns namespace declarations will be stripped. |
| prunetags | no | empty string |
Comma-separated list of tags that will be complitely removed (with all nested elements)
from XML tree after parsing. For exampe if pruneTags is "script,style",
resulting XML will not contain scripts and styles.
|
<html-to-xml outputtype="pretty"> <http url="http://www.motors.ebay.com"/> </html-to-xml>
Downloads the www.motors.ebay.com page and cleans it up producing pretty-prented XML content.
regexpSearches the body for the given regular expression and optionally replaces found occurrences with specified pattern. If body is a list of values then the regexp processor is applied to every item and final execution result is the list.
<regexp replace="true_or_false" max="max_found_occurrences"> <regexp-pattern> body as pattern value </regexp-pattern> <regexp-source> body as the text source </regexp-source> [<regexp-result> body as the result </regexp-result>] </regexp>
_<group_number> are created. See some Regular Expression
tutorial for better explanation of groups.
| Name | Required | Default | Description |
|---|---|---|---|
| replace | no | false |
Logical value telling if found occurrences of regular expression will be replaced.
Valid values are: true/false or yes/no. If this value is true (yes),
then the regexp-result needs to be specified with replacement value.
|
| max | no | Limits the number of found pattern occurrences. There is no limit if it is not specified. |
<regexp> <regexp-pattern>([_\w\d]*)[\s]*=[\s]*([\w\d\s]*+)[\,\.\;]*</regexp-pattern> <regexp-source> var1= test1, var2 = bla bla; index=16; city = Delhi,town=Kingston; </regexp-source> <regexp-result> <template>Value of variable "${_1}" is "${_2}"!</template> </regexp-result> </regexp>
Here, regular expression is looking for specified pattern in two strings, producing as a result list of five values: Value of variable "var1" is "test1"!, Value of variable "var2" is "bla bla"! ...
<regexp replace="true"> <regexp-pattern>[\s]*[\,\.\;][\s]*</regexp-pattern> <regexp-source> var1= test1, var2 = bla bla; index=16; city = Delhi,town=Kingston; </regexp-source> <regexp-result> <template>|</template> </regexp-result> </regexp>
Here, the regular expression replacement produces single value as the result: var1= test1|var2 = bla bla|index=16|city = Delhi|town=Kingston|.
xpathUses an XPath language expression to search an XML document.
<xpath expression="xpath_expression"> body as xml </xpath>
| Name | Required | Default | Description |
|---|---|---|---|
| expression | yes | XPath language expression. |
<xpath expression="//a/@href"> <html-to-xml> <http url="http://www.nba.com/"/> </html-to-xml> </xpath>
The result is sequence of links from the page retrieved from www.nba.com.
xqueryUses an XQuery language expression to query an XML document.
<xquery> [<xq-param name="xquery_param_name" [type="xquery_param_type"]> body as xquery parameter value </xq-param>] * <xq-expression> body as xquery language construct </xq-expression> </xquery>
| Name | Required | Default | Description |
|---|---|---|---|
| name | yes | Name of XQuery parameter | |
| type | no | node() |
Type of XQuery parameter - one of the values:
node(), integer, long,
float, double, boolean,
string,
node()*, integer*, long*,
float*, double*, boolean*, string*.
|
It is allowed to optionally specify multiple external parameters for the query.
In most cases at least one, containing XML document is needed. For every specified
xquery parameter the declaration inside the xq-expression in the form:
declare variable $<xquery_param_name> as <xquery_param_type> external;
node(), integer, long,
float, double, boolean, string
and analog sequence types:
node()*, integer*, long*,
float*, double*, boolean*, string*.
If not specified, default XQuery parameter is node().
<xquery> <xq-param name="doc"> <html-to-xml> <http url="${sys.fullUrl(startUrl, articleUrl)}"/> </html-to-xml> </xq-param> <xq-expression><![CDATA[ declare variable $doc as node() external; let $author := data($doc//div[@class="byline"]) let $title := data($doc//h1) let $text := data($doc//div[@id="articleBody"]) return <article> <title>{$title}</title> <author>{$author}</author> <text>{$text}</text> </article> ]]></xq-expression> </xquery>
The xquery is applied to the downloaded page resulting XML containing information about newspaper's articles.
xsltApplies XSLT transformation to the XML document.
<xslt> <xml> body as xml </xml> <stylesheet> body as xsl </stylesheet> </xslt>
<xslt> <xml> <html-to-xml> <http url="${url}"/> </html-to-xml> </xml> <stylesheet> <file path="stylesheets/tree.xsl"/> </stylesheet> </xslt>
XSLT transformation, taken from the file is applied to the downloaded content.
scriptExecutes code written in specified scripting language. Web-Harvest supports BeanShell, Groovy and Javascript. All of them are powerfull, wide-spread and popular scripting languages.
Body of script processors is executed in specified language and optionally
evaluated expression specified in return attribute is returned.
All variables defined during configuration execution are also available in the script
processor. However, it must be noted that variables used throughtout Web-Harvest are not
simple types - they all are org.webharvest.runtime.variables.Variable objects (internal
Web-Harvest class) that expose convinient methods:
String toString()byte[] toBinary()boolean toBoolean()int toInt()long toLong()double toDouble()double toDouble()Object[] toArray()java.util.List toList()Object getWrappedObject()
The way to push value back to the Web-Harvest after
script finishes is command sys.defineVariable(varName, varValue, [overwrite]) which creates
appropriate wrapper around specified value : list variables for java.util.List
and arrays and simple variables for other objects. The best way to illustrate this is simple
example bellow.
Each script engine used in the single Web-Harvest configuration, once created, preserves its variable context throughout the configuration, meaning that all variables and objects are available in further script processors that use the same language.
<script language="script_language" return="value_to_return"> body as script </script>
| Name | Required | Default | Description |
|---|---|---|---|
| language | no |
Default scripting language if defined in config element,
or beanshell if nothing is defined.
|
Defines which scripting engine is used in the processor. Valid values are beanshell, javascript and groovy. |
| return | no | Empty value | Specifies what this processor should evaluate at the end and return as processing value. |
<?xml version="1.0" encoding="UTF-8"?> <config> <var-def name="birthDate"> 11/4/1958 </var-def> <var-def name="web_harvest_day_variable"> <script return="namedDay.toUpperCase()"><![CDATA[ tokenizer = new StringTokenizer(birthDate.toString(), "./-\\"); day = Integer.parseInt(tokenizer.nextToken()); month = Integer.parseInt(tokenizer.nextToken()); year = Integer.parseInt(tokenizer.nextToken()); Calendar cal = Calendar.getInstance(); cal.set(Calendar.DAY_OF_MONTH, day); cal.set(Calendar.MONTH, month-1); cal.set(Calendar.YEAR, year); switch( cal.get(Calendar.DAY_OF_WEEK) ) { case 0 : namedDay = "Sunday"; break; case 1 : namedDay = "Monday"; break; case 2 : namedDay = "Tuesday"; break; case 3 : namedDay = "Wendsday"; break; case 4 : namedDay = "Thursday"; break; case 5 : namedDay = "Friday"; break; default: namedDay = "Saturday"; break; } ]]></script> </var-def> <template> The day when you were born was ${namedDay}. </template> <file action="write" path="day.txt"> <var name="web_harvest_day_variable"/> </file> </config>
This example also shows that script internal variables once defined, are
available in all the following script and template
processors (namedDay).
templateFor the given text content, parts surrounded with ${ and } are evaluated using the specified scripting engine. If no scripting language is specified, default one is used (see config element).
<template language="script_language"> body as text for templating </template>
| Name | Required | Default | Description |
|---|---|---|---|
| language | no | Default config language | Specifies script language that will be used for evaluation of parts surrounded with ${ and }. Valid values are beanshell, javascript and groovy. |
<var-def name="content"> <file path="textdata/products.txt"/> </var-def> <var-def name="changedContent"> <template> ${sys.datetime("yyyy-MM-dd, HH:mm:ss")} ${sys.lf} ---------------------------------------------------- ${sys.lf} ${my.process(content.toString())} </template> </var-def>
Templater uses some built-in constants, functions and some user-defined objects from variable context in order to produce desired content.
caseExecutes conditional statement. Sequentially checks if some of the specified conditions in inner if elements is satisfied and if found one returns its body as the result. If no true statement is found result of execution is body of else statement if specified, or empty value otherwise.
<case> [<if condition="expression"> if body </if>] * [<else> else body </else>] </case>
| Name | Required | Default | Description |
|---|---|---|---|
| condition | yes |
If true (yes), body of if is evaluated.
|
<var-def name="contact"> <xpath expression="//a[contains(., 'contact')]/@href"> <var name="pageContent"/> </xpath> </var-def> <var-def name="contactMail"> <case> <if condition="${contact.toString() != ''}"> <var name="contact"/> </if> <else> Contact is not defined! </else> </case> </var-def>
Here, conditional processor is used to check if previous xpath search has found contact information on the page.
loopIterate through the specified list and executes specified body logic for each item. Result is the list of processed bodies.
<loop item="item_var_name" index="index_var_name" maxloops="max_loops" filter="list_filter"> <list> body as list value </list> <body> body for each list item </body> </loop>
| Name | Required | Default | Description |
|---|---|---|---|
| item | no | Name of the variable that takes the value of current list item. | |
| index | no | Name of the index variable, initial value for the first loop is 1. | |
| maxloops | no | Limits number of iterations. There is no limit if it is not specified. | |
| filter | no |
Expression for filtering iteration list. It consists of arbitrary number of
restrictions separated by comma. There are the following types of restrictions:
1-20,1:2,unique.
|
<loop item="link" index="i" filter="unique"> <list> <xpath expression="//img/@src"> <html-to-xml> <http url="http://www.yahoo.com"/> </html-to-xml> </xpath> </list> <body> <file action="write" type="binary" path="images/${i}.gif"> <http url="${sys.fullUrl('http://www.yahoo.com', link)}"/> </file> </body> </loop>
Loop iterates over the all unique image URLs from www.yahoo.com and for each URL downloads the image and stores it to the file system.
whileLoops while specified condition is satisfied. The result is list made of processed bodies in each iteration.
<while condition="expression" index="index_var_name" maxloops="max_loops"> body </while>
| Name | Required | Default | Description |
|---|---|---|---|
| condition | yes | Expression that is evaluated for every loop and if its value is true, the body is executed. | |
| index | no | Name of the index variable, initial value for the first loop is 1. | |
| maxloops | no | Limits number of iterations. There is no limit if it is not specified. |
function processor.
functionDeclares the user-defined function.
<function name="function_name"> function body </function>
| Name | Required | Default | Description |
|---|---|---|---|
| name | yes | The name of user-defined function |
<function name="download-multipage-list"> <return> <while condition="${pageUrl.toString().trim() != ''}" maxloops="${maxloops}" index="i"> <empty> <var-def name="content"> <html-to-xml> <http url="${pageUrl}"/> </html-to-xml> </var-def> <var-def name="nextLinkUrl"> <xpath expression="${nextXpath}"> <var name="content"/> </xpath> </var-def> <var-def name="pageUrl"> <template>${sys.fullUrl(pageUrl, nextLinkUrl)}</template> </var-def> </empty> <xpath expression="${itemXPath}"> <var name="content"/> </xpath> </while> </return> </function> <var-def name="imgLinks"> <call name="download-multipage-list"> <call-param name="pageUrl"> http://images.google.com/images?q=harvest&hl=en&btnG=Search+Images&nojs=1 </call-param> <call-param name="nextXPath"> //a[@shape='rect' and .='Next']/@href </call-param> <call-param name="itemXPath"> //img[contains(@src, 'images?q=tbn')]/@src </call-param> <call-param name="maxloops"> 5 </call-param> </call> </var-def>
Here the function named download-multipage-list is defined in order to serve multiple extractions. It collects link URLs from series of pages where XPath expression parameter is used to determine URL of next page with links if it exists. This situation is typical for list of products, or list of search results spanning multiple web pages. After that, the function is called with specified parameters in order to collect image links from Google images search limiting number of resulting pages to 5.
returnReturns value from the user-defined function.
<return> body as return value </return>
function processor.
callCalls the user-defined function.
<call name="function_name"> [<call-param name="function_name"> body as actual parameter value </call-param>] * </call>
| Name | Required | Default | Description |
|---|---|---|---|
| name | yes | The name of user-defined function |
function processor.
includeIncludes other configuration file and executes its logic. This is useful for keeping libraries of common functions or for splitting large extraction process into multiple files.
<include path="file_path"/>
| Name | Required | Default | Description |
|---|---|---|---|
| path | yes | Path of he configuration file to be included. Path is relative the directory where including configuration file is. |
<include path="lib.xml"/>
tryWraps execution and for any recoverable exception returns default value without crashing the whole process.
<try> <body> try body </body> <catch> catch body </catch> </try>
<var-def name="reportText"> <try> <body> <file path="data/report.txt"/> </body> <catch> No report file! </catch> </try> </var-def>
File read exception is caught if occurred and default value is stored in the variable.
exitConditionally breaks the configuration execution.
<exit condition="condition" message="message" />
| Name | Required | Default | Description |
|---|---|---|---|
| condition | no | true | Condition that determines if execution will stop. Must be boolean value (true, yes, false, no). |
| message | no | Optional message to the user if configuration is exiting. Will be part of logging information, or dialog warning will popup if Web-Harvest is used in GUI mode. |
<exit condition='${!sys.isVariableDefined("username")}' message="No username provided!" />
Configuration is stopping execution if variable username
is not defined.
Writing Web-Harvest configuration could be tricky, especially when it includes multiple features like regular expressions, xquery, variables, various templates. In order to find out the problem more easily, Web-Harvest internally uses Log4J for logging each processor's execution. In programmatic use, Log4J could be configured depending on user's wish.
Furthermore, there is a way to save temporal execution values to the file system.
Debugging option must be turned on (see Usage) and for
each processor whose execution is monitored it is possible to define special
attribute id that tells Web-Harvest to save its content to the
file named _debug/<id>_<num>.debug under the working path.
For example in the following configuration pipeline:
<xpath expression="//a/@href" id="yahoo_links"> <html-to-xml id="yahoo_xml"> <http url="http://www.yahoo.com" id="yahoo_html"/> </html-to-xml> </xpath>
all three processors are told to save their results. Thus, the following files are created:
