Processor Reference
Complete reference for all Web-Harvest processors and their configuration options
Web & HTTP Processors
<http>
Performs HTTP requests to fetch web content
Attributes:
url
- Target URL (required)method
- HTTP method (GET, POST, PUT, DELETE)timeout
- Request timeout in millisecondsfollowredirects
- Follow HTTP redirects (true/false)useragent
- Custom User-Agent string
Example:
<http url="https://example.com" method="GET" timeout="30000">
<http-header name="Accept">application/json</http-header>
<http-param name="page">1</http-param>
</http>
<html-to-xml>
Converts HTML content to well-formed XML using advanced parsing strategies
Attributes:
advancedxmlescape
- Advanced XML character escapingspecialentities
- Support for special HTML entitiesunicodechars
- Unicode character supportallowmultiwordattributes
- Allow multi-word attributesomitunknowntags
- Omit unknown HTML tagsomitcomments
- Omit HTML commentsomittermlenvelope
- Omit XML envelope
Example:
<html-to-xml
advancedxmlescape="true"
specialentities="true"
unicodechars="true">
<http url="https://example.com"/>
</html-to-xml>
<web-browser>
Headless web browser for JavaScript-heavy sites using PhantomJS
Attributes:
path
- Path to PhantomJS executableport
- Port for browser communication
Example:
<web-browser path="/usr/bin/phantomjs" port="8081">
<web-browser-load url="https://spa.example.com"/>
<web-browser-javascript>
return document.querySelector('.content').innerHTML;
</web-browser-javascript>
</web-browser>
Data Extraction Processors
<xpath>
Extracts data using XPath expressions
Attributes:
expression
- XPath expression (required)id
- Unique identifier
Example:
<xpath expression="//div[@class='product']//h3">
<html-to-xml>
<http url="https://shop.example.com"/>
</html-to-xml>
</xpath>
<xquery>
Advanced data processing using XQuery 3.1
Attributes:
id
- Unique identifier
Example:
<xquery>
<xq-param name="data">
<get var="xmlData"/>
</xq-param>
<xq-expression>
<![CDATA[
declare variable $data as node() external;
<results>
{
for $item in $data//item
return <item>
<name>{$item/name}</name>
<price>{$item/price}</price>
</item>
}
</results>
]]>
</xq-expression>
</xquery>
<regexp>
Pattern matching and text extraction using regular expressions
Attributes:
pattern
- Regular expression patternaction
- Action (match, replace, find)replacement
- Replacement text for replace actionflags
- Regex flags (i, m, s, etc.)
Example:
<regexp pattern="<title>(.*?)</title>" action="match">
<http url="https://example.com"/>
</regexp>
File & Data Processors
<file>
File read/write operations
Attributes:
path
- File path (required)action
- Action (read, write, append)charset
- Character encodingcreatepath
- Create directory path if not exists
Example:
<file path="output.xml" action="write" charset="UTF-8">
<template>
<![CDATA[
<results>
<title>${title}</title>
<content>${content}</content>
</results>
]]>
</template>
</file>
<json-to-xml>
Converts JSON data to XML format
Attributes:
id
- Unique identifier
Example:
<json-to-xml>
<http url="https://api.example.com/data"/>
</json-to-xml>
<xml-to-json>
Converts XML data to JSON format
Attributes:
id
- Unique identifier
Example:
<xml-to-json>
<get var="xmlData"/>
</xml-to-json>
Control Flow Processors
<loop>
Iterates over a collection of items
Attributes:
item
- Variable name for current itemindex
- Variable name for current indexmaxloops
- Maximum number of iterations
Example:
<loop item="product" index="i" maxloops="10">
<list>
<constant>Product 1</constant>
<constant>Product 2</constant>
</list>
<body>
<def var="productName">
<get var="product"/>
</def>
</body>
</loop>
<while>
Repeats execution while condition is true
Attributes:
condition
- Condition expression (required)maxloops
- Maximum number of iterations
Example:
<while condition="${pageUrl.toString().length() != 0}" maxloops="10">
<def var="content">
<http url="${pageUrl}"/>
</def>
<def var="nextPage">
<xpath expression="//a[@class='next']/@href">
<get var="content"/>
</xpath>
</def>
<def var="pageUrl">
<get var="nextPage"/>
</def>
</while>
<if>
Conditional execution based on expression
Attributes:
condition
- Condition expression (required)
Example:
<if condition="${price < 100}">
<then>
<def var="alert">Price is below threshold!</def>
</then>
<else>
<def var="alert">Price is above threshold.</def>
</else>
</if>
Variable Processors
<def var>
Defines a new variable with a value
Attributes:
var
- Variable name (required)
Example:
<def var="productUrl">https://shop.example.com/product/123</def>
<def var="productData">
<html-to-xml>
<http url="${productUrl}"/>
</html-to-xml>
</def>
<get var>
Retrieves the value of a variable
Attributes:
var
- Variable name (required)
Example:
<def var="title">
<xpath expression="//title">
<get var="pageContent"/>
</xpath>
</def>
<file path="title.txt" action="write">
<get var="title"/>
</file>
<script>
Executes scripting code (BeanShell, JavaScript, Groovy)
Attributes:
id
- Unique identifier
Example:
<script>
// BeanShell code
String fullUrl = sys.fullUrl("https://example.com", "/path");
return fullUrl;
</script>
Extension Processors
<database>
Database operations (SELECT, INSERT, UPDATE, DELETE)
Attributes:
driver
- JDBC driver classurl
- Database connection URLusername
- Database usernamepassword
- Database passwordquery
- SQL query to execute
Example:
<database
driver="com.mysql.cj.jdbc.Driver"
url="jdbc:mysql://localhost:3306/mydb"
username="user"
password="pass"
query="SELECT * FROM products WHERE price < ?">
<db-param>100</db-param>
</database>
<mail>
Sends email messages
Attributes:
to
- Recipient email addressfrom
- Sender email addresssubject
- Email subjectsmtphost
- SMTP server hostsmtpport
- SMTP server port
Example:
<mail
to="admin@example.com"
from="scraper@example.com"
subject="Scraping Alert"
smtphost="smtp.example.com"
smtpport="587">
<template>Price alert: ${productName} is now $${price}</template>
</mail>
<tokenize>
Splits text into tokens using delimiters
Attributes:
delimiters
- Delimiter characterstrimtokens
- Trim whitespace from tokensallowemptytokens
- Allow empty tokens
Example:
<tokenize delimiters="," trimtokens="true">
<file path="data.csv" action="read"/>
</tokenize>