Web & HTTP Processors

<http>

Performs HTTP requests to fetch web content

Attributes:

  • url - Target URL (required)
  • method - HTTP method (GET, POST, PUT, DELETE)
  • timeout - Request timeout in milliseconds
  • followredirects - Follow HTTP redirects (true/false)
  • useragent - Custom User-Agent string

Example:

<http url="https://example.com" method="GET" timeout="30000">
    <http-header name="Accept">application/json</http-header>
    <http-param name="page">1</http-param>
</http>

<html-to-xml>

Converts HTML content to well-formed XML using advanced parsing strategies

Attributes:

  • advancedxmlescape - Advanced XML character escaping
  • specialentities - Support for special HTML entities
  • unicodechars - Unicode character support
  • allowmultiwordattributes - Allow multi-word attributes
  • omitunknowntags - Omit unknown HTML tags
  • omitcomments - Omit HTML comments
  • omittermlenvelope - Omit XML envelope

Example:

<html-to-xml 
    advancedxmlescape="true" 
    specialentities="true" 
    unicodechars="true">
    <http url="https://example.com"/>
</html-to-xml>

<web-browser>

Headless web browser for JavaScript-heavy sites using PhantomJS

Attributes:

  • path - Path to PhantomJS executable
  • port - Port for browser communication

Example:

<web-browser path="/usr/bin/phantomjs" port="8081">
    <web-browser-load url="https://spa.example.com"/>
    <web-browser-javascript>
        return document.querySelector('.content').innerHTML;
    </web-browser-javascript>
</web-browser>

Data Extraction Processors

<xpath>

Extracts data using XPath expressions

Attributes:

  • expression - XPath expression (required)
  • id - Unique identifier

Example:

<xpath expression="//div[@class='product']//h3">
    <html-to-xml>
        <http url="https://shop.example.com"/>
    </html-to-xml>
</xpath>

<xquery>

Advanced data processing using XQuery 3.1

Attributes:

  • id - Unique identifier

Example:

<xquery>
    <xq-param name="data">
        <get var="xmlData"/>
    </xq-param>
    <xq-expression>
        <![CDATA[
        declare variable $data as node() external;
        <results>
            {
                for $item in $data//item
                return <item>
                    <name>{$item/name}</name>
                    <price>{$item/price}</price>
                </item>
            }
        </results>
        ]]>
    </xq-expression>
</xquery>

<regexp>

Pattern matching and text extraction using regular expressions

Attributes:

  • pattern - Regular expression pattern
  • action - Action (match, replace, find)
  • replacement - Replacement text for replace action
  • flags - Regex flags (i, m, s, etc.)

Example:

<regexp pattern="<title>(.*?)</title>" action="match">
    <http url="https://example.com"/>
</regexp>

File & Data Processors

<file>

File read/write operations

Attributes:

  • path - File path (required)
  • action - Action (read, write, append)
  • charset - Character encoding
  • createpath - Create directory path if not exists

Example:

<file path="output.xml" action="write" charset="UTF-8">
    <template>
        <![CDATA[
        <results>
            <title>${title}</title>
            <content>${content}</content>
        </results>
        ]]>
    </template>
</file>

<json-to-xml>

Converts JSON data to XML format

Attributes:

  • id - Unique identifier

Example:

<json-to-xml>
    <http url="https://api.example.com/data"/>
</json-to-xml>

<xml-to-json>

Converts XML data to JSON format

Attributes:

  • id - Unique identifier

Example:

<xml-to-json>
    <get var="xmlData"/>
</xml-to-json>

Control Flow Processors

<loop>

Iterates over a collection of items

Attributes:

  • item - Variable name for current item
  • index - Variable name for current index
  • maxloops - Maximum number of iterations

Example:

<loop item="product" index="i" maxloops="10">
    <list>
        <constant>Product 1</constant>
        <constant>Product 2</constant>
    </list>
    <body>
        <def var="productName">
            <get var="product"/>
        </def>
    </body>
</loop>

<while>

Repeats execution while condition is true

Attributes:

  • condition - Condition expression (required)
  • maxloops - Maximum number of iterations

Example:

<while condition="${pageUrl.toString().length() != 0}" maxloops="10">
    <def var="content">
        <http url="${pageUrl}"/>
    </def>
    <def var="nextPage">
        <xpath expression="//a[@class='next']/@href">
            <get var="content"/>
        </xpath>
    </def>
    <def var="pageUrl">
        <get var="nextPage"/>
    </def>
</while>

<if>

Conditional execution based on expression

Attributes:

  • condition - Condition expression (required)

Example:

<if condition="${price < 100}">
    <then>
        <def var="alert">Price is below threshold!</def>
    </then>
    <else>
        <def var="alert">Price is above threshold.</def>
    </else>
</if>

Variable Processors

<def var>

Defines a new variable with a value

Attributes:

  • var - Variable name (required)

Example:

<def var="productUrl">https://shop.example.com/product/123</def>
<def var="productData">
    <html-to-xml>
        <http url="${productUrl}"/>
    </html-to-xml>
</def>

<get var>

Retrieves the value of a variable

Attributes:

  • var - Variable name (required)

Example:

<def var="title">
    <xpath expression="//title">
        <get var="pageContent"/>
    </xpath>
</def>
<file path="title.txt" action="write">
    <get var="title"/>
</file>

<script>

Executes scripting code (BeanShell, JavaScript, Groovy)

Attributes:

  • id - Unique identifier

Example:

<script>
    // BeanShell code
    String fullUrl = sys.fullUrl("https://example.com", "/path");
    return fullUrl;
</script>

Extension Processors

<database>

Database operations (SELECT, INSERT, UPDATE, DELETE)

Attributes:

  • driver - JDBC driver class
  • url - Database connection URL
  • username - Database username
  • password - Database password
  • query - SQL query to execute

Example:

<database 
    driver="com.mysql.cj.jdbc.Driver"
    url="jdbc:mysql://localhost:3306/mydb"
    username="user"
    password="pass"
    query="SELECT * FROM products WHERE price < ?">
    <db-param>100</db-param>
</database>

<mail>

Sends email messages

Attributes:

  • to - Recipient email address
  • from - Sender email address
  • subject - Email subject
  • smtphost - SMTP server host
  • smtpport - SMTP server port

Example:

<mail 
    to="admin@example.com"
    from="scraper@example.com"
    subject="Scraping Alert"
    smtphost="smtp.example.com"
    smtpport="587">
    <template>Price alert: ${productName} is now $${price}</template>
</mail>

<tokenize>

Splits text into tokens using delimiters

Attributes:

  • delimiters - Delimiter characters
  • trimtokens - Trim whitespace from tokens
  • allowemptytokens - Allow empty tokens

Example:

<tokenize delimiters="," trimtokens="true">
    <file path="data.csv" action="read"/>
</tokenize>