Web-Harvest 2.1.0 is a powerful, enterprise-grade web scraping and data extraction tool with advanced HTML parsing, XQuery support, and comprehensive processor ecosystem.
<config xmlns="http://org.webharvest/schema/2.1/core">
<def var="products">
<html-to-xml advancedxmlescape="true">
<http url="https://shop.example.com/products"/>
</html-to-xml>
</def>
<loop item="product">
<xpath expression="//div[@class='product']">
<get var="products"/>
</xpath>
<body>
<def var="name">
<xpath expression=".//h3">
<get var="product"/>
</xpath>
</def>
<def var="price">
<xpath expression=".//span[@class='price']">
<get var="product"/>
</xpath>
</def>
</body>
</loop>
</config>
Everything you need for professional web scraping and data extraction
Multi-strategy HTML parsing with JSoup, TagSoup, and Apache Tika for maximum compatibility and reliability.
Powerful querying capabilities with full XQuery 3.1 and XPath 3.1 support for complex data extraction.
Comprehensive set of processors for HTTP requests, file operations, database connections, and more.
Robust error handling with try-catch blocks, retry mechanisms, and comprehensive logging.
Optimized for speed and memory efficiency with streaming processing and parallel execution.
Plugin-based architecture allows easy extension with custom processors and functionality.
Comprehensive set of processors organized by functionality
<http>
- HTTP requests<http-header>
- HTTP headers<http-param>
- HTTP parameters<html-to-xml>
- HTML parsing<web-browser>
- Headless browser<xpath>
- XPath extraction<xquery>
- XQuery processing<regexp>
- Regular expressions<json-to-xml>
- JSON conversion<xml-to-json>
- XML conversion<file>
- File read/write<include>
- Include files<template>
- Template processing<text>
- Text processing<xml>
- XML processing<loop>
- Loops<while>
- While loops<if>
- Conditional execution<try>
- Error handling<case>
- Switch statements<def var>
- Variable definition<get var>
- Variable access<set var>
- Variable assignment<list>
- List creation<script>
- Script execution<database>
- Database operations<mail>
- Email sending<ftp>
- FTP operations<zip>
- Archive operations<tokenize>
- Text tokenizationReal-world examples demonstrating Web-Harvest capabilities
Monitor product prices and availability across multiple e-commerce sites.
<config xmlns="http://org.webharvest/schema/2.1/core">
<def var="productUrl">https://shop.example.com/product/123</def>
<def var="productData">
<html-to-xml advancedxmlescape="true">
<http url="${productUrl}"/>
</html-to-xml>
</def>
<def var="currentPrice">
<xpath expression="//span[@class='price']">
<get var="productData"/>
</xpath>
</def>
</config>
Extract and analyze social media posts for sentiment analysis.
<config xmlns="http://org.webharvest/schema/2.1/core">
<def var="socialData">
<http url="https://api.social.com/posts?hashtag=#technology"/>
</def>
<def var="postsXml">
<json-to-xml>
<get var="socialData"/>
</json-to-xml>
</def>
<def var="sentimentAnalysis">
<xquery>
<xq-param name="posts">
<get var="postsXml"/>
</xq-param>
<xq-expression>
<![CDATA[
declare variable $posts as node() external;
let $positive := count($posts//post[sentiment = "positive"])
let $negative := count($posts//post[sentiment = "negative"])
return <sentiment_analysis>
<positive_posts>{$positive}</positive_posts>
<negative_posts>{$negative}</negative_posts>
</sentiment_analysis>
]]>
</xq-expression>
</xquery>
</def>
</config>
Complete data processing pipeline with validation and transformation.
<config xmlns="http://org.webharvest/schema/2.1/core">
<def var="rawData">
<file path="input.csv" action="read"/>
</def>
<def var="processedData">
<loop item="row">
<tokenize delimiters=",">
<get var="rawData"/>
</tokenize>
<body>
<def var="validatedRow">
<script>
// Data validation logic
if (row.length() >= 3) {
return row;
}
</script>
</def>
</body>
</loop>
</def>
</config>
Complete API reference for developers
Web-Harvest uses XML-based configuration files to define scraping workflows. Each configuration consists of processors that perform specific tasks.
<?xml version="1.0" encoding="UTF-8"?>
<config xmlns="http://org.webharvest/schema/2.1/core"
charset="UTF-8"
scriptlang="beanshell">
<!-- Your processors here -->
</config>
Attribute | Description | Default |
---|---|---|
charset |
Character encoding for the configuration | UTF-8 |
scriptlang |
Scripting language (beanshell, javascript, groovy) | beanshell |
Get started with Web-Harvest in minutes