Advanced Web Scraping & Data Extraction

Web-Harvest 2.1.0 is a powerful, enterprise-grade web scraping and data extraction tool with advanced HTML parsing, XQuery support, and comprehensive processor ecosystem.

50+ Processors
100% Java
2.1.0 Latest Version
web-scraper.xml
<config xmlns="http://org.webharvest/schema/2.1/core">
  <def var="products">
    <html-to-xml advancedxmlescape="true">
      <http url="https://shop.example.com/products"/>
    </html-to-xml>
  </def>
  
  <loop item="product">
    <xpath expression="//div[@class='product']">
      <get var="products"/>
    </xpath>
    <body>
      <def var="name">
        <xpath expression=".//h3">
          <get var="product"/>
        </xpath>
      </def>
      <def var="price">
        <xpath expression=".//span[@class='price']">
          <get var="product"/>
        </xpath>
      </def>
    </body>
  </loop>
</config>

Powerful Features

Everything you need for professional web scraping and data extraction

Advanced HTML Parsing

Multi-strategy HTML parsing with JSoup, TagSoup, and Apache Tika for maximum compatibility and reliability.

XQuery & XPath Support

Powerful querying capabilities with full XQuery 3.1 and XPath 3.1 support for complex data extraction.

50+ Processors

Comprehensive set of processors for HTTP requests, file operations, database connections, and more.

Error Handling

Robust error handling with try-catch blocks, retry mechanisms, and comprehensive logging.

High Performance

Optimized for speed and memory efficiency with streaming processing and parallel execution.

Extensible Architecture

Plugin-based architecture allows easy extension with custom processors and functionality.

Processor Categories

Comprehensive set of processors organized by functionality

Web & HTTP

  • <http> - HTTP requests
  • <http-header> - HTTP headers
  • <http-param> - HTTP parameters
  • <html-to-xml> - HTML parsing
  • <web-browser> - Headless browser

Data Processing

  • <xpath> - XPath extraction
  • <xquery> - XQuery processing
  • <regexp> - Regular expressions
  • <json-to-xml> - JSON conversion
  • <xml-to-json> - XML conversion

File Operations

  • <file> - File read/write
  • <include> - Include files
  • <template> - Template processing
  • <text> - Text processing
  • <xml> - XML processing

Control Flow

  • <loop> - Loops
  • <while> - While loops
  • <if> - Conditional execution
  • <try> - Error handling
  • <case> - Switch statements

Variables

  • <def var> - Variable definition
  • <get var> - Variable access
  • <set var> - Variable assignment
  • <list> - List creation
  • <script> - Script execution

Extensions

  • <database> - Database operations
  • <mail> - Email sending
  • <ftp> - FTP operations
  • <zip> - Archive operations
  • <tokenize> - Text tokenization

Example Configurations

Real-world examples demonstrating Web-Harvest capabilities

E-commerce Monitoring

Business

Monitor product prices and availability across multiple e-commerce sites.

<config xmlns="http://org.webharvest/schema/2.1/core">
  <def var="productUrl">https://shop.example.com/product/123</def>
  <def var="productData">
    <html-to-xml advancedxmlescape="true">
      <http url="${productUrl}"/>
    </html-to-xml>
  </def>
  <def var="currentPrice">
    <xpath expression="//span[@class='price']">
      <get var="productData"/>
    </xpath>
  </def>
</config>

Social Media Analytics

Analytics

Extract and analyze social media posts for sentiment analysis.

<config xmlns="http://org.webharvest/schema/2.1/core">
  <def var="socialData">
    <http url="https://api.social.com/posts?hashtag=#technology"/>
  </def>
  <def var="postsXml">
    <json-to-xml>
      <get var="socialData"/>
    </json-to-xml>
  </def>
  <def var="sentimentAnalysis">
    <xquery>
      <xq-param name="posts">
        <get var="postsXml"/>
      </xq-param>
      <xq-expression>
        <![CDATA[
        declare variable $posts as node() external;
        let $positive := count($posts//post[sentiment = "positive"])
        let $negative := count($posts//post[sentiment = "negative"])
        return <sentiment_analysis>
          <positive_posts>{$positive}</positive_posts>
          <negative_posts>{$negative}</negative_posts>
        </sentiment_analysis>
        ]]>
      </xq-expression>
    </xquery>
  </def>
</config>

Data Pipeline

Processing

Complete data processing pipeline with validation and transformation.

<config xmlns="http://org.webharvest/schema/2.1/core">
  <def var="rawData">
    <file path="input.csv" action="read"/>
  </def>
  <def var="processedData">
    <loop item="row">
      <tokenize delimiters=",">
        <get var="rawData"/>
      </tokenize>
      <body>
        <def var="validatedRow">
          <script>
            // Data validation logic
            if (row.length() >= 3) {
              return row;
            }
          </script>
        </def>
      </body>
    </loop>
  </def>
</config>

API Documentation

Complete API reference for developers

Getting Started

Web-Harvest uses XML-based configuration files to define scraping workflows. Each configuration consists of processors that perform specific tasks.

Basic Configuration Structure

<?xml version="1.0" encoding="UTF-8"?>
<config xmlns="http://org.webharvest/schema/2.1/core"
        charset="UTF-8"
        scriptlang="beanshell">
  <!-- Your processors here -->
</config>

Configuration Attributes

Attribute Description Default
charset Character encoding for the configuration UTF-8
scriptlang Scripting language (beanshell, javascript, groovy) beanshell

Download & Installation

Get started with Web-Harvest in minutes

Command Line Interface

Run Web-Harvest from the command line for automation and scripting.

# Download and run
java -jar webharvest-cli-2.1.0.jar config=examples/simple_test.xml

# With custom output
java -jar webharvest-cli-2.1.0.jar config=my-config.xml output=results.xml
Download CLI

Graphical Interface

Use the intuitive GUI for visual configuration and testing.

# Run the IDE
java -jar webharvest-ide-2.1.0.jar
Download IDE

Installation Steps

  1. Download the appropriate JAR file for your needs
  2. Install Java 8 or higher if not already installed
  3. Run the JAR file using the command line or double-click
  4. Configure your scraping workflow using XML configuration
  5. Execute and monitor your scraping jobs