Getting Started
with WebHarvest

Your first web scraper in 5 minutes

Learn the basics of WebHarvest through a simple, practical example. No complex setup required - just download, configure, and scrape.

Getting Started

Before You Begin

What you'll need to get started

Java 11+

Java 11 or higher installed on your system

WebHarvest v2.2.0

Latest release with all dependencies included

Text Editor

Any XML editor or use the Web IDE

VS Code, IntelliJ, or IDE

1. Installation

Download and set up WebHarvest

Quick Install

Terminal
# Download from SourceForge
# Extract the ZIP file
unzip webharvest-2.2.0.zip

# Navigate to CLI directory
cd webharvest-cli

# Verify installation
java -jar webharvest-cli-2.2.0.jar --version
# Output: WebHarvest v2.2.0

Tip: Add WebHarvest to your PATH or create an alias:
alias webharvest='java -jar /path/to/webharvest-cli-2.2.0.jar'

2. Create Your First Scraper

A simple example that extracts data from a webpage

Create my-first-scraper.xml

This example fetches a webpage, extracts the title and first paragraph, and saves the results to a file.

my-first-scraper.xml
<?xml version="1.0" encoding="UTF-8"?>
<config xmlns="http://org.webharvest/schema/2.1/core">

    <!-- Step 1: Fetch the webpage -->
    <def var="html">
        <http url="https://example.com"/>
    </def>

    <!-- Step 2: Convert HTML to XML for parsing -->
    <def var="page">
        <html-to-xml>
            <get var="html"/>
        </html-to-xml>
    </def>

    <!-- Step 3: Extract the title -->
    <def var="title">
        <xpath expression="//title/text()">
            <get var="page"/>
        </xpath>
    </def>

    <!-- Step 4: Extract first paragraph -->
    <def var="firstPara">
        <xpath expression="//p[1]/text()">
            <get var="page"/>
        </xpath>
    </def>

    <!-- Step 5: Save results to file -->
    <file action="write" path="results.txt">
        Title: ${title}
        
        First Paragraph: ${firstPara}
    </file>

    <!-- Step 6: Print to console -->
    <template>
        ✅ Scraping complete!
        Title: ${title}
    </template>

</config>

3. Run Your Scraper

Execute the configuration and see the results

Execute

Terminal
# Run the scraper
java -jar webharvest-cli-2.2.0.jar my-first-scraper.xml

# Expected output:
# ✅ Scraping complete!
# Title: Example Domain

# Check the results file
cat results.txt

# Output:
# Title: Example Domain
# 
# First Paragraph: This domain is for use in illustrative examples...

You've just created and executed your first web scraper! The results are saved in results.txt.

Understanding the Code

Breaking down each component

HTTP Plugin

<http url="https://example.com"/>

Fetches the webpage content. Supports headers, authentication, cookies, POST data, and more.

HTTP Plugin Docs

HTML-to-XML Plugin

<html-to-xml>...</html-to-xml>

Converts messy HTML into clean, parseable XML. Essential for XPath queries.

HTML-to-XML Docs

XPath Plugin

<xpath expression="//title/text()">...</xpath>

Extracts data using XPath queries. Use /text() for text content, or node paths for elements.

XPath Plugin Docs

Variables

<def var="title">...</def>

Store and reuse data. Access with ${varName} syntax in templates.

Variable Docs

Next Steps

Take your scraping to the next level

Try More Examples

Explore our library of ready-to-use configurations for common scraping scenarios.

Explore Core Plugins

Learn about all 46 built-in plugins for HTTP, parsing, control flow, and more.

Use the Web IDE

Develop, test, and debug scrapers in your browser with real-time feedback.

Advanced Topics

Once you're comfortable with basics, explore advanced features:

Full Documentation Architecture Create Custom Plugins Extension Modules

Need Help?

Join our community, report bugs, or request features on SourceForge.

Get Support Join Community