Your first web scraper in 5 minutes
Learn the basics of WebHarvest through a simple, practical example. No complex setup required - just download, configure, and scrape.
What you'll need to get started
Java 11 or higher installed on your system
Latest release with all dependencies included
Any XML editor or use the Web IDE
Download and set up WebHarvest
# Download from SourceForge
# Extract the ZIP file
unzip webharvest-2.2.0.zip
# Navigate to CLI directory
cd webharvest-cli
# Verify installation
java -jar webharvest-cli-2.2.0.jar --version
# Output: WebHarvest v2.2.0
Tip: Add WebHarvest to your PATH or create an alias:
alias webharvest='java -jar /path/to/webharvest-cli-2.2.0.jar'
A simple example that extracts data from a webpage
my-first-scraper.xml
This example fetches a webpage, extracts the title and first paragraph, and saves the results to a file.
<?xml version="1.0" encoding="UTF-8"?>
<config xmlns="http://org.webharvest/schema/2.1/core">
<!-- Step 1: Fetch the webpage -->
<def var="html">
<http url="https://example.com"/>
</def>
<!-- Step 2: Convert HTML to XML for parsing -->
<def var="page">
<html-to-xml>
<get var="html"/>
</html-to-xml>
</def>
<!-- Step 3: Extract the title -->
<def var="title">
<xpath expression="//title/text()">
<get var="page"/>
</xpath>
</def>
<!-- Step 4: Extract first paragraph -->
<def var="firstPara">
<xpath expression="//p[1]/text()">
<get var="page"/>
</xpath>
</def>
<!-- Step 5: Save results to file -->
<file action="write" path="results.txt">
Title: ${title}
First Paragraph: ${firstPara}
</file>
<!-- Step 6: Print to console -->
<template>
✅ Scraping complete!
Title: ${title}
</template>
</config>
Execute the configuration and see the results
# Run the scraper
java -jar webharvest-cli-2.2.0.jar my-first-scraper.xml
# Expected output:
# ✅ Scraping complete!
# Title: Example Domain
# Check the results file
cat results.txt
# Output:
# Title: Example Domain
#
# First Paragraph: This domain is for use in illustrative examples...
You've just created and executed your first web scraper! The results are saved in
results.txt
.
Breaking down each component
<http url="https://example.com"/>
Fetches the webpage content. Supports headers, authentication, cookies, POST data, and more.
HTTP Plugin Docs<html-to-xml>...</html-to-xml>
Converts messy HTML into clean, parseable XML. Essential for XPath queries.
HTML-to-XML Docs<xpath expression="//title/text()">...</xpath>
Extracts data using XPath queries. Use /text()
for text content, or node paths for
elements.
<def var="title">...</def>
Store and reuse data. Access with ${varName}
syntax in templates.
Take your scraping to the next level
Explore our library of ready-to-use configurations for common scraping scenarios.
Learn about all 46 built-in plugins for HTTP, parsing, control flow, and more.
Develop, test, and debug scrapers in your browser with real-time feedback.
Once you're comfortable with basics, explore advanced features:
Join our community, report bugs, or request features on SourceForge.