Extract the web. Automate data harvesting.
Production-ready framework with 57 plugins: 47 core (built-in) + 10 extensions (optional), modern plugin architecture, session management, and professional web IDE.
Develop, test, and debug your scrapers directly in the browser
VS Code's powerful editor with XML syntax highlighting and auto-completion
WebSocket-based live streaming of logs, progress, and results
Per-tab logs, results, and session tracking (v2.2)
Track duration, tokens, and performance in real-time
<config xmlns="http://org.webharvest/schema/2.1/core">
<!-- Fetch HTML -->
<def var="html">
<http url="https://example.com"/>
</def>
<!-- Parse to XML -->
<def var="page">
<html-to-xml>
<get var="html"/>
</html-to-xml>
</def>
<!-- Extract Title -->
<def var="title">
<xpath expression="//title/text()">
<get var="page"/>
</xpath>
</def>
</config>
Production-ready tools for professional web scraping
HTML/XML parsing with XPath, XQuery, and CSS selectors. Multi-strategy parsing for maximum compatibility.
Full-featured HTTP client with connection pooling, cookies, authentication, and advanced headers.
Modern plugin architecture with dependency injection, auto-discovery, and 47 built-in plugins.
Built-in session management, metrics, token tracking, and performance monitoring for production use.
CLI and GUI share configuration. Modern -option syntax. Auto-discovery. Override via command-line.
Most commonly used processors - all included in webharvest-core
Fetch web pages & APIs
Parse HTML content
Extract data from XML
Parse JSON data
Define variables
Retrieve variables
Generate output
Iterate collections
Advanced XML queries
Pattern matching
JavaScript & Groovy
Read & write files
Plus 35 more: if, while, xslt, xml-to-json, tokenize, case, try-catch, function, include, and more
View All 47 Core Processors 10 External PluginsDownload WebHarvest v2.2.0 and automate your data extraction workflows
Java 11+ • Apache License 2.0 • Production Ready