Automate data extraction from the terminal
Production-ready command-line interface with unified configuration system (v2.2.0). Perfect for servers, cron jobs, CI/CD pipelines, and batch processing workflows.
Everything you need for professional command-line web scraping
Execute scraping configurations with a single command. Standard input/output for easy integration with shell scripts and pipelines.
Headless execution perfect for servers and cloud environments. No GUI required, minimal resource footprint.
v2.2.0 introduces unified settings system. Same configuration works for CLI and IDE, eliminating duplicate configs.
Schedule scrapers with cron jobs for periodic data collection. Perfect for monitoring, price tracking, and content updates.
Integrate with Jenkins, GitLab CI, GitHub Actions, and other CI/CD tools. Exit codes and logging for pipeline automation.
Process multiple configurations sequentially or in parallel. Handle large-scale data extraction with ease.
WebHarvest 2.2.0 introduces a unified configuration system that eliminates the need for separate settings between CLI and IDE. Here's what you need to know:
webharvest.properties
) and
CLI--proxy-host
, --proxy-port
,
--proxy-user
--user-agent
, --timeout
,
--follow-redirects
--output-dir
, --working-dir
--debug
, --verbose
Existing users: Your old webharvest.properties
files will continue
to work.
New installations use the unified system by default. No action required for existing setups.
✅ Backward Compatible - Existing configurations continue to work
🚀 Simplified Workflow - One config system for all tools
Get up and running in 3 simple steps
Download the CLI JAR file from SourceForge:
# Download CLI
wget https://sourceforge.net/projects/web-harvest/files/webhervest/2.2.0/webharvest-cli-2.2.0.jar/download -O webharvest-cli-2.2.0.jar
# Or using curl
curl -L -o webharvest-cli-2.2.0.jar \
https://sourceforge.net/projects/web-harvest/files/webhervest/2.2.0/webharvest-cli-2.2.0.jar/download
Execute a configuration file:
# Basic usage
java -jar webharvest-cli-2.2.0.jar config.xml
# With options
java -jar webharvest-cli-2.2.0.jar \
--config=scraper.xml \
--output=results.json \
--verbose
Quick guide to key workflows
java -jar webharvest-cli.jar config.xml
--output=file.json
to save results--proxy-host
and --proxy-port
for HTTP proxy--user-agent
for custom headers--timeout
for request limits--verbose
for detailed logging--working-dir
for relative pathscommand > results.log 2>&1
&
--verbose
--verbose
for detailed output--debug
for maximum verbosity2> error.log
jq
for JSON processingxq
for XML to JSON conversionFull CLI reference
java -jar webharvest-cli-2.2.0.jar [OPTIONS] <config.xml>
OPTIONS:
-c, --config=FILE Path to XML configuration file (required)
-o, --output=FILE Output file for results (default: stdout)
-f, --format=FORMAT Output format: xml, json, text (default: xml)
-v, --verbose Enable verbose logging
-q, --quiet Suppress all output except errors
-D, --define=VAR=VALUE Define configuration variable
--working-dir=DIR Set working directory
--timeout=SECONDS Execution timeout (default: no limit)
--retries=NUM Number of retries for HTTP requests (default: 3)
--proxy=HOST:PORT HTTP proxy server
--user-agent=STRING Custom User-Agent header
-h, --help Show this help message
--version Show version information
EXAMPLES:
# Basic execution
java -jar webharvest-cli.jar scraper.xml
# With output file
java -jar webharvest-cli.jar -c scraper.xml -o results.json
# Define variables
java -jar webharvest-cli.jar -c scraper.xml -D baseUrl=https://example.com
# Verbose mode
java -jar webharvest-cli.jar -c scraper.xml --verbose
# Use proxy
java -jar webharvest-cli.jar -c scraper.xml --proxy=proxy.example.com:8080
How the CLI works under the hood
--proxy-host
, --user-agent
,
--timeout
webharvest.properties
for persistent settings
Real-world CLI automation scenarios
Run scrapers on a schedule with cron:
# Run every day at 2 AM
0 2 * * * java -jar /path/to/webharvest-cli.jar \
/path/to/scraper.xml >> /var/log/scraper.log 2>&1
# Run every hour
0 * * * * java -jar webharvest-cli.jar daily-check.xml
# Run every Monday at 9 AM
0 9 * * 1 java -jar webharvest-cli.jar weekly-report.xml
Integrate with GitHub Actions:
name: Data Scraping
on:
schedule:
- cron: '0 0 * * *' # Daily at midnight
workflow_dispatch:
jobs:
scrape:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-java@v3
with:
java-version: '11'
- name: Run Scraper
run: |
java -jar webharvest-cli-2.2.0.jar \
--config=scraper.xml \
--output=results.json
- name: Upload Results
uses: actions/upload-artifact@v3
with:
name: scraping-results
path: results.json
Process multiple sites:
#!/bin/bash
# Process multiple configurations
for config in configs/*.xml; do
echo "Processing $config..."
java -jar webharvest-cli.jar \
--config="$config" \
--output="results/$(basename $config .xml).json"
done
echo "Batch processing complete!"
Integrate with data processing:
#!/bin/bash
# Scrape → Transform → Load pipeline
# 1. Scrape data
java -jar webharvest-cli.jar scraper.xml -o raw-data.xml
# 2. Transform with jq
cat raw-data.xml | xq -c '.data[]' > processed.json
# 3. Load to database
psql -d mydb -c "COPY data FROM STDIN WITH CSV" < processed.json
echo "Pipeline complete!"
Common issues and solutions
Problem: Unsupported class file major version
Solution:
# Check Java version
java -version # Should be 11+
# Install Java 11+ if needed
# Ubuntu/Debian:
sudo apt install openjdk-11-jdk
# macOS (Homebrew):
brew install openjdk@11
Problem: Configuration file not found
Solution:
Problem: Connection timeouts or proxy errors
Solution:
# Use proxy
java -jar webharvest-cli.jar config.xml \
--proxy-host=proxy.example.com \
--proxy-port=8080
# Increase timeout
java -jar webharvest-cli.jar config.xml \
--timeout=60000
Problem: OutOfMemoryError or slow performance
Solution:
# Increase heap size
java -Xmx1g -jar webharvest-cli-2.2.0.jar config.xml
# Or configure in script
export JAVA_OPTS="-Xmx1g -Xms256m"
java $JAVA_OPTS -jar webharvest-cli.jar config.xml
Problem: No output or wrong format
Solution:
--verbose
to see execution detailsecho $?
after execution--format=json
--output=results.json
Problem: Cron jobs or scripts not working
Solution:
> logfile.log 2>&1
Download WebHarvest CLI and experience professional command-line web scraping
Java 11+ Required • BSD License • 15+ MB Download