Professional CLI
for Web Scraping

Automate data extraction from the terminal

Production-ready command-line interface with unified configuration system (v2.2.0). Perfect for servers, cron jobs, CI/CD pipelines, and batch processing workflows.

Unified Configuration System (v2.2.0)

WebHarvest 2.2.0 introduces a unified configuration system that eliminates the need for separate settings between CLI and IDE. Here's what you need to know:

Problem Solved (Bug #39)

Before: Separate configs for GUI (webharvest.properties) and CLI
Now: Single unified configuration system
Benefit: No synchronization issues, no duplicate settings
Compatibility: Works seamlessly with both CLI and IDE

Unified Settings Available

HTTP Proxy: --proxy-host, --proxy-port, --proxy-user
HTTP Settings: --user-agent, --timeout, --follow-redirects
Output: --output-dir, --working-dir
Execution: --debug, --verbose

Migration Guide

Existing users: Your old webharvest.properties files will continue to work. New installations use the unified system by default. No action required for existing setups.

✅ Backward Compatible - Existing configurations continue to work
🚀 Simplified Workflow - One config system for all tools

# Download CLI wget https://sourceforge.net/projects/web-harvest/files/webhervest/2.2.0/webharvest-cli-2.2.0.jar/download -O webharvest-cli-2.2.0.jar # Or using curl curl -L -o webharvest-cli-2.2.0.jar \ https://sourceforge.net/projects/web-harvest/files/webhervest/2.2.0/webharvest-cli-2.2.0.jar/download

java -jar webharvest-cli-2.2.0.jar [OPTIONS] <config.xml> OPTIONS: -c, --config=FILE Path to XML configuration file (required) -o, --output=FILE Output file for results (default: stdout) -f, --format=FORMAT Output format: xml, json, text (default: xml) -v, --verbose Enable verbose logging -q, --quiet Suppress all output except errors -D, --define=VAR=VALUE Define configuration variable --working-dir=DIR Set working directory --timeout=SECONDS Execution timeout (default: no limit) --retries=NUM Number of retries for HTTP requests (default: 3) --proxy=HOST:PORT HTTP proxy server --user-agent=STRING Custom User-Agent header -h, --help Show this help message --version Show version information EXAMPLES: # Basic execution java -jar webharvest-cli.jar scraper.xml # With output file java -jar webharvest-cli.jar -c scraper.xml -o results.json # Define variables java -jar webharvest-cli.jar -c scraper.xml -D baseUrl=https://example.com # Verbose mode java -jar webharvest-cli.jar -c scraper.xml --verbose # Use proxy java -jar webharvest-cli.jar -c scraper.xml --proxy=proxy.example.com:8080

# Run every day at 2 AM 0 2 * * * java -jar /path/to/webharvest-cli.jar \ /path/to/scraper.xml >> /var/log/scraper.log 2>&1 # Run every hour 0 * * * * java -jar webharvest-cli.jar daily-check.xml # Run every Monday at 9 AM 0 9 * * 1 java -jar webharvest-cli.jar weekly-report.xml

name: Data Scraping on: schedule: - cron: '0 0 * * *' # Daily at midnight workflow_dispatch: jobs: scrape: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - uses: actions/setup-java@v3 with: java-version: '11' - name: Run Scraper run: | java -jar webharvest-cli-2.2.0.jar \ --config=scraper.xml \ --output=results.json - name: Upload Results uses: actions/upload-artifact@v3 with: name: scraping-results path: results.json

#!/bin/bash # Process multiple configurations for config in configs/*.xml; do echo "Processing $config..." java -jar webharvest-cli.jar \ --config="$config" \ --output="results/$(basename $config .xml).json" done echo "Batch processing complete!"

#!/bin/bash # Scrape → Transform → Load pipeline # 1. Scrape data java -jar webharvest-cli.jar scraper.xml -o raw-data.xml # 2. Transform with jq cat raw-data.xml | xq -c '.data[]' > processed.json # 3. Load to database psql -d mydb -c "COPY data FROM STDIN WITH CSV" < processed.json echo "Pipeline complete!"

Java Version Error

Problem: Unsupported class file major version

Solution:

terminal

# Check Java version
java -version  # Should be 11+

# Install Java 11+ if needed
# Ubuntu/Debian:
sudo apt install openjdk-11-jdk

# macOS (Homebrew):
brew install openjdk@11

Configuration Not Found

Problem: Configuration file not found

Solution:

Check file path is correct and file exists
Use absolute path or correct relative path
Verify XML syntax is valid
Check file permissions (readable)

Network Issues

Problem: Connection timeouts or proxy errors

Solution:

terminal

# Use proxy
java -jar webharvest-cli.jar config.xml \
  --proxy-host=proxy.example.com \
  --proxy-port=8080

# Increase timeout
java -jar webharvest-cli.jar config.xml \
  --timeout=60000

Memory Issues

Problem: OutOfMemoryError or slow performance

Solution:

terminal

# Increase heap size
java -Xmx1g -jar webharvest-cli-2.2.0.jar config.xml

# Or configure in script
export JAVA_OPTS="-Xmx1g -Xms256m"
java $JAVA_OPTS -jar webharvest-cli.jar config.xml

Output Issues

Problem: No output or wrong format

Solution:

Use --verbose to see execution details
Check exit code: echo $? after execution
Specify output format: --format=json
Save to file: --output=results.json

Automation Issues

Problem: Cron jobs or scripts not working

Solution:

Use absolute paths in cron jobs
Set JAVA_HOME in environment
Redirect output: > logfile.log 2>&1
Check file permissions and ownership

Professional CLIfor Web Scraping

CLI Features

Simple Command-Line

Server-Ready

Unified Configuration

Cron Integration

CI/CD Ready

Batch Processing

Unified Configuration System (v2.2.0)

Problem Solved (Bug #39)

Unified Settings Available

Migration Guide

Installation

System Requirements

Download & Extract

Run Your First Scraper

Using the CLI

Basic Execution

Configuration Options

Automation & Scheduling

Batch Processing

Debugging

Integration

Command-Line Options

Usage

Architecture

Components

Unified Configuration (v2.2)

Execution Flow

Integration Points

Common Use Cases

Scheduled Data Collection

CI/CD Integration

Batch Processing

Data Pipeline

Troubleshooting

Java Version Error

Configuration Not Found

Network Issues

Memory Issues

Output Issues

Automation Issues

Ready to Start?

Professional CLI
for Web Scraping