Professional CLI
for Web Scraping

Automate data extraction from the terminal

Production-ready command-line interface with unified configuration system (v2.2.0). Perfect for servers, cron jobs, CI/CD pipelines, and batch processing workflows.

WebHarvest CLI - Professional Command Line Interface

CLI Features

Everything you need for professional command-line web scraping

Simple Command-Line

Execute scraping configurations with a single command. Standard input/output for easy integration with shell scripts and pipelines.

One Command Pipeline Ready

Server-Ready

Headless execution perfect for servers and cloud environments. No GUI required, minimal resource footprint.

Headless Cloud Ready

Unified Configuration

v2.2.0 introduces unified settings system. Same configuration works for CLI and IDE, eliminating duplicate configs.

v2.2.0 Feature Unified Settings

Cron Integration

Schedule scrapers with cron jobs for periodic data collection. Perfect for monitoring, price tracking, and content updates.

Scheduling Monitoring

CI/CD Ready

Integrate with Jenkins, GitLab CI, GitHub Actions, and other CI/CD tools. Exit codes and logging for pipeline automation.

GitHub Actions Jenkins

Batch Processing

Process multiple configurations sequentially or in parallel. Handle large-scale data extraction with ease.

Parallel Scalable

Unified Configuration System (v2.2.0)

WebHarvest 2.2.0 introduces a unified configuration system that eliminates the need for separate settings between CLI and IDE. Here's what you need to know:

Problem Solved (Bug #39)

  • Before: Separate configs for GUI (webharvest.properties) and CLI
  • Now: Single unified configuration system
  • Benefit: No synchronization issues, no duplicate settings
  • Compatibility: Works seamlessly with both CLI and IDE

Unified Settings Available

  • HTTP Proxy: --proxy-host, --proxy-port, --proxy-user
  • HTTP Settings: --user-agent, --timeout, --follow-redirects
  • Output: --output-dir, --working-dir
  • Execution: --debug, --verbose

Migration Guide

Existing users: Your old webharvest.properties files will continue to work. New installations use the unified system by default. No action required for existing setups.

Backward Compatible - Existing configurations continue to work
🚀 Simplified Workflow - One config system for all tools

Installation

Get up and running in 3 simple steps

1

System Requirements

  • Java: Java 11 or higher (OpenJDK or Oracle JDK)
  • Memory: Minimum 256 MB RAM (512 MB recommended)
  • OS: Windows, macOS, Linux (any platform with Java 11+)
  • Network: Internet access for downloading dependencies
2

Download & Extract

Download the CLI JAR file from SourceForge:

terminal
# Download CLI
wget https://sourceforge.net/projects/web-harvest/files/webhervest/2.2.0/webharvest-cli-2.2.0.jar/download -O webharvest-cli-2.2.0.jar

# Or using curl
curl -L -o webharvest-cli-2.2.0.jar \
  https://sourceforge.net/projects/web-harvest/files/webhervest/2.2.0/webharvest-cli-2.2.0.jar/download
3

Run Your First Scraper

Execute a configuration file:

terminal
# Basic usage
java -jar webharvest-cli-2.2.0.jar config.xml

# With options
java -jar webharvest-cli-2.2.0.jar \
  --config=scraper.xml \
  --output=results.json \
  --verbose

Using the CLI

Quick guide to key workflows

Basic Execution

  1. Create XML configuration file
  2. Run: java -jar webharvest-cli.jar config.xml
  3. Results output to stdout by default
  4. Use --output=file.json to save results
  5. Check exit codes for automation (0 = success)

Configuration Options

  1. Use --proxy-host and --proxy-port for HTTP proxy
  2. Set --user-agent for custom headers
  3. Configure --timeout for request limits
  4. Enable --verbose for detailed logging
  5. Use --working-dir for relative paths

Automation & Scheduling

  1. Add CLI commands to cron jobs
  2. Use in CI/CD pipelines (GitHub Actions, Jenkins)
  3. Redirect output: command > results.log 2>&1
  4. Chain with other tools using pipes
  5. Monitor with exit codes and logging

Batch Processing

  1. Create multiple XML configuration files
  2. Use shell loops to process sequentially
  3. Implement parallel processing with &
  4. Collect results in separate output files
  5. Monitor progress with --verbose

Debugging

  1. Use --verbose for detailed output
  2. Enable --debug for maximum verbosity
  3. Check exit codes: 0=success, non-zero=error
  4. Redirect stderr: 2> error.log
  5. Test configurations in IDE first

Integration

  1. Use in shell scripts for automation
  2. Integrate with data pipelines
  3. Combine with jq for JSON processing
  4. Use with xq for XML to JSON conversion
  5. Pipe output to databases or APIs

Command-Line Options

Full CLI reference

Usage

CLI Reference
java -jar webharvest-cli-2.2.0.jar [OPTIONS] <config.xml>

OPTIONS:
  -c, --config=FILE         Path to XML configuration file (required)
  -o, --output=FILE         Output file for results (default: stdout)
  -f, --format=FORMAT       Output format: xml, json, text (default: xml)
  -v, --verbose             Enable verbose logging
  -q, --quiet               Suppress all output except errors
  -D, --define=VAR=VALUE    Define configuration variable
  --working-dir=DIR         Set working directory
  --timeout=SECONDS         Execution timeout (default: no limit)
  --retries=NUM             Number of retries for HTTP requests (default: 3)
  --proxy=HOST:PORT         HTTP proxy server
  --user-agent=STRING       Custom User-Agent header
  -h, --help                Show this help message
  --version                 Show version information

EXAMPLES:
  # Basic execution
  java -jar webharvest-cli.jar scraper.xml

  # With output file
  java -jar webharvest-cli.jar -c scraper.xml -o results.json

  # Define variables
  java -jar webharvest-cli.jar -c scraper.xml -D baseUrl=https://example.com

  # Verbose mode
  java -jar webharvest-cli.jar -c scraper.xml --verbose

  # Use proxy
  java -jar webharvest-cli.jar -c scraper.xml --proxy=proxy.example.com:8080

Architecture

How the CLI works under the hood

Components

  • CLI Main: Command-line argument parsing and execution orchestration
  • Core Engine: WebHarvest session API, plugin system, execution manager
  • Configuration: Unified settings system for CLI and IDE compatibility

Unified Configuration (v2.2)

  • Single Source: One configuration system for CLI and IDE
  • Command-Line Args: --proxy-host, --user-agent, --timeout
  • Properties File: webharvest.properties for persistent settings
  • Backward Compatible: Existing configs continue to work

Execution Flow

  • Parse Args: Command-line options and XML config file
  • Initialize: WebHarvest session with unified settings
  • Execute: XML configuration with plugin system
  • Output: Results to stdout, file, or specified format

Integration Points

  • Exit Codes: 0=success, non-zero=error for automation
  • Standard I/O: stdin/stdout for pipeline integration
  • File Output: JSON, XML, text formats supported
  • Logging: Configurable verbosity levels

Common Use Cases

Real-world CLI automation scenarios

Scheduled Data Collection

Run scrapers on a schedule with cron:

crontab
# Run every day at 2 AM
0 2 * * * java -jar /path/to/webharvest-cli.jar \
  /path/to/scraper.xml >> /var/log/scraper.log 2>&1

# Run every hour
0 * * * * java -jar webharvest-cli.jar daily-check.xml

# Run every Monday at 9 AM
0 9 * * 1 java -jar webharvest-cli.jar weekly-report.xml

CI/CD Integration

Integrate with GitHub Actions:

.github/workflows/scrape.yml
name: Data Scraping
on:
  schedule:
    - cron: '0 0 * * *'  # Daily at midnight
  workflow_dispatch:

jobs:
  scrape:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-java@v3
        with:
          java-version: '11'
      - name: Run Scraper
        run: |
          java -jar webharvest-cli-2.2.0.jar \
            --config=scraper.xml \
            --output=results.json
      - name: Upload Results
        uses: actions/upload-artifact@v3
        with:
          name: scraping-results
          path: results.json

Batch Processing

Process multiple sites:

batch-scrape.sh
#!/bin/bash
# Process multiple configurations

for config in configs/*.xml; do
  echo "Processing $config..."
  java -jar webharvest-cli.jar \
    --config="$config" \
    --output="results/$(basename $config .xml).json"
done

echo "Batch processing complete!"

Data Pipeline

Integrate with data processing:

pipeline.sh
#!/bin/bash
# Scrape → Transform → Load pipeline

# 1. Scrape data
java -jar webharvest-cli.jar scraper.xml -o raw-data.xml

# 2. Transform with jq
cat raw-data.xml | xq -c '.data[]' > processed.json

# 3. Load to database
psql -d mydb -c "COPY data FROM STDIN WITH CSV" < processed.json

echo "Pipeline complete!"

Troubleshooting

Common issues and solutions

Java Version Error

Problem: Unsupported class file major version

Solution:

terminal
# Check Java version
java -version  # Should be 11+

# Install Java 11+ if needed
# Ubuntu/Debian:
sudo apt install openjdk-11-jdk

# macOS (Homebrew):
brew install openjdk@11

Configuration Not Found

Problem: Configuration file not found

Solution:

  • Check file path is correct and file exists
  • Use absolute path or correct relative path
  • Verify XML syntax is valid
  • Check file permissions (readable)

Network Issues

Problem: Connection timeouts or proxy errors

Solution:

terminal
# Use proxy
java -jar webharvest-cli.jar config.xml \
  --proxy-host=proxy.example.com \
  --proxy-port=8080

# Increase timeout
java -jar webharvest-cli.jar config.xml \
  --timeout=60000

Memory Issues

Problem: OutOfMemoryError or slow performance

Solution:

terminal
# Increase heap size
java -Xmx1g -jar webharvest-cli-2.2.0.jar config.xml

# Or configure in script
export JAVA_OPTS="-Xmx1g -Xms256m"
java $JAVA_OPTS -jar webharvest-cli.jar config.xml

Output Issues

Problem: No output or wrong format

Solution:

  • Use --verbose to see execution details
  • Check exit code: echo $? after execution
  • Specify output format: --format=json
  • Save to file: --output=results.json

Automation Issues

Problem: Cron jobs or scripts not working

Solution:

  • Use absolute paths in cron jobs
  • Set JAVA_HOME in environment
  • Redirect output: > logfile.log 2>&1
  • Check file permissions and ownership

Ready to Start?

Download WebHarvest CLI and experience professional command-line web scraping

Java 11+ Required • BSD License • 15+ MB Download