WebHarvest
Architecture

Modular, plugin-based, extensible design

Understanding the core architecture, plugin system, and execution model that powers WebHarvest's flexibility and performance.

System Architecture (v2.2)

Modular design with clear separation of concerns

System Architecture Diagram

webharvest-ide Web-based IDE
webharvest-cli Command-line tool
webharvest-core (Framework) 47 Plugins • Parser • Runtime • DI • Sessions • Events
webharvest-database External Plugin
webharvest-mail External Plugin
webharvest-ftp External Plugin
webharvest-zip External Plugin
webharvest-webbrowser External Plugin

Architecture: webharvest-core is the foundation. Applications (IDE, CLI) and external plugins build on top.

Module Breakdown

Core framework and external plugins

webharvest-core

The Foundation: Framework with 47 built-in plugins, parser, runtime engine, DI, session management, and token tracking.

Required 47 Plugins

webharvest-cli

Command-Line Tool: Run configurations from terminal, scripting support, batch processing.

Optional

webharvest-ide

Web IDE: Embedded Jetty server, Monaco editor, real-time WebSocket updates, per-tab workspaces.

Optional Jetty 11

External Plugins

Extensions: Database, Mail, FTP, ZIP, Web Browser - separate modules, auto-discovered when on classpath.

Optional 5+ Plugins

Execution Flow

From XML configuration to results

Request Execution Pipeline

1. Configuration Loading config.xml → SAXConfigParser → XmlNode tree
2. Definition Resolution <http> → HttpDef → HttpPlugin mapping
3. Plugin Registration @CorePlugin discovery → PluginRegistry
4. Dependency Injection Guice → @Inject HttpService, etc.
5. Session Creation UUID + SessionMetrics + TokenTracker
6. Execution Scraper.execute() → Plugins → Variables
7. Event Broadcasting SessionStarted → SessionCompleted
8. Results & Cleanup Variables extracted • Metrics calculated

Plugin System

How plugins are discovered, registered, and executed

Plugin Discovery

  • Classpath Scanning: Reflections library scans for @CorePlugin and @Autoscanned
  • Dual Architecture: Old @Definition + new @CorePlugin coexist
  • Namespace Mapping: XML namespace → Plugin class
  • Runtime Registration: Plugins can be added at runtime

Dependency Injection

  • Guice Modules: ConfigModule, ServicesModule
  • Static Injection: BaseTemplater uses static DI
  • Service Location: InjectorHelper.getInjector()
  • Scoped Injection: Singleton, Prototype patterns supported

Plugin Execution

  • Context Management: DynamicScopeContext for variable scoping
  • Variable System: Typed variables (Node, List, Empty)
  • Interruption: All plugins check InterruptedException
  • Error Handling: <try>/<catch> support

Session Management API (v2.2)

Production-grade execution tracking and monitoring

Session Lifecycle (v2.2)

1. SESSION CREATED UUID: "a3f2c8e1..." • Status: PENDING • Event: SessionCreatedEvent
ClientContext attached • SessionMetrics initialized • TokenTracker started
2. SESSION STARTED Status: PENDING → RUNNING • Event: SessionStartedEvent
Start time recorded • Thread assigned • DynamicScopeContext created
3. EXECUTION Plugins execute • Variables stored • Metrics tracked
TokenTracker:
HTTP_REQUEST +1
HTTP_BYTES +N
CPU_TIME +N ms
MEMORY_PEAK tracked
SessionMetrics:
processedElements +1
Events broadcast
Progress updated
COMPLETED Success!
FAILED Error occurred
CANCELLED User stopped
End time recorded • Duration calculated • Token snapshot • Context preserved
5. CLEANUP (optional) Session removed • Context destroyed • Memory freed

Session Tracking

  • UUID-based IDs: UUID.randomUUID() for globally unique identifiers
  • Lifecycle States: PENDING → RUNNING → COMPLETED/FAILED/CANCELLED
  • Thread-Safe Registry: ConcurrentHashMap with atomic operations
  • Client Context: ClientID + arbitrary metadata (tags, project, user)
  • Blocking Wait: awaitCompletion(timeout) for sync execution

Metrics & Tracking

  • Duration: Instant-based timing with nanosecond precision
  • Element Count: Atomic counter for processed processors
  • Processing Rate: Real-time elements/second calculation
  • Token Usage: Granular tracking of resources consumed
  • Immutable Snapshots: TokenUsage for safe reporting

Event System

  • Guava EventBus: Async, decoupled event broadcasting
  • Session Events: Created, Started, Completed, Failed, Cancelled
  • Listener Registration: @Subscribe annotation for handlers
  • IDE Integration: Real-time WebSocket updates via event listeners
  • Thread-Safe: Event posting from any thread

Multi-Threading & Concurrency

Thread-safe execution of parallel scrapers

Concurrent Execution Model

WebHarvestService (API) executeAsync(config, context, options) → Future<Session>
ThreadPoolExecutor Core: 10 threads • Max: 200 threads • Queue: Unbounded
[Thread 1]
Session A
ScraperContext (isolated)
TokenTracker (atomic)
SessionMetrics (atomic)
[Thread 2]
Session B
ScraperContext (isolated)
TokenTracker (atomic)
SessionMetrics (atomic)
[Thread 3]
Session C
ScraperContext (isolated)
TokenTracker (atomic)
SessionMetrics (atomic)
SessionRegistry ConcurrentHashMap • Thread-safe reads/writes • No locking

Thread Safety Guarantees

✅ Isolated ScraperContext per session
✅ AtomicLong counters (lock-free)
✅ ConcurrentHashMap (concurrent access)
✅ EventBus thread-safe
✅ No synchronized blocks in hot paths

Thread Safety

  • Isolated Contexts: Each session has own DynamicScopeContext
  • Atomic Counters: AtomicLong for metrics (lock-free)
  • Concurrent Collections: ConcurrentHashMap for registries
  • Immutable Data: TokenUsage, ClientContext are immutable
  • No Shared State: Variables scoped per-session

Thread Pool

  • Cached Pool: Scales 0 → 200 threads on demand
  • Keep-Alive: 60 seconds idle thread timeout
  • Unbounded Queue: All sessions accepted (no rejection)
  • Graceful Shutdown: awaitTermination() for cleanup
  • Cancellation: interrupt() propagates to plugins

Cancellation

  • User-Initiated: session.cancel() → thread interrupt
  • Plugin Support: All plugins check InterruptedException
  • Graceful Exit: Current processor finishes, then stops
  • Status Update: RUNNING → CANCELLED
  • Event: SessionCancelledEvent broadcast

Token Tracking & Resource Monitoring

Optional billing model for SaaS deployments - track usage, enforce quotas

This is an example pricing model for applications using WebHarvest as a service. The framework tracks resources - billing implementation is optional and customizable.

Token Types & Billing Model

HTTP_REQUEST

Every HTTP call
Rate limiting
$0.001/request

HTTP_BYTES

Data transferred
Bandwidth monitoring
$0.10/GB

CPU_TIME

Processing time
Performance tracking
$0.02/CPU-hour

MEMORY_PEAK

Max heap usage
Resource planning
$0.01/GB-hour

Example Cost Calculations

Session A: E-commerce Scraper

HTTP_REQUEST: 1,250 requests $1.25
HTTP_BYTES: 45 MB $0.0045
CPU_TIME: 12,500 ms (12.5s) $0.00007
MEMORY_PEAK: 256 MB negligible
TOTAL COST: ~$1.25

Session B: Data Lake Pipeline

HTTP_REQUEST: 50 requests $0.05
HTTP_BYTES: 2.5 GB $0.25
CPU_TIME: 180,000 ms (3 min) $0.001
MEMORY_PEAK: 1.2 GB $0.0012
TOTAL COST: ~$0.30

Multi-Tenant Quota Management

Client "DataCorp"

Session 1: 1,000 requests
Session 2: 2,500 requests
Session 3: 500 requests
Total: 4,000 requests
Quota: 5,000/month
Status: ⚠️ 80% usage - Alert!

Client "StartupXYZ"

Plan: Free tier
Limit: 100 requests/day
Usage: 95 requests
Remaining: 5 requests
Status: ⚠️ 95% usage - Warning!

Token Tracking

  • Atomic Counters: AtomicLong for each resource type (lock-free)
  • Real-Time Updates: Incremented during execution (no batching)
  • Immutable Snapshots: TokenUsage for safe reporting
  • Per-Session: Each session has dedicated TokenTracker
  • No Overhead: ~5ns per increment (negligible performance impact)

Quota Enforcement

  • Pre-Execution Check: Validate quota before starting session
  • Client-Based Limits: Per-client quotas (free/paid tiers)
  • Resource-Specific: Limit HTTP requests, bytes, CPU separately
  • Soft Limits: Warnings at 80% usage
  • Hard Limits: Reject execution if quota exceeded

Billing Integration

  • Usage Export: TokenUsage → JSON/CSV for billing systems
  • Pricing Models: Per-request, per-GB, per-CPU-hour, tiered plans
  • Client Metadata: Attach account ID, plan tier, payment method
  • Aggregation: Sum tokens across sessions by client
  • Integration: Stripe, AWS Marketplace, custom billing API

Use Case Scenarios

How WebHarvest fits into modern data architectures

Web → Data Lake Pipeline

Scenario: Extract e-commerce data from 1000+ sites → transform → load into S3/BigQuery

Architecture:

  • Step 1: WebHarvest scrapes product data (parallel sessions)
  • Step 2: Transform to Parquet/Avro using XQuery/XSLT
  • Step 3: Upload to S3 bucket (batch)
  • Step 4: Trigger Glue/Athena for analytics

Integration:

  • AWS Lambda triggers WebHarvest via REST API
  • Session events → CloudWatch logs
  • Token usage → Cost allocation tags

ETL Pipeline Integration

Scenario: WebHarvest as data source in enterprise ETL/orchestration platforms

Platforms:

  • Argo Workflows: Kubernetes-native orchestration
  • Apache Airflow: Python-based DAG scheduling
  • Luigi: Spotify's batch processing framework
  • Prefect: Modern workflow orchestration

Example: Argo Workflow

  • Step 1: Trigger WebHarvest container
  • Step 2: Execute scraper → export JSON
  • Step 3: Transform with dbt/Spark
  • Step 4: Load to warehouse (Snowflake, BigQuery)
  • Monitoring: Session metrics → Prometheus

Intelligent Automation

Scenario: AI-driven scraping with dynamic config generation

Flow:

  • ML Model: Analyzes site → generates XPath selectors
  • Config Builder: Programmatic XML generation
  • WebHarvest: Executes generated config
  • Validation: AI validates extracted data quality
  • Feedback Loop: Retrain model on failed extractions

Message Transformation Hub

Scenario: Transform between formats for ESB/B2B integration

Use Cases:

  • EDI Transformation: Web data → EDI X12/EDIFACT formats
  • HL7 Healthcare: API data → HL7 messages
  • B2B Integration: Supplier APIs → company format
  • Format Conversion: JSON ↔ XML ↔ CSV

Integration Platforms:

  • MuleSoft Anypoint: Custom connector
  • Apache Camel: WebHarvest component
  • IBM Integration Bus: Java processor
  • Kafka Connect: Source connector

API Aggregation

Scenario: Combine data from multiple REST APIs → unified response

Example:

  • Call 5 different APIs (weather, stock, news, traffic, currency)
  • Parse JSON responses → XML
  • Merge with XQuery
  • Template → Custom JSON format
  • Return via webhook to client app

Data Integration Platform

Scenario: Integration with Fivetran, Airbyte, Segment

Approach:

  • Custom Connector: WebHarvest as source connector
  • Configuration: Store scraper configs in connector settings
  • Scheduling: Platform triggers WebHarvest via API
  • Output: Normalized JSON → Platform schema mapping
  • Monitoring: Session metrics → Platform observability

Technology Stack

Libraries and frameworks powering WebHarvest

Core Technologies

  • Java 11+: Modern Java features, lambda expressions
  • Maven: Build tool, dependency management
  • Guice: Dependency injection framework
  • SLF4J + Log4j: Logging infrastructure

HTTP & Networking

  • Apache HttpClient 5: HTTP client library
  • Connection Pooling: Efficient resource usage
  • SSL/TLS Support: Secure connections
  • Cookie Management: Session handling

XML & Data Processing

  • SAX Parser: Fast, memory-efficient XML parsing
  • XPath (Saxon): XML querying
  • XQuery (Saxon): Advanced XML transformations
  • HTMLCleaner: HTML to XML conversion

Scripting & Templating

  • JavaScript (Rhino): JS scripting engine
  • Groovy: Dynamic scripting
  • BeanShell: Lightweight scripting
  • BaseTemplater: Variable interpolation

IDE Technologies

  • Jetty 11: Embedded web server
  • WebSocket: Real-time communication
  • Monaco Editor: VS Code editor engine
  • Gson: JSON serialization

Testing

  • TestNG: Test framework
  • Mockito: Mocking framework
  • JaCoCo: Code coverage
  • 2,569 Tests: Comprehensive test suite

Continuous Dependency Updates

All dependencies are actively maintained and updated to latest stable versions. Apache HttpClient 5.x, Jetty 11, Guice 7.x - always current and secure.

Build Tool Support

Maven & Gradle integration

Maven

pom.xml
<!-- Core Library -->
<dependency>
  <groupId>org.webharvest</groupId>
  <artifactId>webharvest-core</artifactId>
  <version>2.2.0</version>
</dependency>

Gradle

build.gradle
// Core Library
dependencies {
    implementation 'org.webharvest:webharvest-core:2.2.0'
}