Modular, plugin-based, extensible design
Understanding the core architecture, plugin system, and execution model that powers WebHarvest's flexibility and performance.
Modular design with clear separation of concerns
Architecture: webharvest-core is the foundation. Applications (IDE, CLI) and external plugins build on top.
Core framework and external plugins
The Foundation: Framework with 47 built-in plugins, parser, runtime engine, DI, session management, and token tracking.
Command-Line Tool: Run configurations from terminal, scripting support, batch processing.
Web IDE: Embedded Jetty server, Monaco editor, real-time WebSocket updates, per-tab workspaces.
Extensions: Database, Mail, FTP, ZIP, Web Browser - separate modules, auto-discovered when on classpath.
From XML configuration to results
How plugins are discovered, registered, and executed
@CorePlugin and
@Autoscanned
@Definition + new @CorePlugin
coexistConfigModule, ServicesModuleBaseTemplater uses static DIInjectorHelper.getInjector()DynamicScopeContext for variable scopingInterruptedException<try>/<catch> support
Production-grade execution tracking and monitoring
UUID.randomUUID() for globally unique identifiers
ConcurrentHashMap with atomic operations
awaitCompletion(timeout) for sync executionInstant-based timing with nanosecond precisionTokenUsage for safe reporting@Subscribe annotation for handlersThread-safe execution of parallel scrapers
DynamicScopeContextAtomicLong for metrics (lock-free)ConcurrentHashMap for registriesTokenUsage, ClientContext are
immutableawaitTermination() for cleanupinterrupt() propagates to pluginssession.cancel() → thread interruptInterruptedExceptionSessionCancelledEvent broadcastOptional billing model for SaaS deployments - track usage, enforce quotas
| HTTP_REQUEST: | 1,250 requests | $1.25 |
| HTTP_BYTES: | 45 MB | $0.0045 |
| CPU_TIME: | 12,500 ms (12.5s) | $0.00007 |
| MEMORY_PEAK: | 256 MB | negligible |
| TOTAL COST: | ~$1.25 | |
| HTTP_REQUEST: | 50 requests | $0.05 |
| HTTP_BYTES: | 2.5 GB | $0.25 |
| CPU_TIME: | 180,000 ms (3 min) | $0.001 |
| MEMORY_PEAK: | 1.2 GB | $0.0012 |
| TOTAL COST: | ~$0.30 | |
AtomicLong for each resource type (lock-free)
TokenUsage for safe reportingTokenTrackerTokenUsage → JSON/CSV for billing systemsHow WebHarvest fits into modern data architectures
Scenario: Extract e-commerce data from 1000+ sites → transform → load into S3/BigQuery
Scenario: WebHarvest as data source in enterprise ETL/orchestration platforms
Scenario: AI-driven scraping with dynamic config generation
Scenario: Transform between formats for ESB/B2B integration
Scenario: Combine data from multiple REST APIs → unified response
Scenario: Integration with Fivetran, Airbyte, Segment
Libraries and frameworks powering WebHarvest
All dependencies are actively maintained and updated to latest stable versions. Apache HttpClient 5.x, Jetty 11, Guice 7.x - always current and secure.
Maven & Gradle integration
<!-- Core Library -->
<dependency>
<groupId>org.webharvest</groupId>
<artifactId>webharvest-core</artifactId>
<version>2.2.0</version>
</dependency>
// Core Library
dependencies {
implementation 'org.webharvest:webharvest-core:2.2.0'
}