Modular, plugin-based, extensible design
Understanding the core architecture, plugin system, and execution model that powers WebHarvest's flexibility and performance.
Modular design with clear separation of concerns
Architecture: webharvest-core is the foundation. Applications (IDE, CLI) and external plugins build on top.
Core framework and external plugins
The Foundation: Framework with 47 built-in plugins, parser, runtime engine, DI, session management, and token tracking.
Command-Line Tool: Run configurations from terminal, scripting support, batch processing.
Web IDE: Embedded Jetty server, Monaco editor, real-time WebSocket updates, per-tab workspaces.
Extensions: Database, Mail, FTP, ZIP, Web Browser - separate modules, auto-discovered when on classpath.
From XML configuration to results
How plugins are discovered, registered, and executed
@CorePlugin
and
@Autoscanned
@Definition
+ new @CorePlugin
coexistConfigModule
, ServicesModule
BaseTemplater
uses static DIInjectorHelper.getInjector()
DynamicScopeContext
for variable scopingInterruptedException
<try>
/<catch>
support
Production-grade execution tracking and monitoring
UUID.randomUUID()
for globally unique identifiers
ConcurrentHashMap
with atomic operations
awaitCompletion(timeout)
for sync executionInstant
-based timing with nanosecond precisionTokenUsage
for safe reporting@Subscribe
annotation for handlersThread-safe execution of parallel scrapers
DynamicScopeContext
AtomicLong
for metrics (lock-free)ConcurrentHashMap
for registriesTokenUsage
, ClientContext
are
immutableawaitTermination()
for cleanupinterrupt()
propagates to pluginssession.cancel()
→ thread interruptInterruptedException
SessionCancelledEvent
broadcastOptional billing model for SaaS deployments - track usage, enforce quotas
HTTP_REQUEST: | 1,250 requests | $1.25 |
HTTP_BYTES: | 45 MB | $0.0045 |
CPU_TIME: | 12,500 ms (12.5s) | $0.00007 |
MEMORY_PEAK: | 256 MB | negligible |
TOTAL COST: | ~$1.25 |
HTTP_REQUEST: | 50 requests | $0.05 |
HTTP_BYTES: | 2.5 GB | $0.25 |
CPU_TIME: | 180,000 ms (3 min) | $0.001 |
MEMORY_PEAK: | 1.2 GB | $0.0012 |
TOTAL COST: | ~$0.30 |
AtomicLong
for each resource type (lock-free)
TokenUsage
for safe reportingTokenTracker
TokenUsage
→ JSON/CSV for billing systemsHow WebHarvest fits into modern data architectures
Scenario: Extract e-commerce data from 1000+ sites → transform → load into S3/BigQuery
Scenario: WebHarvest as data source in enterprise ETL/orchestration platforms
Scenario: AI-driven scraping with dynamic config generation
Scenario: Transform between formats for ESB/B2B integration
Scenario: Combine data from multiple REST APIs → unified response
Scenario: Integration with Fivetran, Airbyte, Segment
Libraries and frameworks powering WebHarvest
All dependencies are actively maintained and updated to latest stable versions. Apache HttpClient 5.x, Jetty 11, Guice 7.x - always current and secure.
Maven & Gradle integration
<!-- Core Library -->
<dependency>
<groupId>org.webharvest</groupId>
<artifactId>webharvest-core</artifactId>
<version>2.2.0</version>
</dependency>
// Core Library
dependencies {
implementation 'org.webharvest:webharvest-core:2.2.0'
}