Creating Custom
Plugins

Extend WebHarvest with your own functionality

Complete guide to building custom plugins with the modern @CorePlugin architecture. Integrate any API, service, or data source into your scraping workflows.

Plugin Development

Why Create Custom Plugins?

When the 47 core plugins aren't enough

WebHarvest's 47 core plugins cover most web scraping scenarios. However, you might need custom plugins when:

Integration Needs

  • Integrate with proprietary APIs
  • Connect to custom databases
  • Use specialized libraries

Custom Processing

  • Implement business logic
  • Add custom transformations
  • Create domain-specific DSL

Plugin Architecture (v2.2.0)

Modern @CorePlugin vs Legacy @Definition

New Architecture (v2.2.0)

Recommended for new plugins

  • Use @CorePlugin annotation
  • Extend AbstractPlugin
  • Automatic discovery
  • Simpler, cleaner code
  • Better dependency injection

Legacy Architecture

Still supported

  • Use @Definition annotation
  • Extend AbstractProcessor
  • Requires definition class
  • More boilerplate
  • Backward compatibility
This guide focuses on the new @CorePlugin architecture. It's simpler, requires less code, and is the recommended approach for v2.2.0+.

Quick Start (5 minutes)

Your first custom plugin

1

Maven Setup

Add WebHarvest core dependency:

pom.xml
<dependency>
    <groupId>org.webharvest</groupId>
    <artifactId>webharvest-core</artifactId>
    <version>2.2.0</version>
</dependency>
2

Create Plugin Class

Extend AbstractPlugin and add @CorePlugin annotation:

HelloWorldPlugin.java
package com.example.plugins;

import org.webharvest.plugin.AbstractPlugin;
import org.webharvest.plugin.annotations.CorePlugin;
import org.webharvest.plugin.PluginConfiguration;
import org.webharvest.runtime.DynamicScopeContext;
import org.webharvest.runtime.variables.Variable;
import org.webharvest.runtime.variables.NodeVariable;

@CorePlugin(elementName = "hello-world")
public class HelloWorldPlugin extends AbstractPlugin {
    
    @Override
    protected Variable doExecute(DynamicScopeContext context) 
            throws InterruptedException {
        return new NodeVariable("Hello from custom plugin!");
    }
    
    @Override
    protected void doValidateConfiguration(PluginConfiguration config) {
        // Validation logic (optional)
    }
    
    @Override
    protected void doInitialize(PluginConfiguration config) {
        // Initialization logic (optional)
    }
}
3

Package & Deploy

Build JAR and add to WebHarvest classpath:

terminal
mvn clean package
cp target/my-plugin-1.0.0.jar $WEBHARVEST_HOME/lib/

# Plugin is auto-discovered on next run!
4

Use in Configuration

Your plugin is now available in XML configurations:

config.xml
<?xml version="1.0" encoding="UTF-8"?>
<config xmlns="http://org.webharvest/schema/2.1/core">
    <def var="result">
        <hello-world/>
    </def>
</config>

Complete Example: Plugin with Parameters

Real-world plugin with attributes, body, and validation

Base64 Encoder Plugin

This plugin encodes text to Base64:

Base64Plugin.java
package com.example.plugins;

import org.webharvest.plugin.AbstractPlugin;
import org.webharvest.plugin.annotations.CorePlugin;
import org.webharvest.plugin.PluginConfiguration;
import org.webharvest.plugin.PluginMetadata;
import org.webharvest.plugin.exceptions.PluginException;
import org.webharvest.runtime.DynamicScopeContext;
import org.webharvest.runtime.variables.Variable;
import org.webharvest.runtime.variables.NodeVariable;
import java.util.Base64;

/**
 * Encodes text to Base64.
 * 
 * Usage:
 * <base64 charset="UTF-8">text to encode</base64>
 */
@CorePlugin(elementName = "base64")
public class Base64Plugin extends AbstractPlugin {
    
    private String charset = "UTF-8";
    
    @Override
    protected void doInitialize(PluginConfiguration config) throws PluginException {
        // Get optional charset attribute
        this.charset = config.getAttribute("charset", "UTF-8");
    }
    
    @Override
    protected void doValidateConfiguration(PluginConfiguration config) {
        // Validate charset if provided
        String charsetAttr = config.getAttribute("charset");
        if (charsetAttr != null && !isValidCharset(charsetAttr)) {
            throw new PluginValidationException(
                "Invalid charset: " + charsetAttr
            );
        }
    }
    
    @Override
    protected Variable doExecute(DynamicScopeContext context) 
            throws InterruptedException {
        // Get body content
        Variable bodyResult = executeBody(context);
        String input = bodyResult.toString();
        
        // Encode to Base64
        byte[] encoded = Base64.getEncoder().encode(
            input.getBytes(charset)
        );
        
        return new NodeVariable(new String(encoded));
    }
    
    @Override
    public PluginMetadata getMetadata() {
        return new PluginMetadata() {
            @Override
            public String getName() { return "base64"; }
            
            @Override
            public String getDescription() { 
                return "Encodes text to Base64"; 
            }
            
            @Override
            public String getVersion() { return "1.0.0"; }
        };
    }
    
    private boolean isValidCharset(String charset) {
        try {
            java.nio.charset.Charset.forName(charset);
            return true;
        } catch (Exception e) {
            return false;
        }
    }
}

Usage Example:

config.xml
<def var="encoded">
    <base64 charset="UTF-8">
        Hello, WebHarvest!
    </base64>
</def>

<!-- Result: SGVsbG8sIFdlYkhhcnZlc3Qh -->

Automatic Plugin Discovery

No manual registration required

How It Works

WebHarvest v2.2.0 automatically discovers plugins using classpath scanning. Any class annotated with @CorePlugin is registered automatically.

Method 1: Add to Classpath

terminal
# Copy JAR to lib directory
cp my-plugin.jar $WEBHARVEST_HOME/lib/

# Or run with classpath
java -cp webharvest-core.jar:my-plugin.jar \
    org.webharvest.Main config.xml

Method 2: Package Scanning (Custom Packages)

terminal
# Scan custom packages
java -Dwebharvest.plugin.packages=com.example.plugins,com.mycompany \
     -jar webharvest-cli.jar config.xml

Method 3: Programmatic Registration

Main.java
import org.webharvest.plugin.scanner.PluginPackageScanner;
import org.webharvest.definition.CorePluginRegistry;

CorePluginRegistry registry = CorePluginRegistry.getInstance();
PluginPackageScanner scanner = new PluginPackageScanner(registry);

// Scan your custom packages
scanner.scanPackages("com.example.plugins", "com.mycompany.webharvest");

Plugin Lifecycle

Understanding the execution flow

1

Initialize

doInitialize() is called once when plugin is loaded. Parse attributes, setup resources, validate configuration.

2

Execute

doExecute() is called for each invocation. Access context, execute body, process data, return result.

3

Cleanup

cleanup() is called when plugin is unloaded. Close connections, release resources, cleanup state.

Advanced Features

Take your plugins to the next level

Attributes & Parameters

Access XML attributes in your plugin:

Java
String url = config.getAttribute("url");
int timeout = Integer.parseInt(
    config.getAttribute("timeout", "30000")
);

Execute Body

Process child elements:

Java
// Execute all child elements
Variable bodyResult = executeBody(context);
String content = bodyResult.toString();

Access Context

Read/write variables, get session:

Java
// Get variable
Variable var = context.getVar("myVar");

// Set variable
context.setLocalVar("result", newValue);

// Get session
ScraperSession session = context.getSession();

Event System

Publish events for monitoring:

Java
EventBus eventBus = context.getEventBus();
eventBus.publish(new CustomEvent(data));

Token Tracking

Track metrics and performance:

Java
ScraperSession session = context.getSession();
session.incrementToken(TokenType.HTTP_REQUEST, 1);
session.incrementToken(TokenType.HTTP_BYTES, bytes);

Error Handling

Proper exception handling:

Java
try {
    // Plugin logic
} catch (IOException e) {
    throw new PluginException(
        "Failed to process: " + e.getMessage(), e
    );
}

Creating Extension Modules

Separate Maven modules like FTP, Mail, Database

Project Structure

Extension modules are separate Maven projects with their own dependencies:

project structure
webharvest-myextension/
├── pom.xml
├── README.md
└── src/
    └── main/
        ├── java/
        │   └── org/webharvest/plugin/myext/
        │       ├── MyExtPlugin.java
        │       └── MyExtException.java
        └── resources/
            └── META-INF/
                └── services/
                    └── org.webharvest.plugin.Plugin

pom.xml Example:

pom.xml
<project>
    <groupId>org.webharvest</groupId>
    <artifactId>webharvest-myextension</artifactId>
    <version>2.2.0</version>
    
    <dependencies>
        <!-- Core dependency -->
        <dependency>
            <groupId>org.webharvest</groupId>
            <artifactId>webharvest-core</artifactId>
            <version>2.2.0</version>
            <scope>provided</scope>
        </dependency>
        
        <!-- Your heavy dependency -->
        <dependency>
            <groupId>com.example</groupId>
            <artifactId>heavy-library</artifactId>
            <version>1.0.0</version>
        </dependency>
    </dependencies>
</project>

Testing Your Plugin

Unit tests and integration tests

Unit Test Example

Base64PluginTest.java
import org.junit.Test;
import static org.junit.Assert.*;

public class Base64PluginTest {
    
    @Test
    public void shouldEncodeText() throws Exception {
        // Create plugin
        Base64Plugin plugin = new Base64Plugin();
        
        // Create mock configuration
        PluginConfiguration config = new MockPluginConfiguration();
        plugin.initialize(config);
        
        // Create mock context with body
        DynamicScopeContext context = createContextWithBody("Hello");
        
        // Execute
        Variable result = plugin.execute(context);
        
        // Verify
        assertEquals("SGVsbG8=", result.toString());
    }
}

Best Practices

Write production-ready plugins

Documentation

  • Write JavaDoc comments
  • Include usage examples
  • Document attributes
  • Explain return values

Validation

  • Validate required attributes
  • Check parameter types
  • Provide clear error messages
  • Fail fast on invalid config

Naming Conventions

  • Use lowercase-hyphen: my-plugin
  • Class name: MyPluginPlugin
  • Package: com.example.plugins
  • Descriptive and clear

Testing

  • Write unit tests
  • Test edge cases
  • Mock dependencies
  • Test error handling

Thread Safety

  • Avoid mutable state
  • Use context for data
  • Synchronize shared resources
  • Handle interrupts

Performance

  • Minimize object creation
  • Cache expensive operations
  • Use streaming for large data
  • Track tokens for metrics

Real-World Examples

Learn from existing plugins

Example 1: HTTP Client Plugin

Study how HttpPlugin handles requests, headers, and responses:

Reference
Source: webharvest-core/src/main/java/org/webharvest/plugin/core/HttpPlugin.java

Features to learn:
• Attribute handling (url, method, timeout)
• Child element processing (headers, params)
• Resource management (HTTP connections)
• Error handling (network failures)
• Token tracking (HTTP_REQUEST, HTTP_BYTES)

Example 2: Template Plugin

See how TemplatePlugin processes dynamic content:

Reference
Source: webharvest-core/src/main/java/org/webharvest/plugin/core/TemplatePlugin.java

Features to learn:
• Variable substitution
• Context manipulation
• Body execution
• String processing

Example 3: Mail Extension Module

Learn how extension modules work:

Reference
Source: webharvest-mail/src/main/java/org/webharvest/plugin/mail/MailPlugin.java

Features to learn:
• External dependencies (JavaMail)
• Separate Maven module
• Complex configuration
• Resource cleanup

Related Resources