Extend WebHarvest with your own functionality
Complete guide to building custom plugins with the modern @CorePlugin architecture. Integrate any API, service, or data source into your scraping workflows.
When the 47 core plugins aren't enough
WebHarvest's 47 core plugins cover most web scraping scenarios. However, you might need custom plugins when:
Modern @CorePlugin vs Legacy @Definition
Recommended for new plugins
@CorePlugin annotationAbstractPluginStill supported
@Definition annotationAbstractProcessorYour first custom plugin
Add WebHarvest core dependency:
<dependency>
<groupId>org.webharvest</groupId>
<artifactId>webharvest-core</artifactId>
<version>2.2.0</version>
</dependency>
Extend AbstractPlugin and add @CorePlugin annotation:
package com.example.plugins;
import org.webharvest.plugin.AbstractPlugin;
import org.webharvest.plugin.annotations.CorePlugin;
import org.webharvest.plugin.PluginConfiguration;
import org.webharvest.runtime.DynamicScopeContext;
import org.webharvest.runtime.variables.Variable;
import org.webharvest.runtime.variables.NodeVariable;
@CorePlugin(elementName = "hello-world")
public class HelloWorldPlugin extends AbstractPlugin {
@Override
protected Variable doExecute(DynamicScopeContext context)
throws InterruptedException {
return new NodeVariable("Hello from custom plugin!");
}
@Override
protected void doValidateConfiguration(PluginConfiguration config) {
// Validation logic (optional)
}
@Override
protected void doInitialize(PluginConfiguration config) {
// Initialization logic (optional)
}
}
Build JAR and add to WebHarvest classpath:
mvn clean package
cp target/my-plugin-1.0.0.jar $WEBHARVEST_HOME/lib/
# Plugin is auto-discovered on next run!
Your plugin is now available in XML configurations:
<?xml version="1.0" encoding="UTF-8"?>
<config xmlns="http://org.webharvest/schema/2.1/core">
<def var="result">
<hello-world/>
</def>
</config>
Real-world plugin with attributes, body, and validation
This plugin encodes text to Base64:
package com.example.plugins;
import org.webharvest.plugin.AbstractPlugin;
import org.webharvest.plugin.annotations.CorePlugin;
import org.webharvest.plugin.PluginConfiguration;
import org.webharvest.plugin.PluginMetadata;
import org.webharvest.plugin.exceptions.PluginException;
import org.webharvest.runtime.DynamicScopeContext;
import org.webharvest.runtime.variables.Variable;
import org.webharvest.runtime.variables.NodeVariable;
import java.util.Base64;
/**
* Encodes text to Base64.
*
* Usage:
* <base64 charset="UTF-8">text to encode</base64>
*/
@CorePlugin(elementName = "base64")
public class Base64Plugin extends AbstractPlugin {
private String charset = "UTF-8";
@Override
protected void doInitialize(PluginConfiguration config) throws PluginException {
// Get optional charset attribute
this.charset = config.getAttribute("charset", "UTF-8");
}
@Override
protected void doValidateConfiguration(PluginConfiguration config) {
// Validate charset if provided
String charsetAttr = config.getAttribute("charset");
if (charsetAttr != null && !isValidCharset(charsetAttr)) {
throw new PluginValidationException(
"Invalid charset: " + charsetAttr
);
}
}
@Override
protected Variable doExecute(DynamicScopeContext context)
throws InterruptedException {
// Get body content
Variable bodyResult = executeBody(context);
String input = bodyResult.toString();
// Encode to Base64
byte[] encoded = Base64.getEncoder().encode(
input.getBytes(charset)
);
return new NodeVariable(new String(encoded));
}
@Override
public PluginMetadata getMetadata() {
return new PluginMetadata() {
@Override
public String getName() { return "base64"; }
@Override
public String getDescription() {
return "Encodes text to Base64";
}
@Override
public String getVersion() { return "1.0.0"; }
};
}
private boolean isValidCharset(String charset) {
try {
java.nio.charset.Charset.forName(charset);
return true;
} catch (Exception e) {
return false;
}
}
}
<def var="encoded">
<base64 charset="UTF-8">
Hello, WebHarvest!
</base64>
</def>
<!-- Result: SGVsbG8sIFdlYkhhcnZlc3Qh -->
No manual registration required
WebHarvest v2.2.0 automatically discovers plugins using classpath scanning.
Any class annotated with @CorePlugin is registered automatically.
# Copy JAR to lib directory
cp my-plugin.jar $WEBHARVEST_HOME/lib/
# Or run with classpath
java -cp webharvest-core.jar:my-plugin.jar \
org.webharvest.Main config.xml
# Scan custom packages
java -Dwebharvest.plugin.packages=com.example.plugins,com.mycompany \
-jar webharvest-cli.jar config.xml
import org.webharvest.plugin.scanner.PluginPackageScanner;
import org.webharvest.definition.CorePluginRegistry;
CorePluginRegistry registry = CorePluginRegistry.getInstance();
PluginPackageScanner scanner = new PluginPackageScanner(registry);
// Scan your custom packages
scanner.scanPackages("com.example.plugins", "com.mycompany.webharvest");
Understanding the execution flow
doInitialize() is called once when plugin is loaded.
Parse attributes, setup resources, validate configuration.
doExecute() is called for each invocation.
Access context, execute body, process data, return result.
cleanup() is called when plugin is unloaded.
Close connections, release resources, cleanup state.
Take your plugins to the next level
Access XML attributes in your plugin:
String url = config.getAttribute("url");
int timeout = Integer.parseInt(
config.getAttribute("timeout", "30000")
);
Process child elements:
// Execute all child elements
Variable bodyResult = executeBody(context);
String content = bodyResult.toString();
Read/write variables, get session:
// Get variable
Variable var = context.getVar("myVar");
// Set variable
context.setLocalVar("result", newValue);
// Get session
ScraperSession session = context.getSession();
Publish events for monitoring:
EventBus eventBus = context.getEventBus();
eventBus.publish(new CustomEvent(data));
Track metrics and performance:
ScraperSession session = context.getSession();
session.incrementToken(TokenType.HTTP_REQUEST, 1);
session.incrementToken(TokenType.HTTP_BYTES, bytes);
Proper exception handling:
try {
// Plugin logic
} catch (IOException e) {
throw new PluginException(
"Failed to process: " + e.getMessage(), e
);
}
Separate Maven modules like FTP, Mail, Database
Extension modules are separate Maven projects with their own dependencies:
webharvest-myextension/
├── pom.xml
├── README.md
└── src/
└── main/
├── java/
│ └── org/webharvest/plugin/myext/
│ ├── MyExtPlugin.java
│ └── MyExtException.java
└── resources/
└── META-INF/
└── services/
└── org.webharvest.plugin.Plugin
<project>
<groupId>org.webharvest</groupId>
<artifactId>webharvest-myextension</artifactId>
<version>2.2.0</version>
<dependencies>
<!-- Core dependency -->
<dependency>
<groupId>org.webharvest</groupId>
<artifactId>webharvest-core</artifactId>
<version>2.2.0</version>
<scope>provided</scope>
</dependency>
<!-- Your heavy dependency -->
<dependency>
<groupId>com.example</groupId>
<artifactId>heavy-library</artifactId>
<version>1.0.0</version>
</dependency>
</dependencies>
</project>
Unit tests and integration tests
import org.junit.Test;
import static org.junit.Assert.*;
public class Base64PluginTest {
@Test
public void shouldEncodeText() throws Exception {
// Create plugin
Base64Plugin plugin = new Base64Plugin();
// Create mock configuration
PluginConfiguration config = new MockPluginConfiguration();
plugin.initialize(config);
// Create mock context with body
DynamicScopeContext context = createContextWithBody("Hello");
// Execute
Variable result = plugin.execute(context);
// Verify
assertEquals("SGVsbG8=", result.toString());
}
}
Write production-ready plugins
my-pluginMyPluginPlugincom.example.pluginsLearn from existing plugins
Study how HttpPlugin handles requests, headers, and responses:
Source: webharvest-core/src/main/java/org/webharvest/plugin/core/HttpPlugin.java
Features to learn:
• Attribute handling (url, method, timeout)
• Child element processing (headers, params)
• Resource management (HTTP connections)
• Error handling (network failures)
• Token tracking (HTTP_REQUEST, HTTP_BYTES)
See how TemplatePlugin processes dynamic content:
Source: webharvest-core/src/main/java/org/webharvest/plugin/core/TemplatePlugin.java
Features to learn:
• Variable substitution
• Context manipulation
• Body execution
• String processing
Learn how extension modules work:
Source: webharvest-mail/src/main/java/org/webharvest/plugin/mail/MailPlugin.java
Features to learn:
• External dependencies (JavaMail)
• Separate Maven module
• Complex configuration
• Resource cleanup