<regexp-pattern>

Regexp pattern definition

Core v2.2.0

Overview

The processor defines the regular expression pattern for matching and extraction. Must be used as a child element of . Supports Java regex syntax with capture groups.

Usage Examples

Example 1: Extract links from HTML

example-1.xml
<?xml version="1.0" encoding="UTF-8"?>
<config xmlns="http://org.webharvest/schema/2.1/core">
<regexp>
  <regexp-pattern><![CDATA[<a href="([^"]+)">]]></regexp-pattern>
  <regexp-source>${htmlPage}</regexp-source>
</regexp>
</config>

Example 2: Extract emails

example-2.xml
<?xml version="1.0" encoding="UTF-8"?>
<config xmlns="http://org.webharvest/schema/2.1/core">
<regexp>
  <regexp-pattern><![CDATA[([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,})]]></regexp-pattern>
  <regexp-source>${textContent}</regexp-source>
</regexp>
</config>

Example 3: Price extraction with capture groups

example-3.xml
<?xml version="1.0" encoding="UTF-8"?>
<config xmlns="http://org.webharvest/schema/2.1/core">
<regexp>
  <regexp-pattern><![CDATA[Price:\s*\$(\d+)\.(\d{2})]]></regexp-pattern>
  <regexp-source>Price: $19.99</regexp-source>
  <regexp-result index="1"/><!-- Returns "19" -->
  <regexp-result index="2"/><!-- Returns "99" -->
</regexp>
</config>

Example 4: Complex pattern with multiple groups

example-4.xml
<?xml version="1.0" encoding="UTF-8"?>
<config xmlns="http://org.webharvest/schema/2.1/core">
<regexp>
  <regexp-pattern><![CDATA[<div class="product" data-id="(\d+)" data-name="([^"]+)" data-price="([\d.]+)">]]></regexp-pattern>
  <regexp-source>${productHtml}</regexp-source>
</regexp>
<!-- Access groups with <regexp-result index="1"/>, <regexp-result index="2"/>, etc. -->
</config>

Parameters

Important Notes

Related Processors