XPath Best Practices
in WebHarvest

Master text extraction vs element selection

Learn the most common XPath pitfall in WebHarvest and how to avoid it. Understand when to use /text(), string(), or plain element selection.

The Most Common XPath Issue

Element vs Text Content

"When I use //title or //h1, I get <h1>Herman Melville - Moby-Dick</h1> instead of just Herman Melville - Moby-Dick. How do I extract only the text?"
— Common WebHarvest Question

By default, XPath expressions like //title return the entire element including tags. To extract only the text content, you need to use /text() or string() function.

💡 Quick Fix: Add /text() to your XPath expression: //title/text() instead of //title

Pattern 1: Basic Text Extraction

Extract text content only

When you want text only (for display, templates, or variables), always use /text():

Incorrect (returns element with tags)

bad-example.xml

<xpath expression="//title">
    <get var="page"/>
</xpath>

<!-- Result: <title>My Page</title> -->

Correct (returns text only)

good-example.xml

<xpath expression="//title/text()">
    <get var="page"/>
</xpath>

<!-- Result: My Page -->

💡 Alternative: You can also use string(//title) which does the same thing.

Pattern 2: Multiple Elements

Extract text from several elements at once

When selecting multiple elements with |, add /text() to each:

Incorrect

bad-multiple.xml

<xpath expression="//h1 | //h2 | //h3">

Correct

good-multiple.xml

<xpath expression="//h1/text() | //h2/text() | //h3/text()">

Pattern 3: Nested Paths

Extract text from nested elements

Even with complex nested paths, add /text() at the end:

Incorrect

bad-nested.xml

<xpath expression="//div[@id='content']//span">

Correct

good-nested.xml

<xpath expression="//div[@id='content']//span/text()">

Pattern 4: Loop Extraction

Extract text from child elements in loops

When processing XML/HTML in a loop, use /text() to extract values from child elements:

loop-extraction.xml

<loop item="row">
    <xpath expression="//row">
        <get var="dbResult"/>
    </xpath>
    <body>
        <product>
            <!-- ❌ INCORRECT: Returns <id>123</id> -->
            <id><xpath expression="id"><get var="row"/></xpath></id>
            
            <!-- ✅ CORRECT: Returns 123 -->
            <id><xpath expression="id/text()"><get var="row"/></xpath></id>
        </product>
    </body>
</loop>

When NOT to Use /text()

Exceptions to the rule

There are three situations where you should NOT use /text():

1. Node Selection for Iteration

When selecting elements for loop processing:

<!-- OK - need node objects for iteration -->
<loop item="row">
    <xpath expression="//row">
        <get var="xmlData"/>
    </xpath>
    <body>
        <!-- Process each row -->
    </body>
</loop>

2. Further XPath Processing

When you need the element for nested queries:

<!-- OK - need element for further XPath -->
<def var="contentDiv">
    <xpath expression="//div[@id='content']">
        <get var="page"/>
    </xpath>
</def>

3. Attribute Extraction

Attributes return values directly:

<!-- Attributes don't need /text() -->
<xpath expression="//a/@href">
    <get var="page"/>
</xpath>

Quick Reference Table

Choose the right XPath expression

Goal	XPath Expression	Example Result
Get text content	`//element/text()`	`//h1/text()` → "Title"
Get text (function)	`string(//element)`	`string(//h1)` → "Title"
Get element	`//element`	`//h1` → `<h1>Title</h1>`
Get attribute	`//element/@attr`	`//a/@href` → "url"
Multiple text	`//e1/text() \| //e2/text()`	Text from multiple elements

Testing Your XPath

Verify expressions work correctly

Use the <file> plugin to save extracted data and verify it's correct:

test-xpath.xml

<def var="test">
    <xpath expression="//h1/text()">
        <get var="page"/>
    </xpath>
</def>

<!-- 🐛 DEBUG: Save to file -->
<file action="write" path="debug/xpath-result.txt">
    Result: ${test}
</file>

💡 Pro Tip: Check debug/xpath-result.txt to verify the result matches your expectations. If you see XML tags in the output, you forgot /text()!

Summary

Key takeaways

General Rule

If you're extracting content for display, templates, or text processing, always use /text() or string() function.

Exception: Only use //element (without /text()) when you need the element node for further XPath processing or iteration.

XPath Best Practicesin WebHarvest