Master text extraction vs element selection
Learn the most common XPath pitfall in WebHarvest and how to avoid it.
Understand when to use /text(), string(), or plain element selection.
Element vs Text Content
"When I use//titleor//h1, I get<h1>Herman Melville - Moby-Dick</h1>instead of justHerman Melville - Moby-Dick. How do I extract only the text?"
By default, XPath expressions like //title return the entire element
including tags.
To extract only the text content, you need to use /text() or string()
function.
/text() to your XPath expression:
//title/text() instead of //title
Extract text content only
When you want text only (for display, templates, or variables), always use /text():
<xpath expression="//title">
<get var="page"/>
</xpath>
<!-- Result: <title>My Page</title> -->
<xpath expression="//title/text()">
<get var="page"/>
</xpath>
<!-- Result: My Page -->
string(//title) which does the same
thing.
Extract text from several elements at once
When selecting multiple elements with |, add /text() to each:
<xpath expression="//h1 | //h2 | //h3">
<xpath expression="//h1/text() | //h2/text() | //h3/text()">
Extract text from nested elements
Even with complex nested paths, add /text() at the end:
<xpath expression="//div[@id='content']//span">
<xpath expression="//div[@id='content']//span/text()">
Extract text from child elements in loops
When processing XML/HTML in a loop, use /text() to extract values from child elements:
<loop item="row">
<xpath expression="//row">
<get var="dbResult"/>
</xpath>
<body>
<product>
<!-- ❌ INCORRECT: Returns <id>123</id> -->
<id><xpath expression="id"><get var="row"/></xpath></id>
<!-- ✅ CORRECT: Returns 123 -->
<id><xpath expression="id/text()"><get var="row"/></xpath></id>
</product>
</body>
</loop>
Exceptions to the rule
There are three situations where you should NOT use /text():
When selecting elements for loop processing:
<!-- OK - need node objects for iteration -->
<loop item="row">
<xpath expression="//row">
<get var="xmlData"/>
</xpath>
<body>
<!-- Process each row -->
</body>
</loop>
When you need the element for nested queries:
<!-- OK - need element for further XPath -->
<def var="contentDiv">
<xpath expression="//div[@id='content']">
<get var="page"/>
</xpath>
</def>
Attributes return values directly:
<!-- Attributes don't need /text() -->
<xpath expression="//a/@href">
<get var="page"/>
</xpath>
Choose the right XPath expression
| Goal | XPath Expression | Example Result |
|---|---|---|
| Get text content | //element/text() |
//h1/text() → "Title" |
| Get text (function) | string(//element) |
string(//h1) → "Title" |
| Get element | //element |
//h1 → <h1>Title</h1> |
| Get attribute | //element/@attr |
//a/@href → "url" |
| Multiple text | //e1/text() | //e2/text() |
Text from multiple elements |
Verify expressions work correctly
Use the <file> plugin to save extracted data and verify it's correct:
<def var="test">
<xpath expression="//h1/text()">
<get var="page"/>
</xpath>
</def>
<!-- 🐛 DEBUG: Save to file -->
<file action="write" path="debug/xpath-result.txt">
Result: ${test}
</file>
debug/xpath-result.txt to verify the result matches your
expectations.
If you see XML tags in the output, you forgot /text()!
Key takeaways
If you're extracting content for display, templates, or text processing,
always use /text() or string() function.
Exception: Only use //element (without /text())
when you need the element node for further XPath processing or iteration.