Debugging
Complex Scrapers

Complete guide to debugging WebHarvest configurations

Learn 5 powerful debugging patterns to inspect responses, variables, and execution flow when scraping complex websites.

The Problem

Common debugging challenges in web scraping

"I am developing a scraping tool for a complex javascript website... frequently, I am getting the wrong response. However, it is not easy to spot such instances. Debugging a configuration is really complicated without having a possibility to look at what WebHarvest is actually processing."
— GitHub Issue #16

Common Debugging Challenges

💡 Good News! WebHarvest has powerful debugging capabilities built-in. This guide shows you 5 debugging patterns you can use RIGHT NOW.

Pattern 1: Save Responses to Debug Files

The simplest and most effective debugging technique

Save every response to a file so you can inspect it in your browser or text editor. This lets you see exactly what HTML WebHarvest is processing.

debug-save-response.xml
<!-- Fetch page -->
<def var="response">
    <http url="https://complex-website.com/api/data">
        <http-header name="X-API-Key" value="${apiKey}"/>
    </http>
</def>

<!-- 🐛 DEBUG: Save response to file -->
<file path="debug/response-${sys.timestamp()}.html" action="write">
    ${response}
</file>

<!-- Now process it -->
<def var="data">
    <xpath expression="//div[@class='result']">
        <html-to-xml>${response}</html-to-xml>
    </xpath>
</def>
💡 Pro Tip: Use ${sys.timestamp()} or ${index} to create unique filenames for each request. This prevents overwriting and lets you see the sequence of requests.

Pattern 2: Variable Inspection with Log

See what's actually in your variables

Use <log> to print variable values at key points in your scraper.

debug-variables.xml
<!-- Extract data -->
<def var="productName">
    <xpath expression="//h1[@class='product-title']/text()">${page}</xpath>
</def>

<!-- 🐛 DEBUG: Print variable -->
<log message="Product name: ${productName}"/>

<!-- Continue processing... -->
💡 Tip: Log variables before and after transformations to see where data gets corrupted.

Pattern 3: Multi-Step Pipeline Debugging

Save output at each transformation step

For complex pipelines (fetch → parse → transform → extract), save output after each step.

debug-pipeline.xml
<!-- Step 1: Fetch -->
<def var="raw">
    <http url="https://example.com"/>
</def>
<file path="debug/1-raw.html" action="write">${raw}</file>

<!-- Step 2: Parse HTML to XML -->
<def var="xml">
    <html-to-xml>${raw}</html-to-xml>
</def>
<file path="debug/2-xml.xml" action="write">${xml}</file>

<!-- Step 3: Extract data -->
<def var="data">
    <xpath expression="//div[@class='result']">${xml}</xpath>
</def>
<file path="debug/3-data.txt" action="write">${data}</file>

<!-- Step 4: Transform -->
<def var="clean">
    <regexp pattern="<[^>]+>" replace="">${data}</regexp>
</def>
<file path="debug/4-clean.txt" action="write">${clean}</file>
💡 Result: You get a "paper trail" showing exactly where data gets corrupted or lost. Open files 1, 2, 3, 4 and see which step breaks!

Pattern 4: Conditional Logging for Edge Cases

Log only when something unexpected happens

Use <if> to log only when conditions fail, reducing noise.

debug-conditional.xml
<!-- Extract product price -->
<def var="price">
    <xpath expression="//span[@class='price']/text()">${page}</xpath>
    </def>
    
<!-- 🐛 DEBUG: Log if price is empty or invalid -->
<if condition="${empty(price) or price == ''}">
    <log level="ERROR" message="⚠️ Price extraction failed!"/>
    <file path="debug/failed-page.html" action="write">${page}</file>
</if>

Pattern 5: Debug Mode Configuration

Toggle debugging on/off with a single variable

Create a DEBUG_MODE variable to control all debugging output.

debug-mode.xml
<!-- Set debug mode (true/false) -->
    <def var="DEBUG_MODE">true</def>

<!-- Your scraper logic -->
<def var="data">
    <http url="https://example.com"/>
</def>

<!-- Conditional debug save -->
        <if condition="${DEBUG_MODE}">
    <file path="debug/response.html" action="write">${data}</file>
    <log message="DEBUG: Saved response to debug/response.html"/>
</if>
✅ How to Use:
  1. Set DEBUG_MODE to true during development
  2. Set to false in production
  3. All debug saves/logs controlled by single variable

Debugging Checklist

Essential steps when debugging complex websites

Save Responses

Always save HTTP responses to files using <file path="debug/...">

Log Variables

Use <log> to print variable values at key transformation points

Pipeline Debugging

Save output after each step in complex transformation pipelines

Debug Mode

Use DEBUG_MODE variable to control all debugging output

Related Resources