Tutorial: Form Submit
& Pagination

Master advanced scraping patterns

Step-by-step guide for login, form submission, pagination handling, and detail page scraping.

What You'll Learn

Complete workflow from login to data extraction

Form Submission

Submit login forms with POST requests

Pagination

Iterate through multiple pages automatically

Detail Pages

Extract links and scrape individual detail pages

Save Results

Store extracted data to files

Step 1: Submit Login Form

Authenticate with POST request

step1-login.xml
<?xml version="1.0" encoding="UTF-8"?>
<config xmlns="http://org.webharvest/schema/2.1/core">
<!-- Submit login form -->
<def var="loginResponse">
    <http url="https://example.com/login" method="POST">
        <http-param name="username" value="myuser"/>
        <http-param name="password" value="mypass"/>
    </http>
</def>

<!-- Save for debugging -->
<file path="output/login-response.html" action="write">
    ${loginResponse}
</file>
</config>
💡 Key Points: Use method="POST" for form submission. HTTP session cookies are automatically maintained across requests.

Step 2: Extract Product Links

Get all detail page URLs

step2-extract-links.xml
<?xml version="1.0" encoding="UTF-8"?>
<config xmlns="http://org.webharvest/schema/2.1/core">
<!-- Fetch results page -->
<def var="resultsPage">
    <http url="https://example.com/products"/>
</def>

<!-- Extract all product links -->
<def var="productLinks">
    <xpath expression="//a[@class='product-link']/@href">
        <html-to-xml>${resultsPage}</html-to-xml>
    </xpath>
</def>

<log message="Found ${productLinks.length} product links"/>
</config>

Step 3: Loop Through Detail Pages

Scrape each product page

step3-detail-pages.xml
<?xml version="1.0" encoding="UTF-8"?>
<config xmlns="http://org.webharvest/schema/2.1/core">
<!-- Loop through each product link -->
<loop item="link" index="i">
    ${productLinks}
    
    <!-- Fetch detail page -->
    <def var="detailPage">
        <http url="${link}"/>
    </def>
    
    <!-- Extract product data -->
    <def var="productName">
        <xpath expression="//h1[@class='title']/text()">
            <html-to-xml>${detailPage}</html-to-xml>
        </xpath>
    </def>
    
    <def var="price">
        <xpath expression="//span[@class='price']/text()">
            <html-to-xml>${detailPage}</html-to-xml>
        </xpath>
    </def>
    
    <!-- Save to file -->
    <file path="output/product-${i}.txt" action="write">
        Name: ${productName}
        Price: ${price}
    </file>
    
    <!-- Be polite - wait 2 seconds -->
    <sleep time="2000"/>
</loop>
</config>

Step 4: Handle Pagination

Automatically navigate through multiple pages

step4-pagination.xml
<?xml version="1.0" encoding="UTF-8"?>
<config xmlns="http://org.webharvest/schema/2.1/core">
<def var="currentPage">1</def>
<def var="hasNextPage">true</def>

<while condition="${hasNextPage}">
    <!-- Fetch page -->
    <def var="page">
        <http url="https://example.com/products?page=${currentPage}"/>
    </def>
    
    <!-- Extract and process products on this page -->
    <!-- ... your extraction logic ... -->
    
    <!-- Check if "Next" button exists -->
    <def var="nextButton">
        <xpath expression="//a[@class='next-page']">
            <html-to-xml>${page}</html-to-xml>
        </xpath>
    </def>
    
    <!-- Update loop condition -->
    <set-var name="hasNextPage">
        <script>!context.getVar('nextButton').isEmpty()</script>
    </set-var>
    
    <!-- Increment page counter -->
    <set-var name="currentPage">
        <script>parseInt(context.getVar('currentPage')) + 1</script>
    </set-var>
    
    <sleep time="2000"/>
</while>
</config>
💡 How This Works: The <while> loop continues as long as the "Next" button exists on the page. When there's no next page, the loop stops.

Related Resources