Rate Limiting
& Polite Scraping

Best practices for respectful web scraping

Learn how to control scraping speed, avoid overloading servers, and implement responsible scraping patterns.

Why Rate Limiting Matters

Protecting servers and ensuring reliable scraping

Prevent Server Overload

Too many requests can crash small websites or trigger rate limits

Avoid IP Bans

Aggressive scraping gets your IP blocked permanently

Be Respectful

Good scraping citizens don't abuse free services

Better Reliability

Slower scraping = fewer errors = more successful extractions

Pattern 1: Simple Delays with Sleep

Add pauses between requests

Use <sleep> to pause execution for a specified time (in milliseconds).

rate-limit-sleep.xml
<?xml version="1.0" encoding="UTF-8"?>
<config xmlns="http://org.webharvest/schema/2.1/core">
<loop item="url">
    ${urlList}
    
    <!-- Fetch page -->
    <http url="${url}"/>
    
    <!-- ⏱️ WAIT 2 seconds before next request -->
    <sleep time="2000"/>
</loop>
</config>
💡 Recommendation: Start with 2-second delays for unknown sites, then adjust based on server response.

Pattern 2: Random Delays

Vary delay times to appear more human

Random delays (1-3 seconds) are less detectable than fixed intervals.

rate-limit-random.xml
<?xml version="1.0" encoding="UTF-8"?>
<config xmlns="http://org.webharvest/schema/2.1/core">
<loop item="url">
    ${urlList}
    
    <http url="${url}"/>
    
    <!-- Random delay between 1-3 seconds -->
    <def var="randomDelay">
        <script>1000 + Math.random() * 2000</script>
    </def>
    
    <sleep time="${randomDelay}"/>
</loop>
</config>

Best Practices

DO

  • Start with 2-3 second delays
  • Use random delays when possible
  • Respect robots.txt
  • Set proper User-Agent header

DON'T

  • Never scrape at maximum speed
  • Don't ignore HTTP 429 (Too Many Requests)
  • Don't scrape during peak hours
  • Don't pretend to be a real browser if you're not

Related Resources