Responsible web scraping is essential for building sustainable AI applications. Here are the best practices every developer should follow.

Rate Limiting

Never hammer a website with requests:

// Bad: No rate limiting
for (const url of urls) {
  await scrape(url); // 100 requests/second
}

// Good: Rate limited
const limiter = new Bottleneck({ maxConcurrent: 2, minTime: 1000 });
for (const url of urls) {
  await limiter.schedule(() => scrape(url)); // 2 req/sec max
}

Respect robots.txt

Always check and follow robots.txt rules:

import robotsParser from 'robots-parser';

const robots = robotsParser('https://example.com/robots.txt', robotsTxt);
if (!robots.isAllowed(url, 'MyBot')) {
  console.log('URL disallowed by robots.txt');
  return null;
}

Error Handling

Error	Action
429 Rate Limited	Exponential backoff, reduce rate
403 Forbidden	Check robots.txt, may be blocked
404 Not Found	Log and skip, don't retry
500 Server Error	Retry with backoff (max 3x)
Timeout	Retry once, then skip

Legal Considerations

Public data only: Don't scrape login-required content
Check ToS: Some sites explicitly prohibit scraping
Copyrighted content: Fair use applies, but be cautious
Personal data: GDPR/CCPA compliance required

Caching Strategy

async function scrapeWithCache(url: string, maxAge = 3600) {
  const cached = await cache.get(url);
  if (cached && Date.now() - cached.timestamp < maxAge * 1000) {
    return cached.data;
  }
  
  const fresh = await tryb.read(url);
  await cache.set(url, { data: fresh, timestamp: Date.now() });
  return fresh;
}

Monitoring & Alerts

Track success/failure rates per domain
Alert on unusual failure spikes
Monitor credit/cost usage
Log all requests for debugging

Related Guides

Rate Limiting

Never hammer a website with requests:

// Bad: No rate limiting for (const url of urls) { await scrape(url); // 100 requests/second } // Good: Rate limited const limiter = new Bottleneck({ maxConcurrent: 2, minTime: 1000 }); for (const url of urls) { await limiter.schedule(() => scrape(url)); // 2 req/sec max }

import robotsParser from 'robots-parser'; const robots = robotsParser('https://example.com/robots.txt', robotsTxt); if (!robots.isAllowed(url, 'MyBot')) { console.log('URL disallowed by robots.txt'); return null; }

Error Handling

Error

Action

429 Rate Limited

Exponential backoff, reduce rate

403 Forbidden

Check robots.txt, may be blocked

404 Not Found

Log and skip, don't retry

500 Server Error

Retry with backoff (max 3x)

Timeout

Retry once, then skip

Caching Strategy

async function scrapeWithCache(url: string, maxAge = 3600) { const cached = await cache.get(url); if (cached && Date.now() - cached.timestamp < maxAge * 1000) { return cached.data; } const fresh = await tryb.read(url); await cache.set(url, { data: fresh, timestamp: Date.now() }); return fresh; }

Web Scraping Best Practices for AI Applications

Rate Limiting

Respect robots.txt

Error Handling

Legal Considerations

Caching Strategy

Monitoring & Alerts

Related Guides

Related Articles

robots.txt for AI Agents: Complete Guide

Cloudflare Bypass for AI Agents: Ethical Approaches

Web Scraping API Pricing Guide: Cost Comparison 2024

Ready to Give Your AI Eyes?

Web Scraping Best Practices for AI Applications

Rate Limiting

Respect robots.txt

Error Handling

Legal Considerations

Caching Strategy

Monitoring & Alerts

Related Guides

Related Articles

robots.txt for AI Agents: Complete Guide

Cloudflare Bypass for AI Agents: Ethical Approaches

Web Scraping API Pricing Guide: Cost Comparison 2024

Ready to Give Your AI Eyes?