Web Scraping Best Practices for AI Applications
Build responsible, reliable web scraping systems for your AI applications. Best practices for production deployment.
Marcus Chen
Founder & CEO

Responsible web scraping is essential for building sustainable AI applications. Here are the best practices every developer should follow.
Rate Limiting
Never hammer a website with requests:
// Bad: No rate limiting
for (const url of urls) {
await scrape(url); // 100 requests/second
}
// Good: Rate limited
const limiter = new Bottleneck({ maxConcurrent: 2, minTime: 1000 });
for (const url of urls) {
await limiter.schedule(() => scrape(url)); // 2 req/sec max
}
Respect robots.txt
Always check and follow robots.txt rules:
import robotsParser from 'robots-parser';
const robots = robotsParser('https://example.com/robots.txt', robotsTxt);
if (!robots.isAllowed(url, 'MyBot')) {
console.log('URL disallowed by robots.txt');
return null;
}
Error Handling
| Error | Action |
|---|---|
| 429 Rate Limited | Exponential backoff, reduce rate |
| 403 Forbidden | Check robots.txt, may be blocked |
| 404 Not Found | Log and skip, don't retry |
| 500 Server Error | Retry with backoff (max 3x) |
| Timeout | Retry once, then skip |
Legal Considerations
- Public data only: Don't scrape login-required content
- Check ToS: Some sites explicitly prohibit scraping
- Copyrighted content: Fair use applies, but be cautious
- Personal data: GDPR/CCPA compliance required
Caching Strategy
async function scrapeWithCache(url: string, maxAge = 3600) {
const cached = await cache.get(url);
if (cached && Date.now() - cached.timestamp < maxAge * 1000) {
return cached.data;
}
const fresh = await tryb.read(url);
await cache.set(url, { data: fresh, timestamp: Date.now() });
return fresh;
}
Monitoring & Alerts
- Track success/failure rates per domain
- Alert on unusual failure spikes
- Monitor credit/cost usage
- Log all requests for debugging
Related Guides

Marcus Chen
Founder & CEO at Tryb
Marcus builds ethical AI infrastructure.


