robots.txt is the standard for telling bots what they can and can't access. Understanding it is essential for building ethical AI agents.

What is robots.txt?

A text file at the root of a website (e.g., https://example.com/robots.txt) that provides crawling instructions for bots.

Basic Syntax

# Allow all bots
User-agent: *
Allow: /

# Block all bots
User-agent: *
Disallow: /

# Block specific paths
User-agent: *
Disallow: /admin/
Disallow: /private/

# Allow specific bot
User-agent: Tryb-Agent
Allow: /

Common Directives

Directive	Meaning
User-agent: *	Applies to all bots
Disallow: /	Block entire site
Disallow: /path/	Block specific path
Allow: /path/	Explicitly allow path
Crawl-delay: 10	Wait 10s between requests
Sitemap: url	Location of sitemap

AI Agent Considerations

AI agents should:

Check robots.txt before scraping any domain
Use a descriptive User-agent string
Respect Crawl-delay directives
Cache robots.txt (refresh every 24h)

import robotsParser from 'robots-parser';

async function canScrape(url: string): Promise<boolean> {
  const domain = new URL(url).origin;
  const robotsUrl = `${domain}/robots.txt`;
  
  const response = await fetch(robotsUrl);
  const robotsTxt = await response.text();
  
  const robots = robotsParser(robotsUrl, robotsTxt);
  return robots.isAllowed(url, 'Tryb-Agent');
}

Legal Status

robots.txt is not legally binding, but:

Courts have referenced it in scraping cases
Ignoring it may constitute trespass or ToS violation
Following it demonstrates good faith

Related Guides

# Allow all bots User-agent: * Allow: / # Block all bots User-agent: * Disallow: / # Block specific paths User-agent: * Disallow: /admin/ Disallow: /private/ # Allow specific bot User-agent: Tryb-Agent Allow: /

Common Directives

Directive

Meaning

User-agent: *

Applies to all bots

Disallow: /

Block entire site

Disallow: /path/

Block specific path

Allow: /path/

Explicitly allow path

Crawl-delay: 10

Wait 10s between requests

Sitemap: url

Location of sitemap

AI Agent Considerations

AI agents should:

Check robots.txt before scraping any domain

Use a descriptive User-agent string

Respect Crawl-delay directives

Cache robots.txt (refresh every 24h)

import robotsParser from 'robots-parser'; async function canScrape(url: string): Promise<boolean> { const domain = new URL(url).origin; const robotsUrl = `${domain}/robots.txt`; const response = await fetch(robotsUrl); const robotsTxt = await response.text(); const robots = robotsParser(robotsUrl, robotsTxt); return robots.isAllowed(url, 'Tryb-Agent'); }

robots.txt for AI Agents: Complete Guide

What is robots.txt?

Basic Syntax

Common Directives

AI Agent Considerations

Legal Status

Related Guides

Related Articles

Web Scraping Best Practices for AI Applications

Cloudflare Bypass for AI Agents: Ethical Approaches

How to Choose a Web Scraping Tool for AI

Ready to Give Your AI Eyes?

robots.txt for AI Agents: Complete Guide

What is robots.txt?

Basic Syntax

Common Directives

AI Agent Considerations

Legal Status

Related Guides

Related Articles

Web Scraping Best Practices for AI Applications

Cloudflare Bypass for AI Agents: Ethical Approaches

How to Choose a Web Scraping Tool for AI

Ready to Give Your AI Eyes?