ai-crawlers-2026.md 5.5 KB

AI crawlers — 2026 reference

State as of 2026-04. Cross-check via WebSearch on FULL audits — new bots and renames ship monthly.

The two categories that matter

The blanket "block AI" strategy of 2024 is obsolete. Bots now split into two roles, and treating them the same loses traffic.

Training bots — scrape content to train future models

No direct user traffic. No citation back. Content vanishes into weights.

User-agent Company Notes
GPTBot OpenAI Training for GPT models
Google-Extended Google Opt-out for Gemini training
CCBot Common Crawl Feeds many LLMs (open dataset)
anthropic-ai Anthropic Legacy training bot (being phased out)
ClaudeBot Anthropic Current training bot
Bytespider ByteDance / TikTok Aggressive scraper, frequent complaints
Meta-ExternalAgent Meta Training for Llama family
Meta-ExternalFetcher Meta Per-request fetch
Applebot-Extended Apple Opt-out for Apple Intelligence training
Amazonbot Amazon Alexa + internal LLMs
cohere-ai Cohere Training
Diffbot Diffbot Knowledge Graph construction
omgilibot Webz.io Data resale
img2dataset Various Image dataset builders
Timpibot Timpi Search-index + training hybrid

Search / retrieval bots — fetch content to cite in live answers

User asked a question → bot fetches → cites your URL → traffic returns.

User-agent Company Notes
OAI-SearchBot OpenAI Powers ChatGPT Search
ChatGPT-User OpenAI On-demand fetch when user asks ChatGPT about a URL
Claude-SearchBot Anthropic Powers Claude web search
Claude-User Anthropic On-demand fetch inside Claude
Claude-Web Anthropic Legacy retrieval bot
PerplexityBot Perplexity Index builder
Perplexity-User Perplexity On-demand fetch
GoogleOther Google Various Google retrieval use cases
FacebookBot Meta Meta AI search
DuckAssistBot DuckDuckGo DuckAssist answers
YouBot You.com You.com retrieval
MistralAI-User Mistral On-demand fetch

Recommended default strategy — PERMISSIVE

Rationale: the user's stated goal is to maximise AI visibility. The future-of-search brief favours being cited over being protected.

# robots.txt — PERMISSIVE default (allow everything, block problem bots)

# --- Training bots: allow (contributes to brand visibility long-term) ---
User-agent: GPTBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: Meta-ExternalAgent
Allow: /

User-agent: CCBot
Allow: /

# --- Search / retrieval bots: always allow (direct traffic) ---
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

# --- Block only known-abusive bots (aggressive scraping, no return value) ---
User-agent: Bytespider
Disallow: /

User-agent: omgilibot
Disallow: /

User-agent: img2dataset
Disallow: /

# --- Default: allow the rest ---
User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml

Alternative — RESTRICTIVE (for premium content, paywalled, regulated)

# robots.txt — RESTRICTIVE (block training, allow retrieval)

# Block all training bots
User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: Diffbot
Disallow: /

User-agent: Timpibot
Disallow: /

# Allow search/retrieval (keeps citations flowing)
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml

Common mistakes

  • Only blocking ClaudeBot — does not block Claude-SearchBot or Claude-User. Same for other families.
  • Using GPTBot to block ChatGPT Search — wrong. OAI-SearchBot and ChatGPT-User are the search bots.
  • Blocking CCBot — has knock-on effects across dozens of downstream LLMs that train on Common Crawl.
  • Using wildcards (e.g. User-agent: *AI*) — robots.txt wildcards are not universally supported.
  • Relying on meta robots<meta name="robots"> is less respected than robots.txt by AI crawlers. Use both.

Verification

Each bot should return 200 for allowed, 403 for blocked, via simulated requests:

DOMAIN="example.com"
for UA in "GPTBot" "ClaudeBot" "PerplexityBot" "OAI-SearchBot" "ChatGPT-User" "Google-Extended"; do
  CODE=$(curl -sI -A "$UA" -o /dev/null -w "%{http_code}" "https://$DOMAIN/")
  echo "$UA: $CODE"
done

This hits the page, not robots.txt directly — but if the origin respects robots.txt via CDN/WAF rules, you'll see the difference.

Sources to refresh this doc