claude/agents/resources/ai-crawlers-2026.md
bastien 95347d2e47 feat(seo/geo): split into parallel seo + geo agents with shared resources
Refactor the monolithic seo-analyzer into two specialist agents
orchestrated in parallel by the /seo skill, plus a standalone /geo
skill for AI-only audits.

Changes
- agents/seo-analyzer.md: refocused on classical engines (Google, Bing,
  DuckDuckGo). Adds Core Web Vitals 2.0 (LCP/INP/CLS + VSI), CSP + full
  security headers, hreflang audit, video SEO (transcripts), accessibility
  as ranking signal, image/video sitemaps.
- agents/geo-analyzer.md: new agent for AI engines (ChatGPT, Claude,
  Perplexity, Gemini, Google AI Overviews, Copilot). Covers AI crawler
  policy, llms.txt/llms-full.txt, Schema.org for AI extraction (QAPage,
  Speakable, Person+Article, Organization graph), entity SEO (Wikidata,
  sameAs, Knowledge Panel), content shape (Definition Lead, TL;DR,
  Q->A, citable stats, freshness), AI visibility testing.
- agents/resources/: shared knowledge base referenced by both agents —
  ai-crawlers-2026.md (25+ bots, training vs retrieval categories,
  permissive/restrictive templates), llms-txt-template.md, geo-schemas.md
  (incl. deprecated list: ClaimReview, CourseInfo, etc. removed June 2025),
  entity-seo.md, content-shape-for-ai.md, ai-visibility-tools.md,
  automation-catalog.md.
- skills/seo/SKILL.md: becomes parallel dispatcher. Collects context
  once (depth + business), spawns both agents in a single message for
  concurrent execution, merges envelopes into unified SEO.md. Includes
  authoritative file-ownership matrix to prevent parallel-edit races.
- skills/geo/SKILL.md: new standalone wrapper for GEO-only audits.

Scoring
- Combined score: GLOBAL = 0.80 * SEO + 0.20 * GEO (local B2C),
  0.75 * SEO + 0.25 * GEO (SaaS/national/content).
- GEO axis weight raised from 5% (old) to first-class dimension.

Policy
- AI crawlers: permissive default (maximise AI citations). Restrictive
  template available for premium/regulated content.
- Every user action in SEO.md section 11 must cite automation options
  from automation-catalog.md.

Tools
- WebFetch + WebSearch added to allowed-tools of both skills and
  both agents (needed for live CWV via PageSpeed API, AI visibility
  testing, Wikidata/Knowledge Panel lookups, competitor analysis).

Research basis (2026 state of the art validated via WebSearch):
- Core Web Vitals 2.0 (VSI signal, Google core update March 2026)
- AI Overviews trigger on ~48% of Google searches
- ClaimReview + 6 other schema types deprecated June 2025
- Definition Lead Architecture (CMU KDD 2024, +impression score)
- Citations + stats add up to 40% AI visibility (Aggarwal 2024)
- Wikidata grounds every major LLM (ChatGPT, Claude, Gemini, Perplexity)

Backup
- agents/seo-analyzer.md.bak kept for rollback reference.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-21 16:16:30 +02:00

210 lines
5.5 KiB
Markdown

# AI crawlers — 2026 reference
State as of 2026-04. Cross-check via WebSearch on FULL audits — new
bots and renames ship monthly.
## The two categories that matter
The blanket "block AI" strategy of 2024 is obsolete. Bots now split
into two roles, and treating them the same loses traffic.
### Training bots — scrape content to train future models
No direct user traffic. No citation back. Content vanishes into weights.
| User-agent | Company | Notes |
|---|---|---|
| `GPTBot` | OpenAI | Training for GPT models |
| `Google-Extended` | Google | Opt-out for Gemini training |
| `CCBot` | Common Crawl | Feeds many LLMs (open dataset) |
| `anthropic-ai` | Anthropic | Legacy training bot (being phased out) |
| `ClaudeBot` | Anthropic | Current training bot |
| `Bytespider` | ByteDance / TikTok | Aggressive scraper, frequent complaints |
| `Meta-ExternalAgent` | Meta | Training for Llama family |
| `Meta-ExternalFetcher` | Meta | Per-request fetch |
| `Applebot-Extended` | Apple | Opt-out for Apple Intelligence training |
| `Amazonbot` | Amazon | Alexa + internal LLMs |
| `cohere-ai` | Cohere | Training |
| `Diffbot` | Diffbot | Knowledge Graph construction |
| `omgilibot` | Webz.io | Data resale |
| `img2dataset` | Various | Image dataset builders |
| `Timpibot` | Timpi | Search-index + training hybrid |
### Search / retrieval bots — fetch content to cite in live answers
User asked a question → bot fetches → cites your URL → traffic returns.
| User-agent | Company | Notes |
|---|---|---|
| `OAI-SearchBot` | OpenAI | Powers ChatGPT Search |
| `ChatGPT-User` | OpenAI | On-demand fetch when user asks ChatGPT about a URL |
| `Claude-SearchBot` | Anthropic | Powers Claude web search |
| `Claude-User` | Anthropic | On-demand fetch inside Claude |
| `Claude-Web` | Anthropic | Legacy retrieval bot |
| `PerplexityBot` | Perplexity | Index builder |
| `Perplexity-User` | Perplexity | On-demand fetch |
| `GoogleOther` | Google | Various Google retrieval use cases |
| `FacebookBot` | Meta | Meta AI search |
| `DuckAssistBot` | DuckDuckGo | DuckAssist answers |
| `YouBot` | You.com | You.com retrieval |
| `MistralAI-User` | Mistral | On-demand fetch |
## Recommended default strategy — PERMISSIVE
Rationale: the user's stated goal is to maximise AI visibility. The
future-of-search brief favours being cited over being protected.
```
# robots.txt — PERMISSIVE default (allow everything, block problem bots)
# --- Training bots: allow (contributes to brand visibility long-term) ---
User-agent: GPTBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: Applebot-Extended
Allow: /
User-agent: Meta-ExternalAgent
Allow: /
User-agent: CCBot
Allow: /
# --- Search / retrieval bots: always allow (direct traffic) ---
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: Claude-User
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Perplexity-User
Allow: /
# --- Block only known-abusive bots (aggressive scraping, no return value) ---
User-agent: Bytespider
Disallow: /
User-agent: omgilibot
Disallow: /
User-agent: img2dataset
Disallow: /
# --- Default: allow the rest ---
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml
```
## Alternative — RESTRICTIVE (for premium content, paywalled, regulated)
```
# robots.txt — RESTRICTIVE (block training, allow retrieval)
# Block all training bots
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: Diffbot
Disallow: /
User-agent: Timpibot
Disallow: /
# Allow search/retrieval (keeps citations flowing)
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: Claude-User
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Perplexity-User
Allow: /
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml
```
## Common mistakes
- **Only blocking `ClaudeBot`** — does not block `Claude-SearchBot` or `Claude-User`. Same for other families.
- **Using `GPTBot` to block ChatGPT Search** — wrong. `OAI-SearchBot` and `ChatGPT-User` are the search bots.
- **Blocking `CCBot`** — has knock-on effects across dozens of downstream LLMs that train on Common Crawl.
- **Using wildcards** (e.g. `User-agent: *AI*`) — robots.txt wildcards are not universally supported.
- **Relying on meta robots** — `<meta name="robots">` is less respected than robots.txt by AI crawlers. Use both.
## Verification
Each bot should return 200 for allowed, 403 for blocked, via simulated requests:
```bash
DOMAIN="example.com"
for UA in "GPTBot" "ClaudeBot" "PerplexityBot" "OAI-SearchBot" "ChatGPT-User" "Google-Extended"; do
CODE=$(curl -sI -A "$UA" -o /dev/null -w "%{http_code}" "https://$DOMAIN/")
echo "$UA: $CODE"
done
```
This hits the page, not robots.txt directly — but if the origin respects
robots.txt via CDN/WAF rules, you'll see the difference.
## Sources to refresh this doc
- https://platform.openai.com/docs/bots
- https://darkvisitors.com/agents (community-maintained)
- https://github.com/ai-robots-txt/ai.robots.txt
- Anthropic docs: https://docs.anthropic.com/
- Cloudflare AI crawlers dashboard (if account available)