Refactor the monolithic seo-analyzer into two specialist agents orchestrated in parallel by the /seo skill, plus a standalone /geo skill for AI-only audits. Changes - agents/seo-analyzer.md: refocused on classical engines (Google, Bing, DuckDuckGo). Adds Core Web Vitals 2.0 (LCP/INP/CLS + VSI), CSP + full security headers, hreflang audit, video SEO (transcripts), accessibility as ranking signal, image/video sitemaps. - agents/geo-analyzer.md: new agent for AI engines (ChatGPT, Claude, Perplexity, Gemini, Google AI Overviews, Copilot). Covers AI crawler policy, llms.txt/llms-full.txt, Schema.org for AI extraction (QAPage, Speakable, Person+Article, Organization graph), entity SEO (Wikidata, sameAs, Knowledge Panel), content shape (Definition Lead, TL;DR, Q->A, citable stats, freshness), AI visibility testing. - agents/resources/: shared knowledge base referenced by both agents — ai-crawlers-2026.md (25+ bots, training vs retrieval categories, permissive/restrictive templates), llms-txt-template.md, geo-schemas.md (incl. deprecated list: ClaimReview, CourseInfo, etc. removed June 2025), entity-seo.md, content-shape-for-ai.md, ai-visibility-tools.md, automation-catalog.md. - skills/seo/SKILL.md: becomes parallel dispatcher. Collects context once (depth + business), spawns both agents in a single message for concurrent execution, merges envelopes into unified SEO.md. Includes authoritative file-ownership matrix to prevent parallel-edit races. - skills/geo/SKILL.md: new standalone wrapper for GEO-only audits. Scoring - Combined score: GLOBAL = 0.80 * SEO + 0.20 * GEO (local B2C), 0.75 * SEO + 0.25 * GEO (SaaS/national/content). - GEO axis weight raised from 5% (old) to first-class dimension. Policy - AI crawlers: permissive default (maximise AI citations). Restrictive template available for premium/regulated content. - Every user action in SEO.md section 11 must cite automation options from automation-catalog.md. Tools - WebFetch + WebSearch added to allowed-tools of both skills and both agents (needed for live CWV via PageSpeed API, AI visibility testing, Wikidata/Knowledge Panel lookups, competitor analysis). Research basis (2026 state of the art validated via WebSearch): - Core Web Vitals 2.0 (VSI signal, Google core update March 2026) - AI Overviews trigger on ~48% of Google searches - ClaimReview + 6 other schema types deprecated June 2025 - Definition Lead Architecture (CMU KDD 2024, +impression score) - Citations + stats add up to 40% AI visibility (Aggarwal 2024) - Wikidata grounds every major LLM (ChatGPT, Claude, Gemini, Perplexity) Backup - agents/seo-analyzer.md.bak kept for rollback reference. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
210 lines
5.5 KiB
Markdown
210 lines
5.5 KiB
Markdown
# AI crawlers — 2026 reference
|
|
|
|
State as of 2026-04. Cross-check via WebSearch on FULL audits — new
|
|
bots and renames ship monthly.
|
|
|
|
## The two categories that matter
|
|
|
|
The blanket "block AI" strategy of 2024 is obsolete. Bots now split
|
|
into two roles, and treating them the same loses traffic.
|
|
|
|
### Training bots — scrape content to train future models
|
|
No direct user traffic. No citation back. Content vanishes into weights.
|
|
|
|
| User-agent | Company | Notes |
|
|
|---|---|---|
|
|
| `GPTBot` | OpenAI | Training for GPT models |
|
|
| `Google-Extended` | Google | Opt-out for Gemini training |
|
|
| `CCBot` | Common Crawl | Feeds many LLMs (open dataset) |
|
|
| `anthropic-ai` | Anthropic | Legacy training bot (being phased out) |
|
|
| `ClaudeBot` | Anthropic | Current training bot |
|
|
| `Bytespider` | ByteDance / TikTok | Aggressive scraper, frequent complaints |
|
|
| `Meta-ExternalAgent` | Meta | Training for Llama family |
|
|
| `Meta-ExternalFetcher` | Meta | Per-request fetch |
|
|
| `Applebot-Extended` | Apple | Opt-out for Apple Intelligence training |
|
|
| `Amazonbot` | Amazon | Alexa + internal LLMs |
|
|
| `cohere-ai` | Cohere | Training |
|
|
| `Diffbot` | Diffbot | Knowledge Graph construction |
|
|
| `omgilibot` | Webz.io | Data resale |
|
|
| `img2dataset` | Various | Image dataset builders |
|
|
| `Timpibot` | Timpi | Search-index + training hybrid |
|
|
|
|
### Search / retrieval bots — fetch content to cite in live answers
|
|
User asked a question → bot fetches → cites your URL → traffic returns.
|
|
|
|
| User-agent | Company | Notes |
|
|
|---|---|---|
|
|
| `OAI-SearchBot` | OpenAI | Powers ChatGPT Search |
|
|
| `ChatGPT-User` | OpenAI | On-demand fetch when user asks ChatGPT about a URL |
|
|
| `Claude-SearchBot` | Anthropic | Powers Claude web search |
|
|
| `Claude-User` | Anthropic | On-demand fetch inside Claude |
|
|
| `Claude-Web` | Anthropic | Legacy retrieval bot |
|
|
| `PerplexityBot` | Perplexity | Index builder |
|
|
| `Perplexity-User` | Perplexity | On-demand fetch |
|
|
| `GoogleOther` | Google | Various Google retrieval use cases |
|
|
| `FacebookBot` | Meta | Meta AI search |
|
|
| `DuckAssistBot` | DuckDuckGo | DuckAssist answers |
|
|
| `YouBot` | You.com | You.com retrieval |
|
|
| `MistralAI-User` | Mistral | On-demand fetch |
|
|
|
|
## Recommended default strategy — PERMISSIVE
|
|
|
|
Rationale: the user's stated goal is to maximise AI visibility. The
|
|
future-of-search brief favours being cited over being protected.
|
|
|
|
```
|
|
# robots.txt — PERMISSIVE default (allow everything, block problem bots)
|
|
|
|
# --- Training bots: allow (contributes to brand visibility long-term) ---
|
|
User-agent: GPTBot
|
|
Allow: /
|
|
|
|
User-agent: Google-Extended
|
|
Allow: /
|
|
|
|
User-agent: ClaudeBot
|
|
Allow: /
|
|
|
|
User-agent: Applebot-Extended
|
|
Allow: /
|
|
|
|
User-agent: Meta-ExternalAgent
|
|
Allow: /
|
|
|
|
User-agent: CCBot
|
|
Allow: /
|
|
|
|
# --- Search / retrieval bots: always allow (direct traffic) ---
|
|
User-agent: OAI-SearchBot
|
|
Allow: /
|
|
|
|
User-agent: ChatGPT-User
|
|
Allow: /
|
|
|
|
User-agent: Claude-SearchBot
|
|
Allow: /
|
|
|
|
User-agent: Claude-User
|
|
Allow: /
|
|
|
|
User-agent: PerplexityBot
|
|
Allow: /
|
|
|
|
User-agent: Perplexity-User
|
|
Allow: /
|
|
|
|
# --- Block only known-abusive bots (aggressive scraping, no return value) ---
|
|
User-agent: Bytespider
|
|
Disallow: /
|
|
|
|
User-agent: omgilibot
|
|
Disallow: /
|
|
|
|
User-agent: img2dataset
|
|
Disallow: /
|
|
|
|
# --- Default: allow the rest ---
|
|
User-agent: *
|
|
Allow: /
|
|
|
|
Sitemap: https://example.com/sitemap.xml
|
|
```
|
|
|
|
## Alternative — RESTRICTIVE (for premium content, paywalled, regulated)
|
|
|
|
```
|
|
# robots.txt — RESTRICTIVE (block training, allow retrieval)
|
|
|
|
# Block all training bots
|
|
User-agent: GPTBot
|
|
Disallow: /
|
|
|
|
User-agent: Google-Extended
|
|
Disallow: /
|
|
|
|
User-agent: ClaudeBot
|
|
Disallow: /
|
|
|
|
User-agent: anthropic-ai
|
|
Disallow: /
|
|
|
|
User-agent: CCBot
|
|
Disallow: /
|
|
|
|
User-agent: Bytespider
|
|
Disallow: /
|
|
|
|
User-agent: Meta-ExternalAgent
|
|
Disallow: /
|
|
|
|
User-agent: Applebot-Extended
|
|
Disallow: /
|
|
|
|
User-agent: Amazonbot
|
|
Disallow: /
|
|
|
|
User-agent: cohere-ai
|
|
Disallow: /
|
|
|
|
User-agent: Diffbot
|
|
Disallow: /
|
|
|
|
User-agent: Timpibot
|
|
Disallow: /
|
|
|
|
# Allow search/retrieval (keeps citations flowing)
|
|
User-agent: OAI-SearchBot
|
|
Allow: /
|
|
|
|
User-agent: ChatGPT-User
|
|
Allow: /
|
|
|
|
User-agent: Claude-SearchBot
|
|
Allow: /
|
|
|
|
User-agent: Claude-User
|
|
Allow: /
|
|
|
|
User-agent: PerplexityBot
|
|
Allow: /
|
|
|
|
User-agent: Perplexity-User
|
|
Allow: /
|
|
|
|
User-agent: *
|
|
Allow: /
|
|
|
|
Sitemap: https://example.com/sitemap.xml
|
|
```
|
|
|
|
## Common mistakes
|
|
|
|
- **Only blocking `ClaudeBot`** — does not block `Claude-SearchBot` or `Claude-User`. Same for other families.
|
|
- **Using `GPTBot` to block ChatGPT Search** — wrong. `OAI-SearchBot` and `ChatGPT-User` are the search bots.
|
|
- **Blocking `CCBot`** — has knock-on effects across dozens of downstream LLMs that train on Common Crawl.
|
|
- **Using wildcards** (e.g. `User-agent: *AI*`) — robots.txt wildcards are not universally supported.
|
|
- **Relying on meta robots** — `<meta name="robots">` is less respected than robots.txt by AI crawlers. Use both.
|
|
|
|
## Verification
|
|
|
|
Each bot should return 200 for allowed, 403 for blocked, via simulated requests:
|
|
|
|
```bash
|
|
DOMAIN="example.com"
|
|
for UA in "GPTBot" "ClaudeBot" "PerplexityBot" "OAI-SearchBot" "ChatGPT-User" "Google-Extended"; do
|
|
CODE=$(curl -sI -A "$UA" -o /dev/null -w "%{http_code}" "https://$DOMAIN/")
|
|
echo "$UA: $CODE"
|
|
done
|
|
```
|
|
|
|
This hits the page, not robots.txt directly — but if the origin respects
|
|
robots.txt via CDN/WAF rules, you'll see the difference.
|
|
|
|
## Sources to refresh this doc
|
|
|
|
- https://platform.openai.com/docs/bots
|
|
- https://darkvisitors.com/agents (community-maintained)
|
|
- https://github.com/ai-robots-txt/ai.robots.txt
|
|
- Anthropic docs: https://docs.anthropic.com/
|
|
- Cloudflare AI crawlers dashboard (if account available)
|