bastien 95347d2e47 feat(seo/geo): split into parallel seo + geo agents with shared resources

Refactor the monolithic seo-analyzer into two specialist agents
orchestrated in parallel by the /seo skill, plus a standalone /geo
skill for AI-only audits.

Changes
- agents/seo-analyzer.md: refocused on classical engines (Google, Bing,
  DuckDuckGo). Adds Core Web Vitals 2.0 (LCP/INP/CLS + VSI), CSP + full
  security headers, hreflang audit, video SEO (transcripts), accessibility
  as ranking signal, image/video sitemaps.
- agents/geo-analyzer.md: new agent for AI engines (ChatGPT, Claude,
  Perplexity, Gemini, Google AI Overviews, Copilot). Covers AI crawler
  policy, llms.txt/llms-full.txt, Schema.org for AI extraction (QAPage,
  Speakable, Person+Article, Organization graph), entity SEO (Wikidata,
  sameAs, Knowledge Panel), content shape (Definition Lead, TL;DR,
  Q->A, citable stats, freshness), AI visibility testing.
- agents/resources/: shared knowledge base referenced by both agents —
  ai-crawlers-2026.md (25+ bots, training vs retrieval categories,
  permissive/restrictive templates), llms-txt-template.md, geo-schemas.md
  (incl. deprecated list: ClaimReview, CourseInfo, etc. removed June 2025),
  entity-seo.md, content-shape-for-ai.md, ai-visibility-tools.md,
  automation-catalog.md.
- skills/seo/SKILL.md: becomes parallel dispatcher. Collects context
  once (depth + business), spawns both agents in a single message for
  concurrent execution, merges envelopes into unified SEO.md. Includes
  authoritative file-ownership matrix to prevent parallel-edit races.
- skills/geo/SKILL.md: new standalone wrapper for GEO-only audits.

Scoring
- Combined score: GLOBAL = 0.80 * SEO + 0.20 * GEO (local B2C),
  0.75 * SEO + 0.25 * GEO (SaaS/national/content).
- GEO axis weight raised from 5% (old) to first-class dimension.

Policy
- AI crawlers: permissive default (maximise AI citations). Restrictive
  template available for premium/regulated content.
- Every user action in SEO.md section 11 must cite automation options
  from automation-catalog.md.

Tools
- WebFetch + WebSearch added to allowed-tools of both skills and
  both agents (needed for live CWV via PageSpeed API, AI visibility
  testing, Wikidata/Knowledge Panel lookups, competitor analysis).

Research basis (2026 state of the art validated via WebSearch):
- Core Web Vitals 2.0 (VSI signal, Google core update March 2026)
- AI Overviews trigger on ~48% of Google searches
- ClaimReview + 6 other schema types deprecated June 2025
- Definition Lead Architecture (CMU KDD 2024, +impression score)
- Citations + stats add up to 40% AI visibility (Aggarwal 2024)
- Wikidata grounds every major LLM (ChatGPT, Claude, Gemini, Perplexity)

Backup
- agents/seo-analyzer.md.bak kept for rollback reference.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-04-21 16:16:30 +02:00

5.5 KiB

Raw Blame History

AI crawlers — 2026 reference

State as of 2026-04. Cross-check via WebSearch on FULL audits — new bots and renames ship monthly.

The two categories that matter

The blanket "block AI" strategy of 2024 is obsolete. Bots now split into two roles, and treating them the same loses traffic.

Training bots — scrape content to train future models

No direct user traffic. No citation back. Content vanishes into weights.

User-agent	Company	Notes
`GPTBot`	OpenAI	Training for GPT models
`Google-Extended`	Google	Opt-out for Gemini training
`CCBot`	Common Crawl	Feeds many LLMs (open dataset)
`anthropic-ai`	Anthropic	Legacy training bot (being phased out)
`ClaudeBot`	Anthropic	Current training bot
`Bytespider`	ByteDance / TikTok	Aggressive scraper, frequent complaints
`Meta-ExternalAgent`	Meta	Training for Llama family
`Meta-ExternalFetcher`	Meta	Per-request fetch
`Applebot-Extended`	Apple	Opt-out for Apple Intelligence training
`Amazonbot`	Amazon	Alexa + internal LLMs
`cohere-ai`	Cohere	Training
`Diffbot`	Diffbot	Knowledge Graph construction
`omgilibot`	Webz.io	Data resale
`img2dataset`	Various	Image dataset builders
`Timpibot`	Timpi	Search-index + training hybrid

Search / retrieval bots — fetch content to cite in live answers

User asked a question → bot fetches → cites your URL → traffic returns.

User-agent	Company	Notes
`OAI-SearchBot`	OpenAI	Powers ChatGPT Search
`ChatGPT-User`	OpenAI	On-demand fetch when user asks ChatGPT about a URL
`Claude-SearchBot`	Anthropic	Powers Claude web search
`Claude-User`	Anthropic	On-demand fetch inside Claude
`Claude-Web`	Anthropic	Legacy retrieval bot
`PerplexityBot`	Perplexity	Index builder
`Perplexity-User`	Perplexity	On-demand fetch
`GoogleOther`	Google	Various Google retrieval use cases
`FacebookBot`	Meta	Meta AI search
`DuckAssistBot`	DuckDuckGo	DuckAssist answers
`YouBot`	You.com	You.com retrieval
`MistralAI-User`	Mistral	On-demand fetch

Recommended default strategy — PERMISSIVE

Rationale: the user's stated goal is to maximise AI visibility. The future-of-search brief favours being cited over being protected.

# robots.txt — PERMISSIVE default (allow everything, block problem bots)

# --- Training bots: allow (contributes to brand visibility long-term) ---
User-agent: GPTBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: Meta-ExternalAgent
Allow: /

User-agent: CCBot
Allow: /

# --- Search / retrieval bots: always allow (direct traffic) ---
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

# --- Block only known-abusive bots (aggressive scraping, no return value) ---
User-agent: Bytespider
Disallow: /

User-agent: omgilibot
Disallow: /

User-agent: img2dataset
Disallow: /

# --- Default: allow the rest ---
User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml

Alternative — RESTRICTIVE (for premium content, paywalled, regulated)

# robots.txt — RESTRICTIVE (block training, allow retrieval)

# Block all training bots
User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: Diffbot
Disallow: /

User-agent: Timpibot
Disallow: /

# Allow search/retrieval (keeps citations flowing)
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml

Common mistakes

Only blocking ClaudeBot — does not block Claude-SearchBot or Claude-User. Same for other families.
Using GPTBot to block ChatGPT Search — wrong. OAI-SearchBot and ChatGPT-User are the search bots.
Blocking CCBot — has knock-on effects across dozens of downstream LLMs that train on Common Crawl.
Using wildcards (e.g. User-agent: *AI*) — robots.txt wildcards are not universally supported.
Relying on meta robots — <meta name="robots"> is less respected than robots.txt by AI crawlers. Use both.

Verification

Each bot should return 200 for allowed, 403 for blocked, via simulated requests:

DOMAIN="example.com"
for UA in "GPTBot" "ClaudeBot" "PerplexityBot" "OAI-SearchBot" "ChatGPT-User" "Google-Extended"; do
  CODE=$(curl -sI -A "$UA" -o /dev/null -w "%{http_code}" "https://$DOMAIN/")
  echo "$UA: $CODE"
done

This hits the page, not robots.txt directly — but if the origin respects robots.txt via CDN/WAF rules, you'll see the difference.

Sources to refresh this doc

https://platform.openai.com/docs/bots
https://darkvisitors.com/agents (community-maintained)
https://github.com/ai-robots-txt/ai.robots.txt
Anthropic docs: https://docs.anthropic.com/
Cloudflare AI crawlers dashboard (if account available)

5.5 KiB Raw Blame History