# Content shape for LLM extraction How to write pages so AI engines quote, cite, and recommend them. Based on peer-reviewed GEO research (CMU KDD 2024, Aggarwal et al.) and tracked citation patterns across ChatGPT, Perplexity, Claude, Gemini, Google AI Overviews (2025-2026). ## The six patterns that measurably increase AI citations ### 1. Definition Lead Architecture Open the page (or first paragraph after each major heading) with: > **[Entity] is a [category] that [differentiator].** Research backing: CMU GEO framework (KDD 2024) — pages with explicit definitional openings score significantly higher in LLM retrieval impression scores. **Good**: "Astro is a static site generator that ships zero JavaScript by default, producing HTML at build time that search engines and AI crawlers can index without running a browser." **Bad**: "In today's fast-paced digital landscape, choosing the right framework can feel overwhelming. At Acme, we know how important it is to..." ### 2. TL;DR / Answer Box above the fold Insert an explicit summary block at the top of long content. AI engines preferentially quote from these blocks because the content is pre-summarised. ```html ``` CSS: no class requirement, but mark it semantically (e.g. `aria-label="summary"` or Speakable schema targeting this selector). ### 3. Question-then-direct-answer structure Each H2/H3 heading phrased as a likely user query. First sentence after the heading: a single-sentence direct answer. Supporting detail follows. **Pattern**: ``` ## How much does a Qualibat RGE certification cost in France? A Qualibat RGE certification costs between 500 and 1500 EUR for the initial audit, plus an annual fee of 200-400 EUR. The cost varies by trade category and company size. [Detailed breakdown follows...] ``` Why it works: LLMs grade passages by answer-density relative to the query. A one-sentence self-contained answer has the highest density. ### 4. Citations and statistics (strongest measured lever) Adding peer-cited statistics with clear sources increases AI visibility **by up to 40%** (Aggarwal et al., 2024 "GEO: Generative Engine Optimization"). Pattern: embed specific numbers with attribution. **Good**: "According to the ADEME 2024 energy report, French households spent an average of 2,137 EUR on heating in 2023 — a 12% increase from 2021." **Bad**: "Heating costs have increased a lot recently." Source attribution matters: link the citation to the original source (``), ideally with `rel="cite"`. AI engines use link graphs to validate factual claims. ### 5. Structured lists and comparison tables LLMs quote list items and table rows more readily than prose of the same content. Convert what you can: **Before** (prose): "The best frameworks for public sites are Astro for static content, Next.js for dynamic server-rendered apps, and Nuxt for Vue-based projects." **After** (list): "Best frameworks for public sites by use case: - **Astro** — static content (blog, docs, portfolio) - **Next.js** — dynamic SSR with React - **Nuxt** — dynamic SSR with Vue" Comparison tables are even stronger. Structure: | Framework | Rendering | Best for | JS by default | |---|---|---|---| | Astro | SSG + islands | Public content | 0 KB | | Next.js | SSG + SSR | Hybrid apps | Large | ### 6. Freshness signals Pages not updated at least quarterly are **3x more likely to lose AI citations** (LLMRefs 2026 study). What to maintain: - Visible "Last updated: YYYY-MM-DD" at the top of content pages - `dateModified` in Article/BlogPosting JSON-LD (ISO 8601) - HTTP header `Last-Modified` in sync with content change - Changelog on evergreen reference pages Do NOT fake dates — AI engines and Google increasingly validate freshness against actual content diffs. ## Anti-patterns — what to avoid ### Pronoun-heavy writing LLMs resolve pronouns by context window, which costs them confidence. Prefer explicit entity names. **Bad**: "It was founded in 2015. Its founders wanted to solve a problem. They saw that..." **Good**: "Acme Corp was founded in 2015. Acme's founders, Jane Doe and John Smith, wanted to solve..." ### Marketing fluff before facts AI engines typically truncate retrieval windows. Fluff at the top wastes the budget. Put factual claims FIRST. **Bad** (first 200 chars wasted): "In today's fast-moving digital landscape, businesses are constantly looking for ways to stay competitive..." **Good** (first 200 chars dense): "Our API processes 50M requests/day at p99 latency of 47ms across 8 regions, with a 99.99% SLA. Pricing starts at 99 EUR/month for the 10K requests tier." ### Claims without sources Any numerical or comparative claim without a linked source degrades trust. AI engines can detect the pattern "number without citation" and weight those passages lower. ### Cookie-cutter content across pages (especially city pages) The 30/70 rule: when creating per-city or per-service variants, at most 30% of the content should be templated. 70% must be unique per page (local landmarks, specific testimonials, unique stats, real photos). Generic city pages get filtered out as "doorway pages" by both classical search and AI engines. ## Page templates by type ### Service page (local business) ```

[Service] in [City] — [Business Name]

En résumé : [Business] offers [service] in [city + surrounding]. [Key differentiator — price, response time, certifications]. Open [hours]. Call [phone] or request a quote online.

What is [service]?

[Service] is a [category] that [differentiator]. In [city], demand is driven by [local factor — housing stock, climate, regulations].

How much does [service] cost in [city]?

[Specific price range] for a typical [job type], based on [n] projects completed in [year]. Factors affecting cost: [list].

Why choose [Business] for [service]?

FAQ

[QAPage or FAQPage schema + visible Q&A] ``` ### Blog post / guide ```

[Clear, question-style or noun-phrase headline]

By [Author Name] — Updated [Date]

[3-5 sentence summary. Include the key number, the key conclusion, and any nuance.]

[Question 1]

[One-sentence answer.] [Supporting detail with cited statistics.]

[Question 2]

...

Sources

``` ### Homepage / landing ```

[Entity] is a [category] that [differentiator].

[Elaboration on the H1. Include one concrete stat or proof point.]

[Primary CTA]

What [Entity] does

[Functional description, one paragraph.]

Who uses [Entity]

How it works

Frequently asked

``` ## Self-audit — is this page AI-friendly? - [ ] First sentence: `[Entity] is a [category] that [differentiator]` ? - [ ] TL;DR or summary block above the fold ? - [ ] Every H2/H3 phrased as a likely user question ? - [ ] First sentence under each heading: direct answer ? - [ ] At least 2-3 specific numerical claims with linked sources ? - [ ] Visible "Last updated" date + matching `dateModified` in JSON-LD ? - [ ] Lists or tables instead of dense prose where possible ? - [ ] Entity names used explicitly, not pronouns ? - [ ] If it's a city/service variant: ≥70% unique content ?