Executive Summary
In early 2026, Demand Genius published a study that quickly became one of the most talked-about pieces of research in the AEO space. They observed that AI-powered search engines appear to narrow their brand recommendations as users move deeper into the buying journey—from exploratory questions (TOFU) to purchase-ready queries (BOFU)—and they introduced useful conceptual language for this pattern, calling it the “Dark AI” influence layer. Read their study here: https://demand-genius.com/resource/dark-ai-and-what-actually-drives-aeo-influence/
Their framing was compelling, and the question it raised was one we could not stop thinking about: if this convergence is real, does it hold across the models buyers actually use, across categories beyond B2B tech, and with enough repetition to separate signal from noise? Demand Genius described their work as a preliminary investigation, and that was exactly the invitation we needed.
This study picks up where they left off. We tested the same core theory across 4 LLMs (ChatGPT, Claude, Gemini, and Perplexity), 8 diverse industry verticals, and 4,480 individual API calls, with fully documented prompts, model versions, and analysis methodology. Every data point is reproducible.
The headline: the convergence pattern is real and holds up across every model and vertical we tested. What multi-model testing adds is a new dimension to the picture—the strength, shape, and strategic implications of convergence vary substantially depending on which model is answering and which category is being asked about. That variation changes how brands should act on the findings.
What We Found
Convergence is real and model-dependent. All four models show greater brand consistency at BOFU than at TOFU. The strength varies by 6x across models—Claude and ChatGPT converge strongly (avg K1 delta +0.34), while Gemini barely converges at all (+0.09).
There is no universal brand canon. At BOFU, each model recommends a substantially different set of brands. Average cross-model agreement (Jaccard similarity) is just 0.15. A brand optimizing for ChatGPT visibility may be invisible to Claude or Perplexity.
The MOFU “trough” reveals a behavioral mode switch rather than gradual narrowing. Models write their longest, most detailed responses at MOFU while mentioning the fewest brands. At BOFU, they switch to shorter, list-heavy responses. This reframes the funnel as two distinct response modes rather than a linear progression.
Citations are architectural, not funnel-driven. Perplexity cites sources 73% of the time regardless of funnel stage. ChatGPT and Gemini cite 0%. Citation behavior is a property of model architecture rather than buyer intent.
Vertical matters enormously. SaaS and Construction show genuine cross-model brand agreement (Jaccard 0.30–0.35). Legal and Marketing show near-zero agreement (0.03). The “Dark AI” effect is heavily category-dependent.
Methodology
Study Design
To test how broadly the convergence pattern generalizes, we expanded the study design along five dimensions: model coverage, vertical diversity, run count, prompt transparency, and sample size. The goal was to produce a dataset large enough and transparent enough that anyone could replicate the analysis end-to-end.
| Variable | Original Study | Expanded Replication |
| Models Tested | ChatGPT (version undisclosed) | ChatGPT (GPT-4o), Claude (Sonnet 4), Gemini (2.0 Flash), Perplexity (Sonar Pro) |
| Runs per Prompt | 3 | 10 |
| Verticals | 14 (all B2B tech) | 8 (Marketing, SaaS, Defense, Legal, Manufacturing, Construction, Real Estate, Higher Ed) |
| Prompts Disclosed | No | Yes (all 112 prompts documented) |
| Temperature | Not specified | 0.7 |
| Session Handling | Not specified | Stateless (fresh API call each run) |
| Total API Calls | ~126 (estimated) | 4,480 |
Table 1: Methodology comparison between the original study and this expanded replication.
Prompt Design
For each of the 8 verticals, we wrote prompts at three funnel stages designed to mirror realistic buyer behavior. TOFU prompts explore problems without naming solutions. MOFU prompts compare approaches and evaluate trade-offs. BOFU prompts directly ask for brand recommendations. Each vertical has 4–5 prompts per stage, with persona variation (e.g., VP of Sales, CEO, Operations Director) and specificity variation (from broad to narrow).
All 112 prompts are available in the published dataset. No prompts were derived from any client’s existing content.
Brand Extraction
Raw LLM responses were processed through a dedicated brand extraction pipeline using Claude’s API at temperature 0. The extraction prompt was specifically tuned to handle differences in markdown formatting across models, normalize brand name variants, and identify organizations across all 8 verticals (including defense contractors, law firms, universities, and construction firms). We verified citation presence through the extraction model and direct URL pattern matching.
Findings
Finding 1: Convergence Direction Confirmed Across All Verticals
The directional claim at the heart of the Demand Genius study holds: across every vertical and model, K1 delta is positive from TOFU to BOFU. Brands do become more consistently mentioned when users ask purchase-ready questions. This is a meaningful replication; the pattern is reproducible and robust to model choice and category.

Figure 1: K1 (Canon Concentration) increases from TOFU to BOFU across all verticals.
The strongest convergence appears in SaaS (+0.356 avg delta), Higher Ed (+0.293), and Manufacturing (+0.254). The weakest appears in Defense (+0.172) and Real Estate (+0.196). Brand density in training data may cause this—SaaS and Higher Ed have extremely well-documented brand landscapes, while Defense and Real Estate are more fragmented and regional.
Finding 2: Convergence Strength Varies Dramatically by Model
This is the most significant finding for AEO practitioners, and it’s the one that only becomes visible once you look across multiple models: convergence strength varies by roughly 6x across the four LLMs we tested.

Figure 2: BOFU K1 by model across all verticals. Claude and ChatGPT converge strongly; Gemini barely converges.
| Model | Avg BOFU K1 | Avg Delta | Behavior |
| Claude | 0.562 | +0.348 | Strongest internal opinions; highest consistency |
| ChatGPT | 0.525 | +0.340 | Strong convergence; closest to original study findings |
| Perplexity | 0.422 | +0.202 | Retrieval creates run-to-run variance |
| Gemini | 0.153 | +0.091 | Very short responses; minimal convergence |
Table 2: Model-level convergence comparison.
Each model converges for a different underlying reason. Claude’s consistency comes from training weights: Iit forms strong internal opinions and repeats them across runs. Perplexity’s variability comes from live retrieval: it pulls different sources each time, creating natural variance. Gemini’s weak convergence correlates with its dramatically shorter responses (avg 606 chars at BOFU vs. 2,200+ for other models), which mechanically limits the number of brands it can name.
Implication for AEO strategy: optimizing for “AI visibility” is not a single problem. Iit is, at minimum, four separate problems, each with different dynamics.
Finding 3: Each Model Recommends Different Brands
Perhaps the most practically important finding is that there is no universal brand canon across models. The average Jaccard similarity between any two models’ top-3 brand lists at BOFU is just 0.15.

Figure 3: Cross-model brand agreement at BOFU. SaaS and Construction show partial consensus; Legal and Marketing show almost none.
Only two verticals show meaningful cross-model agreement: SaaS (Jaccard 0.35, driven by Salesforce and HubSpot appearing in 3–4 models) and Construction (Jaccard 0.30, driven by Procore appearing in all 4 models). In Legal, each model recommends entirely different firms—ChatGPT surfaces Hogan Lovells, Claude surfaces Chambers USA (a directory, not a firm), Gemini surfaces Littler Mendelson, and Perplexity surfaces Latham & Watkins.
Brands that appear in 3+ models’ BOFU top-3: Procore (construction), Salesforce (SaaS), HubSpot (SaaS), Northrop Grumman (defense), Rockwell Automation (manufacturing), Siemens (manufacturing). These represent genuine cross-model authority.
Finding 4: The MOFU Behavioral Mode Switch
A smooth narrowing from TOFU through MOFU to BOFU is an intuitive mental model—but our data shows something more interesting: a behavioral mode switch.

Figure 4: Brand mention efficiency by stage. MOFU produces the longest responses with the lowest brand density.

Figure 5: BOFU responses are shorter but dramatically more brand-dense.
Across all verticals, MOFU responses average 2,489 characters—the longest of any stage—while mentioning just 0.60 brands per 1,000 characters. BOFU responses average only 1,808 characters but pack in 5.01 brands per 1,000 characters. The models are not gradually narrowing a consideration set. They are switching from advisory mode (MOFU: “here’s how to evaluate options”) to list-generation mode (BOFU: “here are the top options”).
This distinction matters for AEO strategy. The funnel metaphor implies that brands should optimize content at each stage to guide buyers through the consideration stage. The behavioral reality suggests brands need two different strategies: one for appearing in advisory content (conceptual authority) and one for appearing in recommendation lists (brand recall).
Finding 5: Citations Are Architectural, Not Behavioral
The original study suggested that citations and source links emerge primarily at BOFU, hinting that LLMs provide more verifiable information at purchase-ready stages. Viewing the same question across four models significantly reframes that observation.

Figure 6: Citation rate by model. Perplexity cites 73% of the time, regardless of stage; others almost never cite.
Perplexity cites sources in 73% of responses across TOFU, MOFU, and BOFU, as retrieval is an architectural feature, not a behavioral choice. ChatGPT cites in 0.0% of responses. Claude cites in 1.0%. We conclude that citation behavior is a property of model architecture rather than a signal about where a buyer is in their journey.
Finding 6: Convergence Delta Is Driven by Model, Not Vertical

Figure 7: Convergence delta heatmap. ChatGPT and Claude show strong deltas across most verticals; Gemini is near-zero everywhere.
The heatmap reveals that model choice explains more variance in convergence than vertical choice. ChatGPT and Claude show consistently strong deltas across nearly all verticals. Gemini shows near-zero deltas across the board, including a negative delta in Defense (-0.12), indicating it was more consistent at TOFU than at BOFU. This suggests that the convergence phenomenon is largely a function of how specific models handle recommendation-style prompts, rather than a universal property of how LLMs process buyer intent.
Implications for AEO Strategy
What This Means for Brands
1. Multi-model monitoring is non-negotiable. A brand that ranks #1 in Claude’s recommendations may not appear at all in ChatGPT’s or Gemini’s. Any AEO monitoring strategy that tracks only one model is seeing at most 25% of the picture.
2. Vertical context determines strategy. In high-agreement verticals (SaaS, Construction), there is a stable cross-model canon that brands can target. In low-agreement verticals (Legal, Marketing), the opportunity is wide open, but the strategy must be model-specific.
3. The funnel metaphor needs an update. The behavioral mode switch at BOFU means brands should think of advisory presence (MOFU) and list inclusion (BOFU) as two separate optimization challenges, not as a linear progression.
4. Perplexity is structurally different. Because it pulls live sources, Perplexity is the only model in which content recency and source authority directly influence brand visibility. Traditional SEO signals (backlinks, domain authority, fresh content) have a direct bearing on Perplexity results, unlike ChatGPT or Claude.
5. Gemini’s short responses create a different competitive landscape. With responses averaging 600–800 characters at BOFU (vs. 2,000+ for others), Gemini surfaces fewer brands total. This means the “slots” available for brand mentions are scarcer, making inclusion more competitive but also more binary. You’re either in, or you’re not.
Extending the Framework
The Demand Genius study gave the field valuable conceptual vocabulary, such as canon concentration and the “dark AI” influence layer, and identified a directional pattern that our expanded testing confirms. What multi-model, multi-vertical testing adds is dimensionality: the convergence effect operates differently across models, produces different brand canons in different models, and interacts with vertical context in ways that simply aren’t visible from a single-model view.
Our goal with this work is to build on that foundation with the model-by-model and vertical-by-vertical nuance that operators need to act on it. A few specific extensions worth highlighting:
- Convergence magnitude is model-specific. A brand strategy calibrated to ChatGPT’s convergence behavior will over- or under-estimate visibility on Gemini and Perplexity.
- Citation behavior is architectural. Treating citations as a funnel-stage signal works inside a single model but doesn’t generalize; across models, it’s a retrieval-vs-generation design choice.
- Vertical density shapes the opportunity. Dense, well-documented categories (SaaS, Higher Ed) behave differently from fragmented ones (Defense, Real Estate), and marketers should calibrate AEO strategies accordingly.
Taken together, these extensions preserve the core insight that AI-mediated recommendations narrow in identifiable, measurable ways while giving practitioners a more actionable map of how that narrowing actually works in the wild.
Methodology Notes & Limitations
Gemini response length. Gemini 2.0 Flash consistently returned responses that were 2–4x shorter than those of other models. This may reflect the model’s design (Flash is optimized for speed, not depth) or a difference in how max_tokens is handled. Future replications should also test Gemini Pro.
Perplexity self-referential behavior. In the Marketing BOFU, Perplexity’s top-recommended “brand” was ChatGPT (K1 = 0.525), followed by Perplexity itself. When asked about marketing tools, it prefers to recommend AI tools, including competing LLMs. This warrants further investigation.
Brand extraction limitations. All brand extraction was performed via Claude’s API at temperature 0. While we validated extraction quality across model output formats, some entity types (government agencies, industry standards, generic product categories) may be inconsistently classified as “brands.”
Temporal snapshot. All data was collected in April 2026. LLM behavior changes over time as models are updated. These findings represent a point-in-time snapshot.