Track, measure, and maximize brand reach in AI with LLM Visibility

A Comparative Evaluation of LLM Responses from Gemini, OpenAI, and Perplexity - Search Atlas - Advanced SEO Software

Picture of Manick Bhan

Manick Bhan

Founder CEO/CTO

research

A Comparative Evaluation of LLM Responses from Gemini, OpenAI, and Perplexity

Try Search Atlas

You may also read a concise version of this research in our blog: Comparative Analysis of LLM Citation Behavior: SEO Strategy Implications

Introduction

This study compares how large language models (LLMs) reference external sources when responding to identical queries. By examining their domain citation behavior, we assess whether differences in web search capability and model architecture influence how information is retrieved and attributed.

The dataset comprises 5,504,399 LLM responses across 748,425 unique user queries, collected over a 30-day period between August 25 and September 25, 2025. Among the models studied, Perplexity Sonar operates with web search enabled, while Gemini-2.0-Flash-Lite and OpenAI’s GPT-4o-mini generate responses without live retrieval. This configuration provides a controlled framework to evaluate citation breadth, overlap, and agreement across systems with distinct access to external data sources.

Summary of Dataset

  • Total responses: 5,504,399 (Gemini, OpenAI, and Perplexity combined)
  • Unique prompt queries: 748,425
  • Data collection period: August 25 – September 25, 2025
  • Models analyzed:
    • Perplexity Sonar – Web search enabled
    • Gemini-2.0-Flash-Lite – Web search disabled
    • OpenAI GPT-4o-mini – Web search disabled

Methodology

Data Source and Collection

The analysis draws from a dataset of 5.5 million LLM-generated responses spanning 748,425 queries, collected between August 25 and September 25, 2025. The dataset includes outputs from Gemini, OpenAI, and Perplexity Sonar, representing both models with and without active web retrieval.

Data Normalization and Filtering

All citations were extracted from model outputs, and cited domains were standardized to a normalized domain.tld format to ensure cross-model consistency. For fair comparison, only queries where all three LLMs produced citations were retained for analysis.

Analytical Framework

Citation behavior was evaluated using three primary metrics:

  1. Domain citation count – Number of unique domains cited per query.
  2. Jaccard similarity – Ratio of shared to total unique domains between model pairs.
  3. Agreement rate – Percentage of queries where at least one domain overlapped across models.

Extended Analyses

Complementary evaluations examined response length, citation density, and URL freshness to assess whether verbosity or publication recency influenced retrieval diversity and citation breadth.

Domain Citation Behaviour Across LLM Models

Total Domains Cited by Each LLM

For samples where all three LLMs cited domains for the same query, below is a plot of the total domains cited by each LLM model.

Total Domains Cited by Each LLM

Distribution of Domain Citations per Query

Distribution of Domain Citations per Query

Average Domain Cited per Query

For samples where all three LLMs cited domains for the same queries, this chart shows the average domains cited per query.

avg domains cited per query (mean)

The median domains cited per query shows the typical number of sources each model references for a given question, providing a clearer picture of their usual citation behavior without the influence of outliers.

avg domains cited per query (median)

AI Cited Domain Agreement Across LLM Models

For samples where all three LLMs cited domains for the same queries, this chart shows the average similarity between each pair’s cited domains, measured using Jaccard similarity (calculated as the size of the intersection divided by the size of the union of their cited domains).

Note on Web Search: Web search was disabled for Google and OpenAI. It remained enabled for Perplexity, as their web search is always on and cannot be deactivated.

This may explain why Perplexity returns more domains per query.

Formula: Jaccard Similarity

To measure the similarity between two sets of LLM responses for the same query, we used Jaccard similarity.


The formula is:

J(A,B) =| A B || A B |

Example:

Using Jaccard similarity, sets:

  • Gemini = {A, B, C, D, E},
  • OpenAI = {A, B, C, D, E, F, G}

Domains cited by both Gemini and OpenAI (intersection) = {A, B, C, D, E } = 5

All unique domains across both = {A, B, C, D, E, F, G} = 7

Jaccard similarity = intersection (5) / Union (7) = 0.714 = 71.4%

Average Domain Overlap (Agreement Rate) Between LLMs

Average Domain Overlap (Agreement Rate) Between LLMs

Percentage of queries where each model pair agreed on at least one citation

Search Atlas advanced SEO software dashboard showing domain query data.

Across LLM pairs, Gemini and OpenAI show the strongest citation alignment, sharing ≈ 42 % of cited domains on average.

However, while overall overlap is modest, most queries (≈ 60–65 %) still contain at least one shared domain, indicating partial convergence in source selection even when full citation sets differ.

Distribution of Domain Overlap Scores

High overlap scores comparison chart for Search Atlas SEO software platform.

Gemini and OpenAI show the highest and most stable domain overlap, while overlaps involving Perplexity are lower and more dispersed, likely because Perplexity’s active web search retrieves a wider and more diverse set of sources.

Venn Diagram: Domain Citations Overlap Between LLMs

For samples where all three LLMs (Perplexity, OpenAI, and Gemini) cited domains for the same set of queries, the chart below shows the domain citation overlap between all three models.

Performance SEO software dashboard showcasing domain citation overlaps and analytics.

LLM Output Citation Count & Length Comparison

For samples where all three LLMs cited domains for the same queries, we analyzed the number of citations generated by each model and the corresponding output length. This comparison helps assess whether longer responses tend to include more citations and whether citation density varies significantly across LLMs.

Citation Count by Platform & Model

Key Insights:

OpenAI GPT‑4o‑mini‑2024‑07‑18 cites far less frequently than the other four models, with no extreme outliers in citation count.

For consistent source attribution, Perplexity (Sonar) stands out as the clear choice, providing citations in nearly every response.

OpenAI GPT‑5‑nano‑2025‑08‑07 and Gemini‑2.0‑Flash‑lite behave similarly: they rarely cite, and when they do, it’s often in the form of rare, citation-heavy outliers.

Example of Queries with Low Domain Citations

Example of Queries with Low Domain Citations

Insights from Single-Citation Examples

  1. Local or Brand-Specific Queries
    Several prompts directly reference a single company or service provider, such as “Action Pest Control bed bug removal Mid-South,” “Rick Lucas Plumbing maintenance service,” and “Toyota of Boerne best-selling new vehicles.” In these cases, the cited domain represents the official business website, which fully satisfies the information intent.
  1. Product and Service Reviews
    Queries such as “Appliances Connection reviews on brand selection” or “entrepreneur tool reviews and comparisons” often point to domains that either host reviews (reviewed.com) or represent platforms that publish comparison-oriented marketing content (monday.com). In both cases, one authoritative domain provides sufficient coverage for the topic.
  1. General Informational or How-To Queries
    Some queries (e.g., “How to improve with Chess.com puzzles,” “how AI changes SEO for marketers”) rely on instructional or standardized knowledge that can be supported by one high-trust source (e.g., chess.com, schema.org).
  1. Web Search Availability and Single-Citation Behavior Across Models

Although Perplexity operates with web search enabled while Gemini and OpenAI do not, all three models display single-citation behavior for certain query types. This suggests that enabling web search does not necessarily expand citation breadth in every case, particularly for narrow, brand-specific, or self-contained prompts where one domain sufficiently answers the query.

LLM Response Length (Character Count) by Platform & Model

LLM Response Length (Character Count) by Platform & Model

Key Insights:

– For concise answers, Perplexity (Sonar) performs best.

– For verbose/detailed outputs, Gemini is the clear leader (though sometimes excessively long).

– OpenAI models strike a middle ground between verbosity and brevity, with gpt-4o leaning more towards brevity, while gpt-6 nano tends to be slightly more verbose.

LLM Response Length vs Citation Relationship

  1. Response Length Variability Across Models:
    Gemini (Gemini 2.0 Flash-Lite) shows the widest range in output length, occasionally exceeding 60,000 characters, indicating a tendency toward longer and more verbose responses. In contrast, Perplexity (Sonar) consistently produces shorter outputs with minimal variation.
  2. Citation Density Differences:
    Despite shorter responses, Perplexity tends to include relatively more citations per output than Gemini or GPT-4o-mini, suggesting a stronger focus on referencing. OpenAI’s GPT-5-nano occasionally produces very high citation counts, though these are outliers.
  3. Length vs. Citation Count Relationship:
    While longer outputs (e.g., Gemini) generally provide more opportunity for citations, the data does not show a direct correlation between response length and citation count. Some models (like Perplexity) achieve higher citation density even within shorter responses, implying that citation behavior is model-specific rather than purely a function of length.
  4. Model Output Characteristics:
    • Gemini: Longest, most variable responses; fewer citations per length.
    • GPT-5-nano: Moderate length with occasional bursts of high citation counts.
    • GPT-4o-mini: Shorter, concise responses with balanced citation levels.
    • Perplexity Sonar: Shortest responses but most citation-dense overall.

Conclusion

This analysis reveals clear differences in how major LLMs cite external sources, shaped by their architecture and access to web search.

  • Perplexity Sonar consistently delivers the highest citation density and broadest domain diversity. Its always-on web search allows it to cite more recent URLs and multiple domains per query, making it the most transparent and retrieval-aligned model.
  • Gemini 2.0 Flash-Lite produces the longest responses, often exceeding 60k characters, but cites fewer sources. While it shares some domain overlap with OpenAI, it favors verbosity over breadth of citations.
  • OpenAI GPT models (GPT-4o-mini, GPT-5-nano) offer a balanced approach: moderate response lengths, consistent citation behavior, and strong overlap with Gemini on trusted domains. GPT-5-nano occasionally shows bursts of high citation count.

Key Observations:

  • Citation overlap across models is limited but meaningful, with about 60–65% of shared queries including at least one common domain.
  • Models without web search (e.g., Gemini, OpenAI) still cite authoritative sources effectively, especially for niche or brand-specific queries.
  • There is no direct correlation between response length and number of citations, indicating that verbosity does not necessarily translate to citation richness.

Implications

Models with web retrieval capabilities (like Perplexity) are more suitable for tasks requiring up-to-date, citation-rich, and diverse content. In contrast, non-retrieval models (such as Gemini and OpenAI) may perform adequately in structured or narrowly scoped tasks. 

Ultimately, model selection should align with the specific trade-offs between citation fidelity, verbosity, and content freshness.

Picture of Manick Bhan

Manick Bhan

Founder CEO/CTO

Manick Bhan is a 3x INC 5000 Founder CEO/CTO of Search Atlas which is an AI SEO automation platform used by thousands of brands and agencies and awarded Best SEO Platform by the Global Search Awards, Shortlisted by Capterra, Front Runners by Software Advice, Category Leaders by GetApp, and best tool for customer satisfaction and usability by Gartner.

Manick Bhan founded LinkGraph, a digital marketing firm that helps enterprise brands and agencies scale through data-driven SEO with clients like Shutterfly and Samsung. LinkGraph is listed as one of the Fastest Growing Private Companies in the US by inc.5000, as one of the Best Workplaces in Advertising & Marketing by Fortune, as New York’s B2B Leaders by Clutch, won no.1 Spot in Nevada’s Top Workplaces, Best B2B SEO Campaign by The Drum Awards for Search, and named Best Start-Up Agency at U.S. Search Awards.

Manick Bhan is the owner for Signal Genesys, the leading platform for automated press release distribution and digital presence management, and LinkLaboratory, the largest online publisher catalog in the world.

With 10+ years of experience in SEO from the in-house and agency side, Manick Bhan has taught both startups and Fortune 500 companies how to scale their brands with a data-driven SEO strategy that can break into any market and outrank even the biggest of competitors. Bhan’s innovative approach to SEO has helped Search Atlas and LinkGraph scale to multiple 8 figures.

Manick's thought leadership has appeared in leading publications like Forbes, Search Engine Journal (SEJ), VentureBeat, G2, Digital Summit, Wordstream, Wix SEO Hub, Wordable, Inc. Masters, AllBusiness, SEO Blog, Jumpstory, Serpstat, Outbrain, Improvado, Unstack, Clickbank, Built in, Martechseries, Smartbrief, Marketingprofs, Readwrite, Honeybook, Content Marketing Institute, LocalIQ, CXL, Oncrawl, Venture Beat, Addicted2Success, Search Engine Watch, Business 2 Community, Digital Connect MAG, and VegasInc.

Manick Bhan is a speaker at events like TechCrunch Disrupt, Traffic & Conversion Summit, Ad World, HighLevel Summit, Chiang Mai SEO, Merchant Mastery, SEO Week, AI Bot Summit, SEO Spring Training, LeadSnap Mansion Mastermind, SEOROCKSTARS, LeadSnapEvents, DigiMarCon, brightonSEO, Affiliate Summit West, Traffic and Conversion Summit, Outranking Summit, TES Affiliate Conference, billo Summit, ContentTECH Summit, Content Marketing Conference, VEGPRENEUR Expert Hour, Ai4 Conference, SMX West, and Affiliate Summit West.

Manick Bhan is the Founder CEO/CTO of the SEOTheory community, a community designed for agency owners looking to increase their SEO results.

Manick Bhan enjoys writing and speaking on topics that range from digital marketing to artificial intelligence and machine learning to social impact in the animal welfare and environmental space.

Manick lives in Medellin, Colombia with his wife Sophia Deluz-Bhan, daughter Ruby, and a house full of animals including Voodoo the SEO cat.

Visualize Your SEO Success: Expert Videos & Strategies

Mastering SEO with AIUnlocking AI ContentSEO StrategySearch Atlas Strategy SessionCommunity EditionSearch Atlas Academy

Real Success Stories: In-Depth Case Studies

Business name:

Nonprofit Sensory Learning Center

The Challenge:

This center is dedicated to providing essential resources and programs for children with sensory processing needs, but needed stronger organic visibility to reach families searching for support.

+111% Organic Traffic
+75.5% Organic Keywords
Top 1 Ranking For Target Keyword
Search Atlas case study results dashboard

How We Did It:

The client implemented OTTO recommendations and technical fixes through Search Atlas, improving local relevance and turning hidden opportunities into measurable organic growth.

Ready to Replace Your SEO Stack With a Smarter System?

  • 25 - 1000+ websites being managed
  • 25 - 1000+ PPC accounts being managed
  • 25 - 1000+ GBP accounts being managed