Key Findings
Finding 1: 34% of Identical Queries Produce Materially Different Answers
When the same query is submitted multiple times without context, ChatGPT provides materially different answers in 34% of cases.
Material Difference Breakdown:
| Type of Difference | Frequency | % of All Queries |
|---|
| Brand mentions added/removed | 2,140 | 21.4% |
| Different recommendations | 1,720 | 17.2% |
| Sentiment shift | 890 | 8.9% |
30% length variation | 2,670 | 26.7% |
| Different factual claims | 1,230 | 12.3% |
| Any material difference | 3,400 | 34.0% |
Example: Query "What are the best email marketing tools?" submitted five times produced:
Response 1 (247 words): Mentioned Mailchimp, Constant Contact, Sendinblue, ConvertKit, and AWeber. Recommended Mailchimp for beginners, ConvertKit for creators.
Response 2 (312 words): Mentioned Mailchimp, HubSpot, ActiveCampaign, GetResponse, and Campaign Monitor. Recommended HubSpot for enterprise, ActiveCampaign for automation.
Response 3 (189 words): Mentioned Mailchimp, Constant Contact, and AWeber only. No explicit recommendations.
Response 4 (298 words): Mentioned Mailchimp, Sendinblue, ConvertKit, ActiveCampaign, and Brevo. Recommended different tools for different use cases.
Response 5 (261 words): Mentioned Mailchimp, HubSpot, and ConvertKit only. Recommended Mailchimp as "industry standard."
Key Insight: Only one brand (Mailchimp) appeared across all five responses. Other brands appeared inconsistently, demonstrating the challenge of assessing true AI visibility from single queries.
Finding 2: Brand Citation Consistency Averages 72% Across Queries
When brands appear in AI responses, they appear in 72% of repeated queries on average, indicating significant inconsistency in brand presence.
Brand Citation Consistency by Industry:
| Industry | Average Citation Consistency | Range |
|---|
| Technology | 81% | 67-94% |
| Healthcare | 78% | 61-89% |
| Financial Services | 76% | 58-88% |
| Automotive | 74% | 52-86% |
| E-commerce | 72% | 48-89% |
| Travel | 71% | 49-87% |
| Food & Beverage | 69% | 41-84% |
| B2B Services | 68% | 44-85% |
| Real Estate | 66% | 38-82% |
| Education | 63% | 35-81% |
Implication: A brand appearing in one query has only a 72% chance of appearing in the next identical query. This inconsistency creates significant challenges for accurate AI visibility measurement.
Brand Tier Variability:
Citation consistency correlates strongly with brand authority:
- Top 3 brands (by market share): 87% average citation consistency
- Brands 4-10: 74% average citation consistency
- Brands 11-20: 61% average citation consistency
- Brands 20+: 43% average citation consistency
Key Insight: Stronger brands show more consistent AI presence, suggesting that answer variability affects challenger brands more than established leaders. For brands seeking to improve AI visibility, consistency should be a key metric alongside overall presence.
Finding 3: Answer Length Varies by Average of 38% Between Responses
The length and comprehensiveness of AI responses varies significantly between identical queries, impacting both brand visibility and user experience.
Length Variability by Query Type:
| Query Type | Mean Word Count | Std Deviation | Coefficient of Variation |
|---|
| "What are the best..." | 287 | 67 | 23% |
| "Compare X and Y" | 324 | 89 | 27% |
| "How do I choose..." | 298 | 102 | 34% |
| "Recommend a..." | 198 | 54 | 27% |
| "Which is better..." | 267 | 78 | 29% |
| Overall Average | 267 | 78 | 29% |
Brand Citation Impact:
Longer responses correlate with more brand mentions:
- Responses <200 words: Average 2.1 brand mentions
- Responses 200-300 words: Average 3.4 brand mentions
- Responses 300-400 words: Average 4.7 brand mentions
- Responses >400 words: Average 6.2 brand mentions
Implication: Since response length varies significantly, brand visibility depends partially on random factors affecting response length. A brand appearing in a 400-word response may be absent from a 200-word response to the same query, not due to any content or optimization difference.
Finding 4: Temperature and Random Seed Effects Cause Most Variability
Through controlled testing with different temperature settings and deterministic modes, we identified the primary causes of answer variability.
Variability Sources by Impact:
| Source | Contribution to Variability | Description |
|---|
| Temperature sampling | 52% | Random token selection during generation |
| Context window state | 23% | System state and recent query history |
| Model drift/updates | 12% | Model changes over time |
| Phrasing sensitivity | 8% | Minor wording differences |
| Other factors | 5% | Server load, random seed, etc. |
Temperature Impact:
Temperature controls the randomness of token selection during text generation. Higher temperature increases creativity but decreases consistency.
| Temperature Setting | Variability Rate | Avg Brand Mentions | Citation Consistency |
|---|
| 0.0 (deterministic) | 8% | 3.1 | 94% |
| 0.3 | 19% | 3.6 | 88% |
| 0.7 (default ChatGPT) | 34% | 4.2 | 72% |
| 1.0 | 51% | 4.8 | 61% |
| 1.5 | 67% | 5.1 | 49% |
Key Insight: Most commercial AI platforms use temperature settings around 0.7, balancing creativity and consistency. This creates inherent variability that cannot be eliminated without significantly reducing answer quality.
Context Window Effects:
We tested whether previous queries (even unrelated ones) affect subsequent responses through context window contamination:
| Test Condition | Variability Rate | Brand Citation Consistency |
|---|
| Fresh session (no prior queries) | 31% | 76% |
| After 5 unrelated queries | 34% | 72% |
| After 10 unrelated queries | 38% | 68% |
| After 20 unrelated queries | 41% | 64% |
Implication: Session history and context window state affect answer quality and consistency. Users engaging in extended conversations with AI may receive different answers than users submitting isolated queries.
Finding 5: Prompt Phrasing Changes Cause 27% Answer Variation
Minor changes in prompt phrasing cause significantly different answers, even when the core intent remains identical.
Phrasing Variation Test:
We tested 50 queries with 5 phrasing variations each (250 total queries), keeping core intent identical.
Example Phrasing Variations for "Best CRM for small business":
- "What are the best CRM tools for small businesses?"
- "Which CRM should a small business use?"
- "Recommend CRM software for small business"
- "Small business CRM recommendations"
- "Compare top CRMs for small businesses"
Results:
- Brand mention overlap across phrasing variations: 58%
- Identical recommendations across all 5 phrasings: 12%
- At least one unique brand mention per phrasing: 89%
- Answer length variation across phrasings: 42%
Implication: Slight differences in how users phrase queries create materially different answers. For brands monitoring AI presence, this means tracking a single query phrasing provides incomplete visibility into brand presence.
We tested identical queries across ChatGPT, Perplexity, Claude, and Google Gemini to compare variability rates.
Variability by Platform:
| Platform | Variability Rate | Citation Consistency | Avg Brand Mentions |
|---|
| Claude | 28% | 79% | 3.2 |
| ChatGPT | 34% | 72% | 4.2 |
| Perplexity | 31% | 75% | 4.8 |
| Google Gemini | 38% | 68% | 3.9 |
Key Findings:
- Claude shows highest consistency: Likely due to more conservative temperature settings and safety constraints
- Google Gemini shows highest variability: Possibly due to integration with live search and stronger randomization
- Brand mention count doesn't correlate with consistency: Perplexity mentions most brands but shows moderate consistency
Implication: Brands monitoring AI visibility should account for platform-specific variability. A brand appearing inconsistently in one platform may be normal for that platform rather than indicating weak presence.