Behind Gemini 3.1 Pro's "13 out of 16 Wins" — The Benchmarks Published and the Benchmarks Left Out¶
TL;DR
- Gemini 3.1 Pro excels in reasoning and science: ARC-AGI-2 77.1%, GPQA Diamond 94.3%
- But GPT-5.3-Codex left 14 of 16 benchmarks unpublished — many "wins" are against an absent competitor
- GDPval-AA (enterprise tasks): Claude models lead by ~300 points
- Arena (user voting): Opus 4.6 leads by just 4 points — essentially tied
- Half the cost of Opus 4.6. No single "best overall" model exists; choose by use case
Audience: Engineers and decision-makers evaluating LLM performance
Understand how to read Google's official benchmark claims Compare real performance gaps between Gemini 3.1 Pro, Claude Opus 4.6, and GPT-5.3-Codex Get use-case-specific guidance on model selection
Key Takeaways¶
- "13/16 wins" needs context GPT-5.3-Codex left most benchmarks unpublished — Gemini wins against an absent competitor in many categories
- Third-party data tells a different story Arena shows Opus 4.6 just 4 points ahead; GDPval-AA shows Claude leading by 300+ points
- Cost is the clear differentiator Less than half the cost of Opus 4.6 with comparable reasoning performance
Breaking news disclaimer
This article is based on information available at launch on February 20, 2026. Arena (formerly LMSYS) user voting data is in its early stages, and rankings may shift as more data accumulates.
Google's Official Benchmarks: Claiming 13 of 16 Top Spots¶
On February 20, 2026, Google released Gemini 3.1 Pro as a preview. The model achieved 77.1% on ARC-AGI-2 and 94.3% on GPQA Diamond, recording industry-leading scores across multiple benchmarks.
Google published a comparison table on deepmind.google covering 6 models: Gemini 3 Pro, Sonnet 4.6, Opus 4.6, GPT-5.2, and GPT-5.3-Codex.
Score Overview¶
| Benchmark | Gemini 3.1 Pro | Opus 4.6 | GPT-5.2 | GPT-5.3-Codex |
|---|---|---|---|---|
| ARC-AGI-2 (Abstract reasoning) | 77.1% | 68.8% | 52.9% | — |
| GPQA Diamond (Science) | 94.3% | 91.3% | 92.4% | — |
| HLE without tools (Academic reasoning) | 44.4% | 40.0% | 34.5% | — |
| HLE with tools | 51.4% | 53.1% | 45.5% | — |
| Terminal-Bench 2.0 standard harness | 68.5% | 65.4% | 54.0% | 64.7% |
| Terminal-Bench 2.0 custom harness | — | — | 62.2% | 77.3% |
| SWE-Bench Verified | 80.6% | 80.8% | 80.0% | — |
| SWE-Bench Pro (Public) | 54.2% | — | 55.6% | 56.8% |
| GDPval-AA Elo (Enterprise tasks) | 1317 | 1606 | 1462 | — |
| APEX-Agents | 33.5% | 29.8% | 23.0% | — |
| MRCR v2 128k | 84.9% | 84.0% | 83.8% | — |
Bold = highest per row. "—" = unpublished. Note: Sonnet 4.6 scored 1633 on GDPval-AA, the highest of all models.
The table above highlights key benchmarks only. The official Model Card also shows Gemini 3.1 Pro leading in agentic tasks (BrowseComp 85.9%, MCP Atlas 69.2%, τ2-bench Telecom 99.3%), competitive programming (LiveCodeBench Pro Elo 2887), and multilingual (MMMLU 92.6%).
Three Things This Table Doesn't Say¶
1. GPT-5.3-Codex is mostly absent. Out of 16 benchmarks, Codex scores appear in only 2: Terminal-Bench 2.0 and SWE-Bench Pro (Public). While Codex is positioned as a coding-specialized model, the lack of general reasoning scores means the playing field for declaring "Gemini wins" is severely limited.
2. GDPval-AA shows a significant gap. GDPval-AA measures enterprise task performance (finance, legal, etc.). Gemini 3.1 Pro scored 1317, trailing Sonnet 4.6 (1633) and Opus 4.6 (1606) by nearly 300 points. Artificial Analysis independently confirmed that Gemini "improved but did not take the lead."
3. Google didn't publish its own custom harness scores. For Terminal-Bench 2.0, GPT-5.3-Codex reported 77.3% using its custom harness (Codex harness). Google only published the standard harness (Terminus-2) score. Whether Google lacks custom harness results or chose not to publish them remains unclear.
Third-Party Evaluations: Different Leaderboards, Different Winners¶
Independent evaluations paint a more nuanced picture. Automated benchmark aggregation puts Gemini on top; user voting puts Claude Opus 4.6 slightly ahead.
Artificial Analysis Intelligence Index v4.0¶
Artificial Analysis received pre-release access from Google and conducted independent evaluation. Gemini 3.1 Pro scored 57, taking the Index's top spot — 4 points ahead of Opus 4.6 (53) and 6 points ahead of Sonnet 4.6 (51).
It ranked first in 6 of 10 evaluation categories: Terminal-Bench Hard, AA-Omniscience (hallucination reduction), HLE, GPQA Diamond, SciCode, and CritPt (research-level physics reasoning). CritPt at 18% exceeded the runner-up by over 5 points, highlighting particular strength in scientific reasoning.
However, the same report explicitly noted that GDPval-AA performance "improved but did not reach the top," confirming the enterprise task gap from a third-party perspective.
Arena (formerly LMSYS Chatbot Arena)¶
Arena's blind user voting presents a different picture. As of February 20, the Text Arena top rankings were:
- Claude Opus 4.6 — 1504 Elo
- Claude Opus 4.6 Thinking — 1504 Elo
- Gemini 3.1 Pro Preview — 1500 Elo
- Gemini 3 Pro — 1487 Elo
In the overall Text category, Gemini 3.1 Pro sits just 4 points behind the leading Opus 4.6. Contrary to the "dominant leader" narrative from official benchmarks, human blind testing shows the two models as essentially tied. Vision and Code category evaluations are still ongoing, and rankings may shift over the coming weeks.
Data Missing from Google's Comparison Table¶
Several scores from official announcements by other companies are absent from Google's table.
| Benchmark | Model | Score | Source |
|---|---|---|---|
| OSWorld (PC operation) | Opus 4.6 | 72.7% | Anthropic official |
| OSWorld (PC operation) | GPT-5.3-Codex | 64.7% | OpenAI official |
| MRCR v2 1M 8-needle | Opus 4.6 | 76% | Anthropic official |
| BigLaw Bench (Legal reasoning) | Opus 4.6 | 90.2% | Anthropic official |
| Cybersecurity CTF | GPT-5.3-Codex | 77.6% | OpenAI official |
Opus 4.6's OSWorld score of 72.7% exceeds any model in Google's table. For MRCR v2 at 1 million tokens, Google lists Opus 4.6 as "Not supported," yet Anthropic claims 76% with its beta 1M context window. Whether this reflects evaluation timing differences or intentional exclusion is unknown.
Cost Performance: A Clear Advantage¶
Beyond raw performance, cost structure deserves attention. For high-volume API use cases, this difference can be decisive.
| Model | Input / 1M tokens | Output / 1M tokens | AA Index run cost |
|---|---|---|---|
| Gemini 3.1 Pro | $2.00 | $12.00 | $892 |
| Opus 4.6 (max) | $5.00 | $25.00 | $1,800+ |
| GPT-5.2 (xhigh) | — | — | $1,800+ |
At less than half the cost of Opus 4.6 for Artificial Analysis's full benchmark suite, Gemini 3.1 Pro becomes a practical choice for applications requiring high-volume inference.
Use-Case Recommendations (Current Assessment)¶
No single "best overall" model exists. The optimal choice depends on the specific use case.
| Use Case | Top Candidate | Evidence |
|---|---|---|
| Abstract reasoning & science | Gemini 3.1 Pro | ARC-AGI-2 77.1%, GPQA 94.3% |
| Competitive programming | Gemini 3.1 Pro | LiveCodeBench Pro Elo 2887 (~500 ahead of 2nd) |
| Agentic (search & MCP) | Gemini 3.1 Pro | BrowseComp 85.9%, MCP Atlas 69.2% |
| Enterprise tasks (finance, legal) | Sonnet 4.6 / Opus 4.6 | GDPval-AA 1633/1606 vs Gemini 1317 |
| Terminal / CLI coding | GPT-5.3-Codex | Custom harness 77.3% (standard: Gemini leads) |
| PC operation agent | Opus 4.6 | OSWorld 72.7% |
| Long context (1M tokens) | Opus 4.6 | MRCR v2 1M 76% (beta) |
| Cost performance | Gemini 3.1 Pro | Half the cost of Opus with comparable reasoning |
Summary: Reading the Asymmetry in Benchmark Disclosure¶
Gemini 3.1 Pro is a strong model for reasoning, scientific knowledge, and cost efficiency, as evidenced by its Artificial Analysis Intelligence Index first-place ranking. However, understanding each vendor's benchmark disclosure strategy changes the resolution of "13 out of 16 wins."
- Winning against absent competitors. With GPT-5.3-Codex scores published for only 2 of 16 benchmarks, concluding that "Gemini won" across the remaining 14 is premature
- A 300-point gap in enterprise tasks. The GDPval-AA deficit against Claude models cannot be ignored when evaluating for finance, legal, or other enterprise applications
- Losing to the previous generation in some areas. MMMU-Pro (multimodal understanding) shows Gemini 3 Pro (81.0%) outperforming 3.1 Pro (80.5%) — newer models don't always surpass their predecessors across the board
- An industry-wide pattern. Highlighting favorable benchmarks while omitting unfavorable ones is not unique to Google. Use-case-specific selection and waiting for independent verification remains the rational approach
This asymmetry in benchmark disclosure is likely to deepen. As each vendor pushes agentic capabilities (tool use, multi-step reasoning), standardizing evaluation conditions becomes harder, not easier. The HLE "with tools / without tools" case — where rankings flip depending on execution conditions — illustrates the problem. Reading "under what conditions was this measured" rather than just the model name will become the core literacy for model selection going forward.
Update schedule
This article will be updated in detail once Arena scores stabilize and sufficient data accumulates across categories (Vision, Code, etc.).
Related Articles¶
- Codex CLI vs Claude Code 2026 Opus 4.6 vs GPT-5.3-Codex detailed benchmark comparison
- LLM Coding Benchmark Comparison 2026 How to read the 4 key coding benchmark metrics
- Claude Code Complete Guide 1M context, benchmarks, and pricing overview
- GPT-5 Codex Beginner Guide Getting started with Codex CLI
References¶
- Google DeepMind Model Card: Gemini 3.1 Pro
- Google DeepMind Evals Methodology: Gemini 3.1 Pro
- Google Official Blog: Gemini 3.1 Pro
- Anthropic Official: Claude Opus 4.6
- OpenAI Official: GPT-5.3-Codex
- Artificial Analysis: Gemini 3.1 Pro Preview
- Arena Leaderboard Changelog