Behind Gemini 3.1 Pro's "13 out of 16 Wins" — The Benchmarks Published and the Benchmarks Left Out¶

TL;DR

Gemini 3.1 Pro excels in reasoning and science: ARC-AGI-2 77.1%, GPQA Diamond 94.3%
But GPT-5.3-Codex left 14 of 16 benchmarks unpublished — many "wins" are against an absent competitor
GDPval-AA (enterprise tasks): Claude models lead by ~300 points
Arena (user voting): Opus 4.6 leads by just 4 points — essentially tied
Half the cost of Opus 4.6. No single "best overall" model exists; choose by use case

Audience: Engineers and decision-makers evaluating LLM performance

Understand how to read Google's official benchmark claims Compare real performance gaps between Gemini 3.1 Pro, Claude Opus 4.6, and GPT-5.3-Codex Get use-case-specific guidance on model selection

Key Takeaways¶

"13/16 wins" needs context GPT-5.3-Codex left most benchmarks unpublished — Gemini wins against an absent competitor in many categories
Third-party data tells a different story Arena shows Opus 4.6 just 4 points ahead; GDPval-AA shows Claude leading by 300+ points
Cost is the clear differentiator Less than half the cost of Opus 4.6 with comparable reasoning performance

Breaking news disclaimer

This article is based on information available at launch on February 20, 2026. Arena (formerly LMSYS) user voting data is in its early stages, and rankings may shift as more data accumulates.

Google's Official Benchmarks: Claiming 13 of 16 Top Spots¶

On February 20, 2026, Google released Gemini 3.1 Pro as a preview. The model achieved 77.1% on ARC-AGI-2 and 94.3% on GPQA Diamond, recording industry-leading scores across multiple benchmarks.

Google published a comparison table on deepmind.google covering 6 models: Gemini 3 Pro, Sonnet 4.6, Opus 4.6, GPT-5.2, and GPT-5.3-Codex.

Score Overview¶

Benchmark	Gemini 3.1 Pro	Opus 4.6	GPT-5.2	GPT-5.3-Codex
ARC-AGI-2 (Abstract reasoning)	77.1%	68.8%	52.9%	—
GPQA Diamond (Science)	94.3%	91.3%	92.4%	—
HLE without tools (Academic reasoning)	44.4%	40.0%	34.5%	—
HLE with tools	51.4%	53.1%	45.5%	—
Terminal-Bench 2.0 standard harness	68.5%	65.4%	54.0%	64.7%
Terminal-Bench 2.0 custom harness	—	—	62.2%	77.3%
SWE-Bench Verified	80.6%	80.8%	80.0%	—
SWE-Bench Pro (Public)	54.2%	—	55.6%	56.8%
GDPval-AA Elo (Enterprise tasks)	1317	1606	1462	—
APEX-Agents	33.5%	29.8%	23.0%	—
MRCR v2 128k	84.9%	84.0%	83.8%	—

Bold = highest per row. "—" = unpublished. Note: Sonnet 4.6 scored 1633 on GDPval-AA, the highest of all models.

The table above highlights key benchmarks only. The official Model Card also shows Gemini 3.1 Pro leading in agentic tasks (BrowseComp 85.9%, MCP Atlas 69.2%, τ2-bench Telecom 99.3%), competitive programming (LiveCodeBench Pro Elo 2887), and multilingual (MMMLU 92.6%).

Three Things This Table Doesn't Say¶

1. GPT-5.3-Codex is mostly absent. Out of 16 benchmarks, Codex scores appear in only 2: Terminal-Bench 2.0 and SWE-Bench Pro (Public). While Codex is positioned as a coding-specialized model, the lack of general reasoning scores means the playing field for declaring "Gemini wins" is severely limited.

2. GDPval-AA shows a significant gap. GDPval-AA measures enterprise task performance (finance, legal, etc.). Gemini 3.1 Pro scored 1317, trailing Sonnet 4.6 (1633) and Opus 4.6 (1606) by nearly 300 points. Artificial Analysis independently confirmed that Gemini "improved but did not take the lead."

3. Google didn't publish its own custom harness scores. For Terminal-Bench 2.0, GPT-5.3-Codex reported 77.3% using its custom harness (Codex harness). Google only published the standard harness (Terminus-2) score. Whether Google lacks custom harness results or chose not to publish them remains unclear.

Third-Party Evaluations: Different Leaderboards, Different Winners¶

Independent evaluations paint a more nuanced picture. Automated benchmark aggregation puts Gemini on top; user voting puts Claude Opus 4.6 slightly ahead.

Artificial Analysis Intelligence Index v4.0¶

Artificial Analysis received pre-release access from Google and conducted independent evaluation. Gemini 3.1 Pro scored 57, taking the Index's top spot — 4 points ahead of Opus 4.6 (53) and 6 points ahead of Sonnet 4.6 (51).

It ranked first in 6 of 10 evaluation categories: Terminal-Bench Hard, AA-Omniscience (hallucination reduction), HLE, GPQA Diamond, SciCode, and CritPt (research-level physics reasoning). CritPt at 18% exceeded the runner-up by over 5 points, highlighting particular strength in scientific reasoning.

However, the same report explicitly noted that GDPval-AA performance "improved but did not reach the top," confirming the enterprise task gap from a third-party perspective.

Arena (formerly LMSYS Chatbot Arena)¶

Arena's blind user voting presents a different picture. As of February 20, the Text Arena top rankings were:

Claude Opus 4.6 — 1504 Elo
Claude Opus 4.6 Thinking — 1504 Elo
Gemini 3.1 Pro Preview — 1500 Elo
Gemini 3 Pro — 1487 Elo

In the overall Text category, Gemini 3.1 Pro sits just 4 points behind the leading Opus 4.6. Contrary to the "dominant leader" narrative from official benchmarks, human blind testing shows the two models as essentially tied. Vision and Code category evaluations are still ongoing, and rankings may shift over the coming weeks.

Data Missing from Google's Comparison Table¶

Several scores from official announcements by other companies are absent from Google's table.

Benchmark	Model	Score	Source
OSWorld (PC operation)	Opus 4.6	72.7%	Anthropic official
OSWorld (PC operation)	GPT-5.3-Codex	64.7%	OpenAI official
MRCR v2 1M 8-needle	Opus 4.6	76%	Anthropic official
BigLaw Bench (Legal reasoning)	Opus 4.6	90.2%	Anthropic official
Cybersecurity CTF	GPT-5.3-Codex	77.6%	OpenAI official

Opus 4.6's OSWorld score of 72.7% exceeds any model in Google's table. For MRCR v2 at 1 million tokens, Google lists Opus 4.6 as "Not supported," yet Anthropic claims 76% with its beta 1M context window. Whether this reflects evaluation timing differences or intentional exclusion is unknown.

Cost Performance: A Clear Advantage¶

Beyond raw performance, cost structure deserves attention. For high-volume API use cases, this difference can be decisive.

Model	Input / 1M tokens	Output / 1M tokens	AA Index run cost
Gemini 3.1 Pro	$2.00	$12.00	$892
Opus 4.6 (max)	$5.00	$25.00	$1,800+
GPT-5.2 (xhigh)	—	—	$1,800+

At less than half the cost of Opus 4.6 for Artificial Analysis's full benchmark suite, Gemini 3.1 Pro becomes a practical choice for applications requiring high-volume inference.

Use-Case Recommendations (Current Assessment)¶

No single "best overall" model exists. The optimal choice depends on the specific use case.

Use Case	Top Candidate	Evidence
Abstract reasoning & science	Gemini 3.1 Pro	ARC-AGI-2 77.1%, GPQA 94.3%
Competitive programming	Gemini 3.1 Pro	LiveCodeBench Pro Elo 2887 (~500 ahead of 2^nd)
Agentic (search & MCP)	Gemini 3.1 Pro	BrowseComp 85.9%, MCP Atlas 69.2%
Enterprise tasks (finance, legal)	Sonnet 4.6 / Opus 4.6	GDPval-AA 1633/1606 vs Gemini 1317
Terminal / CLI coding	GPT-5.3-Codex	Custom harness 77.3% (standard: Gemini leads)
PC operation agent	Opus 4.6	OSWorld 72.7%
Long context (1M tokens)	Opus 4.6	MRCR v2 1M 76% (beta)
Cost performance	Gemini 3.1 Pro	Half the cost of Opus with comparable reasoning

Summary: Reading the Asymmetry in Benchmark Disclosure¶

Gemini 3.1 Pro is a strong model for reasoning, scientific knowledge, and cost efficiency, as evidenced by its Artificial Analysis Intelligence Index first-place ranking. However, understanding each vendor's benchmark disclosure strategy changes the resolution of "13 out of 16 wins."

Winning against absent competitors. With GPT-5.3-Codex scores published for only 2 of 16 benchmarks, concluding that "Gemini won" across the remaining 14 is premature
A 300-point gap in enterprise tasks. The GDPval-AA deficit against Claude models cannot be ignored when evaluating for finance, legal, or other enterprise applications
Losing to the previous generation in some areas. MMMU-Pro (multimodal understanding) shows Gemini 3 Pro (81.0%) outperforming 3.1 Pro (80.5%) — newer models don't always surpass their predecessors across the board
An industry-wide pattern. Highlighting favorable benchmarks while omitting unfavorable ones is not unique to Google. Use-case-specific selection and waiting for independent verification remains the rational approach

This asymmetry in benchmark disclosure is likely to deepen. As each vendor pushes agentic capabilities (tool use, multi-step reasoning), standardizing evaluation conditions becomes harder, not easier. The HLE "with tools / without tools" case — where rankings flip depending on execution conditions — illustrates the problem. Reading "under what conditions was this measured" rather than just the model name will become the core literacy for model selection going forward.

Update schedule

This article will be updated in detail once Arena scores stabilize and sufficient data accumulates across categories (Vision, Code, etc.).

Codex CLI vs Claude Code 2026 Opus 4.6 vs GPT-5.3-Codex detailed benchmark comparison
LLM Coding Benchmark Comparison 2026 How to read the 4 key coding benchmark metrics
Claude Code Complete Guide 1M context, benchmarks, and pricing overview
GPT-5 Codex Beginner Guide Getting started with Codex CLI

References¶

Google DeepMind Model Card: Gemini 3.1 Pro
Google DeepMind Evals Methodology: Gemini 3.1 Pro
Google Official Blog: Gemini 3.1 Pro
Anthropic Official: Claude Opus 4.6
OpenAI Official: GPT-5.3-Codex
Artificial Analysis: Gemini 3.1 Pro Preview
Arena Leaderboard Changelog