Skip to content

Google Gemini Complete Guide

Behind Gemini 3.1 Pro's "13 out of 16 Wins" — The Benchmarks Published and the Benchmarks Left Out

TL;DR

  • Gemini 3.1 Pro excels in reasoning and science: ARC-AGI-2 77.1%, GPQA Diamond 94.3%
  • But GPT-5.3-Codex left 14 of 16 benchmarks unpublished — many "wins" are against an absent competitor
  • GDPval-AA (enterprise tasks): Claude models lead by ~300 points
  • Arena (user voting): Opus 4.6 leads by just 4 points — essentially tied
  • Half the cost of Opus 4.6. No single "best overall" model exists; choose by use case

Audience: Engineers and decision-makers evaluating LLM performance

Understand how to read Google's official benchmark claims Compare real performance gaps between Gemini 3.1 Pro, Claude Opus 4.6, and GPT-5.3-Codex Get use-case-specific guidance on model selection

Key Takeaways

  • "13/16 wins" needs context GPT-5.3-Codex left most benchmarks unpublished — Gemini wins against an absent competitor in many categories
  • Third-party data tells a different story Arena shows Opus 4.6 just 4 points ahead; GDPval-AA shows Claude leading by 300+ points
  • Cost is the clear differentiator Less than half the cost of Opus 4.6 with comparable reasoning performance

Breaking news disclaimer

This article is based on information available at launch on February 20, 2026. Arena (formerly LMSYS) user voting data is in its early stages, and rankings may shift as more data accumulates.

Google's Official Benchmarks: Claiming 13 of 16 Top Spots

On February 20, 2026, Google released Gemini 3.1 Pro as a preview. The model achieved 77.1% on ARC-AGI-2 and 94.3% on GPQA Diamond, recording industry-leading scores across multiple benchmarks.

Google published a comparison table on deepmind.google covering 6 models: Gemini 3 Pro, Sonnet 4.6, Opus 4.6, GPT-5.2, and GPT-5.3-Codex.

Score Overview

BenchmarkGemini 3.1 ProOpus 4.6GPT-5.2GPT-5.3-Codex
ARC-AGI-2 (Abstract reasoning)77.1%68.8%52.9%
GPQA Diamond (Science)94.3%91.3%92.4%
HLE without tools (Academic reasoning)44.4%40.0%34.5%
HLE with tools51.4%53.1%45.5%
Terminal-Bench 2.0 standard harness68.5%65.4%54.0%64.7%
Terminal-Bench 2.0 custom harness62.2%77.3%
SWE-Bench Verified80.6%80.8%80.0%
SWE-Bench Pro (Public)54.2%55.6%56.8%
GDPval-AA Elo (Enterprise tasks)131716061462
APEX-Agents33.5%29.8%23.0%
MRCR v2 128k84.9%84.0%83.8%

Bold = highest per row. "—" = unpublished. Note: Sonnet 4.6 scored 1633 on GDPval-AA, the highest of all models.

The table above highlights key benchmarks only. The official Model Card also shows Gemini 3.1 Pro leading in agentic tasks (BrowseComp 85.9%, MCP Atlas 69.2%, τ2-bench Telecom 99.3%), competitive programming (LiveCodeBench Pro Elo 2887), and multilingual (MMMLU 92.6%).

Three Things This Table Doesn't Say

1. GPT-5.3-Codex is mostly absent. Out of 16 benchmarks, Codex scores appear in only 2: Terminal-Bench 2.0 and SWE-Bench Pro (Public). While Codex is positioned as a coding-specialized model, the lack of general reasoning scores means the playing field for declaring "Gemini wins" is severely limited.

2. GDPval-AA shows a significant gap. GDPval-AA measures enterprise task performance (finance, legal, etc.). Gemini 3.1 Pro scored 1317, trailing Sonnet 4.6 (1633) and Opus 4.6 (1606) by nearly 300 points. Artificial Analysis independently confirmed that Gemini "improved but did not take the lead."

3. Google didn't publish its own custom harness scores. For Terminal-Bench 2.0, GPT-5.3-Codex reported 77.3% using its custom harness (Codex harness). Google only published the standard harness (Terminus-2) score. Whether Google lacks custom harness results or chose not to publish them remains unclear.

Third-Party Evaluations: Different Leaderboards, Different Winners

Independent evaluations paint a more nuanced picture. Automated benchmark aggregation puts Gemini on top; user voting puts Claude Opus 4.6 slightly ahead.

Artificial Analysis Intelligence Index v4.0

Artificial Analysis received pre-release access from Google and conducted independent evaluation. Gemini 3.1 Pro scored 57, taking the Index's top spot — 4 points ahead of Opus 4.6 (53) and 6 points ahead of Sonnet 4.6 (51).

It ranked first in 6 of 10 evaluation categories: Terminal-Bench Hard, AA-Omniscience (hallucination reduction), HLE, GPQA Diamond, SciCode, and CritPt (research-level physics reasoning). CritPt at 18% exceeded the runner-up by over 5 points, highlighting particular strength in scientific reasoning.

However, the same report explicitly noted that GDPval-AA performance "improved but did not reach the top," confirming the enterprise task gap from a third-party perspective.

Arena (formerly LMSYS Chatbot Arena)

Arena's blind user voting presents a different picture. As of February 20, the Text Arena top rankings were:

  • Claude Opus 4.6 — 1504 Elo
  • Claude Opus 4.6 Thinking — 1504 Elo
  • Gemini 3.1 Pro Preview — 1500 Elo
  • Gemini 3 Pro — 1487 Elo

In the overall Text category, Gemini 3.1 Pro sits just 4 points behind the leading Opus 4.6. Contrary to the "dominant leader" narrative from official benchmarks, human blind testing shows the two models as essentially tied. Vision and Code category evaluations are still ongoing, and rankings may shift over the coming weeks.

Data Missing from Google's Comparison Table

Several scores from official announcements by other companies are absent from Google's table.

BenchmarkModelScoreSource
OSWorld (PC operation)Opus 4.672.7%Anthropic official
OSWorld (PC operation)GPT-5.3-Codex64.7%OpenAI official
MRCR v2 1M 8-needleOpus 4.676%Anthropic official
BigLaw Bench (Legal reasoning)Opus 4.690.2%Anthropic official
Cybersecurity CTFGPT-5.3-Codex77.6%OpenAI official

Opus 4.6's OSWorld score of 72.7% exceeds any model in Google's table. For MRCR v2 at 1 million tokens, Google lists Opus 4.6 as "Not supported," yet Anthropic claims 76% with its beta 1M context window. Whether this reflects evaluation timing differences or intentional exclusion is unknown.

Cost Performance: A Clear Advantage

Beyond raw performance, cost structure deserves attention. For high-volume API use cases, this difference can be decisive.

ModelInput / 1M tokensOutput / 1M tokensAA Index run cost
Gemini 3.1 Pro$2.00$12.00$892
Opus 4.6 (max)$5.00$25.00$1,800+
GPT-5.2 (xhigh)$1,800+

At less than half the cost of Opus 4.6 for Artificial Analysis's full benchmark suite, Gemini 3.1 Pro becomes a practical choice for applications requiring high-volume inference.

Use-Case Recommendations (Current Assessment)

No single "best overall" model exists. The optimal choice depends on the specific use case.

Use CaseTop CandidateEvidence
Abstract reasoning & scienceGemini 3.1 ProARC-AGI-2 77.1%, GPQA 94.3%
Competitive programmingGemini 3.1 ProLiveCodeBench Pro Elo 2887 (~500 ahead of 2nd)
Agentic (search & MCP)Gemini 3.1 ProBrowseComp 85.9%, MCP Atlas 69.2%
Enterprise tasks (finance, legal)Sonnet 4.6 / Opus 4.6GDPval-AA 1633/1606 vs Gemini 1317
Terminal / CLI codingGPT-5.3-CodexCustom harness 77.3% (standard: Gemini leads)
PC operation agentOpus 4.6OSWorld 72.7%
Long context (1M tokens)Opus 4.6MRCR v2 1M 76% (beta)
Cost performanceGemini 3.1 ProHalf the cost of Opus with comparable reasoning

Summary: Reading the Asymmetry in Benchmark Disclosure

Gemini 3.1 Pro is a strong model for reasoning, scientific knowledge, and cost efficiency, as evidenced by its Artificial Analysis Intelligence Index first-place ranking. However, understanding each vendor's benchmark disclosure strategy changes the resolution of "13 out of 16 wins."

  • Winning against absent competitors. With GPT-5.3-Codex scores published for only 2 of 16 benchmarks, concluding that "Gemini won" across the remaining 14 is premature
  • A 300-point gap in enterprise tasks. The GDPval-AA deficit against Claude models cannot be ignored when evaluating for finance, legal, or other enterprise applications
  • Losing to the previous generation in some areas. MMMU-Pro (multimodal understanding) shows Gemini 3 Pro (81.0%) outperforming 3.1 Pro (80.5%) — newer models don't always surpass their predecessors across the board
  • An industry-wide pattern. Highlighting favorable benchmarks while omitting unfavorable ones is not unique to Google. Use-case-specific selection and waiting for independent verification remains the rational approach

This asymmetry in benchmark disclosure is likely to deepen. As each vendor pushes agentic capabilities (tool use, multi-step reasoning), standardizing evaluation conditions becomes harder, not easier. The HLE "with tools / without tools" case — where rankings flip depending on execution conditions — illustrates the problem. Reading "under what conditions was this measured" rather than just the model name will become the core literacy for model selection going forward.

Update schedule

This article will be updated in detail once Arena scores stabilize and sufficient data accumulates across categories (Vision, Code, etc.).

References