5 March 2026

Opus 4.6 vs GPT Codex 5.3 vs GPT 5.4

Updated to include GPT-5.4 and Gemini 3.1 Pro

A comparison of benchmark metrics between Opus 4.6 and Codex 5.3 models.

Anthropic and OpenAI both recently published Terminal-Bench 2.0 results, but in separate charts and a table. I wanted the full picture, so I combined them.

Benchmark Comparison

Agentic Coding

Terminal-Bench 2.0

Note: All OpenAI models shown at xhigh compute setting. GPT-5.2-Codex appears twice — 64.7% as reported by Anthropic, 64.0% as reported by OpenAI. Harnesses differ: Anthropic & Google used the Terminus-2 harness; OpenAI used Codex. Scores are not directly comparable across providers.

Model	Accuracy	Source
GPT-5.3-Codex (xhigh)	77.3%	OpenAI
GPT-5.4	75.1%	OpenAI
Gemini 3.1 Pro	68.5%	Google
Opus 4.6	65.4%	Anthropic
GPT-5.2-Codex (xhigh)	64.0–64.7%	Both
GPT-5.2 (xhigh)	62.2%	OpenAI
Opus 4.5	59.8%	Anthropic
Gemini 3 Pro - Thinking (High)	56.9%	Google
Sonnet 4.5	51.0%	Anthropic

GPT-5.3-Codex (xhigh) leads at 77.3%, with GPT-5.4 close behind at 75.1%. The middle is tightly packed between 59–65%.

A word of caution: these numbers aren’t directly comparable. The harness used to run the benchmark matters enormously — Anthropic and Google ran their models through the Terminus-2 harness, while OpenAI ran theirs through Codex. Different harnesses can affect scaffolding, tool access, and retry logic, which means some of the gap between providers may reflect the harness, not the model.

When companies benchmark in isolation, you only see their angle. Putting the numbers side by side tells a different story.