Opus 4.6 vs GPT Codex 5.3 vs GPT 5.4
Updated to include GPT-5.4 and Gemini 3.1 Pro
A comparison of benchmark metrics between Opus 4.6 and Codex 5.3 models.
Anthropic and OpenAI both recently published Terminal-Bench 2.0 results, but in separate charts and a table. I wanted the full picture, so I combined them.
Agentic Coding
Note: All OpenAI models shown at xhigh compute setting. GPT-5.2-Codex appears twice — 64.7% as reported by Anthropic, 64.0% as reported by OpenAI. Harnesses differ: Anthropic & Google used the Terminus-2 harness; OpenAI used Codex. Scores are not directly comparable across providers.
| Model | Accuracy | Source |
|---|---|---|
| GPT-5.3-Codex (xhigh) | 77.3% | OpenAI |
| GPT-5.4 | 75.1% | OpenAI |
| Gemini 3.1 Pro | 68.5% | |
| Opus 4.6 | 65.4% | Anthropic |
| GPT-5.2-Codex (xhigh) | 64.0–64.7% | Both |
| GPT-5.2 (xhigh) | 62.2% | OpenAI |
| Opus 4.5 | 59.8% | Anthropic |
| Gemini 3 Pro - Thinking (High) | 56.9% | |
| Sonnet 4.5 | 51.0% | Anthropic |
GPT-5.3-Codex (xhigh) leads at 77.3%, with GPT-5.4 close behind at 75.1%. The middle is tightly packed between 59–65%.
A word of caution: these numbers aren’t directly comparable. The harness used to run the benchmark matters enormously — Anthropic and Google ran their models through the Terminus-2 harness, while OpenAI ran theirs through Codex. Different harnesses can affect scaffolding, tool access, and retry logic, which means some of the gap between providers may reflect the harness, not the model.
When companies benchmark in isolation, you only see their angle. Putting the numbers side by side tells a different story.