Back

Small coding models on Terminal-Bench 2

Updated on: April 23rd 2026
Original date: Feb 26th 2026

Frontier models get most of the headlines, but the more interesting race is happening one tier down. Here’s how open-weight and smaller models stack up on Terminal-Bench 2.0.

Benchmark Comparison

Small Coding Models

Terminal-Bench 2.0

Source: Terminal-Bench 2.0 leaderboard. All Qwen3.5 MoE models use activated parameter counts (A-suffix). K2.5-1T-A32B is a 1T-parameter sparse MoE from Moonshot AI with 32B active parameters.

The top of the chart is no longer just a Qwen3.5 story. Qwen3.6-27B now leads this group at 59.3%, outperforming even the much larger Qwen3.5-397B-A17B at 52.5%. Qwen3.6-35B-A3B also lands in the top tier at 51.5%, suggesting Alibaba’s newer generation is pushing small-model coding performance meaningfully higher.

Just below that, the older leaders still hold up well. K2.5-1T-A32B scores 50.8%, Qwen3.5-122B-A10B reaches 49.4%, and Gemma4-31B comes in at 42.9%. In the same general size class, Qwen3.5-27B posts 41.6%, Qwen3.5-35B-A3B scores 40.5%, and Gemma4-26BA4B trails at 34.2%.

GPT-OSS-120B at 18.7% remains the clearest outlier. Even at a much larger footprint than the 27B–35B class, it underperforms models that are dramatically smaller on disk. Once you add model size as a second dimension, the efficiency story becomes much more interesting than the raw leaderboard alone.

ModelScoreProviderGGUFSize
Qwen3.6-27B59.3%Alibabaunsloth/Qwen3.6-27B-GGUF16.8 GB
Qwen3.5-397B-A17B52.5%Alibabaunsloth/Qwen3.5-397B-A17B-GGUF244 GB
Qwen3.6-35B-A3B51.5%Alibabaunsloth/Qwen3.6-35B-A3B-GGUF22.1 GB
K2.5-1T-A32B50.8%Moonshotunsloth/Kimi-K2.5-GGUF621 GB
Qwen3.5-122B-A10B49.4%Alibabaunsloth/Qwen3.5-122B-A10B-GGUF76.5 GB
Gemma4-31B42.9%Googleunsloth/gemma-4-31B-it-GGUF18.3 GB
Qwen3.5-27B41.6%Alibabaunsloth/Qwen3.5-27B-GGUF16.7 GB
Qwen3.5-35B-A3B40.5%Alibabaunsloth/Qwen3.5-35B-A3B-GGUF22 GB
Gemma4-26BA4B34.2%Googleunsloth/gemma-4-26B-A4B-it-GGUF16.9 GB
GPT-OSS-120B18.7%OpenAIunsloth/gpt-oss-120b-GGUF62.8 GB
Intelligence vs Size

Terminal-Bench Score vs Model Size

Q4_K_M GGUF size in GB

Note: Sizes are GGUF download sizes for the same quantization level, Q4_K_M. This chart compares storage footprint against Terminal-Bench 2.0 score, making the efficiency tradeoff more visible than the leaderboard alone.

Using the same Q4_K_M quantization across the board makes the size comparison much cleaner. The scatter plot shows the main takeaway immediately: Qwen3.6-27B sits in the best part of the frontier here, delivering the highest score while staying under 17 GB. Gemma4-31B and Qwen3.5-27B also look strong on a score-per-GB basis, while the giant MoE checkpoints buy you some extra capability at a very steep storage cost.

The ceiling for this set is now close to 60%, which is a meaningful jump from the earlier ~52% range. Small and mid-sized open models are improving fast, and the newest Qwen3.6 entries make that trend hard to ignore.