Skip to content
Coding

Terminal-Bench-2

Terminal Bench v2 — agentic tasks in CLI.

26 models published a score
# Model Company Score
1 Claude Mythos 5 Anthropic 88.0
2 Claude Fable 5 Anthropic 84.3
3 GPT-5.5 OpenAI 82.7
4 Claude Mythos Preview Anthropic 82.0
5 GLM-5.2 Zhipu AI 81.0
6 GPT-5.3-Codex OpenAI 77.3
7 Gemini 3.5 Flash Google DeepMind 76.2
8 Qwen3.7-Max Alibaba 69.7
9 Claude Opus 4.7 Anthropic 69.4
10 Gemini 3.1 Pro Google DeepMind 68.5
11 MiMo V2.5 Pro Xiaomi 68.4
12 Kimi K2.6 Moonshot AI 66.7
13 MiniMax M3 MiniMax 66.0
14 MiMo V2.5 Xiaomi 65.8
15 Claude Opus 4.6 Anthropic 65.4
16 GLM-5.1 Zhipu AI 63.5
17 Qwen3.6-Plus Alibaba 61.6
18 Qwen3.6-27B Alibaba 59.3
19 MiniMax M2.7 MiniMax 57.0
20 Gemini 3 Pro Google DeepMind 56.9
21 Hunyuan Hy3-preview Tencent 54.4
22 Nemotron 3 Ultra 550B-A55B Nvidia 54.0
23 Qwen3.6-35B-A3B Alibaba 51.5
24 Step 3.5 Flash StepFun 51.0
25 GLM-4.7 Zhipu AI 41.0
26 Qwen3-Coder-Next Alibaba 36.2