Coding
Terminal-Bench-2
Terminal Bench v2 - tareas agenticas en CLI.
26 modelos publicaron score
| # | Modelo | Empresa | Score |
|---|---|---|---|
| 1 | Claude Mythos 5 | Anthropic | 88.0 |
| 2 | Claude Fable 5 | Anthropic | 84.3 |
| 3 | GPT-5.5 | OpenAI | 82.7 |
| 4 | Claude Mythos Preview | Anthropic | 82.0 |
| 5 | GLM-5.2 | Zhipu AI | 81.0 |
| 6 | GPT-5.3-Codex | OpenAI | 77.3 |
| 7 | Gemini 3.5 Flash | Google DeepMind | 76.2 |
| 8 | Qwen3.7-Max | Alibaba | 69.7 |
| 9 | Claude Opus 4.7 | Anthropic | 69.4 |
| 10 | Gemini 3.1 Pro | Google DeepMind | 68.5 |
| 11 | MiMo V2.5 Pro | Xiaomi | 68.4 |
| 12 | Kimi K2.6 | Moonshot AI | 66.7 |
| 13 | MiniMax M3 | MiniMax | 66.0 |
| 14 | MiMo V2.5 | Xiaomi | 65.8 |
| 15 | Claude Opus 4.6 | Anthropic | 65.4 |
| 16 | GLM-5.1 | Zhipu AI | 63.5 |
| 17 | Qwen3.6-Plus | Alibaba | 61.6 |
| 18 | Qwen3.6-27B | Alibaba | 59.3 |
| 19 | MiniMax M2.7 | MiniMax | 57.0 |
| 20 | Gemini 3 Pro | Google DeepMind | 56.9 |
| 21 | Hunyuan Hy3-preview | Tencent | 54.4 |
| 22 | Nemotron 3 Ultra 550B-A55B | Nvidia | 54.0 |
| 23 | Qwen3.6-35B-A3B | Alibaba | 51.5 |
| 24 | Step 3.5 Flash | StepFun | 51.0 |
| 25 | GLM-4.7 | Zhipu AI | 41.0 |
| 26 | Qwen3-Coder-Next | Alibaba | 36.2 |