Coding
SWE-bench-Pro
Version profesional de SWE-bench con issues mas complejos.
25 modelos publicaron score
| # | Modelo | Empresa | Score |
|---|---|---|---|
| 1 | Claude Mythos 5 | Anthropic | 80.3 |
| 2 | Claude Fable 5 | Anthropic | 80.0 |
| 3 | Claude Mythos Preview | Anthropic | 77.8 |
| 4 | Claude Opus 4.8 | Anthropic | 69.2 |
| 5 | Claude Opus 4.7 | Anthropic | 64.3 |
| 6 | GLM-5.2 | Zhipu AI | 62.1 |
| 7 | Qwen3.7-Max | Alibaba | 60.6 |
| 8 | MiniMax M3 | MiniMax | 59.0 |
| 9 | GPT-5.5 | OpenAI | 58.6 |
| 10 | Kimi K2.6 | Moonshot AI | 58.6 |
| 11 | GLM-5.1 | Zhipu AI | 58.4 |
| 12 | GPT-5.4 | OpenAI | 57.7 |
| 13 | Qwen3.7-Plus | Alibaba | 57.6 |
| 14 | MiMo V2.5 Pro | Xiaomi | 57.2 |
| 15 | GPT-5.3-Codex | OpenAI | 56.8 |
| 16 | Step 3.7 Flash | StepFun | 56.3 |
| 17 | MiniMax M2.7 | MiniMax | 56.2 |
| 18 | MiMo V2.5 | Xiaomi | 56.1 |
| 19 | GPT-5.2 | OpenAI | 55.6 |
| 20 | DeepSeek V4 Pro | DeepSeek | 55.4 |
| 21 | Gemini 3.5 Flash | Google DeepMind | 55.1 |
| 22 | Qwen3.6-27B | Alibaba | 53.5 |
| 23 | Kimi K2.5 | Moonshot AI | 50.7 |
| 24 | Qwen3.6-35B-A3B | Alibaba | 49.5 |
| 25 | Qwen3-Coder-Next | Alibaba | 44.3 |