Skip to content
Coding

SWE-bench-Pro

Professional version of SWE-bench with more complex issues.

25 models published a score
# Model Company Score
1 Claude Mythos 5 Anthropic 80.3
2 Claude Fable 5 Anthropic 80.0
3 Claude Mythos Preview Anthropic 77.8
4 Claude Opus 4.8 Anthropic 69.2
5 Claude Opus 4.7 Anthropic 64.3
6 GLM-5.2 Zhipu AI 62.1
7 Qwen3.7-Max Alibaba 60.6
8 MiniMax M3 MiniMax 59.0
9 GPT-5.5 OpenAI 58.6
10 Kimi K2.6 Moonshot AI 58.6
11 GLM-5.1 Zhipu AI 58.4
12 GPT-5.4 OpenAI 57.7
13 Qwen3.7-Plus Alibaba 57.6
14 MiMo V2.5 Pro Xiaomi 57.2
15 GPT-5.3-Codex OpenAI 56.8
16 Step 3.7 Flash StepFun 56.3
17 MiniMax M2.7 MiniMax 56.2
18 MiMo V2.5 Xiaomi 56.1
19 GPT-5.2 OpenAI 55.6
20 DeepSeek V4 Pro DeepSeek 55.4
21 Gemini 3.5 Flash Google DeepMind 55.1
22 Qwen3.6-27B Alibaba 53.5
23 Kimi K2.5 Moonshot AI 50.7
24 Qwen3.6-35B-A3B Alibaba 49.5
25 Qwen3-Coder-Next Alibaba 44.3