Saltar al contenido
Reasoning

Humanitys-Last-Exam

El benchmark mas dificil conocido - problemas academicos novedosos.

27 modelos publicaron score
# Modelo Empresa Score
1 Claude Mythos 5 Anthropic 59.0
2 Muse Spark Meta 58.0
3 Claude Mythos Preview Anthropic 56.8
4 Claude Opus 4.7 Anthropic 54.7
5 Claude Fable 5 Anthropic 53.0
6 Grok 4 Heavy xAI 50.7
7 Kimi K2.5 Moonshot AI 50.2
8 Claude Opus 4.8 Anthropic 49.8
9 Gemini 3 Deep Think Google DeepMind 48.4
10 Gemini 3.1 Pro Google DeepMind 44.4
11 GLM-4.7 Zhipu AI 42.8
12 Qwen3.7-Max Alibaba 41.4
13 GLM-5.2 Zhipu AI 40.5
14 Gemini 3.5 Flash Google DeepMind 40.2
15 Claude Opus 4.6 Anthropic 40.0
16 DeepSeek V4 Pro DeepSeek 37.7
17 Gemini 3 Pro Google DeepMind 37.5
18 Kimi K2.6 Moonshot AI 34.7
19 Gemini 3 Flash Google DeepMind 33.7
20 GLM-5.1 Zhipu AI 31.0
21 GLM-5 Zhipu AI 30.5
22 Nemotron 3 Ultra 550B-A55B Nvidia 26.7
23 Grok 4 xAI 25.4
24 MiMo V2 Flash Xiaomi 22.1
25 Nemotron 3 Super Nvidia 18.3
26 Gemini 3.1 Flash-Lite Google DeepMind 16.0
27 K-EXAONE 236B-A23B LG AI Research 13.6