Reasoning

Humanitys-Last-Exam

El benchmark mas dificil conocido - problemas academicos novedosos.

27 modelos publicaron score

#	Modelo	Empresa	Score
1	Claude Mythos 5	Anthropic	59.0
2	Muse Spark	Meta	58.0
3	Claude Mythos Preview	Anthropic	56.8
4	Claude Opus 4.7	Anthropic	54.7
5	Claude Fable 5	Anthropic	53.0
6	Grok 4 Heavy	xAI	50.7
7	Kimi K2.5	Moonshot AI	50.2
8	Claude Opus 4.8	Anthropic	49.8
9	Gemini 3 Deep Think	Google DeepMind	48.4
10	Gemini 3.1 Pro	Google DeepMind	44.4
11	GLM-4.7	Zhipu AI	42.8
12	Qwen3.7-Max	Alibaba	41.4
13	GLM-5.2	Zhipu AI	40.5
14	Gemini 3.5 Flash	Google DeepMind	40.2
15	Claude Opus 4.6	Anthropic	40.0
16	DeepSeek V4 Pro	DeepSeek	37.7
17	Gemini 3 Pro	Google DeepMind	37.5
18	Kimi K2.6	Moonshot AI	34.7
19	Gemini 3 Flash	Google DeepMind	33.7
20	GLM-5.1	Zhipu AI	31.0
21	GLM-5	Zhipu AI	30.5
22	Nemotron 3 Ultra 550B-A55B	Nvidia	26.7
23	Grok 4	xAI	25.4
24	MiMo V2 Flash	Xiaomi	22.1
25	Nemotron 3 Super	Nvidia	18.3
26	Gemini 3.1 Flash-Lite	Google DeepMind	16.0
27	K-EXAONE 236B-A23B	LG AI Research	13.6