Reasoning

MMLU-Pro

MMLU upgraded with harder questions and 10 answer options.

30 models published a score

#	Model	Company	Score
1	Claude Opus 4.7	Anthropic	91.5
2	Claude Opus 4.6	Anthropic	90.5
3	Claude Opus 4.5	Anthropic	90.0
4	Gemini 3 Pro	Google DeepMind	89.8
5	Qwen3.7-Max	Alibaba	89.6
6	Doubao Seed 2.0 Lite	ByteDance	87.7
7	DeepSeek V4 Pro	DeepSeek	87.5
8	Kimi K2.5	Moonshot AI	87.1
9	Grok 4	xAI	87.0
10	Doubao Seed 2.0 Pro	ByteDance	87.0
11	Nemotron 3 Ultra 550B-A55B	Nvidia	86.8
12	Qwen3-Max-Thinking	Alibaba	85.7
13	Gemma 4 (31B dense)	Google DeepMind	85.2
14	Qwen3.6-35B-A3B	Alibaba	85.2
15	DeepSeek V3.2	DeepSeek	85.0
16	K-EXAONE 236B-A23B	LG AI Research	83.8
17	Nemotron 3 Super	Nvidia	83.7
18	EXAONE 4.5 33B	LG AI Research	83.3
19	Gemma 4 26B-A4B	Google DeepMind	82.6
20	Llama 4 Behemoth	Meta	82.2
21	Nova 2 Lite	Amazon	80.9
22	Llama 4 Maverick	Meta	80.5
23	Nemotron 3 Nano	Nvidia	78.3
24	Mistral Large 3	Mistral AI	78.0
25	Llama 4 Scout	Meta	74.3
26	Nova Premier	Amazon	73.3
27	Gemma 4 E4B	Google DeepMind	69.4
28	Reka Flash 3	Reka	65.0
29	Gemma 4 E2B	Google DeepMind	60.0
30	Jamba 1.7 Large	AI21 Labs	57.7