Reasoning
MMLU
Massive Multitask Language Understanding - 57 materias academicas, ~16K preguntas.
12 modelos publicaron score
| # | Modelo | Empresa | Score |
|---|---|---|---|
| 1 | Gemini 3 Pro | Google DeepMind | 91.8 |
| 2 | DeepSeek R1 0528 | DeepSeek | 90.8 |
| 3 | Nova Premier | Amazon | 87.4 |
| 4 | Grok 4 | xAI | 86.6 |
| 5 | Nova Pro | Amazon | 85.9 |
| 6 | Llama 4 Maverick | Meta | 85.5 |
| 7 | Mistral Large 3 | Mistral AI | 85.5 |
| 8 | Command A | Cohere | 85.5 |
| 9 | Nova Lite | Amazon | 80.5 |
| 10 | AFM Server | Apple | 80.0 |
| 11 | Llama 4 Scout | Meta | 79.6 |
| 12 | AFM On-Device | Apple | 67.8 |