Coding

HumanEval

Functional correctness on 164 Python coding problems.

8 models published a score

#	Model	Company	Score
1	Qwen3.5-Omni-Plus	Alibaba	92.6
2	Mistral Large 3	Mistral AI	92.0
3	Nova Pro	Amazon	89.0
4	Codestral 25.08	Mistral AI	86.6
5	Llama 4 Maverick	Meta	86.4
6	Nova Lite	Amazon	85.4
7	Yi-Lightning	01.AI	83.5
8	DeepSeek V4 Pro	DeepSeek	76.8