Skip to content
Coding

HumanEval

Functional correctness on 164 Python coding problems.

8 models published a score
# Model Company Score
1 Qwen3.5-Omni-Plus Alibaba 92.6
2 Mistral Large 3 Mistral AI 92.0
3 Nova Pro Amazon 89.0
4 Codestral 25.08 Mistral AI 86.6
5 Llama 4 Maverick Meta 86.4
6 Nova Lite Amazon 85.4
7 Yi-Lightning 01.AI 83.5
8 DeepSeek V4 Pro DeepSeek 76.8