Saltar al contenido
Agentic

OSWorld

Computer use benchmark - tareas reales de escritorio.

14 modelos publicaron score
# Modelo Empresa Score
1 Claude Fable 5 Anthropic 85.0
2 Claude Mythos 5 Anthropic 85.0
3 Claude Opus 4.8 Anthropic 83.4
4 Claude Mythos Preview Anthropic 79.6
5 GPT-5.5 OpenAI 78.7
6 Gemini 3.5 Flash Google DeepMind 78.4
7 Claude Opus 4.7 Anthropic 78.0
8 GPT-5.4 Pro OpenAI 75.0
9 GPT-5.4 OpenAI 75.0
10 Kimi K2.6 Moonshot AI 73.1
11 Claude Opus 4.6 Anthropic 72.7
12 Claude Sonnet 4.6 Anthropic 72.5
13 MiniMax M3 MiniMax 70.1
14 GPT-5.3-Codex OpenAI 64.7