Agentic
OSWorld
Computer use benchmark - tareas reales de escritorio.
14 modelos publicaron score
| # | Modelo | Empresa | Score |
|---|---|---|---|
| 1 | Claude Fable 5 | Anthropic | 85.0 |
| 2 | Claude Mythos 5 | Anthropic | 85.0 |
| 3 | Claude Opus 4.8 | Anthropic | 83.4 |
| 4 | Claude Mythos Preview | Anthropic | 79.6 |
| 5 | GPT-5.5 | OpenAI | 78.7 |
| 6 | Gemini 3.5 Flash | Google DeepMind | 78.4 |
| 7 | Claude Opus 4.7 | Anthropic | 78.0 |
| 8 | GPT-5.4 Pro | OpenAI | 75.0 |
| 9 | GPT-5.4 | OpenAI | 75.0 |
| 10 | Kimi K2.6 | Moonshot AI | 73.1 |
| 11 | Claude Opus 4.6 | Anthropic | 72.7 |
| 12 | Claude Sonnet 4.6 | Anthropic | 72.5 |
| 13 | MiniMax M3 | MiniMax | 70.1 |
| 14 | GPT-5.3-Codex | OpenAI | 64.7 |