Methodology
How we collect, verify and publish data. 128 models, 31 benchmarks, 28 companies.
Sources
Scores published in this atlas come exclusively from verifiable primary and secondary sources:
- Official papers and system cards from labs (OpenAI, Anthropic, Google DeepMind, etc.)
- Technical blogs and model cards at launch
- Hugging Face leaderboards (when applicable)
- Verified third-party technical reports
- LMSYS Chatbot Arena (Arena-ELO snapshots with explicit dates)
Note: we do NOT store the AA-Index (Artificial Analysis Intelligence Index) in the catalog because its scale changes with AA methodology (v3 → v4) without our being able to audit the delta. Instead we publish our own Frontier Index, reproducible and documented.
Frontier Index
Reproducible composite derived only from atomic model benchmarks (measured once by the provider, with no scale that drifts over time).
Algorithm: Core-set Percentile Rank
CORE-SET (8 benchmarks): GPQA-Diamond, SWE-bench-Verified,
AIME-2025, MMLU-Pro, SWE-bench-Pro, Terminal-Bench-2, LiveCodeBench, HLE
for each core benchmark b reported by model m:
pct_b = (#flagships_with_score_<=_m / #flagships_with_b) * 100
FrontierIndex(m) = (Σ pct_b * weight_b) / (Σ weight_b) [b ∈ core ∩ m]
// normalized by PRESENT weights — NO coverage penalty
eligible(m) = (#core_benchmarks_of_m >= 3)
AND has(reasoning) AND (has(coding) OR has(math))
// not eligible → "insufficient data", no ranking position Why percentile rank and not a simple weighted average: Different benchmarks have different natural ceilings. A score of 80 on HumanEval (ceiling near 95) is not equivalent to 80 on FrontierMath (ceiling near 50). Percentile rank neutralizes that difference: 80% of the field is 80% of the field, regardless of the bench.
Empirical validation
The Frontier Index ranks only over a fixed CORE-SET of 8 benchmarks a serious flagship reports (GPQA-Diamond, SWE-bench-Verified/Pro, AIME-2025, MMLU-Pro, Terminal-Bench-2, LiveCodeBench, HLE), with NO coverage penalty and an eligibility gate. We validate it with Spearman rank correlation against 3 external references (June 2026), measured on the production pool (FLAGSHIP_POOL). Old method = average of ALL reported benchmarks with a coverage penalty; core-set = current:
| Method | AA-v4 | LMArena | llm-stats |
|---|---|---|---|
| Old (all + penalty) | 0.43 | 0.07 | 0.50 |
| Core-set ✓ | 0.97 | 0.75 | 0.89 |
The core-set lifts correlation with the technical indices from ~0.47 to ~0.93 average (Artificial Analysis 0.97, llm-stats 0.89); even LMArena — human conversational preference, an orthogonal axis — rises to 0.75. The old method averaged auto-reported benchmarks over heterogeneous sets and rewarded the VOLUME of benchmarks via a multiplicative penalty, producing absurd inversions (a specialized preview at #1, an open model above the leading flagship just for reporting more benchmarks). The empirical test lives at packages/data/src/empirical.test.ts and runs in CI.
Core-set & weights
Eight benchmarks with n≥10 in the pool (reliable; the tail with n<10 — IFEval, ARC-AGI-2 — is excluded, because that is where cherry-picking lived). Weights (sum 1.0): GPQA-Diamond 0.26, SWE-bench-Verified 0.18, AIME-2025 0.15, MMLU-Pro 0.10, SWE-bench-Pro 0.09, Terminal-Bench-2 0.08, LiveCodeBench 0.07, HLE 0.07. The index renormalizes by PRESENT weights, so reporting more benchmarks no longer inflates the score.
Eligibility gate
A model enters the ranking only if it reports ≥3 of the core-set AND covers reasoning + (coding or math). Whoever falls short gets no number — it is marked "insufficient data" rather than inventing a position. That is why some flagships (Grok 4.20/4.3, ERNIE, Nova) do not appear ranked: their vendors do not publish comparable benchmarks (LMArena Elo, tau2-Bench, olympiads instead).
The algorithm lives in packages/core/src/scoring/frontierIndex.ts. Percentiles are computed over FLAGSHIP_POOL (GA flagships, no specialized/preview/legacy), so a model has the same index on every page.
Third-party reasoning scores
The newest agentic flagships (MiniMax M3, several Chinese open models) self-report only coding/agentic benchmarks (SWE-bench-Pro, Terminal-Bench), not the reasoning ones (GPQA-Diamond, HLE) the gate requires. When a vendor skips a core-set reasoning benchmark, we fill the gap with Artificial Analysis’s independent measurement, labeled as third-party on the model page. These scores let the model qualify, but are kept out of the vendor percentile pool (the ruler stays 100% vendor self-reported) — a third-party value is only measured against that vendor ruler, never mixed into it. A model whose third-party data still lacks a reasoning benchmark (e.g. Kimi K2.7-Code) stays unranked.
Benchmark taxonomy
The 31 benchmarks are organized into 8 categories: Reasoning, Coding, Math, Knowledge, Instruction, Multilingual, Agentic and General. The taxonomy is opinionated but clear: each benchmark lives in a single category.
What we do NOT do
- We do NOT run models. We mirror what verified sources report.
- We do NOT use synthetic data or estimates that are not published.
- We do NOT receive payment to include or feature models.
- We do NOT have sponsored rankings.
- We do NOT offer a public API: the only way to consume the data is this site.
Update policy
Data is reviewed at the launch of every frontier model. When a new model launches with official scores, we add it. When a reported source is corrected or amended, we update. When a score is identified as contaminated by training data, we flag it.
Each significant change goes to changelog with its date and reason.
Hardware estimation
The Hardware Checker estimates whether a model fits in your GPU using an explicit formula:
VRAM = params × bytes_per_param + KV_cache(context) + overhead
Full detail (bytes per param of each quantization, KV cache scale, MoE caveat, Apple unified memory) on the Hardware Checker page. The estimate is best-effort — real numbers can vary 5-15% depending on framework, batch size and KV-cache compression.
Editorial tone
Numbers without context are numbers without meaning. When a model launches with a record score, we try to explain why it matters, what methodology the benchmark uses, and whether there are caveats (contamination, benchmark version, evaluation conditions). We prefer honesty over marketing — even when that means saying we do not know.