Methodology

Performance Metrics

Each task in FrontierSWE produces a raw performance metric natural to the problem domain.

Performance Engineering

Speedup

Relative to unoptimized code

(e.g. 3.2×)

Migration & Compilation

Pass Rate

Fraction of test suite passing

(e.g. 42%)

Research

Prediction Quality

Against held-out ground truth

(e.g. ρ = 0.68)

Compression

Ratio

Compressed vs. original size

(e.g. 0.34)

Common metrics by task category

Normalization

Because these raw metrics live on different scales and have different units, we normalize them into a single comparable score per task.

We first apply a variance-stabilizing transform appropriate to the metric type (logarithmic transforms for speedup ratios, logit for pass rates, and Fisher's arctanh for correlations). This ensures that multiplicative gains in performance tasks and high-end improvements in pass-rate tasks are weighted proportionally to their difficulty.

We then compute a z-score for each task against a frozen reference cohort of models evaluated at launch, using Median Absolute Deviation to prevent outlier runs from distorting the scale.

Normalization statistics are computed from a fixed cohort of frontier models evaluated at launch (April 2026) and remain frozen. Scores reflect absolute progress relative to a fixed reference point rather than shifting as new models are added to the leaderboard.

Scoring

overall_score = mean(z₁, z₂, ..., zₙ)

A model's overall FrontierSWE score is the mean of its per-task z-scores. Tasks where a model fails to produce a correct implementation contribute a fixed penalty rather than a performance score, so reliability matters alongside raw capability.

We report both an aggregate score across all tasks and a breakdown by category, since a model that excels at performance engineering may struggle with open-ended research and vice versa.