| # | Model | Harness | Score | Correctness | Performance | Research |
|---|---|---|---|---|---|---|
| 1 | Claude Opus 4.6 | Claude Code | +2.47 | +2.81 | +2.14 | +1.92 |
| 2 | GPT-5.4 | Codex | +1.89 | +1.53 | +2.41 | +1.74 |
| 3 | Claude Opus 4.6 | Cursor | +1.64 | +2.12 | +1.48 | +2.08 |
| 4 | Gemini 3.1 Pro | Aider | +1.21 | +0.93 | +1.67 | +1.18 |
| 5 | GPT-5.4 | OpenCode | +0.84 | +1.11 | +0.72 | +0.89 |
| 6 | Claude Sonnet 4.6 | Claude Code | +0.61 | +0.83 | +0.54 | +0.47 |
| 7 | Grok 4 | Cursor | +0.33 | +0.41 | +0.28 | +0.19 |
| 8 | DeepSeek V3.2 | Aider | +0.11 | +0.24 | -0.13 | +0.31 |
| 9 | Gemini 3.1 Pro | OpenCode | -0.28 | -0.14 | -0.31 | -0.22 |
| 10 | Qwen 3.5 | Aider | -0.57 | -0.43 | -0.68 | -0.39 |
| 11 | Llama 4 Maverick | OpenCode | -0.91 | -0.72 | -1.14 | -0.63 |
| 12 | Kimi K2.5 | Aider | -1.18 | -1.03 | -1.28 | -0.97 |
| 13 | Claude Sonnet 4.6 | Cursor | -1.34 | -1.21 | -1.47 | -1.12 |
| 14 | DeepSeek V3.2 | OpenCode | -1.72 | -1.58 | -1.83 | -1.49 |
Scores are z-scores relative to the launch cohort median, averaged across tasks. 0 = median, positive = above median.
Reimplement Dart's code formatter in Haskell.
Build an AOT compiler for Lua 5.4 to native x86-64 ELF.
Optimize Mamba2 inference on NVIDIA B200.
Rewrite FFmpeg's libswscale pixel format converter.
Optimize the Pyright type checker.
Optimize the Revideo rendering pipeline.
Optimize a dependent type checker.
Train a model for protein fitness prediction.
Train a model for molecular property prediction.
Train a model for CRISPRi perturbation prediction.
Build an RL post-training pipeline for a board game.
Train a model for RNA secondary structure prediction.