FrontierSWE

#	Model	Harness	AVG RANK	Dominance	Implementation	Performance	Research
1	GPT-5.5	Codex	2.53	83%	3.50	1.89	2.83
2	Claude Opus 4.7	Claude Code	3.56	72%	2.90	4.00	3.33
3	Claude Opus 4.6	Claude Code	4.18	65%	4.00	4.33	4.00
4	GPT-5.4	Codex	4.29	63%	3.40	5.67	1.67
5	Composer 2.5	Cursor CLI	5.71	48%	4.50	7.22	3.17
6	Gemini 3.1 Pro	Gemini CLI	5.79	47%	7.30	4.11	8.33
7	DeepSeek V4 Pro	Claude Code	6.76	36%	6.70	6.83	6.67
8	Kimi K2.6	Kimi CLI	7.12	32%	5.60	7.89	7.33
9	Kimi K2.5	Kimi CLI	7.41	29%	8.00	6.56	9.00
10	Qwen3.6-Plus	Qwen Code	7.65	26%	9.10	6.50	8.67

Task Performance

We show the best score achieved by any model on the task.

Implementation5

Implementation tasks challenge agents to build complex software systems from scratch or reimplement existing ones in a different language. No model was able to successfully complete any of these tasks in any trial, so we used test pass rate of best@5 as a partial reward to rank models.

01 Implementation

PostgreSQL 18 on SQLite

Build a PostgreSQL 18 server in Zig that uses SQLite for storage.

0/5success rate

16%test pass rate

Claude Opus 4.6 (Claude Code)

02 Implementation

Wan 2.1 on MAX/Mojo

With Modular

Implement Wan 2.1 text-to-video inference on Modular's MAX/Mojo stack.

0/5success rate

50%workloads passed

Claude Opus 4.6 (Claude Code)

03 Implementation

Git to Zig

Reimplement git v2.47.0 in Zig.

0/5success rate

23%test pass rate

Claude Opus 4.6 (Claude Code)

04 Implementation

Dart → Haskell

Reimplement the dart_style formatter as a standalone Haskell executable.

0/5success rate

27%test pass rate

GPT-5.5 (Codex)

05 Implementation

Lua Native Compiler

Build a real AOT compiler from Lua 5.4 source to native x86-64 ELF.

0/5success rate

89%test pass rate

Claude Opus 4.7 (Claude Code)

Research3

Research tasks require agents to design and train ML models or devise novel algorithms, evaluated on held-out data the agent never sees during development.

06 Research

FrogsGame Post-Training

With Thoughtful Lab

Post-train Qwen3-8B to solve FrogsGame boards through tool use.

4%solve rate

GPT-5.5 (Codex)

07 Research

PCQM4Mv2 Molecular Gap Prediction

Train a 2D-only molecular graph regressor for a PCQM4Mv2-derived task.

0.91exp(−MAE)

Claude Opus 4.6 (Claude Code)

08 Research

Optimizer Design

Design a single optimizer that beats tuned AdamW across diverse ML workloads.

3.2xfewer steps vs tuned AdamW

GPT-5.5 (Codex)

Performance Optimization9

Performance optimization tasks ask agents to optimize performance over speed or compression without breaking existing behavior. Correctness here is a gate, not the goal. A failed correctness check means the agents solution did not pass all functional tests. Tasks are ranked by 0.5 × correctness + 0.5 × speedup (or 1 - compression ratio), and the speedup shown is best@5.

09 Performance

libexpat to x86-64 Assembly

Reimplement the libexpat XML parser as a drop-in x86-64 assembly shared library.

0/5correctness

—speedup

Claude Opus 4.6 (Claude Code)

10 Performance

FFmpeg libswscale Re-implementation

Reimplement FFmpeg's libswscale scaler and pixel-format converter in Zig or Rust.

0/5correctness

—speedup

GPT-5.4 (Codex)

11 Performance

Pyright Type Checking Optimization

Make pyright faster without changing diagnostics.

3/5correctness

+28%speedup

Claude Opus 4.6 (Claude Code)

12 Performance

Notebook Compression

Build a lossless domain-specific compressor for canonicalized Jupyter notebooks.

3/5correctness

0.693reduction

Claude Opus 4.6 (Claude Code)

13 Performance

Revideo Rendering Pipeline Optimization

Speed up Revideo's rendering pipeline without changing video output.

4/5correctness

+194%speedup

Claude Opus 4.6 (Claude Code)

14 Performance

Cranelift Codegen Optimization

Speed up Wasmtime's Cranelift backend without breaking correctness.

5/5correctness

+1%speedup

Gemini 3.1 Pro (Gemini CLI)

15 Performance

Dependent Type Checker

Implement a fast dependent type checker for a Martin-Löf-style core language.

0/5correctness

—speedup

GPT-5.4 (Codex)

16 Performance

Granite Mamba2 Inference Optimization

With Prime Intellect

Make a pinned Granite hybrid Mamba2 layer faster on B200 without changing semantics.

5/5correctness

+148%speedup

GPT-5.5 (Codex)

17 Performance

SGLang Inference System Optimization

Make SGLang serving for Qwen3.5-4B faster on a B200 GPU.

0/5correctness

—speedup

Claude Opus 4.7 (Claude Code)

FrontierSWE

Leaderboard

Task Performance

Implementation5

PostgreSQL 18 on SQLite

Wan 2.1 on MAX/Mojo

Git to Zig

Dart → Haskell

Lua Native Compiler

Research3

FrogsGame Post-Training

PCQM4Mv2 Molecular Gap Prediction

Optimizer Design

Performance Optimization9

libexpat to x86-64 Assembly

FFmpeg libswscale Re-implementation

Pyright Type Checking Optimization

Notebook Compression

Revideo Rendering Pipeline Optimization

Cranelift Codegen Optimization

Dependent Type Checker

Granite Mamba2 Inference Optimization

SGLang Inference System Optimization