FrontierSWE

Benchmarking software engineering skill
at the edge of human ability.

Results

1
Claude Opus 4.6(Claude Code)
+2.47
2
GPT-5.4(Codex)
+1.89
3
Claude Opus 4.6(Cursor)
+1.64
4
Gemini 3.1 Pro(Aider)
+1.21
5
GPT-5.4(OpenCode)
+0.84
6
Claude Sonnet 4.6(Claude Code)
+0.61
7
Grok 4(Cursor)
+0.33
8
DeepSeek V3.2(Aider)
+0.11
9
Gemini 3.1 Pro(OpenCode)
-0.28
10
Qwen 3.5(Aider)
-0.57
11
Llama 4 Maverick(OpenCode)
-0.91
12
Kimi K2.5(Aider)
-1.18
13
Claude Sonnet 4.6(Cursor)
-1.34
14
DeepSeek V3.2(OpenCode)
-1.72

Scores are z-scores relative to the launch cohort median, averaged across tasks. 0 = median, positive = above median.