Dart → Haskell

Results

#ModelSuccess RateAvgBest

Claude Opus 4.8

Claude Code

0/522%30%

GPT-5.5

Codex

0/519%27%

Claude Opus 4.7

Claude Code

0/516%25%

GPT-5.4

Codex

0/514%21%

Composer 2.5

Cursor CLI

0/53.1%6.7%

Claude Opus 4.6

Claude Code

0/53.0%3.8%

Kimi K2.6

Kimi CLI

0/52.9%7.5%

Gemini 3.1 Pro

Gemini CLI

0/52.2%11%

Kimi K2.5

Kimi CLI

0/51.1%3.8%

DeepSeek V4 Pro

Claude Code

0/50.4%0.9%

Qwen3.6-Plus

Qwen Code

0/50.2%0.8%

#	Model	Harness	Success Rate	Avg Tests Passed	Best Tests Passed	Avg Tokens	Avg Time
1	Claude Opus 4.8	Claude Code	0/5	22%	30%	18.6M	55m
2	GPT-5.5	Codex	0/5	19%	27%	9.3M	18m
3	Claude Opus 4.7	Claude Code	0/5	16%	25%	84.4M	6h 50m
4	GPT-5.4	Codex	0/5	14%	21%	6.6M	23m
5	Composer 2.5	Cursor CLI	0/5	3.1%	6.7%	6.1M	2h 29m
6	Claude Opus 4.6	Claude Code	0/5	3.0%	3.8%	70.0M	9h 30m
7	Kimi K2.6	Kimi CLI	0/5	2.9%	7.5%	40.2M	6h 58m
8	Gemini 3.1 Pro	Gemini CLI	0/5	2.2%	11%	2.4M	1h 2m
9	Kimi K2.5	Kimi CLI	0/5	1.1%	3.8%	13.7M	54m
10	DeepSeek V4 Pro	Claude Code	0/5	0.4%	0.9%	38.9M	2h 2m
11	Qwen3.6-Plus	Qwen Code	0/5	0.2%	0.8%	31.1M	1h 47m

Background

dart_style is a production formatter whose behavior is defined by both syntax rules and a large accumulated golden corpus. The reference implementation includes separate short-style and tall-style pipelines tied to different Dart language versions.

In this task, the formatter source tree and test corpus act as the operative specification. Rewriting the tool in Haskell requires reproducing the formatter's decisions byte-for-byte across both formatting regimes.

Task

Starting from the Dart dart_styleformatter source tree and a Haskell toolchain, the agent must rebuild the formatter as a standalone Haskell executable. The task is not limited to a handful of pretty-printer rules; it includes both of the formatter's modern pipelines, including the short-style and tall-style regimes used by different Dart language versions.

Match the command line contract of the formatter.
Preserve byte-for-byte formatting behavior on a large hidden golden suite.
Handle the language-version split that changes how formatting decisions are made.

Evaluation

The verifier runs a large golden suite derived from formatter tests and additional corpus-sourced or fuzzed files. Anti-cheat, build, and formatter-discovery failures zero the result; otherwise the score is plain hidden pass rate.

Hidden files cover both short-style and tall-style formatting behavior.
Performance-oriented files are included, but the public score is still reported as pass rate rather than runtime speed.
Output must match the reference formatter byte-for-byte.

No model was able to complete this task successfully, so we used overall test pass rate as a partial reward to rank models.

Environment And Constraints

The task runs without internet access on a CPU-only environment with a preinstalled Haskell toolchain. Agents have the reference formatter source tree and enough tooling to build a standalone executable, but they cannot look up external grammar or library documentation while they work.