Previous TaskRevideo Rendering Pipeline Optimization Next TaskDependent Type Checker

14 Performance

Cranelift Codegen Optimization

Results

#ModelCorrectnessAvgBest

Claude Fable 5

Claude Code

5/50.9991.001

Grok 4.5

Grok CLI

5/50.9981.004

Gemini 3.1 Pro

Gemini CLI

5/50.9951.006

Claude Opus 4.7

Claude Code

5/50.9931.005

GLM-5.2

Claude Code

5/50.9930.998

Claude Opus 4.8

Claude Code

5/50.9851.003

GPT-5.5

Codex

5/50.8011.006

DeepSeek V4 Pro

Claude Code

4/50.7970.999

Composer 2.5

Cursor CLI

4/50.7871.002

Kimi K2.5

Kimi CLI

2/50.3980.997

Claude Opus 4.6

Claude Code

1/50.2001.000

GPT-5.4

Codex

1/50.1990.993

GLM-5.1

Claude Code

0/50.0000.000

Kimi K2.6

Kimi CLI

0/50.0000.000

Qwen3.6-Plus

Qwen Code

0/50.0000.000

#	Model	Harness	Correctness	Avg Score	Best Score	Avg Tokens	Avg Time
1	Claude Fable 5	Claude Code	5/5	0.999	1.001	106.4M	7h 5m
2	Grok 4.5	Grok CLI	5/5	0.998	1.004	40.8M	4h 29m
3	Gemini 3.1 Pro	Gemini CLI	5/5	0.995	1.006	7.4M	1h 30m
4	Claude Opus 4.7	Claude Code	5/5	0.993	1.005	7.5M	4h 20m
5	GLM-5.2	Claude Code	5/5	0.993	0.998	103.2M	5h 34m
6	Claude Opus 4.8	Claude Code	5/5	0.985	1.003	6.4M	9h 32m
7	GPT-5.5	Codex	5/5	0.801	1.006	42.0M	1h 51m
8	DeepSeek V4 Pro	Claude Code	4/5	0.797	0.999	17.6M	2h 48m
9	Composer 2.5	Cursor CLI	4/5	0.787	1.002	7.1M	1h 25m
10	Kimi K2.5	Kimi CLI	2/5	0.398	0.997	7.6M	49m
11	Claude Opus 4.6	Claude Code	1/5	0.200	1.000	135.6M	10h 55m
12	GPT-5.4	Codex	1/5	0.199	0.993	52.3M	1h 13m
13	GLM-5.1	Claude Code	0/5	0.000	0.000	24.0M	2h 50m
14	Kimi K2.6	Kimi CLI	0/5	0.000	0.000	38.8M	5h 57m
15	Qwen3.6-Plus	Qwen Code	0/5	0.000	0.000	14.4M	1h 2m

Background

Wasmtime is a WebAssembly runtime from the Bytecode Alliance, and Cranelift is the optimizing code generator it uses to turn WebAssembly into native machine code quickly. In practice that means this task is not about one isolated compiler pass; it sits inside the backend that a production runtime uses to compile and execute real Wasm programs.

Code generation performance here depends on several interacting subsystems rather than a single hot loop. This task exposes instruction selection, lowering, ISLE rules, e-graph optimization, register allocation, and nearby compiler passes in one codebase, so local wins in one part of the backend can still create regressions elsewhere.

Task

The agent starts from the Wasmtime/Cranelift source tree at a pinned commit and must make code generation faster without breaking correctness. The deliverable is modified Rust and ISLE files that the verifier rebuilds and benchmarks across 53 workloads. The permitted optimization surface is broad: instruction selection, lowering, ISLE rules, e-graph optimization, register allocation, and nearby compiler subsystems are all fair game.

Evaluation

The verifier runs a 53-workload suite spanning production-style Wasm, Rust and C libraries, crypto, numerical kernels, and microbenchmarks. It penalizes regressions more harshly than improvements, maps the resulting weighted harmonic mean to a reward, then multiplies by a compile-time penalty.

Correctness is a hard gate, including build checks, Cranelift tests, and Wasm-spec style validation.
Regressions hurt more than equally sized improvements help.
One crash or major slowdown can drag the aggregate score down sharply.

Environment And Constraints

The task runs offline on 8 CPUs and 32 GB RAM inside a large Rust workspace. The benchmark harness uses careful measurement practices such as pinning and repeated timing, because the entire point is to measure compiler engineering rather than random machine drift.