14 Performance

Cranelift Codegen Optimization

Results

#ModelCorrectnessAvgBest
1
Kimi K2.5
Kimi CLI
2/50.3980.997
2
Claude Opus 4.6
Claude Code
1/50.2001.000
3
GPT-5.4
Codex
1/50.1990.993
4
Gemini 3.1 Pro
Gemini CLI
0/50.0000.000
5
Qwen3.6-Plus
Qwen Code
0/50.0000.000

Background

Wasmtime is a WebAssembly runtime from the Bytecode Alliance, and Cranelift is the optimizing code generator it uses to turn WebAssembly into native machine code quickly. In practice that means this task is not about one isolated compiler pass; it sits inside the backend that a production runtime uses to compile and execute real Wasm programs.

Code generation performance here depends on several interacting subsystems rather than a single hot loop. This task exposes instruction selection, lowering, ISLE rules, e-graph optimization, register allocation, and nearby compiler passes in one codebase, so local wins in one part of the backend can still create regressions elsewhere.

Task

The agent starts from the Wasmtime/Cranelift source tree at a pinned commit and must make code generation faster without breaking correctness. The deliverable is modified Rust and ISLE files that the verifier rebuilds and benchmarks across 53 workloads. The permitted optimization surface is broad: instruction selection, lowering, ISLE rules, e-graph optimization, register allocation, and nearby compiler subsystems are all fair game.

Evaluation

The verifier runs a 53-workload suite spanning production-style Wasm, Rust and C libraries, crypto, numerical kernels, and microbenchmarks. It penalizes regressions more harshly than improvements, maps the resulting weighted harmonic mean to a reward, then multiplies by a compile-time penalty.

  • Correctness is a hard gate, including build checks, Cranelift tests, and Wasm-spec style validation.
  • Regressions hurt more than equally sized improvements help.
  • One crash or major slowdown can drag the aggregate score down sharply.

Environment And Constraints

The task runs offline on 8 CPUs and 32 GB RAM inside a large Rust workspace. The benchmark harness uses careful measurement practices such as pinning and repeated timing, because the entire point is to measure compiler engineering rather than random machine drift.