11 Performance

Pyright Type Checking Optimization

Results

#ModelCorrectnessAvgBest
1
Gemini 3.1 Pro
Gemini CLI
5/51.0761.096
2
GPT-5.4
Codex
5/51.0301.075
3
Qwen3.6-Plus
Qwen Code
5/51.0211.100
4
Kimi K2.5
Kimi CLI
5/51.0011.002
5
Claude Opus 4.6
Claude Code
3/50.8371.139

Background

Pyright 1.1.400 is a production static analyzer implemented as a large TypeScript codebase. Its runtime is shaped by parsing, binding, type evaluation, caching, and large-codebase traversal, while its external contract is the exact set of diagnostics it reports.

Task

This is a performance task on a real TypeScript codebase, not a clean room rewrite. The agent starts inside Pyright 1.1.400 and must make it faster while preserving exact diagnostic behavior.

  • Modify the real source tree and rebuild the bundled CLI.
  • Keep Jest tests and diagnostic parity intact.
  • Improve runtime across public and hidden evaluation programs.

Evaluation

The verifier applies a hard correctness gate first, then measures geometric-mean speedup on public and hidden workloads using ABBA-style paired timing. Build failure, Jest failure, diagnostic drift, anti-cheat failure, or missing benchmark data all zero the reward.

  • Public and hidden workloads both contribute to the geometric mean.
  • ABBA-style paired timing is used to reduce drift and measurement noise.
  • The score is unbounded above 1.0 once correctness is preserved.

Environment And Constraints

Agents work offline on an 8 CPU, 32 GB environment with all npm dependencies preinstalled. The task is intentionally CPU-bound: the engineering challenge is understanding and reshaping a real codebase, not outsourcing work to extra hardware.