Previous TaskFFmpeg libswscale Re-implementation Next TaskNotebook Compression

11 Performance

Pyright Type Checking Optimization

Results

#ModelCorrectnessAvgBest

Claude Fable 5

Claude Code

5/51.2271.251

Grok 4.5

Grok CLI

5/51.1161.153

GLM-5.2

Claude Code

5/51.0781.131

Gemini 3.1 Pro

Gemini CLI

5/51.0761.096

GPT-5.5

Codex

5/51.0711.095

GPT-5.4

Codex

5/51.0301.075

Qwen3.6-Plus

Qwen Code

5/51.0211.100

Kimi K2.5

Kimi CLI

5/51.0011.002

GLM-5.1

Claude Code

4/50.9361.130

Claude Opus 4.8

Claude Code

4/40.8621.090

Claude Opus 4.6

Claude Code

3/50.8371.139

DeepSeek V4 Pro

Claude Code

3/50.8051.071

Claude Opus 4.7

Claude Code

2/30.5351.105

Composer 2.5

Cursor CLI

0/50.4370.474

Kimi K2.6

Kimi CLI

2/50.4301.077

#	Model	Harness	Correctness	Avg Score	Best Score	Avg Tokens	Avg Time
1	Claude Fable 5	Claude Code	5/5	1.227	1.251	104.4M	8h 53m
2	Grok 4.5	Grok CLI	5/5	1.116	1.153	11.6M	42m
3	GLM-5.2	Claude Code	5/5	1.078	1.131	74.2M	4h 26m
4	Gemini 3.1 Pro	Gemini CLI	5/5	1.076	1.096	9.6M	1h 8m
5	GPT-5.5	Codex	5/5	1.071	1.095	20.1M	15h 25m
6	GPT-5.4	Codex	5/5	1.030	1.075	15.4M	30m
7	Qwen3.6-Plus	Qwen Code	5/5	1.021	1.100	2.4M	32m
8	Kimi K2.5	Kimi CLI	5/5	1.001	1.002	11.5M	3h 33m
9	GLM-5.1	Claude Code	4/5	0.936	1.130	164.2M	14h 45m
10	Claude Opus 4.8	Claude Code	4/4	0.862	1.090	12.2M	10h 33m
11	Claude Opus 4.6	Claude Code	3/5	0.837	1.139	66.0M	5h 22m
12	DeepSeek V4 Pro	Claude Code	3/5	0.805	1.071	18.7M	9h 19m
13	Claude Opus 4.7	Claude Code	2/3	0.535	1.105	58.4M	11h 41m
14	Composer 2.5	Cursor CLI	0/5	0.437	0.474	1.4M	8h 39m
15	Kimi K2.6	Kimi CLI	2/5	0.430	1.077	20.0M	4h 38m

Background

Pyright 1.1.400 is a production static analyzer implemented as a large TypeScript codebase. Its runtime is shaped by parsing, binding, type evaluation, caching, and large-codebase traversal, while its external contract is the exact set of diagnostics it reports.

Task

This is a performance task on a real TypeScript codebase, not a clean room rewrite. The agent starts inside Pyright 1.1.400 and must make it faster while preserving exact diagnostic behavior.

Modify the real source tree and rebuild the bundled CLI.
Keep Jest tests and diagnostic parity intact.
Improve runtime across public and hidden evaluation programs.

Evaluation

The verifier applies a hard correctness gate first, then measures geometric-mean speedup on public and hidden workloads using ABBA-style paired timing. Build failure, Jest failure, diagnostic drift, anti-cheat failure, or missing benchmark data all zero the reward.

Public and hidden workloads both contribute to the geometric mean.
ABBA-style paired timing is used to reduce drift and measurement noise.
The score is unbounded above 1.0 once correctness is preserved.

Environment And Constraints

Agents work offline on an 8 CPU, 32 GB environment with all npm dependencies preinstalled. The task is intentionally CPU-bound: the engineering challenge is understanding and reshaping a real codebase, not outsourcing work to extra hardware.