Previous TaskPyright Type Checking Optimization Next TaskRevideo Rendering Pipeline Optimization

12 Performance

Notebook Compression

Results

#ModelCorrectnessAvgBest

Claude Fable 5

Claude Code

5/50.7210.794

GLM-5.2

Claude Code

5/50.6890.702

GPT-5.5

Codex

5/50.6880.688

Grok 4.5

Grok CLI

5/50.6780.682

Gemini 3.1 Pro

Gemini CLI

5/50.6770.687

Claude Opus 4.8

Claude Code

4/50.5530.698

Claude Opus 4.7

Claude Code

4/50.5430.692

GPT-5.4

Codex

4/50.5420.687

Qwen3.6-Plus

Qwen Code

4/50.5350.674

Claude Opus 4.6

Claude Code

3/50.4140.693

Kimi K2.6

Kimi CLI

3/50.4050.688

Kimi K2.5

Kimi CLI

3/50.4000.674

GLM-5.1

Claude Code

2/50.2730.688

Composer 2.5

Cursor CLI

0/50.0000.000

DeepSeek V4 Pro

Claude Code

0/50.0000.000

#	Model	Harness	Correctness	Avg Score	Best Score	Avg Tokens	Avg Time
1	Claude Fable 5	Claude Code	5/5	0.721	0.794	51.2M	4h 59m
2	GLM-5.2	Claude Code	5/5	0.689	0.702	26.0M	1h 37m
3	GPT-5.5	Codex	5/5	0.688	0.688	3.3M	14m
4	Grok 4.5	Grok CLI	5/5	0.678	0.682	3.0M	15m
5	Gemini 3.1 Pro	Gemini CLI	5/5	0.677	0.687	1.6M	25m
6	Claude Opus 4.8	Claude Code	4/5	0.553	0.698	1.9M	24m
7	Claude Opus 4.7	Claude Code	4/5	0.543	0.692	13.8M	55m
8	GPT-5.4	Codex	4/5	0.542	0.687	2.8M	15m
9	Qwen3.6-Plus	Qwen Code	4/5	0.535	0.674	3.2M	27m
10	Claude Opus 4.6	Claude Code	3/5	0.414	0.693	5.8M	1h 5m
11	Kimi K2.6	Kimi CLI	3/5	0.405	0.688	6.5M	2h 29m
12	Kimi K2.5	Kimi CLI	3/5	0.400	0.674	2.3M	58m
13	GLM-5.1	Claude Code	2/5	0.273	0.688	20.6M	1h 34m
14	Composer 2.5	Cursor CLI	0/5	0.000	0.000	1.2M	7m
15	DeepSeek V4 Pro	Claude Code	0/5	0.000	0.000	6.8M	36m

Background

Canonicalized Jupyter notebooks are structured documents that can mix source code, markdown, tracebacks, HTML, SVG, metadata, and large binary outputs encoded inside JSON. This task treats the corpus as a domain-specific compression problem rather than a generic text or blob compression problem.

Task

The agent receives a visible corpus of canonicalized Jupyter notebooks and must build a lossless compression pipeline for them. The submission is not a single executable with one verb. It is a three-stage system: fit a model on the visible data, compress a hidden holdout, and decompress it back to the original tree exactly.

fit can build learned artifacts or dictionaries from visible data.
compress must minimize hidden holdout size.
decompress must reconstruct every file byte-for-byte with the same relative paths.

Evaluation

The verifier canonicalizes notebooks, runs the submitted fit/compress/decompress pipeline, enforces a strict lossless round trip, and then computes geometric-mean compression ratio across the hidden holdout. The reported score is reduction (1 - r), so higher is better.

Scoring Formula

r=\left(\prod_{i=1}^{n}\frac{c_i}{o_i}\right)^{1/n}

\mathrm{score}=\begin{cases} 0, & \text{if decompression is not byte-for-byte lossless} \\ 1-r, & \text{otherwise} \end{cases}

c_i

Environment And Constraints

This is a large CPU-and-disk task: 16 vCPU, 32 GiB RAM, 150 GiB scratch space, no GPU, and no internet access. The hidden holdout is large enough that aggressive but invalid tricks are easy for the verifier to catch during reconstruction.