Previous TaskNotebook Compression Next TaskCranelift Codegen Optimization

13 Performance

Revideo Rendering Pipeline Optimization

Results

#ModelCorrectnessAvgBest

Claude Fable 5

Claude Code

5/51.6101.740

Claude Opus 4.7

Claude Code

4/51.3381.935

Grok 4.5

Grok CLI

5/51.2101.485

Claude Opus 4.6

Claude Code

4/51.1941.970

Gemini 3.1 Pro

Gemini CLI

5/51.0991.157

GLM-5.2

Claude Code

5/51.0681.235

Claude Opus 4.8

Claude Code

5/51.0651.131

GPT-5.5

Codex

5/51.0311.090

GLM-5.1

Claude Code

4/50.9821.383

Kimi K2.6

Kimi CLI

5/50.9011.017

Composer 2.5

Cursor CLI

4/50.8811.049

Qwen3.6-Plus

Qwen Code

5/50.8490.860

Kimi K2.5

Kimi CLI

5/50.8460.869

DeepSeek V4 Pro

Claude Code

4/50.8351.006

GPT-5.4

Codex

4/50.7780.889

#	Model	Harness	Correctness	Avg Score	Best Score	Avg Tokens	Avg Time
1	Claude Fable 5	Claude Code	5/5	1.610	1.740	88.4M	2h 50m
2	Claude Opus 4.7	Claude Code	4/5	1.338	1.935	37.2M	6h 57m
3	Grok 4.5	Grok CLI	5/5	1.210	1.485	17.8M	49m
4	Claude Opus 4.6	Claude Code	4/5	1.194	1.970	95.9M	13h 16m
5	Gemini 3.1 Pro	Gemini CLI	5/5	1.099	1.157	4.9M	44m
6	GLM-5.2	Claude Code	5/5	1.068	1.235	74.1M	3h 1m
7	Claude Opus 4.8	Claude Code	5/5	1.065	1.131	15.3M	57m
8	GPT-5.5	Codex	5/5	1.031	1.090	27.6M	36m
9	GLM-5.1	Claude Code	4/5	0.982	1.383	65.4M	7h 51m
10	Kimi K2.6	Kimi CLI	5/5	0.901	1.017	19.9M	4h 25m
11	Composer 2.5	Cursor CLI	4/5	0.881	1.049	2.6M	54m
12	Qwen3.6-Plus	Qwen Code	5/5	0.849	0.860	468K	9m
13	Kimi K2.5	Kimi CLI	5/5	0.846	0.869	10.6M	2h 31m
14	DeepSeek V4 Pro	Claude Code	4/5	0.835	1.006	16.9M	5h 27m
15	GPT-5.4	Codex	4/5	0.778	0.889	12.3M	22m

Background

Revideo v0.4.2 is an end-to-end video rendering stack that spans browser rendering, canvas work, FFmpeg encoding, and surrounding orchestration. The task uses the real TypeScript codebase rather than an isolated kernel, so changes can land anywhere in that pipeline as long as the produced video stays aligned.

Task

The agent starts from the Revideo v0.4.2 source tree and a benchmark project with example scenes. The deliverable is a modified codebase that the verifier rebuilds from source, renders against hidden scenes, and measures for speedup. The stack includes browser rendering, canvas work, FFmpeg encoding, and surrounding orchestration, so there are several places where the submission can win or lose time.

Modify the real TypeScript codebase rather than writing an external shim.
Preserve the rendered output closely enough for the perceptual checks to pass.
Improve geometric-mean speed across a set of hidden scenes.

Evaluation

The verifier builds the agent's modified Revideo from source and renders a set of scenes the agent never saw during development. To cancel cold-start bias it uses an ABBA protocol — baseline, candidate, candidate, baseline — so that each codebase gets one cold-cache run and one warm-cache run and the systematic advantage of going second washes out.

ABBA Render Sequence

Baseline

Candidate

Baseline

Avg baseline time

(A₁ + A₂) / 2

Avg candidate time

(B₁ + B₂) / 2

speedup = avg baseline / avg candidate

The public score is a performance score, but it is correctness-gated: visibly wrong video does not count no matter how fast it renders. The check runs on the rendered MP4s — frame-level SSIM (structural similarity, a perceptual metric where 1.0 means identical) and overall duration must both stay within tight bounds for the speedup to count.

SSIM Formula

\mathrm{SSIM}(x,y)=\frac{(2\mu_x \mu_y + C_1)(2\sigma_{xy} + C_2)}{(\mu_x^2 + \mu_y^2 + C_1)(\sigma_x^2 + \sigma_y^2 + C_2)}

\mu

Computed per frame; a scene passes only if every frame stays at or above 0.99 SSIM.

The practical implication is that SSIM is somewhat tolerant of tiny pixel-level noise but much less tolerant of changes to structure or alignment. An optimization can change how a frame gets produced and still pass if the same edges, shapes, and brightness relationships end up on screen, but blur, layout drift, or altered motion will pull the score down quickly.

Because the verifier computes SSIM frame by frame against the baseline, timing mistakes are also expensive. A dropped frame, a duplicated frame, or even a one-frame delay means the comparison is suddenly looking at different moments in the animation. Two videos can both look plausible when watched casually and still fail the gate if the frame sequence is no longer aligned.

Putting it together, the verifier follows a three-step pipeline. Only submissions that clear the correctness gate earn a speed score.

Render

ABBA timings average two baseline and two candidate renders before scoring.

Compare

Duration must stay within roughly 98% to 102% of the baseline video.

Gate

Any hidden scene below SSIM 0.99 zeros the speed score.

Once a submission passes the gate, the verifier counts its hidden-scene geometric-mean speedup.

Scoring Formula

\mathrm{score}=\begin{cases} 0, & \text{if any hidden scene fails correctness or is missing} \\ \min\!\left(100,\left(\prod_{i=1}^{n} s_i\right)^{1/n}\right), & \text{otherwise} \end{cases}

s_i=\frac{t_i^{\mathrm{base}}}{t_i^{\mathrm{cand}}}

s_i

Environment And Constraints

The task runs in a Modal container with 8 CPUs, 32 GB RAM, and no internet access. The container image includes the full Revideo monorepo, benchmark scenes, source media, and the preinstalled npm packages that matter for optimization. The work is to reshape a live rendering stack under /app/revideo, not to recreate it from scratch.

Caveats

Task Caveat

This task has a genuine public-data caveat: the task is closely related to blog posts about the underlying Revideo optimizations. The strongest observed results may partly reflect pretraining recall rather than fresh engineering.