13 Performance

Revideo Rendering Pipeline Optimization

Results

#ModelCorrectnessAvgBest
1
Claude Opus 4.6
Claude Code
4/51.1941.970
2
Gemini 3.1 Pro
Gemini CLI
5/51.0991.157
3
Qwen3.6-Plus
Qwen Code
5/50.8490.860
4
Kimi K2.5
Kimi CLI
5/50.8460.869
5
GPT-5.4
Codex
4/50.7780.889

Background

Revideo v0.4.2 is an end-to-end video rendering stack that spans browser rendering, canvas work, FFmpeg encoding, and surrounding orchestration. The task uses the real TypeScript codebase rather than an isolated kernel, so changes can land anywhere in that pipeline as long as the produced video stays aligned.

Task

The agent starts from the Revideo v0.4.2 source tree and a benchmark project with example scenes. The deliverable is a modified codebase that the verifier rebuilds from source, renders against hidden scenes, and measures for speedup. The stack includes browser rendering, canvas work, FFmpeg encoding, and surrounding orchestration, so there are several places where the submission can win or lose time.

  • Modify the real TypeScript codebase rather than writing an external shim.
  • Preserve the rendered output closely enough for the perceptual checks to pass.
  • Improve geometric-mean speed across a set of hidden scenes.

Evaluation

The verifier builds the agent's modified Revideo from source and renders a set of scenes the agent never saw during development. To cancel cold-start bias it uses an ABBA protocol — baseline, candidate, candidate, baseline — so that each codebase gets one cold-cache run and one warm-cache run and the systematic advantage of going second washes out.

ABBA Render Sequence
A
Baseline
B
Candidate
B
Candidate
A
Baseline
Avg baseline time
(A₁ + A₂) / 2
Avg candidate time
(B₁ + B₂) / 2
speedup = avg baseline / avg candidate

The public score is a performance score, but it is correctness-gated: visibly wrong video does not count no matter how fast it renders. The check runs on the rendered MP4s — frame-level SSIM (structural similarity, a perceptual metric where 1.0 means identical) and overall duration must both stay within tight bounds for the speedup to count.

SSIM Formula
SSIM(x,y)=(2μxμy+C1)(2σxy+C2)(μx2+μy2+C1)(σx2+σy2+C2)\mathrm{SSIM}(x,y)=\frac{(2\mu_x \mu_y + C_1)(2\sigma_{xy} + C_2)}{(\mu_x^2 + \mu_y^2 + C_1)(\sigma_x^2 + \sigma_y^2 + C_2)}
  • μ\mu = mean pixel intensity
  • σ2\sigma^2 = variance
  • σxy\sigma_{xy} = covariance
  • C1,C2C_1, C_2 = stabilization constants

Computed per frame; a scene passes only if every frame stays at or above 0.99 SSIM.

The practical implication is that SSIM is somewhat tolerant of tiny pixel-level noise but much less tolerant of changes to structure or alignment. An optimization can change how a frame gets produced and still pass if the same edges, shapes, and brightness relationships end up on screen, but blur, layout drift, or altered motion will pull the score down quickly.

Because the verifier computes SSIM frame by frame against the baseline, timing mistakes are also expensive. A dropped frame, a duplicated frame, or even a one-frame delay means the comparison is suddenly looking at different moments in the animation. Two videos can both look plausible when watched casually and still fail the gate if the frame sequence is no longer aligned.

Putting it together, the verifier follows a three-step pipeline. Only submissions that clear the correctness gate earn a speed score.

01
Render
ABBA timings average two baseline and two candidate renders before scoring.
02
Compare
Duration must stay within roughly 98% to 102% of the baseline video.
03
Gate
Any hidden scene below SSIM 0.99 zeros the speed score.
Once a submission passes the gate, the verifier counts its hidden-scene geometric-mean speedup.
Scoring Formula
score={0,if any hidden scene fails correctness or is missingmin ⁣(100,(i=1nsi)1/n),otherwise\mathrm{score}=\begin{cases} 0, & \text{if any hidden scene fails correctness or is missing} \\ \min\!\left(100,\left(\prod_{i=1}^{n} s_i\right)^{1/n}\right), & \text{otherwise} \end{cases}
si=tibaseticands_i=\frac{t_i^{\mathrm{base}}}{t_i^{\mathrm{cand}}}
  • sis_i = hidden-scene speedup versus the frozen baseline pipeline
  • The 100x cap only clips extreme outliers; normal runs score below it.

Environment And Constraints

The task runs in a Modal container with 8 CPUs, 32 GB RAM, and no internet access. The container image includes the full Revideo monorepo, benchmark scenes, source media, and the preinstalled npm packages that matter for optimization. The work is to reshape a live rendering stack under /app/revideo, not to recreate it from scratch.

Caveats

Task Caveat
This task has a genuine public-data caveat: the task is closely related to blog posts about the underlying Revideo optimizations. The strongest observed results may partly reflect pretraining recall rather than fresh engineering.