Previous TaskPostgreSQL 18 on SQLite Next TaskGit to Zig

02 Implementation

Wan 2.1 on MAX/Mojo

Results

#ModelCorrectnessAvgBest

Claude Fable 5

Claude Code

0/560%75%

GLM-5.2

Claude Code

0/560%75%

Claude Opus 4.8

Claude Code

0/520%75%

GPT-5.4

Codex

0/520%50%

Grok 4.5

Grok CLI

0/515%75%

Claude Opus 4.6

Claude Code

0/510%50%

Claude Opus 4.7

Claude Code

0/50%0%

Composer 2.5

Cursor CLI

0/50%0%

DeepSeek V4 Pro

Claude Code

0/50%0%

Gemini 3.1 Pro

Gemini CLI

0/50%0%

GLM-5.1

Claude Code

0/50%0%

GPT-5.5

Codex

0/50%0%

Kimi K2.5

Kimi CLI

0/50%0%

Kimi K2.6

Kimi CLI

0/50%0%

Qwen3.6-Plus

Qwen Code

0/50%0%

#	Model	Harness	Correctness	Avg Score	Best Score	Avg Tokens	Avg Time
1	Claude Fable 5	Claude Code	0/5	60%	75%	163.2M	11h 2m
2	GLM-5.2	Claude Code	0/5	60%	75%	221.4M	7h 53m
3	Claude Opus 4.8	Claude Code	0/5	20%	75%	17.9M	4h 38m
4	GPT-5.4	Codex	0/5	20%	50%	54.9M	2h 27m
5	Grok 4.5	Grok CLI	0/5	15%	75%	43.2M	4h 42m
6	Claude Opus 4.6	Claude Code	0/5	10%	50%	42.9M	10h 50m
7	Claude Opus 4.7	Claude Code	0/5	0%	0%	9.9M	5h 57m
8	Composer 2.5	Cursor CLI	0/5	0%	0%	4.5M	6h 4m
9	DeepSeek V4 Pro	Claude Code	0/5	0%	0%	19.7M	7h 32m
10	Gemini 3.1 Pro	Gemini CLI	0/5	0%	0%	30.1M	5h 27m
11	GLM-5.1	Claude Code	0/5	0%	0%	68.2M	18h 24m
12	GPT-5.5	Codex	0/5	0%	0%	26.3M	1h 47m
13	Kimi K2.5	Kimi CLI	0/5	0%	0%	16.8M	1h 41m
14	Kimi K2.6	Kimi CLI	0/5	0%	0%	85.5M	12h 11m
15	Qwen3.6-Plus	Qwen Code	0/5	0%	0%	21.0M	1h 37m

Background

Wan 2.1 T2V-1.3B is a text-to-video inference pipeline with model code, scheduler logic, and video-specific tensor handling. This task fixes the output settings and asks for that pipeline to be ported from the supplied PyTorch reference into a MAX/Mojo environment.

The supplied reference is a full inference stack rather than an isolated operator. It includes text conditioning, scheduler updates across denoising steps, and video-specific latent and frame handling.

The task fixes the output configuration, including resolution and denoising steps, so the port targets one defined pipeline path.

Wan 2.1 Inference Pipeline

MAX/Mojo only

Prompt

"a red ball bouncing..."

Encode

UMT5-XXL text encoder

Text embeddings

seq × 4096

Denoise

1.3B diffusion transformer

Video latents

C×F×H×W

Decode

3D causal VAE

Output

480 × 832 video

Task

Modular’s MAX is an open-source inference framework written from the ground up in Mojo — a language designed for high-performance AI workloads on both GPU and CPU.

The agent is given a PyTorch implementation of Wan 2.1 T2V-1.3B, a 1.3-billion-parameter open text-to-video diffusion model, and a MAX/Mojo environment, then asked to port the full video generation pipeline so that it produces reference-matching frames on hidden prompts. Every pipeline component must be reimplemented using the MAX Python API and, where needed, custom Mojo kernels.

The task fixes resolution and denoising settings so work is focused on the port itself.
Reference frames are provided to reduce wasted PyTorch regeneration work during development.

Evaluation

Hidden workloads vary prompt length and frame count while holding the output format fixed at 480x832 with 8 denoising steps. The verifier compares generated frames against reference outputs and turns those per-workload PSNR checks into a plain correctness fraction.

Every hidden workload must return correctly sized, non-blank video output.
The public correctness threshold is PSNR of at least 25 dB per hidden workload.
If 3 of 4 hidden workloads clear 25 dB, the score is 0.75.

No model was able to complete this task successfully, so we used overall test pass rate as a partial reward to rank models.

Environment

The task runs in a Modal container with a single H100, 8 CPU cores, 64 GB RAM, no internet access, and a four-hour agent budget. The container image includes MAX, Mojo, and the supporting toolchain needed for the port. Model assets sit on a mounted volume.

Constraints

There is no access to a verifier-owned hidden baseline, and although PyTorch and diffusers are present for verification purposes, the anti-cheat setup is designed around a genuine MAX/Mojo port.
Pre-scoring source scans require at least one max. import in the candidate.
The same scans reject imports of torch, transformers, or diffusers.
They also reject subprocess, os.system, sys.modules, or __import__ tricks used to reach those packages, so model computation must stay on MAX/Mojo.