02 Implementation

Wan 2.1 on MAX/Mojo

Results

#ModelCorrectnessAvgBest
1
GPT-5.4
Codex
0/520%50%
2
Claude Opus 4.6
Claude Code
0/510%50%
3
Gemini 3.1 Pro
Gemini CLI
0/50%0%
4
Kimi K2.5
Kimi CLI
0/50%0%
5
Qwen3.6-Plus
Qwen Code
0/50%0%

Background

Wan 2.1 T2V-1.3B is a text-to-video inference pipeline with model code, scheduler logic, and video-specific tensor handling. This task fixes the output settings and asks for that pipeline to be ported from the supplied PyTorch reference into a MAX/Mojo environment.

The supplied reference is a full inference stack rather than an isolated operator. It includes text conditioning, scheduler updates across denoising steps, and video-specific latent and frame handling.

The task fixes the output configuration, including resolution and denoising steps, so the port targets one defined pipeline path.

Wan 2.1 Inference Pipeline
MAX/Mojo only
Prompt
"a red ball bouncing..."
01
Encode
UMT5-XXL text encoder
Text embeddings
seq × 4096
02
Denoise
1.3B diffusion transformer
Video latents
C×F×H×W
03
Decode
3D causal VAE
Output
480 × 832 video

Task

Modular’s MAX is an open-source inference framework written from the ground up in Mojo — a language designed for high-performance AI workloads on both GPU and CPU.

The agent is given a PyTorch implementation of Wan 2.1 T2V-1.3B, a 1.3-billion-parameter open text-to-video diffusion model, and a MAX/Mojo environment, then asked to port the full video generation pipeline so that it produces reference-matching frames on hidden prompts. Every pipeline component must be reimplemented using the MAX Python API and, where needed, custom Mojo kernels.

  • The task fixes resolution and denoising settings so work is focused on the port itself.
  • Reference frames are provided to reduce wasted PyTorch regeneration work during development.

Evaluation

Hidden workloads vary prompt length and frame count while holding the output format fixed at 480x832 with 8 denoising steps. The verifier compares generated frames against reference outputs and turns those per-workload PSNR checks into a plain correctness fraction.

  • Every hidden workload must return correctly sized, non-blank video output.
  • The public correctness threshold is PSNR of at least 25 dB per hidden workload.
  • If 3 of 4 hidden workloads clear 25 dB, the score is 0.75.

No model was able to complete this task successfully, so we used overall test pass rate as a partial reward to rank models.

Environment

The task runs in a Modal container with a single H100, 8 CPU cores, 64 GB RAM, no internet access, and a four-hour agent budget. The container image includes MAX, Mojo, and the supporting toolchain needed for the port. Model assets sit on a mounted volume.

Constraints

  • There is no access to a verifier-owned hidden baseline, and although PyTorch and diffusers are present for verification purposes, the anti-cheat setup is designed around a genuine MAX/Mojo port.
  • Pre-scoring source scans require at least one max. import in the candidate.
  • The same scans reject imports of torch, transformers, or diffusers.
  • They also reject subprocess, os.system, sys.modules, or __import__ tricks used to reach those packages, so model computation must stay on MAX/Mojo.