Wan 2.1 T2V-1.3B is a text-to-video inference pipeline with model code, scheduler logic, and video-specific tensor handling. This task fixes the output settings and asks for that pipeline to be ported from the supplied PyTorch reference into a MAX/Mojo environment.
The supplied reference is a full inference stack rather than an isolated operator. It includes text conditioning, scheduler updates across denoising steps, and video-specific latent and frame handling.
The task fixes the output configuration, including resolution and denoising steps, so the port targets one defined pipeline path.
Modular’s MAX is an open-source inference framework written from the ground up in Mojo — a language designed for high-performance AI workloads on both GPU and CPU.
The agent is given a PyTorch implementation of Wan 2.1 T2V-1.3B, a 1.3-billion-parameter open text-to-video diffusion model, and a MAX/Mojo environment, then asked to port the full video generation pipeline so that it produces reference-matching frames on hidden prompts. Every pipeline component must be reimplemented using the MAX Python API and, where needed, custom Mojo kernels.
Hidden workloads vary prompt length and frame count while holding the output format fixed at 480x832 with 8 denoising steps. The verifier compares generated frames against reference outputs and turns those per-workload PSNR checks into a plain correctness fraction.
No model was able to complete this task successfully, so we used overall test pass rate as a partial reward to rank models.
The task runs in a Modal container with a single H100, 8 CPU cores, 64 GB RAM, no internet access, and a four-hour agent budget. The container image includes MAX, Mojo, and the supporting toolchain needed for the port. Model assets sit on a mounted volume.
max. import in the candidate.torch, transformers, or diffusers.subprocess, os.system, sys.modules, or __import__ tricks used to reach those packages, so model computation must stay on MAX/Mojo.