GraniteMoeHybridMambaLayer sits between toy kernel tasks and the full-system Inference System Optimization task. It is a single layer from IBM's Granite family, but it still exposes prefill and decode paths, cache updates, memory layout choices, and hardware-specific tuning on an NVIDIA B200.
This task isolates a single GraniteMoeHybridMambaLayer from IBM's Granite model family and asks the agent to make that layer faster on an NVIDIA B200 without changing semantics. The starting point is a clean PyTorch eager implementation plus a bundle of Triton ops the agent can assemble or extend.
candidate_impl.py.The verifier first enforces a full correctness gate on hidden states, cache updates, logits, and KL-style tolerances. Only candidates that clear that gate receive any speed credit; those are scored by the geometric mean of paired speedups across four hidden workloads.
A score above 1.0 therefore means beating an already optimized production path, not a toy eager baseline.
The task runs in a Modal container with a single NVIDIA B200, 8 CPU cores, 64 GB RAM, no internet access, and a short verifier budget. The container image includes the reference block, extracted assets, a ready-to-use Python environment, and the raw CUDA and Triton tooling needed to go beyond the supplied ops. That leaves most of the budget for kernel work rather than setup or downloads.