16 Performance

Granite Mamba2 Inference Optimization

Results

#ModelCorrectnessAvgBest
1
GPT-5.4
Codex
5/50.6971.033
2
Gemini 3.1 Pro
Gemini CLI
5/50.6760.970
3
Qwen3.6-Plus
Qwen Code
5/50.6110.619
4
Claude Opus 4.6
Claude Code
5/50.6060.623
5
Kimi K2.5
Kimi CLI
5/50.6050.608

Background

GraniteMoeHybridMambaLayer sits between toy kernel tasks and the full-system Inference System Optimization task. It is a single layer from IBM's Granite family, but it still exposes prefill and decode paths, cache updates, memory layout choices, and hardware-specific tuning on an NVIDIA B200.

Task

This task isolates a single GraniteMoeHybridMambaLayer from IBM's Granite model family and asks the agent to make that layer faster on an NVIDIA B200 without changing semantics. The starting point is a clean PyTorch eager implementation plus a bundle of Triton ops the agent can assemble or extend.

  • Produce a faster candidate_impl.py.
  • Preserve hidden-state, cache, and logits behavior closely enough for the verifier tolerances.
  • Beat a strong Triton baseline that already reflects production-grade optimization.

Evaluation

The verifier first enforces a full correctness gate on hidden states, cache updates, logits, and KL-style tolerances. Only candidates that clear that gate receive any speed credit; those are scored by the geometric mean of paired speedups across four hidden workloads.

A score above 1.0 therefore means beating an already optimized production path, not a toy eager baseline.

Environment

The task runs in a Modal container with a single NVIDIA B200, 8 CPU cores, 64 GB RAM, no internet access, and a short verifier budget. The container image includes the reference block, extracted assets, a ready-to-use Python environment, and the raw CUDA and Triton tooling needed to go beyond the supplied ops. That leaves most of the budget for kernel work rather than setup or downloads.