Previous TaskDependent Type Checker Next TaskSGLang Inference System Optimization

16 Performance

Granite Mamba2 Inference Optimization

Results

#ModelCorrectnessAvgBest

GPT-5.5

Codex

5/51.2601.742

Claude Fable 5

Claude Code

5/51.2041.388

Claude Opus 4.8

Claude Code

5/50.7931.185

Grok 4.5

Grok CLI

5/50.7070.958

GPT-5.4

Codex

5/50.6971.033

Gemini 3.1 Pro

Gemini CLI

5/50.6760.970

DeepSeek V4 Pro

Claude Code

4/50.6481.065

GLM-5.1

Claude Code

4/50.6341.163

Composer 2.5

Cursor CLI

4/50.6340.997

Qwen3.6-Plus

Qwen Code

5/50.6110.619

Claude Opus 4.6

Claude Code

5/50.6060.623

Kimi K2.5

Kimi CLI

5/50.6050.608

GLM-5.2

Claude Code

4/50.5931.121

Kimi K2.6

Kimi CLI

4/50.5580.987

Claude Opus 4.7

Claude Code

2/50.2640.716

#	Model	Harness	Correctness	Avg Score	Best Score	Avg Tokens	Avg Time
1	GPT-5.5	Codex	5/5	1.260	1.742	8.5M	15m
2	Claude Fable 5	Claude Code	5/5	1.204	1.388	73.7M	2h 20m
3	Claude Opus 4.8	Claude Code	5/5	0.793	1.185	7.4M	49m
4	Grok 4.5	Grok CLI	5/5	0.707	0.958	40.4M	1h 53m
5	GPT-5.4	Codex	5/5	0.697	1.033	11.2M	27m
6	Gemini 3.1 Pro	Gemini CLI	5/5	0.676	0.970	8.6M	28m
7	DeepSeek V4 Pro	Claude Code	4/5	0.648	1.065	11.4M	1h 3m
8	GLM-5.1	Claude Code	4/5	0.634	1.163	37.1M	2h 56m
9	Composer 2.5	Cursor CLI	4/5	0.634	0.997	7.0M	25m
10	Qwen3.6-Plus	Qwen Code	5/5	0.611	0.619	9.0M	47m
11	Claude Opus 4.6	Claude Code	5/5	0.606	0.623	36.0M	3h 50m
12	Kimi K2.5	Kimi CLI	5/5	0.605	0.608	8.3M	1h 40m
13	GLM-5.2	Claude Code	4/5	0.593	1.121	50.5M	2h 15m
14	Kimi K2.6	Kimi CLI	4/5	0.558	0.987	22.2M	5h 37m
15	Claude Opus 4.7	Claude Code	2/5	0.264	0.716	21.0M	1h 44m

Background

GraniteMoeHybridMambaLayer sits between toy kernel tasks and the full-system Inference System Optimization task. It is a single layer from IBM's Granite family, but it still exposes prefill and decode paths, cache updates, memory layout choices, and hardware-specific tuning on an NVIDIA B200.

Task

This task isolates a single GraniteMoeHybridMambaLayer from IBM's Granite model family and asks the agent to make that layer faster on an NVIDIA B200 without changing semantics. The starting point is a clean PyTorch eager implementation plus a bundle of Triton ops the agent can assemble or extend.

Produce a faster candidate_impl.py.
Preserve hidden-state, cache, and logits behavior closely enough for the verifier tolerances.
Beat a strong Triton baseline that already reflects production-grade optimization.

Evaluation

The verifier first enforces a full correctness gate on hidden states, cache updates, logits, and KL-style tolerances. Only candidates that clear that gate receive any speed credit; those are scored by the geometric mean of paired speedups across four hidden workloads.

A score above 1.0 therefore means beating an already optimized production path, not a toy eager baseline.

Environment

The task runs in a Modal container with a single NVIDIA B200, 8 CPU cores, 64 GB RAM, no internet access, and a short verifier budget. The container image includes the reference block, extracted assets, a ready-to-use Python environment, and the raw CUDA and Triton tooling needed to go beyond the supplied ops. That leaves most of the budget for kernel work rather than setup or downloads.