Previous TaskGranite Mamba2 Inference Optimization Next TaskPostgreSQL 18 on SQLite

17 Performance

SGLang Inference System Optimization

Results

#ModelCorrectnessAvgBest

Kimi K2.5

Kimi CLI

0/50.3420.365

Claude Fable 5

Claude Code

0/50.3410.425

GPT-5.5

Codex

0/50.3310.358

Claude Opus 4.7

Claude Code

0/50.3100.370

Claude Opus 4.8

Claude Code

1/50.3091.187

Grok 4.5

Grok CLI

0/50.3050.319

Claude Opus 4.6

Claude Code

0/50.2970.353

Composer 2.5

Cursor CLI

0/50.2840.358

GLM-5.2

Claude Code

0/50.2640.425

Qwen3.6-Plus

Qwen Code

0/50.2540.356

GLM-5.1

Claude Code

0/50.2530.358

Kimi K2.6

Kimi CLI

0/50.2250.301

Gemini 3.1 Pro

Gemini CLI

0/50.1920.362

DeepSeek V4 Pro

Claude Code

0/50.1600.348

GPT-5.4

Codex

0/50.0000.000

#	Model	Harness	Correctness	Avg Score	Best Score	Avg Tokens	Avg Time
1	Kimi K2.5	Kimi CLI	0/5	0.342	0.365	7.5M	1h 14m
2	Claude Fable 5	Claude Code	0/5	0.341	0.425	62.7M	3h 46m
3	GPT-5.5	Codex	0/5	0.331	0.358	15.4M	31m
4	Claude Opus 4.7	Claude Code	0/5	0.310	0.370	11.8M	6h 5m
5	Claude Opus 4.8	Claude Code	1/5	0.309	1.187	6.9M	1h
6	Grok 4.5	Grok CLI	0/5	0.305	0.319	2.0M	27m
7	Claude Opus 4.6	Claude Code	0/5	0.297	0.353	26.0M	2h 54m
8	Composer 2.5	Cursor CLI	0/5	0.284	0.358	2.6M	1h 17m
9	GLM-5.2	Claude Code	0/5	0.264	0.425	19.1M	2h 16m
10	Qwen3.6-Plus	Qwen Code	0/5	0.254	0.356	9.9M	1h 23m
11	GLM-5.1	Claude Code	0/5	0.253	0.358	36.8M	3h 34m
12	Kimi K2.6	Kimi CLI	0/5	0.225	0.301	11.1M	3h 42m
13	Gemini 3.1 Pro	Gemini CLI	0/5	0.192	0.362	4.4M	51m
14	DeepSeek V4 Pro	Claude Code	0/5	0.160	0.348	14.1M	1h 23m
15	GPT-5.4	Codex	0/5	0.000	0.000	25.8M	35m

Background

SGLang is a high-performance serving framework for large language and vision-language models. In this task, the agent has to optimize an SGLang server for Qwen3.5-4B. That means the full serving path: launch configuration, request scheduling, batching, caching, decode execution, and concurrent request handling. The Granite Mamba2 Inference Optimization task focuses on a single layer in isolation; this one covers the full stack.

Task

The agent is given an SGLang serving setup for Qwen3.5-4B on a single B200 with pre-downloaded model weights and must make the server faster. In practice that means keeping /app/launch_server.sh able to start a correct server process while improving the full serving stack. The optimization surface is intentionally broad: launch configuration, scheduler behavior, kernels, caching strategy, and even model-level changes are all part of the search space as long as correctness remains within the task tolerances.

Evaluation

The public score blends correctness and latency. Let $c$ be the average greedy token-match rate on the hidden prompt set and let $g$ be the geometric-mean latency speedup across the sequential and concurrent hidden workloads. Speed only contributes once correctness is high enough.

A system that matches baseline correctness and baseline speed scores exactly 1.0.

Environment

The task runs in a Modal container with a single NVIDIA B200, 8 CPU cores, 64 GB RAM, and no public internet access. The container image is prebuilt for server surgery and benchmarking, while larger model assets live on mounted storage. The baseline config is verifier-owned, but the candidate server is still started fresh by the verifier from the agent-maintained launch script rather than provided as a long-running process ahead of time.

Constraints

The verifier always scores against its own baseline launched from /tests/launch_baseline.sh; agent-reported files do not affect that measurement.
The candidate is started fresh from /app/launch_server.sh, and leftover launch_server processes are killed before scoring.
Runtime internet access is disabled by allow_internet = false and Harbor-managed CIDR allowlists.