17 Performance

SGLang Inference System Optimization

Results

#ModelCorrectnessAvgBest
1
Claude Opus 4.6
Claude Code
0/50.3250.361
2
Qwen3.6-Plus
Qwen Code
0/50.1120.190
3
Kimi K2.5
Kimi CLI
0/50.1030.293
4
GPT-5.4
Codex
0/50.0410.207
5
Gemini 3.1 Pro
Gemini CLI
0/50.0290.146

Background

SGLang is a high-performance serving framework for large language and vision-language models. In this task, the agent has to optimize an SGLang server for Qwen3.5-4B. That means the full serving path: launch configuration, request scheduling, batching, caching, decode execution, and concurrent request handling. The Granite Mamba2 Inference Optimization task focuses on a single layer in isolation; this one covers the full stack.

Task

The agent is given an SGLang serving setup for Qwen3.5-4B on a single B200 with pre-downloaded model weights and must make the server faster. In practice that means keeping /app/launch_server.sh able to start a correct server process while improving the full serving stack. The optimization surface is intentionally broad: launch configuration, scheduler behavior, kernels, caching strategy, and even model-level changes are all part of the search space as long as correctness remains within the task tolerances.

Evaluation

The public score blends correctness and latency. Let cc be the average greedy token-match rate on the hidden prompt set and let gg be the geometric-mean latency speedup across the sequential and concurrent hidden workloads. Speed only contributes once correctness is high enough.

A system that matches baseline correctness and baseline speed scores exactly 1.0.

Environment

The task runs in a Modal container with a single NVIDIA B200, 8 CPU cores, 64 GB RAM, and no public internet access. The container image is prebuilt for server surgery and benchmarking, while larger model assets live on mounted storage. The baseline config is verifier-owned, but the candidate server is still started fresh by the verifier from the agent-maintained launch script rather than provided as a long-running process ahead of time.

Constraints

  • The verifier always scores against its own baseline launched from /tests/launch_baseline.sh; agent-reported files do not affect that measurement.
  • The candidate is started fresh from /app/launch_server.sh, and leftover launch_server processes are killed before scoring.
  • Runtime internet access is disabled by allow_internet = false and Harbor-managed CIDR allowlists.