SGLang is a high-performance serving framework for large language and vision-language models. In this task, the agent has to optimize an SGLang server for Qwen3.5-4B. That means the full serving path: launch configuration, request scheduling, batching, caching, decode execution, and concurrent request handling. The Granite Mamba2 Inference Optimization task focuses on a single layer in isolation; this one covers the full stack.
The agent is given an SGLang serving setup for Qwen3.5-4B on a single B200 with pre-downloaded model weights and must make the server faster. In practice that means keeping /app/launch_server.sh able to start a correct server process while improving the full serving stack. The optimization surface is intentionally broad: launch configuration, scheduler behavior, kernels, caching strategy, and even model-level changes are all part of the search space as long as correctness remains within the task tolerances.
The public score blends correctness and latency. Let be the average greedy token-match rate on the hidden prompt set and let be the geometric-mean latency speedup across the sequential and concurrent hidden workloads. Speed only contributes once correctness is high enough.
A system that matches baseline correctness and baseline speed scores exactly 1.0.
The task runs in a Modal container with a single NVIDIA B200, 8 CPU cores, 64 GB RAM, and no public internet access. The container image is prebuilt for server surgery and benchmarking, while larger model assets live on mounted storage. The baseline config is verifier-owned, but the candidate server is still started fresh by the verifier from the agent-maintained launch script rather than provided as a long-running process ahead of time.
/tests/launch_baseline.sh; agent-reported files do not affect that measurement./app/launch_server.sh, and leftover launch_server processes are killed before scoring.allow_internet = false and Harbor-managed CIDR allowlists.