08 Research

Optimizer Design

Results

#ModelAvgBest
1
GPT-5.4
Codex
1.6x2.5x
2
Claude Opus 4.6
Claude Code
1.5x1.9x
3
Gemini 3.1 Pro
Gemini CLI
1.1x1.2x
4
Kimi K2.5
Kimi CLI
1.1x1.5x
5
Qwen3.6-Plus
Qwen Code
0.9x1.3x

Background

This task compares one shared optimizer across 10 workloads covering language modeling, vision, graphs, recommendation, and hidden architectures. The same optimizer class and config are reused everywhere, so the task is about cross-workload robustness rather than per-task tuning.

Task

Working on a single H100 with 7 visible workloads (and 3 more hidden at verification), the agent must design one torch.optim.Optimizer subclass plus a shared config that beats tuned AdamW across all of them. The same optimizer implementation and config are used for every task; there is no per-workload hyperparameter tuning at submission time.

  • Submit custom_optimizer.py and optimizer_config.json.
  • Generalize across language modeling, vision, graphs, recommendation, and other workload families.
  • Avoid workload-name hacks or hidden branching on model identity.

Evaluation

Each workload runs for up to 10,000 optimization steps. If the candidate reaches the target loss early, it gets credit according to how many steps it needed relative to the tuned AdamW baseline. If it misses the target, the verifier awards capped partial credit from the final EMA validation loss.

  • The final score is the geometric mean across all 10 workloads.
  • One bad failure can drag the aggregate down sharply.
  • The score is reported as step-count reduction; a score of 2.5x means the candidate reaches the target loss in 1/2.5 the steps of tuned AdamW.

Environment And Constraints

Agents get a single H100, 8 CPU cores, 128 GB RAM, and no internet. The workload set spans GPTs, CNNs, graph models, transformers, recommendation models, and hidden architectures chosen specifically to punish overfitting to the visible set.

Caveats

Task Caveat
The current benchmark code contains a known calibration issue: the tuned AdamW baseline can itself score above 1.0 because the stored target losses do not always line up perfectly with the measured convergence step. That does not destroy the relative ranking, but it does mean the absolute score should be read as a task-internal comparison number rather than a literal wall-clock speedup claim.