Previous TaskPCQM4Mv2 Molecular Gap Prediction Next Tasklibexpat to x86-64 Assembly

08 Research

Optimizer Design

Results

#ModelAvgBest

Grok 4.5

Grok CLI

2.7x3.4x

GLM-5.2

Claude Code

2.6x2.8x

Claude Opus 4.7

Claude Code

2.3x3.0x

GPT-5.5

Codex

2.1x3.2x

Claude Opus 4.8

Claude Code

2.0x2.7x

GPT-5.4

Codex

1.6x2.5x

Claude Opus 4.6

Claude Code

1.5x1.9x

Composer 2.5

Cursor CLI

1.5x2.6x

Kimi K2.6

Kimi CLI

1.5x2.4x

GLM-5.1

Claude Code

1.3x1.4x

DeepSeek V4 Pro

Claude Code

1.2x1.3x

Gemini 3.1 Pro

Gemini CLI

1.1x1.2x

Kimi K2.5

Kimi CLI

1.1x1.5x

Qwen3.6-Plus

Qwen Code

0.9x1.3x

Claude Fable 5

Claude Code

0.8x3.8x

#	Model	Harness	Avg Score	Best Score	Avg Tokens	Avg Time
1	Grok 4.5	Grok CLI	2.7x	3.4x	29.4M	19h 7m
2	GLM-5.2	Claude Code	2.6x	2.8x	85.0M	8h 56m
3	Claude Opus 4.7	Claude Code	2.3x	3.0x	49K	19h 15m
4	GPT-5.5	Codex	2.1x	3.2x	30.4M	13h 37m
5	Claude Opus 4.8	Claude Code	2.0x	2.7x	35.2M	10h 29m
6	GPT-5.4	Codex	1.6x	2.5x	15.8M	3h 16m
7	Claude Opus 4.6	Claude Code	1.5x	1.9x	44.7M	17h 57m
8	Composer 2.5	Cursor CLI	1.5x	2.6x	410K	9h 40m
9	Kimi K2.6	Kimi CLI	1.5x	2.4x	20.1M	10h 22m
10	GLM-5.1	Claude Code	1.3x	1.4x	21.5M	16h 34m
11	DeepSeek V4 Pro	Claude Code	1.2x	1.3x	24.2M	15h 25m
12	Gemini 3.1 Pro	Gemini CLI	1.1x	1.2x	6.1M	2h 50m
13	Kimi K2.5	Kimi CLI	1.1x	1.5x	11.4M	7h 7m
14	Qwen3.6-Plus	Qwen Code	0.9x	1.3x	15.6M	6h 10m
15	Claude Fable 5	Claude Code	0.8x	3.8x	66.8M	14h 9m

Background

This task compares one shared optimizer across 10 workloads covering language modeling, vision, graphs, recommendation, and hidden architectures. The same optimizer class and config are reused everywhere, so the task is about cross-workload robustness rather than per-task tuning.

Task

Working on a single H100 with 7 visible workloads (and 3 more hidden at verification), the agent must design one torch.optim.Optimizer subclass plus a shared config that beats tuned AdamW across all of them. The same optimizer implementation and config are used for every task; there is no per-workload hyperparameter tuning at submission time.

Submit custom_optimizer.py and optimizer_config.json.
Generalize across language modeling, vision, graphs, recommendation, and other workload families.
Avoid workload-name hacks or hidden branching on model identity.

Evaluation

Each workload runs for up to 10,000 optimization steps. If the candidate reaches the target loss early, it gets credit according to how many steps it needed relative to the tuned AdamW baseline. If it misses the target, the verifier awards capped partial credit from the final EMA validation loss.

The final score is the geometric mean across all 10 workloads.
One bad failure can drag the aggregate down sharply.
The score is reported as step-count reduction; a score of 2.5x means the candidate reaches the target loss in 1/2.5 the steps of tuned AdamW.

Environment And Constraints

Agents get a single H100, 8 CPU cores, 128 GB RAM, and no internet. The workload set spans GPTs, CNNs, graph models, transformers, recommendation models, and hidden architectures chosen specifically to punish overfitting to the visible set.

Caveats

Task Caveat

The current benchmark code contains a known calibration issue: the tuned AdamW baseline can itself score above 1.0 because the stored target losses do not always line up perfectly with the measured convergence step. That does not destroy the relative ranking, but it does mean the absolute score should be read as a task-internal comparison number rather than a literal wall-clock speedup claim.