This task compares one shared optimizer across 10 workloads covering language modeling, vision, graphs, recommendation, and hidden architectures. The same optimizer class and config are reused everywhere, so the task is about cross-workload robustness rather than per-task tuning.
Working on a single H100 with 7 visible workloads (and 3 more hidden at verification), the agent must design one torch.optim.Optimizer subclass plus a shared config that beats tuned AdamW across all of them. The same optimizer implementation and config are used for every task; there is no per-workload hyperparameter tuning at submission time.
custom_optimizer.py and optimizer_config.json.Each workload runs for up to 10,000 optimization steps. If the candidate reaches the target loss early, it gets credit according to how many steps it needed relative to the tuned AdamW baseline. If it misses the target, the verifier awards capped partial credit from the final EMA validation loss.
Agents get a single H100, 8 CPU cores, 128 GB RAM, and no internet. The workload set spans GPTs, CNNs, graph models, transformers, recommendation models, and hidden architectures chosen specifically to punish overfitting to the visible set.