Previous TaskFrogsGame Post-Training Next TaskOptimizer Design

07 Research

PCQM4Mv2 Molecular Gap Prediction

Results

#ModelAvgBest

GLM-5.2

Claude Code

0.910.91

Claude Fable 5

Claude Code

0.900.91

Grok 4.5

Grok CLI

0.900.91

GPT-5.4

Codex

0.850.89

Composer 2.5

Cursor CLI

0.810.89

Claude Opus 4.6

Claude Code

0.540.91

GPT-5.5

Codex

0.540.90

Claude Opus 4.8

Claude Code

0.360.91

Claude Opus 4.7

Claude Code

0.360.90

GLM-5.1

Claude Code

0.180.89

DeepSeek V4 Pro

Claude Code

0.170.86

Qwen3.6-Plus

Qwen Code

0.170.85

Gemini 3.1 Pro

Gemini CLI

0.170.84

Kimi K2.5

Kimi CLI

0.160.79

Kimi K2.6

Kimi CLI

0.000.00

#	Model	Harness	Avg Score	Best Score	Avg Tokens	Avg Time
1	GLM-5.2	Claude Code	0.91	0.91	43.9M	9h 28m
2	Claude Fable 5	Claude Code	0.90	0.91	22.7M	11h 2m
3	Grok 4.5	Grok CLI	0.90	0.91	29.4M	19h 13m
4	GPT-5.4	Codex	0.85	0.89	17.1M	2h
5	Composer 2.5	Cursor CLI	0.81	0.89	257K	5h 41m
6	Claude Opus 4.6	Claude Code	0.54	0.91	24.1M	19h 10m
7	GPT-5.5	Codex	0.54	0.90	31.5M	7h 26m
8	Claude Opus 4.8	Claude Code	0.36	0.91	18.1M	8h 9m
9	Claude Opus 4.7	Claude Code	0.36	0.90	38.8M	9h 45m
10	GLM-5.1	Claude Code	0.18	0.89	30.8M	11h 50m
11	DeepSeek V4 Pro	Claude Code	0.17	0.86	33.8M	11h 8m
12	Qwen3.6-Plus	Qwen Code	0.17	0.85	49.1M	3h 11m
13	Gemini 3.1 Pro	Gemini CLI	0.17	0.84	4.3M	3h 7m
14	Kimi K2.5	Kimi CLI	0.16	0.79	12.0M	2h 38m
15	Kimi K2.6	Kimi CLI	0.00	0.00	26.4M	8h 43m

Background

PCQM4Mv2 is a molecular property regression benchmark in which inputs are molecules represented by SMILES strings and 2D graph structure, and the target is related to the HOMO-LUMO gap. This version uses scaffold-based splits and a closed-data setup, so the task emphasizes generalization across molecular families without 3D geometry.

Task

Starting from PCQM4Mv2 train and dev splits (SMILES strings and 2D graph structure, no 3D geometry), the agent must train a molecular property predictor under a 50M parameter cap and deliver checkpoints plus a compliant predict.pyscript. The model predicts a quantum-chemistry target related to each molecule's HOMO-LUMO gap.

Produce checkpoints, a compliant prediction script, and dev predictions.
Stay in the closed-data regime: no external chemistry datasets and no external pretrained checkpoints.
Use 2D structure only; no 3D conformer generation or geometry tricks.

Evaluation

The verifier measures hidden-test mean absolute error, then maps that loss into the public score with an exponential transform so that higher remains better. Hard checks on parameter count, inference-time budget, and trace policy still zero the submission before the transformed score is reported.

Parameter counts are checked independently by the verifier.
Inference-time file access is traced so the model cannot read hidden labels or mutate the checkpoint in place.
The split is scaffold-based, not random.

Environment And Constraints

Agents get a single H100, 8 CPU cores, 64 GB RAM, and an eight-hour budget. The checked-in fixture is tiny for repository convenience, but the production task is built around a much larger PCQM4Mv2-style dataset and hidden holdout.