Previous TaskLua Native Compiler Next TaskPCQM4Mv2 Molecular Gap Prediction

06 Research

FrogsGame Post-Training

Results

#ModelAvgBest

Claude Fable 5

Claude Code

34%68%

Grok 4.5

Grok CLI

3.4%4.4%

Claude Opus 4.8

Claude Code

3.0%3.8%

GLM-5.2

Claude Code

2.4%4%

GPT-5.4

Codex

2.0%3.4%

GPT-5.5

Codex

1.4%4%

Composer 2.5

Cursor CLI

1.4%2.4%

Claude Opus 4.7

Claude Code

1.0%2.4%

Claude Opus 4.6

Claude Code

1.0%3%

Kimi K2.6

Kimi CLI

0.4%1.8%

DeepSeek V4 Pro

Claude Code

0.4%2%

GLM-5.1

Claude Code

0.4%1.8%

Gemini 3.1 Pro

Gemini CLI

0%0%

Kimi K2.5

Kimi CLI

0%0%

Qwen3.6-Plus

Qwen Code

0%0%

#	Model	Harness	Avg Score	Best Score	Avg Tokens	Avg Time
1	Claude Fable 5	Claude Code	34%	68%	25.4M	17h 13m
2	Grok 4.5	Grok CLI	3.4%	4.4%	12.7M	9h 22m
3	Claude Opus 4.8	Claude Code	3.0%	3.8%	12.7M	8h 3m
4	GLM-5.2	Claude Code	2.4%	4%	154.3M	14h 7m
5	GPT-5.4	Codex	2.0%	3.4%	36.0M	3h 28m
6	GPT-5.5	Codex	1.4%	4%	9.7M	2h 28m
7	Composer 2.5	Cursor CLI	1.4%	2.4%	391K	5h 15m
8	Claude Opus 4.7	Claude Code	1.0%	2.4%	14.1M	6h 46m
9	Claude Opus 4.6	Claude Code	1.0%	3%	16.2M	12h 46m
10	Kimi K2.6	Kimi CLI	0.4%	1.8%	48.5M	8h 33m
11	DeepSeek V4 Pro	Claude Code	0.4%	2%	25.5M	7h 20m
12	GLM-5.1	Claude Code	0.4%	1.8%	21.3M	11h 12m
13	Gemini 3.1 Pro	Gemini CLI	0%	0%	6.9M	3h 18m
14	Kimi K2.5	Kimi CLI	0%	0%	12.3M	1h 1m
15	Qwen3.6-Plus	Qwen Code	0%	0%	12.6M	1h 53m

Background

FrogsGame is a logic puzzle, a variant of N-queens played on a colored grid.

Each board is an N x N grid divided into colored regions. The goal is to place exactly N frogs on the board, subject to two rules:

1. No two frogs can share a row, a column, or a color region. On the board below, crossed-out cells show the rest of the claimed row, column, and color region — all off limits.

2. No two frogs can touch, even diagonally. All eight surrounding cells are forbidden — king-move adjacency, just like the queens-and-knights variant of N-queens.

Together these constraints mean that a single frog placement eliminates a large fraction of the board. Solving the puzzle is hard because it requires balancing row, column, region, and spacing simultaneously.

6 x 6 board with one legal frog placement

This exact layout satisfies all four board constraints: one frog in every row, one in every column, one in every region, and no king-move neighbors.

Task

We ask the agent to post-train a model to solve FrogsGame boards using tool use. The agent starts in a container with the Qwen3-8B tokenizer, task scaffolding that generates boards, and access to a remote training and inference API via Tinker. It must then post-train Qwen3-8B and submit a LoRA checkpoint.

The trained model interacts with each puzzle through six tools exposed as an XML tool-call schema. It starts blind — the prompt contains no board — and must call get_state first to discover the grid, regions, and colors. From there it runs a multi-turn loop: on each turn the model sees the full history of prior tool calls and their results, decides the next action, and receives the outcome. It keeps going until it calls submit or hits the 200-call cap.

Prompt

"Solve this Frog Placement Game puzzle."

get_state

Tool Result

{ board, frogs,
n, colors }

Available Tools

get_stateReturns the board grid, placed frogs, board size, and color list. Must be called first.

place_frog(row, col)Place a frog at a cell. Illegal moves are rejected — the frog is not placed.

check_violationsCheck rules and return frog count. Always valid since place_frog rejects illegal moves.

remove_frog(row, col)Remove a previously placed frog.

resetClear all frogs and start over.

submitSubmit the current placement as a final answer. Episode ends.

Evaluation

The verifier generates 500 hidden boards across four difficulty tiers and measures how many of them the trained model is able to solve. Publicly we describe that as solve rate, but the underlying scorer computes the normalized score directly from the raw solved-board count.

Scoring Formula

\mathrm{score}=\frac{S}{500}

S

Board sizes span 6x6 through 13x13.
The evaluation prompt and tool interface are frozen by prepare.py.
Training-time access to hidden solutions is blocked by dedicated anti-cheat checks.

Environment

The task runs in a Modal container with 8 CPUs, 32 GB RAM, no local GPU, and a narrow domain allowlist. The container image includes the game engine, tokenizer, and scripts under /app, but training and inference happen remotely through the Tinker API. The container mostly acts as an orchestration layer for board generation and remote job launches.

Constraints

prepare.py is immutable: the verifier hashes /app/prepare.py and hard-fails any submission that changes it.
The scored artifact is a tinker:// checkpoint path in /app/checkpoint/path.txt; the verifier downloads that checkpoint and evaluates it independently instead of trusting results.json.
Runtime network access is restricted to the job allowlist for Tinker and Hugging Face domains.

Caveats

Difficulty Caveat

Baseline Qwen3-8B performance is effectively zero on the hidden set. That makes the task harsher on plain RL recipes than many post-training tasks and increases the importance of explicit bootstrapping strategies.