06 Research

FrogsGame Post-Training

Results

#ModelAvgBest
1
GPT-5.4
Codex
2.0%3.4%
2
Claude Opus 4.6
Claude Code
1.0%3%
3
Gemini 3.1 Pro
Gemini CLI
0%0%
4
Kimi K2.5
Kimi CLI
0%0%
5
Qwen3.6-Plus
Qwen Code
0%0%

Background

FrogsGame is a logic puzzle, a variant of N-queens played on a colored grid.

Each board is an N x N grid divided into colored regions. The goal is to place exactly N frogs on the board, subject to two rules:

1. No two frogs can share a row, a column, or a color region. On the board below, crossed-out cells show the rest of the claimed row, column, and color region — all off limits.

2. No two frogs can touch, even diagonally. All eight surrounding cells are forbidden — king-move adjacency, just like the queens-and-knights variant of N-queens.

Together these constraints mean that a single frog placement eliminates a large fraction of the board. Solving the puzzle is hard because it requires balancing row, column, region, and spacing simultaneously.

6 x 6 board with one legal frog placement
1
2
3
4
5
6
1
A
E
E
E
1
E
E
2
A
2
A
E
B
E
E
3
C
F
A
D
D
3
D
4
C
C
4
C
D
A
F
5
C
D
E
F
B
F
5
6
C
B
B
6
A
B
C

This exact layout satisfies all four board constraints: one frog in every row, one in every column, one in every region, and no king-move neighbors.

Task

We ask the agent to post-train a model to solve FrogsGame boards using tool use. The agent starts in a container with the Qwen3-8B tokenizer, task scaffolding that generates boards, and access to a remote training and inference API via Tinker. It must then post-train Qwen3-8B and submit a LoRA checkpoint.

The trained model interacts with each puzzle through six tools exposed as an XML tool-call schema. It starts blind — the prompt contains no board — and must call get_state first to discover the grid, regions, and colors. From there it runs a multi-turn loop: on each turn the model sees the full history of prior tool calls and their results, decides the next action, and receives the outcome. It keeps going until it calls submit or hits the 200-call cap.

Prompt
"Solve this Frog Placement Game puzzle."
get_state
Tool Result
{ board, frogs,
  n, colors }
Available Tools
get_stateReturns the board grid, placed frogs, board size, and color list. Must be called first.
place_frog(row, col)Place a frog at a cell. Illegal moves are rejected — the frog is not placed.
check_violationsCheck rules and return frog count. Always valid since place_frog rejects illegal moves.
remove_frog(row, col)Remove a previously placed frog.
resetClear all frogs and start over.
submitSubmit the current placement as a final answer. Episode ends.

Evaluation

The verifier generates 500 hidden boards across four difficulty tiers and measures how many of them the trained model is able to solve. Publicly we describe that as solve rate, but the underlying scorer computes the normalized score directly from the raw solved-board count.

Scoring Formula
score=S500\mathrm{score}=\frac{S}{500}
  • SS = hidden boards solved by the submitted checkpoint
  • The raw verifier stores SS, while the site presents the corresponding solve rate.
  • Board sizes span 6x6 through 13x13.
  • The evaluation prompt and tool interface are frozen by prepare.py.
  • Training-time access to hidden solutions is blocked by dedicated anti-cheat checks.

Environment

The task runs in a Modal container with 8 CPUs, 32 GB RAM, no local GPU, and a narrow domain allowlist. The container image includes the game engine, tokenizer, and scripts under /app, but training and inference happen remotely through the Tinker API. The container mostly acts as an orchestration layer for board generation and remote job launches.

Constraints

  • prepare.py is immutable: the verifier hashes /app/prepare.py and hard-fails any submission that changes it.
  • The scored artifact is a tinker:// checkpoint path in /app/checkpoint/path.txt; the verifier downloads that checkpoint and evaluates it independently instead of trusting results.json.
  • Runtime network access is restricted to the job allowlist for Tinker and Hugging Face domains.

Caveats

Difficulty Caveat
Baseline Qwen3-8B performance is effectively zero on the hidden set. That makes the task harsher on plain RL recipes than many post-training tasks and increases the importance of explicit bootstrapping strategies.