FrogsGame is a logic puzzle, a variant of N-queens played on a colored grid.
Each board is an N x N grid divided into colored regions. The goal is to place exactly N frogs on the board, subject to two rules:
1. No two frogs can share a row, a column, or a color region. On the board below, crossed-out cells show the rest of the claimed row, column, and color region — all off limits.
2. No two frogs can touch, even diagonally. All eight surrounding cells are forbidden — king-move adjacency, just like the queens-and-knights variant of N-queens.
Together these constraints mean that a single frog placement eliminates a large fraction of the board. Solving the puzzle is hard because it requires balancing row, column, region, and spacing simultaneously.
This exact layout satisfies all four board constraints: one frog in every row, one in every column, one in every region, and no king-move neighbors.
We ask the agent to post-train a model to solve FrogsGame boards using tool use. The agent starts in a container with the Qwen3-8B tokenizer, task scaffolding that generates boards, and access to a remote training and inference API via Tinker. It must then post-train Qwen3-8B and submit a LoRA checkpoint.
The trained model interacts with each puzzle through six tools exposed as an XML tool-call schema. It starts blind — the prompt contains no board — and must call get_state first to discover the grid, regions, and colors. From there it runs a multi-turn loop: on each turn the model sees the full history of prior tool calls and their results, decides the next action, and receives the outcome. It keeps going until it calls submit or hits the 200-call cap.
get_stateReturns the board grid, placed frogs, board size, and color list. Must be called first.place_frog(row, col)Place a frog at a cell. Illegal moves are rejected — the frog is not placed.check_violationsCheck rules and return frog count. Always valid since place_frog rejects illegal moves.remove_frog(row, col)Remove a previously placed frog.resetClear all frogs and start over.submitSubmit the current placement as a final answer. Episode ends.The verifier generates 500 hidden boards across four difficulty tiers and measures how many of them the trained model is able to solve. Publicly we describe that as solve rate, but the underlying scorer computes the normalized score directly from the raw solved-board count.
prepare.py.The task runs in a Modal container with 8 CPUs, 32 GB RAM, no local GPU, and a narrow domain allowlist. The container image includes the game engine, tokenizer, and scripts under /app, but training and inference happen remotely through the Tinker API. The container mostly acts as an orchestration layer for board generation and remote job launches.
prepare.py is immutable: the verifier hashes /app/prepare.py and hard-fails any submission that changes it.tinker:// checkpoint path in /app/checkpoint/path.txt; the verifier downloads that checkpoint and evaluates it independently instead of trusting results.json.