Previous TaskSGLang Inference System Optimization Next TaskWan 2.1 on MAX/Mojo

01 Implementation

PostgreSQL 18 on SQLite

Results

#ModelSuccess RateAvgBest

Claude Fable 5

Claude Code

0/517%17%

GLM-5.2

Claude Code

0/515%15%

Claude Opus 4.8

Claude Code

0/513%16%

Kimi K2.6

Kimi CLI

0/513%15%

Composer 2.5

Cursor CLI

0/512%14%

Claude Opus 4.7

Claude Code

0/511%16%

Grok 4.5

Grok CLI

0/511%16%

GPT-5.5

Codex

0/59.9%15%

GPT-5.4

Codex

0/58.7%15%

DeepSeek V4 Pro

Claude Code

0/58.7%15%

GLM-5.1

Claude Code

0/55.7%15%

Gemini 3.1 Pro

Gemini CLI

0/54.2%7.0%

Claude Opus 4.6

Claude Code

0/53.2%16%

Kimi K2.5

Kimi CLI

0/50%0%

Qwen3.6-Plus

Qwen Code

0/50%0%

#	Model	Harness	Success Rate	Avg Tests Passed	Best Tests Passed	Avg Tokens	Avg Time
1	Claude Fable 5	Claude Code	0/5	17%	17%	664.4M	15h 2m
2	GLM-5.2	Claude Code	0/5	15%	15%	161.4M	3h 29m
3	Claude Opus 4.8	Claude Code	0/5	13%	16%	15.8M	51m
4	Kimi K2.6	Kimi CLI	0/5	13%	15%	16.6M	3h 56m
5	Composer 2.5	Cursor CLI	0/5	12%	14%	3.4M	20m
6	Claude Opus 4.7	Claude Code	0/5	11%	16%	49.0M	2h 5m
7	Grok 4.5	Grok CLI	0/5	11%	16%	6.0M	21m
8	GPT-5.5	Codex	0/5	9.9%	15%	19.3M	28m
9	GPT-5.4	Codex	0/5	8.7%	15%	29.7M	4h 35m
10	DeepSeek V4 Pro	Claude Code	0/5	8.7%	15%	42.9M	2h
11	GLM-5.1	Claude Code	0/5	5.7%	15%	28.0M	11h 33m
12	Gemini 3.1 Pro	Gemini CLI	0/5	4.2%	7.0%	1.8M	19m
13	Claude Opus 4.6	Claude Code	0/5	3.2%	16%	185.3M	9h 30m
14	Kimi K2.5	Kimi CLI	0/5	0%	0%	25.2M	3h 35m
15	Qwen3.6-Plus	Qwen Code	0/5	0%	0%	31.7M	1h 34m

Background

PostgreSQL runs as its own process that clients connect to over TCP. After login, the client and server exchange queries using a fixed binary message format called the PostgreSQL wire protocol. It defines every byte that goes over the socket: how startup and authentication work, how SQL strings and parameters are sent, and how result rows, errors, and ready-for-query signals come back. On the server, every table, column, type, and function is described by rows in a set of system tables called the catalog (pg_class, pg_type, pg_proc, pg_namespace, etc.). Tools like psql, ORMs, and migration frameworks read the catalog constantly to figure out the schema, so reproducing its shape matters as much as answering queries.

SQLite skips all of that. It is a small C library your app links against and calls directly, with no network, no login, and no separate process. The whole database is a single file, and its internal schema lives in a much smaller table (sqlite_master). The SQL dialect also diverges from PostgreSQL in real ways. It has looser typing, different date/time functions, a different system catalog, and different rules for things like ALTER TABLE.

PostgreSQL also ships admin tools that deployments depend on: initdb creates a new data directory with the expected layout, and pg_ctl starts and stops the server, manages its PID file, and forwards signals. A drop-in has to act like these tools, not just like the server.

Task

Starting from PostgreSQL 18.3 documentation, a tiny Zig scaffold, and access to SQLite3 and libc, the agent must build a single binary that PostgreSQL clients can connect to normally, while storing data in SQLite underneath. The same compiled binary is expected to stand in for postgres, initdb, and pg_ctl by switching behavior based on argv[0].

Implement the PostgreSQL wire protocol, including startup and query flow.
Translate SQL and catalog behavior onto SQLite-backed storage.
Support lifecycle operations like cluster init, startup, shutdown, ports, and socket paths.
Convince clients and test harnesses that they are talking to PostgreSQL 18.3.

Evaluation

The hidden verifier combines three kinds of tests: PostgreSQL's own regression tests (which compare SQL output character-for-character), integration tests that stress lifecycle management and authentication, and 60 extra smoke tests covering ordinary database operations.

Regression output must match PostgreSQL very closely.
Integration tests check that the server handles connections, authentication, and lifecycle correctly.
The added smoke tests cover ordinary SQL behaviors such as inserts, selects, and schema changes.

No model was able to complete this task successfully, so we used overall test pass rate as a partial reward to rank models. A submission that passes half of the hidden checks across those three buckets scores exactly 0.5.

Environment And Constraints

The task runs in a Modal container with 8 CPUs, 32 GB RAM, and no internet access. The container image stages/app/postgres-sqlite, the offline PostgreSQL docs tree, and the stock psql client. PostgreSQL source code is not present, but the environment does provide the tiny Zig scaffold plussqlite3 and libc system linking, with no external packages or ready-made protocol libraries.

The implementation is intended to be native Zig rather than a thin wrapper around PostgreSQL components.
The verifier rejects external Zig packages entirely and dependency scans reject PostgreSQL-related system libraries such as pg, libpq, pgcommon, and pgport, plus Zig imports like pgwire, postgres, postgresql, libpq, or pq.
The hidden verifier owns the real PostgreSQL binaries, tests, and client harness setup.