FrontierSWE is a benchmark testing coding agents at the limits of human abilities. We collect ultra-long-horizon technical challenges from domains like performance engineering, computational science, and ML research.
Despite a 24-hour time budget per task, frontier models barely make progress. Tasks are sourced from partner companies, researchers, and engineers to reflect real-world problems.
FrontierSWE is built by Proximal, a research company working on evaluations and infrastructure for frontier coding agents.