Executable World Models for ARC-AGI-3 in the Era of Coding Agents
This paper introduces a new approach for solving ARC-AGI-3, a challenging benchmark for artificial intelligence that requires agents to solve abstract, interactive puzzles without explicit instructions. Instead of relying on trial and error within the game environment—which is costly and risky—the researchers developed a system where an AI coding agent builds, tests, and refines an "executable world model." This model is a Python codebase that simulates the game's rules, allowing the agent to plan and verify its actions internally before executing them in the real game.
How the System Works
The agent operates through a continuous loop of observation, modeling, verification, and planning. It maintains three primary Python components: one for reconstructing game states, one for predicting how the environment changes (the world model), and one for planning sequences of actions.
A key feature of this approach is the "refactoring loop." As the agent encounters new game levels, it is prompted to simplify its code, replacing complex, ad-hoc rules with more general abstractions. This serves as a practical proxy for the Minimum Description Length (MDL) principle, which suggests that the best explanations are those that are the most compact. By keeping the code simple and general, the agent is better prepared to handle new, unseen levels within the same game.
Performance and Results
The researchers tested this system on all 25 public games available in the ARC-AGI-3 benchmark. The results were varied: the agent fully solved 7 games and achieved a high human-normalized efficiency score on 6 others. Across all games, it reached a mean Relative Human Action Efficiency (RHAE) of 32.58%.
The study noted significant performance differences between playthroughs of the same game. Because the agent makes open-ended decisions about how to model the environment, small differences in its initial hypotheses can lead to vastly different outcomes. While some games were mastered completely, others resulted in near-total failure, highlighting the sensitivity of the agent's early reasoning.
Limitations and Future Directions
The current implementation serves as a baseline rather than a finished product, and it faces two primary hurdles. First, the agent sometimes suffers from "tunnel vision," where it commits to an incorrect initial hypothesis about the game's rules and fails to consider alternatives. Second, there is a clear gap between building a correct model and using that model to plan effectively; even when the agent understands the game's dynamics, its planner may struggle to find the right sequence of moves.
To improve, the researchers suggest that future versions should incorporate better hypothesis tracking, more sophisticated planning skills, and a stricter enforcement of model-mediated execution. Testing the system on the private validation set will be the next critical step to determine if this approach truly generalizes across different types of games.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!