Back to AI Research

AI Research

Executable World Models for ARC-AGI-3 in the Era of... | AI Research

Key Takeaways

  • Executable World Models for ARC-AGI-3 in the Era of Coding Agents This paper introduces a new approach for solving ARC-AGI-3, a challenging benchmark for art...
  • The system is intentionally direct: it uses a scripted controller, predefined world-model interfaces, verifier programs, and a plan executor, but no hand-coded game-specific logic.
  • We report results on the 25 public ARC-AGI-3 games.
  • Each recorded playthrough uses a fresh agent instance with no access to previous playthrough-specific files or conversation state.
  • Most games have a single recorded playthrough; for a few games, we report multiple independent fresh-agent playthroughs to expose run-to-run variability.
Paper AbstractExpand

We evaluate an initial coding-agent system for ARC-AGI-3 in which the agent maintains an executable Python world model, verifies it against previous observations, refactors it toward simpler abstractions as a practical proxy for an MDL-like simplicity bias, and plans through the model before acting. The system is intentionally direct: it uses a scripted controller, predefined world-model interfaces, verifier programs, and a plan executor, but no hand-coded game-specific logic. We report results on the 25 public ARC-AGI-3 games. Each recorded playthrough uses a fresh agent instance with no access to previous playthrough-specific files or conversation state. Most games have a single recorded playthrough; for a few games, we report multiple independent fresh-agent playthroughs to expose run-to-run variability. The agent fully solved 7 games, achieved a Relative Human Action Efficiency greater than 75%, on 6 games, and obtained a mean per-game RHAE of 32.58%. Because the system uses no game-specific code, it can serve as a game-general baseline for ARC-AGI-3. Performance on the private validation set remains to be tested. Overall, the results provide preliminary evidence that verifier-driven executable world models are a promising approach for ARC-AGI-3 agents.

Executable World Models for ARC-AGI-3 in the Era of Coding Agents

This paper introduces a new approach for solving ARC-AGI-3, a challenging benchmark for artificial intelligence that requires agents to solve abstract, interactive puzzles without explicit instructions. Instead of relying on trial and error within the game environment—which is costly and risky—the researchers developed a system where an AI coding agent builds, tests, and refines an "executable world model." This model is a Python codebase that simulates the game's rules, allowing the agent to plan and verify its actions internally before executing them in the real game.

How the System Works

The agent operates through a continuous loop of observation, modeling, verification, and planning. It maintains three primary Python components: one for reconstructing game states, one for predicting how the environment changes (the world model), and one for planning sequences of actions.
A key feature of this approach is the "refactoring loop." As the agent encounters new game levels, it is prompted to simplify its code, replacing complex, ad-hoc rules with more general abstractions. This serves as a practical proxy for the Minimum Description Length (MDL) principle, which suggests that the best explanations are those that are the most compact. By keeping the code simple and general, the agent is better prepared to handle new, unseen levels within the same game.

Performance and Results

The researchers tested this system on all 25 public games available in the ARC-AGI-3 benchmark. The results were varied: the agent fully solved 7 games and achieved a high human-normalized efficiency score on 6 others. Across all games, it reached a mean Relative Human Action Efficiency (RHAE) of 32.58%.
The study noted significant performance differences between playthroughs of the same game. Because the agent makes open-ended decisions about how to model the environment, small differences in its initial hypotheses can lead to vastly different outcomes. While some games were mastered completely, others resulted in near-total failure, highlighting the sensitivity of the agent's early reasoning.

Limitations and Future Directions

The current implementation serves as a baseline rather than a finished product, and it faces two primary hurdles. First, the agent sometimes suffers from "tunnel vision," where it commits to an incorrect initial hypothesis about the game's rules and fails to consider alternatives. Second, there is a clear gap between building a correct model and using that model to plan effectively; even when the agent understands the game's dynamics, its planner may struggle to find the right sequence of moves.
To improve, the researchers suggest that future versions should incorporate better hypothesis tracking, more sophisticated planning skills, and a stricter enforcement of model-mediated execution. Testing the system on the private validation set will be the next critical step to determine if this approach truly generalizes across different types of games.

Comments (0)

No comments yet

Be the first to share your thoughts!