Autonomous agents are increasingly expected to improve their own decision-making policies through feedback, yet current evaluation methods often fail to distinguish between genuine improvement and simple trial-and-error. This paper introduces "Autonomous Policy Evolution," a controlled evaluation setting designed to measure how effectively an agent can refine an executable policy within a strict interaction budget. To implement this, the authors present EvoPolicyGym, a benchmark that forces agents to iteratively edit code based on feedback from simulated environments, ensuring that the process of evolution itself—rather than just the final result—is what gets measured.
The EvoPolicyGym Framework
EvoPolicyGym moves away from open-ended software engineering tasks, which can be messy and hard to track. Instead, it uses a structured loop: an agent is given a workspace and a fixed budget of episodes. The agent submits a policy, receives feedback from the environment (such as performance metrics and diagnostic data), and then uses that information to edit the policy code. This cycle repeats until the budget is exhausted. By keeping the environment and the interaction rules consistent, the benchmark allows researchers to see exactly how an agent allocates its limited "thinking" time and how it translates failure signals into successful code revisions.
Evaluating Agent Performance
The researchers tested four different AI models on the "Core16" suite, a collection of 16 diverse environments ranging from classic control tasks and robotics to navigation challenges. Each model was given a 128-episode budget to optimize its policy. The results showed that GPT-5.5 achieved the strongest overall performance, ranking in the top two across all 16 environments. Claude Opus 4.7 also performed well, particularly in the MiniGrid family of tasks. The study highlights that the best-performing agents are those that can maintain high performance across a wide variety of tasks, rather than those that only succeed in isolated, specific scenarios.
Beyond Final Scores
A key feature of this benchmark is its ability to provide "trajectory-level diagnostics." Because the system records every edit and feedback loop, researchers can look past the final score to understand the agent's decision-making process. The analysis reveals that successful policy evolution is not just about winning a task; it is about the agent's ability to discover the right mechanisms for the environment and refine its code efficiently under pressure. By auditing these traces, the authors can distinguish between agents that simply tune parameters and those that perform more complex structural changes to their code.
Implications for Future Research
EvoPolicyGym serves as both a leaderboard for comparing AI models and a diagnostic tool for understanding how self-evolving systems interact with their environment. By enforcing a strict visibility boundary—where training feedback is visible but validation and final testing remain hidden—the benchmark ensures that agents are truly learning to generalize rather than just overfitting to visible data. This approach provides a more rigorous way to study how autonomous agents can be trained to improve their own behavior in a reliable, measurable, and efficient manner.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!