Large language model (LLM) agents often struggle in unfamiliar environments because they tend to "prematurely exploit"—they try to solve tasks using prior knowledge before they truly understand their surroundings. This paper introduces a new framework to fix this by teaching agents how to explore their environment systematically before they attempt to complete any specific goals.
The Problem: Premature Exploitation
Current AI agents are typically trained to maximize task-completion rewards. While this works well in familiar or static settings, it fails when the agent encounters new environments. Because these agents lack an intrinsic drive to learn about their surroundings, they often fall into two traps: they either act aimlessly through trial and error or they confidently follow a plan based on incorrect assumptions about the environment’s tools and layout. The authors argue that agents need to be able to acquire "grounded" knowledge—real-world information gathered through direct interaction—rather than relying solely on pre-existing training data.
Measuring Exploration
To address this, the researchers formalized "autonomous exploration" as a measurable skill. They introduced a new metric called Exploration Checkpoint Coverage (ECC). This metric tracks how effectively an agent discovers key states, objects, and functional affordances within an environment. By using ECC, the team discovered that standard task-oriented training does not naturally lead to good exploration skills; instead, agents trained this way tend to repeat the same narrow behaviors, failing to map out the environment effectively.
The Explore-then-Act Paradigm
The authors propose a new training strategy that alternates between two types of rollouts: task-execution and exploration. During exploration rollouts, the agent is rewarded based on its ECC score, which encourages it to discover as much as possible about the environment.
This leads to the "Explore-then-Act" paradigm. Instead of jumping straight into a task, the agent is given an "interaction budget." During this phase, it explores the environment without a goal, gathering information about constraints and available tools. It then summarizes this information into a knowledge base, which it uses to inform its decisions when it finally begins the task-execution phase.
Results and Impact
Experiments across environments like ALFWorld, SciWorld, and TextCraft show that this approach significantly improves performance. Agents trained with the ECC-guided strategy were better at navigating unfamiliar settings and showed higher success rates in downstream tasks compared to agents trained only for task completion. The findings suggest that autonomous exploration is a vital "meta-capability" for building AI agents that are truly ready for the complexities of the real world, where environments are diverse and constantly changing.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!