Look Before You Leap: Autonomous Exploration for LL...

Look Before You Leap: Autonomous Exploration for LL... | AI Research

Key Takeaways

Large language model (LLM) agents often struggle in unfamiliar environments because they tend to "prematurely exploit"—they try to solve tasks using prior kn...
Large language model based agents often fail in unfamiliar environments due to premature exploitation: a tendency to act on prior knowledge before acquiring sufficient environment-specific information.
We identify autonomous exploration as a critical yet underexplored capability for building adaptive agents.
To formalize and quantify this capability, we introduce Exploration Checkpoint Coverage, a verifiable metric that measures how broadly an agent discovers key states, objects, and affordances.
Our systematic evaluation reveals that agents trained with standard task-oriented reinforcement learning consistently exhibit narrow and repetitive behaviors that impede downstream performance.

Paper AbstractExpand

Large language model based agents often fail in unfamiliar environments due to premature exploitation: a tendency to act on prior knowledge before acquiring sufficient environment-specific information. We identify autonomous exploration as a critical yet underexplored capability for building adaptive agents. To formalize and quantify this capability, we introduce Exploration Checkpoint Coverage, a verifiable metric that measures how broadly an agent discovers key states, objects, and affordances. Our systematic evaluation reveals that agents trained with standard task-oriented reinforcement learning consistently exhibit narrow and repetitive behaviors that impede downstream performance. To address this limitation, we develop a training strategy that interleaves task-execution rollouts and exploration rollouts, with each type of rollout optimized by its corresponding verifiable reward. Building on this training strategy, we propose the Explore-then-Act paradigm, which decouples information-gathering from task execution: agents first utilize an interaction budget to acquire grounded environmental knowledge, then leverage it for task resolution. Our results demonstrate that learning to systematically explore is imperative for building generalizable and real-world-ready agents.

Large language model (LLM) agents often struggle in unfamiliar environments because they tend to "prematurely exploit"—they try to solve tasks using prior knowledge before they truly understand their surroundings. This paper introduces a new framework to fix this by teaching agents how to explore their environment systematically before they attempt to complete any specific goals.

The Problem: Premature Exploitation

Current AI agents are typically trained to maximize task-completion rewards. While this works well in familiar or static settings, it fails when the agent encounters new environments. Because these agents lack an intrinsic drive to learn about their surroundings, they often fall into two traps: they either act aimlessly through trial and error or they confidently follow a plan based on incorrect assumptions about the environment’s tools and layout. The authors argue that agents need to be able to acquire "grounded" knowledge—real-world information gathered through direct interaction—rather than relying solely on pre-existing training data.

Measuring Exploration

To address this, the researchers formalized "autonomous exploration" as a measurable skill. They introduced a new metric called Exploration Checkpoint Coverage (ECC). This metric tracks how effectively an agent discovers key states, objects, and functional affordances within an environment. By using ECC, the team discovered that standard task-oriented training does not naturally lead to good exploration skills; instead, agents trained this way tend to repeat the same narrow behaviors, failing to map out the environment effectively.

The Explore-then-Act Paradigm

The authors propose a new training strategy that alternates between two types of rollouts: task-execution and exploration. During exploration rollouts, the agent is rewarded based on its ECC score, which encourages it to discover as much as possible about the environment.
This leads to the "Explore-then-Act" paradigm. Instead of jumping straight into a task, the agent is given an "interaction budget." During this phase, it explores the environment without a goal, gathering information about constraints and available tools. It then summarizes this information into a knowledge base, which it uses to inform its decisions when it finally begins the task-execution phase.

Results and Impact

Experiments across environments like ALFWorld, SciWorld, and TextCraft show that this approach significantly improves performance. Agents trained with the ECC-guided strategy were better at navigating unfamiliar settings and showed higher success rates in downstream tasks compared to agents trained only for task completion. The findings suggest that autonomous exploration is a vital "meta-capability" for building AI agents that are truly ready for the complexities of the real world, where environments are diverse and constantly changing.