AGENTCL: Toward Rigorous Evaluation of Continual Le...

AGENTCL: Toward Rigorous Evaluation of Continual Le... | AI Research

Key Takeaways

Language agents, such as those powered by large language models, often perform complex tasks but fail to retain or reuse the knowledge gained from one episod...
Language agents spend substantial inference time solving individual tasks, yet the experience acquired in one episode is often underutilized in future episodes.
Continual learning expects an agent to accumulate reusable experience across a stream of tasks, improve over time, and avoid interference from irrelevant experiences.
Unfortunately, existing benchmarks struggle to evaluate continual learning in language agents rigorously.
This paper presents an evaluation framework AgentCL for continual learning in agents, centered on controlled task streams and metrics for transfer gains.

Paper AbstractExpand

Language agents spend substantial inference time solving individual tasks, yet the experience acquired in one episode is often underutilized in future episodes. Continual learning expects an agent to accumulate reusable experience across a stream of tasks, improve over time, and avoid interference from irrelevant experiences. Unfortunately, existing benchmarks struggle to evaluate continual learning in language agents rigorously. Most efforts focus on retrieval and reasoning over long-context conversations or documents, while recent lifelong-adaptation benchmarks often rely on naive task streams with limited analysis of cross-task relationships, making it difficult to understand what an agent learns and reuses over time. This paper presents an evaluation framework AgentCL for continual learning in agents, centered on controlled task streams and metrics for transfer gains. AGENTCL constructs compositional streams where earlier sub-solutions, evidence, or workflows are intentionally reusable in later tasks, and contrasts them with naive streams where such reusability is not guaranteed. We use the benchmark to evaluate non-parametric memory designs for continual learning. To diagnose how memory design choices affect continual learning, we develop MemProbe, a probing method that stores interactions, insights, and skills, while filtering unreliable experiences during consolidation. Empirical analysis across coding, deep research, and language understanding/reasoning tasks shows that naive streams offer limited ability to distinguish memory designs, whereas controlled streams more clearly distinguish their plasticity. Meanwhile, naive and held-out settings often yield limited gains and can expose memory-induced degradation. These results highlight the need for stronger memory designs that balance plasticity and stable reuse.

Language agents, such as those powered by large language models, often perform complex tasks but fail to retain or reuse the knowledge gained from one episode to the next. While continual learning aims to help these agents accumulate experience over time, current benchmarks struggle to measure this effectively because they often rely on random, "naive" sequences of tasks. This paper introduces AgentCL, a new evaluation framework designed to rigorously test how agents learn, store, and reuse information across controlled task streams.

Controlling Task Relationships

The core innovation of AgentCL is the use of "compositional streams." Unlike naive streams, where tasks are ordered randomly, compositional streams are intentionally designed so that earlier tasks provide sub-solutions, evidence, or workflows that are directly useful for solving later, more complex tasks. By contrasting these two types of streams, the researchers can determine whether an agent is genuinely learning to reuse abstracted knowledge or simply benefiting from incidental exposure to similar domains.

Measuring Learning with Two-Pass Evaluation

To quantify an agent's performance, the framework employs a two-pass evaluation protocol. In the first pass, the agent is allowed to read from and write to its memory as it processes a sequence of tasks. In the second pass, the agent’s memory is "frozen," meaning it can only read from what it previously stored. This allows researchers to calculate three specific metrics:

Plasticity Gain: Does the memory built from earlier tasks actually help the agent solve current ones?
Stability Gain: Does the experience from a task remain accessible and useful even after the agent has processed many subsequent, potentially distracting tasks?
Generalization Gain: Can the agent apply its accumulated memory to entirely new, held-out tasks?

Probing Memory Designs

The authors also developed MemProbe, a diagnostic tool that helps analyze how different memory components—such as interaction history, task insights, and procedural skills—contribute to an agent's success. MemProbe filters out unreliable experiences during the consolidation process, ensuring that only high-quality information is stored. This helps researchers understand which memory design choices lead to better performance and which ones fail to balance the need for new learning (plasticity) with the need to retain past knowledge (stability).

Key Findings and Limitations

The empirical analysis reveals that naive task streams are often too blunt to distinguish between different memory designs, as they compress performance differences and make it hard to see if an agent is truly improving. In contrast, compositional streams clearly highlight the strengths and weaknesses of various memory architectures. The study also uncovers a significant "stability bottleneck": while many memory designs perform well when tasks are explicitly related, they often struggle in naive or held-out settings, where they may inadvertently cause cognitive interference or performance degradation. These results suggest that future research must focus on developing more robust memory consolidation techniques that can handle both predictable and unpredictable task environments.