Language agents, such as those powered by large language models, often perform complex tasks but fail to retain or reuse the knowledge gained from one episode to the next. While continual learning aims to help these agents accumulate experience over time, current benchmarks struggle to measure this effectively because they often rely on random, "naive" sequences of tasks. This paper introduces AgentCL, a new evaluation framework designed to rigorously test how agents learn, store, and reuse information across controlled task streams.
Controlling Task Relationships
The core innovation of AgentCL is the use of "compositional streams." Unlike naive streams, where tasks are ordered randomly, compositional streams are intentionally designed so that earlier tasks provide sub-solutions, evidence, or workflows that are directly useful for solving later, more complex tasks. By contrasting these two types of streams, the researchers can determine whether an agent is genuinely learning to reuse abstracted knowledge or simply benefiting from incidental exposure to similar domains.
Measuring Learning with Two-Pass Evaluation
To quantify an agent's performance, the framework employs a two-pass evaluation protocol. In the first pass, the agent is allowed to read from and write to its memory as it processes a sequence of tasks. In the second pass, the agent’s memory is "frozen," meaning it can only read from what it previously stored. This allows researchers to calculate three specific metrics:
Plasticity Gain: Does the memory built from earlier tasks actually help the agent solve current ones?
Stability Gain: Does the experience from a task remain accessible and useful even after the agent has processed many subsequent, potentially distracting tasks?
Generalization Gain: Can the agent apply its accumulated memory to entirely new, held-out tasks?
Probing Memory Designs
The authors also developed MemProbe, a diagnostic tool that helps analyze how different memory components—such as interaction history, task insights, and procedural skills—contribute to an agent's success. MemProbe filters out unreliable experiences during the consolidation process, ensuring that only high-quality information is stored. This helps researchers understand which memory design choices lead to better performance and which ones fail to balance the need for new learning (plasticity) with the need to retain past knowledge (stability).
Key Findings and Limitations
The empirical analysis reveals that naive task streams are often too blunt to distinguish between different memory designs, as they compress performance differences and make it hard to see if an agent is truly improving. In contrast, compositional streams clearly highlight the strengths and weaknesses of various memory architectures. The study also uncovers a significant "stability bottleneck": while many memory designs perform well when tasks are explicitly related, they often struggle in naive or held-out settings, where they may inadvertently cause cognitive interference or performance degradation. These results suggest that future research must focus on developing more robust memory consolidation techniques that can handle both predictable and unpredictable task environments.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!