Memory Depth, Not Memory Access: Selective Parametric Consolidation for Long-Running Language Agents
Long-running language agents often struggle to maintain consistent goals and behaviors over time. While current systems rely heavily on retrieval—fetching past facts from an external database—this approach only works if the relevant information is actively brought into the agent's current working context. This paper introduces the concept of "memory depth," which refers to the ability of an agent to retain durable, goal-conditioned behaviors that persist even after the working context is cleared. The researchers propose a mechanism called EVAF to selectively consolidate important experiences into the model’s own parameters, ensuring that the agent’s core tendencies remain intact without needing to constantly re-retrieve old information.
The Loop-Drift Protocol
To test this, the authors developed the "loop-drift protocol," a controlled stress test designed to distinguish between simple fact retrieval and true behavioral persistence. In this protocol, the agent is subjected to various distractions, conflicting requests, and long periods of interference. Crucially, the retrieval system remains active throughout the test, but the working context is periodically "unloaded." This allows researchers to measure whether the agent can still act according to its goals after the context is wiped, proving that the memory has been deeply integrated into the model rather than just stored as a searchable file.
How EVAF Works
EVAF (a surprise- and valence-gated LoRA consolidation mechanism) acts as a filter for the agent’s experiences. It uses two main criteria to decide what is worth remembering:
Surprise: How unexpected or informative an event is based on the model's current state.
Valence: How relevant an event is to the user’s long-term goals and preferences.
When an event is deemed important enough, it is written into a small, low-rank adapter (LoRA). By being highly selective, EVAF avoids the "noise" of storing every single interaction, which keeps the model's memory focused on behavioral tendencies rather than just transient facts.
Key Findings
The research demonstrates a clear "depth flip" in performance. Retrieval-augmented generation (RAG) excels at recalling specific, recent facts, but it fails to maintain consistent goal-oriented behavior after a context unload. Conversely, EVAF performs significantly better at maintaining these goals and recovering them after an unload, even with very few parametric writes. The study also highlights that "actuation"—how strongly the model updates its weights based on selected events—is a separate, critical factor. If the model writes too strongly or indiscriminately, it can actually degrade performance, suggesting that successful memory consolidation requires a careful balance between selecting the right information and applying the right amount of weight change.
Limitations and Future Work
The authors emphasize that this research is a narrow mechanism claim rather than a universal solution. While EVAF is effective for goal persistence, it does not currently solve the problem of "stale-memory invalidation"—the ability to delete or update outdated information. Tests using public Memora event streams showed that while the mechanism is directionally positive, it does not yet provide a robust way to handle memory updates. Consequently, the authors identify validity gating and reconsolidation as important areas for future development.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!