Back to AI Research

AI Research

Memory Depth, Not Memory Access: Selective Parametr... | AI Research

Key Takeaways

  • Memory Depth, Not Memory Access: Selective Parametric Consolidation for Long-Running Language Agents Long-running language agents often struggle to maintain...
  • Long-running language agents need more than memory access.
  • Retrieval systems can fetch past facts at query time, but they do not decide which experiences should continue to shape behavior after the working context is unloaded.
  • We study this separate problem as memory depth: durable goal-conditioned tendencies written into a small parametric store.
  • We introduce the loop-drift protocol, a controlled stress test in which the retrieval index remains intact while working context is unloaded and goal-conditioned behavior must persist under long-loop interference.
Paper AbstractExpand

Long-running language agents need more than memory access. Retrieval systems can fetch past facts at query time, but they do not decide which experiences should continue to shape behavior after the working context is unloaded. We study this separate problem as memory depth: durable goal-conditioned tendencies written into a small parametric store. We introduce the loop-drift protocol, a controlled stress test in which the retrieval index remains intact while working context is unloaded and goal-conditioned behavior must persist under long-loop interference. We evaluate EVAF, a surprise- and valence-gated LoRA consolidation mechanism. Across GPT-2 and TinyLlama, retrieval is strongest on shallow factual recall (short-fact accuracy 0.956--0.973), while EVAF is strongest on goal persistence and post-unload recovery (0.812--0.904) with only 2--3 parametric writes per 200 events. Mechanism controls show that selective consolidation factorizes into two controllable dimensions: selection and actuation. Matched random gates isolate selection beyond sparse writing; fixed-inner controls across GPT-2, TinyLlama, and Mistral-7B show that inner-loop write strength is model-dependent; and a Mistral-7B matched-gate inversion reveals asymmetric selection-actuation coupling under miscalibrated actuation. Public Memora event streams serve as an external diagnostic, exposing stale-memory invalidation as an unresolved boundary. Within this probe, selective parametric consolidation supplies memory depth distinct from and complementary to retrieval access.

Memory Depth, Not Memory Access: Selective Parametric Consolidation for Long-Running Language Agents
Long-running language agents often struggle to maintain consistent goals and behaviors over time. While current systems rely heavily on retrieval—fetching past facts from an external database—this approach only works if the relevant information is actively brought into the agent's current working context. This paper introduces the concept of "memory depth," which refers to the ability of an agent to retain durable, goal-conditioned behaviors that persist even after the working context is cleared. The researchers propose a mechanism called EVAF to selectively consolidate important experiences into the model’s own parameters, ensuring that the agent’s core tendencies remain intact without needing to constantly re-retrieve old information.

The Loop-Drift Protocol

To test this, the authors developed the "loop-drift protocol," a controlled stress test designed to distinguish between simple fact retrieval and true behavioral persistence. In this protocol, the agent is subjected to various distractions, conflicting requests, and long periods of interference. Crucially, the retrieval system remains active throughout the test, but the working context is periodically "unloaded." This allows researchers to measure whether the agent can still act according to its goals after the context is wiped, proving that the memory has been deeply integrated into the model rather than just stored as a searchable file.

How EVAF Works

EVAF (a surprise- and valence-gated LoRA consolidation mechanism) acts as a filter for the agent’s experiences. It uses two main criteria to decide what is worth remembering:

  • Surprise: How unexpected or informative an event is based on the model's current state.

  • Valence: How relevant an event is to the user’s long-term goals and preferences.
    When an event is deemed important enough, it is written into a small, low-rank adapter (LoRA). By being highly selective, EVAF avoids the "noise" of storing every single interaction, which keeps the model's memory focused on behavioral tendencies rather than just transient facts.

Key Findings

The research demonstrates a clear "depth flip" in performance. Retrieval-augmented generation (RAG) excels at recalling specific, recent facts, but it fails to maintain consistent goal-oriented behavior after a context unload. Conversely, EVAF performs significantly better at maintaining these goals and recovering them after an unload, even with very few parametric writes. The study also highlights that "actuation"—how strongly the model updates its weights based on selected events—is a separate, critical factor. If the model writes too strongly or indiscriminately, it can actually degrade performance, suggesting that successful memory consolidation requires a careful balance between selecting the right information and applying the right amount of weight change.

Limitations and Future Work

The authors emphasize that this research is a narrow mechanism claim rather than a universal solution. While EVAF is effective for goal persistence, it does not currently solve the problem of "stale-memory invalidation"—the ability to delete or update outdated information. Tests using public Memora event streams showed that while the mechanism is directionally positive, it does not yet provide a robust way to handle memory updates. Consequently, the authors identify validity gating and reconsolidation as important areas for future development.

Comments (0)

No comments yet

Be the first to share your thoughts!