Back to AI Research

AI Research

Meta-Cognitive Memory Policy Optimization for Long-... | AI Research

Key Takeaways

  • Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents Large Language Models (LLMs) often struggle with long-horizon tasks because they rely o...
  • Memory-augmented LLM agents tackle complex long-horizon tasks by recursively summarizing interaction trajectories into compact memory.
  • However, existing approaches typically train these memory policies using outcome-based reinforcement learning, failing to localize where intermediate memory quality degrades.
  • As interactions unfold, ambiguous recursive summaries progressively discard task-relevant information and introduce semantic noise.
  • This exacerbates belief deviation, obscuring the agent's estimate of the latent task state and ultimately derailing long-horizon reasoning.
Paper AbstractExpand

Memory-augmented LLM agents tackle complex long-horizon tasks by recursively summarizing interaction trajectories into compact memory. However, existing approaches typically train these memory policies using outcome-based reinforcement learning, failing to localize where intermediate memory quality degrades. As interactions unfold, ambiguous recursive summaries progressively discard task-relevant information and introduce semantic noise. This exacerbates belief deviation, obscuring the agent's estimate of the latent task state and ultimately derailing long-horizon reasoning. We therefore argue that memory optimization should focus not merely on trajectory-level success, but on the clarity of the belief induced by intermediate summaries. To this end, we introduce Belief Entropy, a self-supervised proxy that probes how uncertain the model remains about the latent task state given its current memory. Based on this proxy, we propose Metacognitive Memory Policy Optimization (MMPO). Instead of relying only on sparse outcome-based signals, MMPO provides fine-grained, memory-specific supervision via explicitly penalizing summaries that induce high epistemic uncertainty. Experiments show that MMPO consistently outperforms existing methods on diverse long-horizon tasks, maintaining 97.1% performance even when scaled to 1.75M-token contexts.

Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents
Large Language Models (LLMs) often struggle with long-horizon tasks because they rely on recursive summarization to manage memory. As these agents summarize interaction histories into compact notes, they frequently discard important information or introduce "semantic noise." This leads to "belief deviation," where the agent’s internal understanding of the task drifts away from reality, eventually causing the reasoning process to collapse. This paper introduces a new training framework called Metacognitive Memory Policy Optimization (MMPO) that shifts the focus from just looking at final results to ensuring that every intermediate memory summary is clear and reliable.

The Problem with Sparse Feedback

Existing methods for training memory-augmented agents typically use Reinforcement Learning based on the final outcome—whether the agent succeeded or failed at the end of a task. The authors argue that this approach is insufficient because it provides only "sparse" feedback. If an agent fails, it is difficult to pinpoint exactly which intermediate summary caused the error. Because the agent receives no guidance on how to manage its memory during the process, it often accumulates irrelevant or noisy information, leading to performance decay as the interaction length increases.

Introducing Belief Entropy

To solve this, the researchers propose a self-supervised proxy called "Belief Entropy." This metric measures how uncertain the model is about the current task state based on its current memory. To calculate this, the model is prompted with an "anchor question"—such as asking the agent to describe its current progress and what information it still needs. If the agent’s response to this question is highly uncertain, it indicates that the memory summary is ambiguous or incomplete. By measuring this uncertainty, the system can identify which summaries are "clear" and which are "noisy."

How MMPO Improves Reasoning

MMPO uses Belief Entropy to provide "dense" supervision during training. Instead of only rewarding the final outcome, the framework assigns rewards to intermediate steps based on the clarity of the memory. If a summary leads to lower Belief Entropy, the agent receives a positive signal, encouraging it to produce more precise and informative summaries. By using a technique called Group Relative Advantage Estimation, the model compares different reasoning paths to determine which memory strategies are most effective at maintaining a stable, accurate belief about the task throughout the entire interaction.

Performance and Scalability

The researchers tested MMPO on complex long-horizon tasks, including RULER-HotpotQA. The results show that MMPO consistently outperforms existing memory-agent frameworks. Notably, the model maintains high performance—97.1%—even when scaled to contexts as large as 1.75 million tokens. This demonstrates that by focusing on the quality of intermediate beliefs rather than just final outcomes, agents can remain stable and effective even during extremely long and complex reasoning processes.

Comments (0)

No comments yet

Be the first to share your thoughts!