Meta-Cognitive Memory Policy Optimization for Long-...

Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents
Large Language Models (LLMs) often struggle with long-horizon tasks because they rely on recursive summarization to manage memory. As these agents summarize interaction histories into compact notes, they frequently discard important information or introduce "semantic noise." This leads to "belief deviation," where the agent’s internal understanding of the task drifts away from reality, eventually causing the reasoning process to collapse. This paper introduces a new training framework called Metacognitive Memory Policy Optimization (MMPO) that shifts the focus from just looking at final results to ensuring that every intermediate memory summary is clear and reliable.

The Problem with Sparse Feedback

Existing methods for training memory-augmented agents typically use Reinforcement Learning based on the final outcome—whether the agent succeeded or failed at the end of a task. The authors argue that this approach is insufficient because it provides only "sparse" feedback. If an agent fails, it is difficult to pinpoint exactly which intermediate summary caused the error. Because the agent receives no guidance on how to manage its memory during the process, it often accumulates irrelevant or noisy information, leading to performance decay as the interaction length increases.

Introducing Belief Entropy

To solve this, the researchers propose a self-supervised proxy called "Belief Entropy." This metric measures how uncertain the model is about the current task state based on its current memory. To calculate this, the model is prompted with an "anchor question"—such as asking the agent to describe its current progress and what information it still needs. If the agent’s response to this question is highly uncertain, it indicates that the memory summary is ambiguous or incomplete. By measuring this uncertainty, the system can identify which summaries are "clear" and which are "noisy."

How MMPO Improves Reasoning

MMPO uses Belief Entropy to provide "dense" supervision during training. Instead of only rewarding the final outcome, the framework assigns rewards to intermediate steps based on the clarity of the memory. If a summary leads to lower Belief Entropy, the agent receives a positive signal, encouraging it to produce more precise and informative summaries. By using a technique called Group Relative Advantage Estimation, the model compares different reasoning paths to determine which memory strategies are most effective at maintaining a stable, accurate belief about the task throughout the entire interaction.

Performance and Scalability

The researchers tested MMPO on complex long-horizon tasks, including RULER-HotpotQA. The results show that MMPO consistently outperforms existing memory-agent frameworks. Notably, the model maintains high performance—97.1%—even when scaled to contexts as large as 1.75 million tokens. This demonstrates that by focusing on the quality of intermediate beliefs rather than just final outcomes, agents can remain stable and effective even during extremely long and complex reasoning processes.

Meta-Cognitive Memory Policy Optimization for Long-... | AI Research

Key Takeaways

The Problem with Sparse Feedback

Introducing Belief Entropy

How MMPO Improves Reasoning

Performance and Scalability

Comments (0)

No comments yet