AutoMem: Automated Learning of Memory as a Cognitiv...

AutoMem: Automated Learning of Memory as a Cognitiv... | AI Research

Key Takeaways

AutoMem: Automated Learning of Memory as a Cognitive Skill Large Language Models (LLMs) often struggle with long-horizon tasks because their "working memory"...
Memory expertise is a learned skill: knowing what to encode, when to retrieve, and how to organize knowledge--a capacity known in cognitive science as metamemory.
We bring this perspective to LLMs by treating memory management as a trainable skill.
We promote file-system operations to first-class memory actions alongside task actions, letting the model itself decide how to manage its memory.
This memory skill improves along two axes: the structure that supports it (prompts, file schemas, action vocabulary), and the proficiency of the model exercising it.

Paper AbstractExpand

Memory expertise is a learned skill: knowing what to encode, when to retrieve, and how to organize knowledge--a capacity known in cognitive science as metamemory. We bring this perspective to LLMs by treating memory management as a trainable skill. We promote file-system operations to first-class memory actions alongside task actions, letting the model itself decide how to manage its memory. This memory skill improves along two axes: the structure that supports it (prompts, file schemas, action vocabulary), and the proficiency of the model exercising it. Both axes resist manual optimization: episodes in long-horizon tasks run for thousands of steps, and a single memory mistake can hide long before it surfaces, making human review of full trajectories impractical. We introduce AutoMem, a framework that automates both axes. In the first loop, a strong LLM reviews complete agent trajectories and iteratively revises the memory structure that shapes how the agent interacts with its memory files. In the second loop, the agent's own good memory decisions are identified from many episodes and used as training signal to sharpen the model's memory proficiency directly. Across three procedurally generated long-horizon games (Crafter, MiniHack, and NetHack), optimizing memory alone--without modifying the model's task-action behavior--improved the base agent's performance ~2x-4x, bringing a 32B open-weight model competitive with frontier systems such as Claude Opus 4.5 and Gemini 3.1 Pro Thinking. Our results show that memory management is an independently learnable skill, and a high-leverage objective yielding large gains on long-horizon tasks.

AutoMem: Automated Learning of Memory as a Cognitive Skill
Large Language Models (LLMs) often struggle with long-horizon tasks because their "working memory"—the context window—is limited. While humans use external tools like notes and files to extend their memory, LLMs typically rely on fixed, pre-designed memory systems. This paper introduces AutoMem, a framework that treats memory management as a trainable skill. Instead of building a static memory module, AutoMem gives the model the ability to use file-system operations (like reading, writing, and searching) as a first-class action, allowing the model to decide for itself what to remember and how to organize it.

Automating Memory Improvement

AutoMem optimizes memory through two automated loops, both driven by a "meta-LLM" that reviews entire episode logs—a task that would be impractical for humans to do manually. In the first loop, the meta-LLM acts as a code reviewer, analyzing the agent's performance and iteratively revising the "scaffold." This includes updating the prompts, file schemas, and the rules for how the agent interacts with its memory. In the second loop, the meta-LLM acts as a training engine, identifying the agent's most successful memory decisions from past episodes and using them to finetune a dedicated "memory specialist" model.

Separating Memory from Action

A key design choice in AutoMem is the separation of concerns. The framework uses two model instances: a "memory specialist" that handles file operations and a "gameplay model" that executes world actions. Because the gameplay model remains unmodified, the agent retains its original task competence while its ability to manage information is sharpened. This separation ensures that improvements in memory proficiency do not interfere with the model’s ability to perform tasks, allowing the two skills to stack and provide a cumulative performance boost.

High-Leverage Results

The researchers tested AutoMem on three complex, procedurally generated games: Crafter, MiniHack, and NetHack. By optimizing memory management alone—without changing the base model’s task-action weights—the framework improved performance by 2x to 4x. This approach allowed a 32B open-weight model to reach performance levels comparable to frontier proprietary systems like Claude Opus 4.5 and Gemini 3.1 Pro Thinking. These results suggest that teaching an LLM how to manage its own memory is a highly effective way to solve long-horizon tasks, often proving more impactful than simply increasing the model's scale.

Why This Matters

The success of AutoMem demonstrates that memory management is an independently learnable skill. By providing the model with a traceable, file-based memory system and using meta-LLMs to automate the refinement of that system, the framework overcomes the "bottleneck" of the context window. This research highlights that for long-term tasks, the ability to organize and retrieve information is just as critical as the model's underlying reasoning capabilities.