Back to AI Research

AI Research

StreamMemBench: Streaming Evaluation of Agent Memor... | AI Research

Key Takeaways

  • Bridging the Gap in Agent Memory The primary goal of personal-agent memory is to transform raw observations and past interactions into helpful, future-orient...
  • A central role of personal-agent memory is to turn stored information and prior interactions into future-oriented assistance.
  • In daily use, useful cues come from what the agent observes and how the user interacts with the agent, and the agent must carry them forward from the current request to similar future tasks.
  • Existing memory benchmarks usually test dialogue recall or task improvement in isolation, leaving the trajectory from streaming observations to later assistance largely untested.
  • We introduce StreamMemBench, a streaming benchmark that constructs a two-step task sequence around each evidence anchor from EgoLife egocentric streams.
Paper AbstractExpand

A central role of personal-agent memory is to turn stored information and prior interactions into future-oriented assistance. In daily use, useful cues come from what the agent observes and how the user interacts with the agent, and the agent must carry them forward from the current request to similar future tasks. Existing memory benchmarks usually test dialogue recall or task improvement in isolation, leaving the trajectory from streaming observations to later assistance largely untested. We introduce StreamMemBench, a streaming benchmark that constructs a two-step task sequence around each evidence anchor from EgoLife egocentric streams. The initial task tests evidence use, while the follow-up task tests whether feedback and interaction experience are reused. Four metrics diagnose evidence recall, initial evidence use, feedback incorporation, and follow-up reuse. Experiments with eight memory systems across two backbones show that current systems often fail to use observed evidence or turn feedback into reliable follow-up behavior, even when evidence is stored or feedback is incorporated locally. StreamMemBench is publicly available at this https URL .

Bridging the Gap in Agent Memory

The primary goal of personal-agent memory is to transform raw observations and past interactions into helpful, future-oriented assistance. While many existing benchmarks evaluate how well an agent recalls dialogue or improves on a single task, they often overlook the "trajectory" of memory—how an agent carries information from a streaming observation forward to assist with a similar task later on. The authors introduce StreamMemBench, a new benchmark designed to evaluate this continuous process of learning and application in real-world scenarios.

A Two-Step Evaluation Approach

StreamMemBench evaluates memory by constructing a two-step task sequence based on "evidence anchors" taken from the EgoLife egocentric (first-person) dataset. This structure allows researchers to test the agent's performance in two distinct phases:

  • Initial Task: This tests whether the agent can successfully use observed evidence to complete an immediate request.

  • Follow-up Task: This tests whether the agent can retain the feedback and interaction experience from the first task and apply that knowledge to a similar, subsequent task.
    To provide a comprehensive assessment, the benchmark utilizes four specific metrics: evidence recall, initial evidence use, feedback incorporation, and follow-up reuse.

Current Limitations in Memory Systems

The researchers tested eight different memory systems across two backbones using StreamMemBench. The results reveal a significant gap in current AI capabilities: even when systems are technically capable of storing evidence or incorporating feedback locally, they often fail to translate that information into reliable, future-oriented behavior. The findings suggest that while modern agents may "remember" data, they struggle to effectively apply that memory to improve performance over time.

Implications for Future Development

StreamMemBench highlights that the challenge for personal agents is not just storage, but the active, long-term application of experience. By moving beyond isolated task testing, this benchmark provides a clearer picture of how agents perform in dynamic, streaming environments. The authors have made StreamMemBench publicly available to help the research community address these persistent failures in memory-based assistance.

Comments (0)

No comments yet

Be the first to share your thoughts!