AgenticSTS: A Bounded-Memory Testbed for Long-Horizon LLM Agents introduces a new methodology for managing how AI agents store and access information during complex, long-term tasks. Current methods often rely on appending all past interactions to a prompt, which creates a cluttered "jumble" of data that makes it difficult to determine which specific memories are actually helping or hindering the agent's performance. This paper proposes a "bounded contract" approach, where memory is retrieved in a structured, typed format, keeping the prompt size consistent regardless of how long the agent has been running.
A New Approach to Agent Memory
The core innovation of this research is replacing the standard practice of appending raw transcripts with a system of typed retrieval. By assembling a fresh user message for every decision based on specific, retrieved memory components, the researchers ensure that the agent’s prompt remains bounded in size. This design allows developers to isolate and test individual memory layers—such as reflections or tool-use history—to see exactly how each piece of information influences the agent's decision-making process.
Testing in Slay the Spire 2
To validate this framework, the authors used Slay the Spire 2, a complex, stochastic deck-building game that requires hundreds of tactical and strategic decisions. The game serves as a challenging testbed because it is difficult for current frontier LLMs to master; while human players achieve a 16% win rate at the lowest difficulty, existing LLM benchmarks have reported zero wins. By using this game, the researchers created a controlled environment to observe how different memory configurations affect an agent's ability to succeed over a long sequence of events.
Key Observations and Findings
In their initial experiments, the researchers compared a "no-store" baseline against an agent equipped with a "triggered strategic skills" layer. The baseline won 3 out of 10 games, while the version with the skill layer won 6 out of 10. While the authors note that this sample size is directional rather than statistically decisive, it demonstrates the potential of their harness to measure the impact of specific memory components.
A Resource for Future Research
The authors have released a comprehensive, reproducible testbed to support further study in this area. This release includes 298 completed game trajectories, condition tags, frozen snapshots of memory and skills, and the analysis scripts used in the study. By providing these tools, the researchers aim to offer a validated methodology for the AI community to better understand how explicit memory layers shape the behavior of long-horizon agents.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!