AI Research

AgenticSTS: A Bounded-Memory Testbed for Long-Horiz... | AI Research

Key Takeaways

AgenticSTS: A Bounded-Memory Testbed for Long-Horizon LLM Agents introduces a new methodology for managing how AI agents store and access information during...
Memory for a long-horizon LLM agent is a contract about what each future decision is allowed to see.
We introduce and instrument an alternative bounded contract: every decision is made from a fresh user message assembled by typed retrieval, with no raw cross-decision transcript appended.
The prompt thus stays bounded across runs of any length, and any single layer can be ablated in isolation.
We instantiate the contract in Slay the Spire 2, a closed-rule stochastic deck-building game whose runs require hundreds of tactical and strategic decisions.

Paper AbstractExpand

Memory for a long-horizon LLM agent is a contract about what each future decision is allowed to see. The simplest contract appends past observations, tool calls, and reflections to every prompt, which makes prior context easy to access but also turns it into a jumbled mixture in which the effect of any single memory component is hard to isolate. We introduce and instrument an alternative bounded contract: every decision is made from a fresh user message assembled by typed retrieval, with no raw cross-decision transcript appended. The prompt thus stays bounded across runs of any length, and any single layer can be ablated in isolation. We instantiate the contract in Slay the Spire 2, a closed-rule stochastic deck-building game whose runs require hundreds of tactical and strategic decisions. A public online benchmark of frontier LLMs on the same game reports zero wins at the lowest difficulty across five configurations, and the developer-reported human win rate at the same difficulty is 16%; the task is hard but not saturated. Within our harness, a fixed-A0 ablation shows the largest observed difference when triggered strategic skills are enabled: the no-store baseline wins 3/10 games and adding the skill layer 6/10. At this sample size the comparison is directional rather than statistically decisive (Fisher exact p\approx0.37); a cross-backbone probe and public accumulating-context baselines are reported as operational comparisons rather than controlled tests of the contract variable itself. We release a reproducible testbed: 298 completed trajectories with condition tags, frozen memory/skill snapshots, prompt records, and analysis scripts -- an agent design and a validated, reusable methodology for studying how explicit memory layers shape long-horizon LLM-agent decisions.

AgenticSTS: A Bounded-Memory Testbed for Long-Horizon LLM Agents introduces a new methodology for managing how AI agents store and access information during complex, long-term tasks. Current methods often rely on appending all past interactions to a prompt, which creates a cluttered "jumble" of data that makes it difficult to determine which specific memories are actually helping or hindering the agent's performance. This paper proposes a "bounded contract" approach, where memory is retrieved in a structured, typed format, keeping the prompt size consistent regardless of how long the agent has been running.

A New Approach to Agent Memory

The core innovation of this research is replacing the standard practice of appending raw transcripts with a system of typed retrieval. By assembling a fresh user message for every decision based on specific, retrieved memory components, the researchers ensure that the agent’s prompt remains bounded in size. This design allows developers to isolate and test individual memory layers—such as reflections or tool-use history—to see exactly how each piece of information influences the agent's decision-making process.

Testing in Slay the Spire 2

To validate this framework, the authors used Slay the Spire 2, a complex, stochastic deck-building game that requires hundreds of tactical and strategic decisions. The game serves as a challenging testbed because it is difficult for current frontier LLMs to master; while human players achieve a 16% win rate at the lowest difficulty, existing LLM benchmarks have reported zero wins. By using this game, the researchers created a controlled environment to observe how different memory configurations affect an agent's ability to succeed over a long sequence of events.

Key Observations and Findings

In their initial experiments, the researchers compared a "no-store" baseline against an agent equipped with a "triggered strategic skills" layer. The baseline won 3 out of 10 games, while the version with the skill layer won 6 out of 10. While the authors note that this sample size is directional rather than statistically decisive, it demonstrates the potential of their harness to measure the impact of specific memory components.

A Resource for Future Research

The authors have released a comprehensive, reproducible testbed to support further study in this area. This release includes 298 completed game trajectories, condition tags, frozen snapshots of memory and skills, and the analysis scripts used in the study. By providing these tools, the researchers aim to offer a validated methodology for the AI community to better understand how explicit memory layers shape the behavior of long-horizon agents.

Comments (0)

No comments yet

Be the first to share your thoughts!