What the paper is about
Large language model (LLM) agents rely on reusable skills to solve complex tasks. However, existing skill creation approaches treat skills as isolated and static artifacts, limiting their reusability, reliability, and long-term improvement. We propose MUSE-Autoskill Agent (Memory-Utilizing Skill Evolution), a skill-centric agent framework that lets agents continuously improve their task-solving capability by creating, reusing, and refining skills under a unified lifecycle (creation, memory, management, evaluation, and refinement). Our framework enables agents to create skills on demand, store and reuse them across tasks, organize and select them efficiently, and evaluate them through unit tests and runtime feedback for continuous refinement. We further introduce skill-level memory that accumulates experience for each skill across tasks, enabling more effective reuse and adaptation over time. Experiments on SkillsBench provide initial evidence that lifecycle-managed skills can improve task success, efficiency, reuse, and cross-agent transfer, highlighting the importance of treating skills as long-lived, experience-aware, and testable assets.
What it covers
1]ByteDance Inc. 2]Rochester Institute of Technology \contribution [*]Work done during an internship at the ByteBrain team \contribution [†]Corresponding author MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation Huawei Lin Peng Li Jie Song Fuxin Jiang Tieying Zhang [ [ [email protected] ( May 26, 2026 ) Abstract Large language model (LLM) agents rely on reusable skills to solve complex tasks. However, existing skill creation approaches treat skills as isolated and static artifacts, limiting their reusability, reliability, and long-term improvement. We propose MUSE-Autoskill Agent ( M emory- U tilizing S kill E volution), a skill-centric agent framework that lets agents continuously improve their task-solving capability by creating, reusing, and refining skills under a unified lifecycle (creation, memory, management, evaluation, and refinement). Our framework enables agents to create skills on demand, store and reuse them across tasks, organize and select them efficiently, and evaluate them through unit tests and runtime feedback for continuous refinement. We further introduce skill-level memory that accumulates experience for each skill across tasks, enabling more effective reuse and adaptation over time. Experiments on SkillsBench provide initial evidence that lifecycle-managed skills can improve task success, efficiency, reuse, and cross-agent transfer, highlighting the importance of treating skills as long-lived, experience-aware, and testable assets. \correspondence Tieying Zhang, Sci. & Eng. Data Analysis Document Proc. Ops & Planning Overall 0 20 20 40 40 60 60 80 80 100 100 55.7 55.7 44.6 44.6 82.2 82.2 36 36 52.1 52.1 78.6 78.6 60.2 60.2 84.4 84.4 51.4 51.4 67.3 67.3 48.6 48.6 43.6 43.6 71.1 71.1 36 36 47.9 47.9 72.9 72.9 47.4 47.4 82.2 82.2 50.1 50.1 61.2 61.2 55.7 55.7 47.4 47.4 77.8 77.8 40.2 40.2 53.2 53.2 72.9 72.9 61.8 61.8 88.9 88.9 57.1 57.1 68.4 68.4 Accuracy (%) Codex w/o Codex w/ hum Hermes w/o Hermes w/ hum MUSE w/o (Ours) MUSE w/ hum (Ours) Figure 1 : MUSE-Autoskill (ours) leads on SkillsBench across domains. Accuracy (%) of three GPT-5.5-backed agents on 51 SkillsBench tasks across four super-domains (Science & Engineering, Data Analysis, Document Processing, Ops & Planning) and the overall Total . Paired bars per agent: lighter = without skills , saturated = with human skills . MUSE-Autoskill achieves the highest with-skills score in 3 of 4 domains and on Total (68.4% vs. Codex 67.3% / Hermes 61.2%), a + 15.2 +15.2 pp lift consistent across agents. See Section 4 and Table 3 . 1 Introduction Skills for agents. Large language model (LLM) agents are increasingly tasked with solving complex, real-world problems that involve interacting with external tools, data, and code, often spanning many steps and disparate domains [ 35 , 16 , 8 ] . As task scope grows, raw model reasoning alone is insufficient: agents need access to reusable units of capability, namely skills , that encapsulate procedures, executable code, or domain-specific instructions and can be composed into solutions [ 27 , 2 ] . Skills are emerging as the natural abstraction for scalable agent systems because they decouple capability from monolithic model weights, enabling modular execution and the accumulation of structured domain knowledge [ 2 , 31 ] . The central open question is how to enable agents to continuously improve their capabilities through skills they can obtain, organize, and refine on their own, without relying on human authoring at every step. Limits of AutoSkill. A growing line of work uses LLMs to synthesize skills automatically, starting from Voyager’s executable code library in Minecraft [ 27 ] and extending to general-purpose agents via AutoSkill [ 34 ] , EvoSkill [ 1 ] , and SkillGen [ 14 ] . More recent approaches use reinforcement learning to jointly optimize skill selection, use, and distillation (Skill1 [ 24 ] ) or to train a dedicated skill curator (SkillOS [ 17 ] ). On the production side, Anthropic’s Agent Skills [ 2 ] standardize skills as portable folders of instructions and scripts. While these methods successfully expand agent functionality, they typically cover only part of the skill lifecycle and leave four practical gaps: (i) a creation–usage mismatch , where skills are produced without access to the agent’s runtime context; (ii) no structured per-skill memory that accumulates free-form experience about individual skills across tasks; (iii) static, unvalidated skills without unit-test-driven evaluation or refinement; and (iv) poor context handling , where flat conversation histories truncate or overflow on long-horizon tasks. Skill lifecycle. We argue that skills should not be one-off generation outputs but long-lived, evolving assets of an agent system. A useful skill is created on demand within the agent’s reasoning loop, stored with associated experience and metadata [ 18 , 19 , 26 ] , retrieved when contextually relevant, validated through tests and runtime feedback, and continuously refined as new evidence accumulates [ 3 , 15 , 14 ] . We formalize this perspective as a unified skill lifecycle with five stages: creation, memory, management, evaluation, and refinement . This reframing turns skills from disposable artifacts into managed, testable, and transferable infrastructure: the foundation needed for agents to accumulate experience across tasks, sessions, and even across different agent systems. MUSE-Autoskill framework. We instantiate this lifecycle in MUSE-Autoskill Agent ( M emory- U tilizing S kill E volution; Figure 2 ). MUSE tightly couples skill creation with execution through a built-in skill_create tool invoked from within the runtime loop, eliminating the creation–usage mismatch. It introduces a multi-level memory comprising short-term, long-term, and (uniquely) skill-level memory , which accumulates per-skill experience across tasks and informs future invocations. An evaluation subsystem grounds reliability in unit tests and execution feedback, automatically triggering refinement when tests fail. A structured context manager with adaptive compression and cross-session state persistence supports long-horizon tasks without information loss or context-window blowup. Together, these components make skills externalized, testable, and transferable, rather than internal model behavior locked inside opaque weights. Results. Figure 1 previews our headline results on SkillsBench , a benchmark of 51 real-world tasks graded by automated verifiers in standardized Docker environments. Among three GPT-5.5-backed agents, MUSE-Autoskill achieves the best with-skills accuracy in 3 of 4 super-domains and overall ( 68.40% , a + 15.21 +15.21 pp lift over its no-skills baseline). When MUSE-Autoskill creates skills from its own successful trajectories, accuracy on the 35 tasks where generation succeeds reaches 87.94% , surpassing the human-skill ceiling. Generated skills also transfer cleanly: injected into a different agent (Hermes), they raise its accuracy by + 10.51 +10.51 pp, closing 79 % 79% of the gap to Hermes with human skills, evidence that MUSE produces externalized knowledge assets rather than agent-specific behavior tied to one runtime. Contributions. This paper makes four contributions:
• Skill lifecycle. We reframe skills from one-off generation outputs into long-lived, lifecycle-managed assets, identifying five stages (creation, memory, management, evaluation, refinement) that any practical skill-centric agent system must address.
• MUSE-Autoskill. A skill-centric agent that improves its task-solving capability over time by integrating skill creation with runtime execution, evaluating skills via unit tests and feedback, and automatically refining them when tests fail.
• Infrastructure. Multi-level memory with a novel skill-level memory that accumulates per-skill experience across tasks; adaptive context compression with cross-session state persistence; and cross-agent skill transfer that makes generated skills usable beyond their authoring agent.
• Validation. Best-in-class SkillsBench accuracy among three GPT-5.5-backed agents (68.40% with human skills, + 15.21 +15.21 pp lift); self-generated skills exceed the human-skill ceiling on 35 tasks (87.94%); generated skills transfer to a different agent with minimal loss. Figure 2 : MUSE-Autoskill Agent architecture. MUSE organizes skills into a unified lifecycle of creation, memory, management, evaluation, and refinement, enabling agents to generate, refine, and reuse skills with accumulated experience over time. 2 Related Work 2.1 LLM Agents LLM-based agents that interact with tools, environments, and data have advanced rapidly in recent years [ 6 , 22 , 5 , 29 ] . Building on ReAct [ 35 ] ’s interleaving of reasoning and action, follow-up systems extend the paradigm to broader workflows, including multimodal autonomous agents such as Agent-Omni [ 11 ] and OmniGAIA [ 10 ] , and a wider body of work on self-improving agents [ 26 , 15 ] . A parallel line of work focuses on equipping agents with tool-use capabilities, ranging from few-shot tool calling [ 21 ] to tool orchestration via model selection [ 23 ] and large-scale API retrieval [ 20 ] ; for software engineering specifically, agents such as CodeAgent [ 36 ] , SWE-Agent [ 32 ] , and OpenHands [ 28 ] drive tool-integrated workflows over sandboxed shells and editors to resolve real-world repository tasks. The capabilities of such systems are now measured by general agent benchmarks including GAIA [ 16 ] , SWE-bench [ 8 ] , and AgentBench [ 13 ] , which together cover web browsing, real-world software engineering, and multi-environment tool use. Despite this progress, most agent frameworks treat the set of available actions as either a fixed, hand-engineered tool registry or a flat conversational scratchpad. They do not natively support agents that can author, validate, and accumulate their own reusable capabilities over time, which is precisely the gap the skill-centric literature, and our framework, set out to close. 2.2 Automatic Skill Systems We organize the growing literature on automatic skill systems along two axes: which stages of the skill lifecycle ( creation, memory, management, evaluation, refinement ) a method addresses, and whether it operates entirely at inference time or requires additional model training. Table 1 summarizes the resulting comparison along these two axes. The first major direction builds skill systems on top of pretrained LLMs without any fine-tuning. Voyager [ 27 ] is the seminal example: in the Minecraft setting, it maintains an ever-growing library of executable-code skills, with self-verification and iterative prompting that lets the same LLM both author and refine skills in response to environment feedback. Follow-up work generalizes this paradigm to general-purpose agents: AutoSkill [ 34 ] derives, maintains, and reuses skills from dialogue and interaction traces as a model-agnostic plugin layer; EvoSkill [ 1 ] analyses execution failures and proposes new skills or edits, retaining only those that improve held-out validation under a Pareto-frontier selection; and SkillGen [ 14 ] iteratively refines skills via contrastive induction over successful and failed trajectories, modelling each skill as an intervention to empirically verify its net effect. The feedback-driven refinement underlying these methods is rooted in a broader self-improvement literature outside the skill setting: Reflexion [ 26 ] maintains reflective text in an episodic memory buffer across attempts, Self-Refine [ 15 ] iteratively rewrites outputs using self-generated critiques, Self-Debug [ 3 ] closes the loop on code generation with execution and unit-test traces, and ExpeL [ 37 ] extracts natural-language insights across training tasks for inference-time reuse. These methods all improve agent behavior through linguistic feedback but stop short of treating skills as first-class, externalized, testable artifacts that outlive a single task or agent. On the industrial side, Anthropic’s Agent Skills [ 2 ] standardize skills as portable folders of SKILL.md instructions and scripts loaded via progressive disclosure; this is the closest practical analogue of our externalized skill format, but the system leaves evaluation and refinement to human authoring. Collectively, these training-free methods are lightweight and naturally portable across LLM backbones, yet each covers only part of the lifecycle: none simultaneously supports structured per-skill memory, unit-test-driven evaluation, and automatic refinement triggered by test feedback. A second, concurrent direction uses reinforcement learning to optimize skill behavior jointly with the policy. SkillMaster [ 33 ] learns a single policy that both acts and edits its skill bank, with edits credited by counterfactual downstream utility. Skill1 [ 24 ] frames skill evolution as a unified RL problem, co-optimizing skill selection, utilization, and distillation under a shared task-outcome reward. SkillOS [ 17 ] pairs a frozen executor with a trainable curator that updates an external skill repository from accumulated experience, and shows that the curator generalizes across executor backbones; this is a portability axis complementary to ours, where the skills themselves rather than the curator are the unit of transfer. Youtu-Agent [ 25 ] pursues a related direction via hybrid policy optimization of tools and agent configurations. These RL-based methods can attain strong optimality on the environments they are trained on, but they couple skill behavior to a trained policy or curator: migrating to a new backbone typically requires additional training, and skills produced by one trained policy are not directly usable by a different agent without re-training. Table 1 : Related work on automatic skill systems by lifecycle stage. ✓ = covered; △ \boldsymbol{\triangle} = partial; ✗ = not addressed. Memory = persistent per-skill experience across tasks. Cross-agent = skills from one agent are usable by another without modification; ✓ requires an explicit cross-agent transfer experiment, △ \boldsymbol{\triangle} indicates portability only across LLM backbones or product variants of the same agent. Training-free = inference-time only, no fine-tuning or RL. Lifecycle stage Method Creation Memory Management Evaluation Refinement Cross-agent Training-free Voyager [ 27 ] ✓ ✗ ✓ △ \boldsymbol{\triangle} ✓ △ \boldsymbol{\triangle} ✓ AutoSkill [ 34 ] ✓ △ \boldsymbol{\triangle} △ \boldsymbol{\triangle} ✗ ✓ ✗ ✓ EvoSkill [ 1 ] ✓ ✗ △ \boldsymbol{\triangle} ✓ ✓ ✗ ✓ SkillGen [ 14 ] ✓ ✗ ✗ ✓ ✓ △ \boldsymbol{\triangle} ✓ Anthropic Skills [ 2 ] ✓ ✗ ✓ ✗ ✗ △ \boldsymbol{\triangle} ✓ SkillMaster [ 33 ] ✓ ✗ △ \boldsymbol{\triangle} △ \boldsymbol{\triangle} ✓ ✗ ✗ Youtu-Agent [ 25 ] ✓ ✗ △ \boldsymbol{\triangle} ✗ ✗ ✗ ✗ Skill1 [ 24 ] ✓ △ \boldsymbol{\triangle} △ \boldsymbol{\triangle} △ \boldsymbol{\triangle} △ \boldsymbol{\triangle} ✗ ✗ SkillOS [ 17 ] ✓ ✓ ✓ △ \boldsymbol{\triangle} ✓ ✗ ✗ MUSE-Autoskill (Ours) ✓ ✓ ✓ ✓ ✓ ✓ ✓ 2.3 Benchmarks and Positioning Several recent benchmarks complement the methods above by stressing different lifecycle stages. SkillsBench [ 9 ] , which we adopt in our experiments, measures end-to-end task accuracy with and without skills across diverse Docker-evaluated real-world tasks. SkillRet [ 4 ] isolates the management stage by evaluating skill retrieval at scale from a library of nearly 18,000 community-contributed skills. SkillLearnBench [ 39 ] and LifelongAgentBench [ 38 ] focus on continual and lifelong skill acquisition over task streams, and notably report that strong individual methods do not consistently dominate, motivating system-level designs such as ours. A concurrent survey [ 31 ] catalogues skill-acquisition modalities and architectural choices for LLM agents, situating both training-free and training-based directions within a broader taxonomy. Compared with the methods above, MUSE-Autoskill differs in that it brings all five lifecycle stages together within a single training-free framework, rather than addressing creation or refinement in isolation. In particular, it introduces skill-level memory that accumulates per-skill experience across tasks, uses unit-test-driven evaluation that automatically triggers refinement when tests fail, and is the only general-purpose method to empirically validate cross-agent skill transfer by injecting its generated skills into a different agent without modification (Section 4 ); other portability claims in the literature are limited to swapping the underlying LLM backbone or sharing skills across product variants of the same agent family, without an explicit cross-agent experiment. The combination of full lifecycle coverage and a training-free design also makes the system portable across LLMs and agent architectures, as summarized in the bottom row of Table 1 . Figure 3 : End-to-end flow of MUSE-Autoskill. The Master Agent runs a ReAct loop; when a skill is needed it either retrieves one from the Skill Bank or dispatches the Skill Creator to synthesize a new package ( SKILL.md plus optional scripts/ and tests/ ). The Evaluator runs the bundled tests; on pass , observations are appended to Memory and surfaced on later steps; on fail , the Refiner patches the package and re-enters the loop. 3 MUSE-Autoskill Agent In this section, we present MUSE-Autoskill Agent , a skill-centric agent framework that solves complex tasks by dynamically creating, reusing, and refining skills. MUSE integrates skill creation, execution, memory, management, and evaluation within a unified agent loop. Figure 2 illustrates the overall architecture and the five lifecycle stages described below. 3.1 Agent Framework The agent operates in an iterative decision-making loop consisting of three core stages: Planning , Action , and Observation [ 35 ] . Given an input query, the agent continuously cycles through these stages to progressively solve the task. This design enables dynamic reasoning, skill invocation, and adaptive refinement based on intermediate feedback. Planning In the planning stage, the agent interprets the input query and determines the next step toward achieving the task objective. This involves decomposing the problem, selecting appropriate strategies, and deciding whether to invoke external skills. The agent may also leverage past observations and memory to refine its plan, enabling more informed and context-aware decisions. Action In the action stage, the agent executes the planned step by invoking skills. These may include retrieving existing skills from the skill bank or utilizing built-in functions such as skill creation and web search. The selected skill is invoked within the agent’s ReAct loop using its built-in tools, producing intermediate or final outputs for the task. The detailed execution mechanism of skills will be introduced in Section 3.2 . Observation In the observation stage, the agent collects and analyzes the results returned from execution. These observations are used to evaluate progress toward the goal and to inform subsequent planning decisions. Through this feedback loop, the agent can iteratively refine its behavior, handle errors, and adapt to complex, multi-step tasks. 3.2 Skill Lifecycle As illustrated in Figure 3 , the agent organizes skills into a unified lifecycle of five stages: creation , memory , management , evaluation , and refinement . To bootstrap this process, the agent is equipped with a small set of built-in skills, including skill_create and web_search . All other skills are not predefined but must be created through this mechanism, ensuring that the agent’s capabilities are dynamically constructed and continuously evolving. Skill As illustrated in Figure 2 , a skill is the basic unit of execution in our system. Each skill is packaged as a structured directory with standard components, following Anthropic’s Agent Skills format [ 2 ] . It includes a SKILL.md file that defines its interface, such as name, description, inputs, and outputs, and may also include subdirectories like scripts/ for executable code, resources/ for auxiliary data, and tests/ for validation. Skills are executed through a unified interface. At runtime, the agent reads SKILL.md to understand how to use the skill, and decides whether to read resources, run scripts, or both. If scripts are required, the execution engine runs the corresponding code with the given inputs and returns the outputs. Using skills improves efficiency. Instead of generating detailed reasoning steps every time, the agent can call a skill with a short interface, which reduces token usage. Skills can also be reused across tasks, allowing the agent to avoid repeating work and making the system more scalable over time. Skill Creation As illustrated in Figure 2 , new skills are generated through the built-in skill_create skill. When existing skills are not sufficient, the agent provides a high-level specification of the desired functionality, including its purpose, inputs, and expected outputs. Based on this specification, the system follows a structured pipeline to construct the skill. It first generates the SKILL.md file to define the interface, then plans the internal structure such as scripts/ , resources/ , and tests/ , and finally generates the corresponding files. The result is a complete and executable skill package. After creation, each skill is gated by an evaluation step: the system runs the unit tests in the newly written tests/ directory inside the sandbox, and only registers the skill into the Skill Bank if all tests pass. If tests fail, the agent inspects the error trace and invokes update_skill to patch the package before re-running tests. This create → \rightarrow evaluate → \rightarrow register loop ensures only reliable skills enter the bank and are reusable in future tasks. This design also keeps all non-built-in functionality consistently created as skills, making them easy to reuse, validate, and improve over time. Skill Evaluation As illustrated in Figure 2 , skills are evaluated to ensure their correctness and reliability before being reused. This evaluation is primarily performed through unit tests defined in the tests/ directory of each skill. After a skill is created, the system executes these tests with predefined inputs and verifies whether the outputs match expected results. This process filters out incorrect or unstable skills and provides signals for further refinement. As part of the self-evolution loop shown in Figure 2 , failed tests can trigger updates or regeneration of the skill. By enforcing systematic evaluation, the agent maintains a high-quality skill set and ensures robust performance during execution. Skill Execution As illustrated in Figure 2 , skill execution is carried out within the agent’s ReAct loop using its built-in tools. Given a task, the agent reads the available skill catalog and selects an appropriate skill. It then reads the SKILL.md file to understand the skill interface, standard operating procedure, and required components. Following the procedure defined in SKILL.md , the agent decides whether to read from resources/ , execute code in scripts/ via sandbox tools, or combine both. Code execution is mediated by a small set of sandbox lifecycle tools ( create_sandbox , sandbox_run , sandbox_upload / sandbox_download , and close_sandbox ) that the agent invokes from inside its ReAct loop. Each sandbox is an isolated process / container with its own filesystem, so failures, side effects, and resource usage are contained per skill invocation. Rather than introducing a separate execution engine, skill execution reuses the same general-purpose tools the agent already uses (file reading, terminal commands, sandbox calls), which avoids redundant infrastructure and lets execution benefit from the agent’s full reasoning capability. The execution process is iterative: intermediate results are fed back into the agent’s reasoning loop, enabling progressive refinement and error handling. This unified approach ensures consistent execution across all skills while preserving flexibility for both simple and complex tasks. Skill Memory As illustrated in Figure 2 , the agent maintains memory at multiple levels to support skill reuse and accumulation over time. In particular, skill-level memory stores the skills themselves along with their metadata, such as descriptions, inputs, and usage history. This allows the agent to efficiently retrieve relevant skills for new tasks. In addition, the agent appends notes and observations to short-term and long-term memory, providing context for future decisions. This memory helps the agent avoid redundant skill creation, reuse effective solutions, and improve performance over time. By maintaining structured memory around skills, the system enables continuous learning and more efficient task execution. Skill Management As illustrated in Figure 2 , skill management maintains the quality and usability of the skill bank. Each skill is indexed using metadata from SKILL.md , including its name, description, inputs, and outputs. At the start of each task, the agent is provided with a catalog of available skills injected into the system prompt, following the progressive-disclosure pattern of Anthropic’s Agent Skills [ 2 ] . The agent then selects the most relevant skill during planning based on this catalog, enabling efficient reuse and reducing unnecessary skill creation. In addition to retrieval, the system supports continuous maintenance of the skill bank through three mechanisms: refinement , merging , and pruning . When a skill fails unit tests or produces incorrect outputs during execution, the agent revises or regenerates it based on the error feedback. When newly created skills overlap significantly with existing ones, the agent merges them into a single, more general skill to avoid redundancy. Skills that consistently fail or remain unused over time are pruned from the skill bank. Together, these mechanisms keep the skill bank compact, reliable, and scalable as the agent accumulates more skills over time. 3.3 Memory Memory plays a central role in enabling MUSE to accumulate knowledge and reuse previously acquired capabilities. Our design builds on prior hierarchical memory architectures for LLM agents: MemGPT [ 18 ] pages between in-context and external memory in an OS-style hierarchy, Generative Agents [ 19 ] maintain a memory stream with periodic synthesis into higher-level reflections, and Reflexion [ 26 ] and ExpeL [ 37 ] accumulate natural-language reflections and insights across episodes. MUSE extends these by adding a per-skill memory scope tied to each SKILL.md file, complementing short- and long-term layers shared with prior work. Skill-level Memory Each skill in the bank carries its own .memory.md file, into which the agent appends notes, lessons, and usage observations accumulated across tasks (e.g., known failure modes, input format quirks, performance caveats). When the same skill is loaded later, this per-skill memory is surfaced alongside its SKILL.md interface, letting the agent benefit from previously learned experience without re-deriving it. Short-term Memory Short-term memory maintains the current task context, including intermediate reasoning steps, observations, and temporary execution results. As the context grows, it is adaptively compressed by summarizing intermediate steps, allowing the agent to handle long-horizon tasks without exceeding the model’s token budget. Long-term Memory Long-term memory stores persistent notes the agent appends across sessions, including reusable conclusions, environment quirks, and general lessons learned outside any single skill (e.g., “prefer batched I/O,” “the project uses pinned package versions”). Unlike short-term memory, long-term memory is not subject to compression and serves as a growing repository of accumulated experience, enabling the agent to improve decision-making over time by drawing on lessons learned in prior runs. 3.4 Context Management The agent maintains context as a DAG of conversation nodes , one per turn (Figure 4 ). Each node records the model response, tool calls, observations, and per-call token usage from one step. Every node carries two sets of pointers: a mutable parent_id that defines the current active chain sent to the LLM, and an immutable history_prev / history_next pair that defines the full history of original turns. The active chain is always a sub-graph of the full history. Figure 4 : Adaptive context compression over a DAG of ReAct turns. Each turn is a ( plan , action , observation ) triple; the first KEEP_FIRST and last KEEP_LAST turns are always pinned and only the middle is eligible for compression. Top → \to Middle: Level-1 rewrites individually oversized turns in place. Middle → \to Bottom: when no single turn is oversized but the chain is still over budget, Level-2 merges the compressible span into one synthetic node. Original turns remain in the full history (linked by immutable history_prev / history_next pointers), so the trajectory is fully replayable. As tasks grow longer, the accumulated short-term context can exceed the model’s token budget. Existing remedies span token-level prompt compression [ 7 ] , attention-sink-based KV retention for streaming inference [ 30 ] , and OS-style virtual context management for general LLM agents [ 18 ] ; positional studies further document significant degradation when relevant content is buried in the middle of a long context [ 12 ] , which motivates the explicit first/last pinning we adopt below. To handle this at the agent level, MUSE applies adaptive context compression with two levels. Level
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!