From Model Scaling to System Scaling: Scaling the H...

What the paper is about

This paper studies the next major bottleneck in agentic AI as system scaling, not only model scaling: the design of auditable, persistent, modular, and verifiable architectures around foundation models. We refer to this shift as scaling the harness: treating the structured execution layer around a foundation model as a first-class object of design, evaluation, and optimization. Although recent large language models enable agents to use tools, retrieve information, maintain memory, and execute long-horizon workflows, evaluation remains largely model-centric, often reducing agents to final-task success while treating memory, retrieval, tool use, orchestration, verification, and governance as secondary implementation details. This framing is increasingly inadequate because agent performance emerges from the interaction among the foundation model, memory substrate, context constructor, skill-routing layer, orchestration loop, and verification-and-governance layer. Together, these components form the agent harness, which translates model capability into long-horizon agent behavior. We study scaling the harness through three core bottlenecks: context governance, trustworthy memory, and dynamic skill routing, together with the orchestration and governance mechanisms that coordinate and constrain them. We further outline a research agenda for harness-level benchmarks that go beyond one-shot task success to measure trajectory quality, memory hygiene, context efficiency, communication fidelity, verification cost, and safe evolution over time. To make the discussion concrete, we develop CheetahClaws: this https URL , a Python-native reference harness, and compare it with Claude Code and OpenClaw. Our main claim is that future progress in agentic AI will depend as much on system design as on stronger foundation models.

What it covers

From Model Scaling to System Scaling: Scaling the Harness in Agentic AI Shangding Gu UC Berkeley This manuscript is under active development, and we welcome any constructive comments and suggestions at [email protected] . Abstract This paper studies the next major bottleneck in agentic AI as system scaling , not only model scaling: the design of auditable, persistent, modular, and verifiable architectures around foundation models. We refer to this shift as scaling the harness : treating the structured execution layer around a foundation model as a first-class object of design, evaluation, and optimization. Recent progress in large language models (LLMs) has enabled agents that use tools, retrieve information, maintain memory, and execute long-horizon workflows. Yet evaluation remains largely model-centric, reducing agents to final-task success or benchmark accuracy while treating memory, retrieval, tool use, orchestration, verification, and governance as secondary implementation details. This framing is increasingly inadequate: agent performance emerges from the interaction among the foundation model, memory substrate, context constructor, skill-routing layer for tools and subagents, orchestration loop, and verification-and-governance layer. Together, these components form the agent harness, the system that translates model capability into long-horizon agent behavior. We therefore study scaling the harness through three core bottlenecks in agentic AI: context governance , trustworthy memory , and dynamic skill routing , together with the orchestration and governance mechanisms that coordinate and constrain them. We further outline a research agenda for harness-level benchmarks that operationalize system scaling, going beyond one-shot task success to measure trajectory quality, memory hygiene, context efficiency, communication fidelity, verification cost, and safe evolution over time. Alongside the framework, we develop and release CheetahClaws 1 1 1 https://github.com/SafeRL-Lab/cheetahclaws , a Python-native reference harness, and use it together with Claude Code and OpenClaw as concrete points of comparison that make harness-level design choices explicit. Our main claim is that future progress in agentic AI will depend as much on system design as on stronger foundation models. Website: https://cheetahclaws.github.io 1 Introduction The dominant story of recent AI progress has been model scaling : larger models, more data, stronger post-training, and higher benchmark scores (OpenAI, 2026 ; Anthropic, 2026 ; Google, 2026 ) . For agentic AI, this story is now incomplete. Once foundation models are embedded into tools, terminals, browsers, repositories, memory stores, and external services, their behavior is no longer determined by the model alone. It is determined by a system : how context is constructed, how memory is retrieved, how tools are invoked, how subagents are routed, how actions are verified, and how failures are audited. Our key claim is therefore that agentic AI should be studied and evaluated as a system-scaling problem, not merely as a model-scaling problem. By model scaling , we refer to improvements in the standalone foundation model, including model size, training data, post-training, and raw reasoning capability. By system scaling , we refer to improvements in the surrounding architecture, including memory, context construction, skill routing across tools and subagents, orchestration, and verification-and-governance, and how these components adapt over time. Equivalently, this is a problem of scaling the harness : improving the structured execution layer around the foundation model, so that these system components work reliably over long horizons. Our claim is not that model scaling no longer matters; rather, once models reach a sufficient capability threshold, many additional gains in long-horizon agent performance increasingly depend on how the system around the model is designed. Modern agentic systems already illustrate what scaling the harness looks like in practice. Production harnesses such as Claude Code (Anthropic, 2025a ) and OpenClaw (Team, 2026 ) couple foundation models to tools, subagents, and persistent project memory (detailed in § 3.1 ); research-side harnesses such as SWE-agent further show that careful tool-schema design alone can improve benchmark accuracy substantially even with a fixed backbone model (Yang et al. , 2024 ) . These systems show that practical agent capability does not arise from next-token prediction alone, but from the interaction between the foundation model and the harness that surrounds it. The relevant object of study is therefore not simply a model plus prompt, but a structured execution system, a view increasingly reflected in recent work on code-centered agent harnesses (Ning et al. , 2026 ) . This perspective is highlighted by recent empirical findings. A field-level analysis of agent benchmarks finds that many results do not separate capability from costs, prompting strategy, and demonstrations, and become non-Pareto-optimal once these factors are controlled (Kapoor et al. , 2024 ) . Consistent with this, redesigning the agent–computer interface alone, while holding the underlying model fixed, can substantially improve SWE-bench accuracy (Yang et al. , 2024 ) . Thus, what is often reported as a model score is in fact a model-plus-harness score. Context length is another example: larger context windows do not guarantee effective information access, because attention dilutes over long inputs (Gu, 2026 ) , and models often prefer evidence at the start or end of the context rather than in the middle (Liu et al. , 2024a ) . Multi-agent systems show a similar pattern: they can outperform single agents on breadth-first tasks but introduce coordination failures that single-agent metrics miss (Anthropic, 2025d ; Cemri et al. , 2026 ) ; we return to this in § 5.2 . Realistic agent benchmarks such as GAIA (Mialon et al. , 2024 ) , τ \tau -bench (Yao et al. , 2024 ) , and Terminal-Bench (Merrill et al. , 2026 ) further show that frontier models struggle when evaluation moves from one-shot prompting to multi-step interaction with tools, environments, and users. In particular, τ \tau -bench shows that agents that look strong under single-shot pass rates can collapse under pass ˆ k \text{pass}\char 94\relax k , the probability of succeeding on k k independent rollouts. This exposes a reliability gap that endpoint accuracy hides. These findings suggest that we need to rethink several parts of the agent system. Prompt engineering (White et al. , 2023 ) remains useful for local control, but long-horizon performance increasingly depends on reusable skills, persistent memory, disciplined context construction, and verification-aware execution. The key issue is not only context size, but context governance : what should be retrieved, compressed, ordered, refreshed, trusted, and kept active at each step. Memory is not merely a storage layer; the harder problem is memory quality , including what to store, what to discard, how to retrieve the right information at the right time, and how to avoid staleness, drift, contamination (Al-Tawaha et al. , 2026 ) , and over-generalization. Multi-agent systems are not automatically collaborative; reliable collaboration requires explicit communication protocols and uncertainty sharing (Guo et al. , 2026 ) , which we expand on in § 5.2 . Finally, the field still lacks a mature framework for agent evolution over time, including how agents should update skills, refine memory, communicate across roles, and remain auditable as they adapt. This paper makes three main contributions:

• System-scaling framing. We develop a systems-centered framing of agentic AI in which progress depends on scaling the harness , not only scaling the model. Our main claim is that the next bottleneck in agentic AI is not only how powerful the model is, but how well the surrounding system manages memory, context, skill routing across tools and subagents, orchestration, verification and governance, and adaptation over time.

• Harness-level framework. We propose a framework that separates base-model reasoning from system factors including memory, context construction, skill routing, orchestration, and verification-and-governance. This framework treats the agent harness as a first-class object of design and analysis.

• Evaluation agenda and reference harness. We outline an evaluation agenda for agentic systems, highlighting that future benchmarks should measure process-level and longitudinal properties such as trajectory quality, memory hygiene, context efficiency, verification cost, safe evolution, and robustness under repeated use. To make the discussion concrete, we develop CheetahClaws , a Python-native reference harness, and compare it against Claude Code and OpenClaw, treating their harness-level design choices as instances of the system-scaling variables identified by our framework. 2 Related Work Agentic coding systems and harness engineering. Modern coding agents follow a line of work on tool-using language models, beginning with interleaved reasoning–and–acting policies such as ReAct (Yao et al. , 2022 ) , self-taught tool invocation (Schick et al. , 2023 ) , and verbal self-correction loops (Shinn et al. , 2023 ) . Production systems such as Claude Code (Anthropic, 2025a , c ) and Codex-style “harness engineering” (Ryan Lopopolo, 2026 ) package these primitives into programmable agent runtimes with tools, subagents, hooks, and persistent project memory. A parallel research line targets software engineering specifically, including SWE-agent’s agent–computer interface, which shows that carefully designed tool schemas can by themselves move benchmark accuracy substantially even with a fixed backbone model (Yang et al. , 2024 ) . Most of this work, however, reports results at the level of individual model variants; comparatively little attention has been paid to the harness itself as a controllable, reproducible object of study, which is the vantage we adopt throughout this paper. Context, memory, and retrieval. Retrieval-augmented generation (Lewis et al. , 2020 ) showed that augmenting parametric language models with external non-parametric memory can substantially improve knowledge-intensive generation and question answering. And following work studies memory as a system component, including MemGPT’s hierarchical memory management (Packer et al. , 2023 ) and Voyager’s growing skill library for open-ended exploration (Wang et al. , 2023 ) . At the same time, recent analyses show that longer context windows come with their own failure modes such as privacy drift (Gu, 2026 ) , and that agents still need calibrated uncertainty to decide when to retrieve at all (Guo et al. , 2026 ) . These results motivate our treatment of context, memory, and retrieval as a context-governance problem rather than as independent capabilities. Skills and multi-agent coordination. Reusable skills have emerged as a way to offload recurring behavior from prompts into durable, callable components (Kazuhiro Sera, 2026 ; Emre Okcular, 2026 ; Wang et al. , 2023 ) , extending earlier work on chain-of-thought prompting (Wei et al. , 2022 ) and prompt-pattern catalogs (White et al. , 2023 ) . In parallel, multi-agent frameworks such as AutoGen (Wu et al. , 2024 ) , MetaGPT (Hong et al. , 2024 ) , and CAMEL (Li et al. , 2023 ) formalize agent-to-agent communication, while Anthropic reports substantial gains from orchestrator-plus-subagent configurations on breadth-first research tasks (Anthropic, 2025d ) . Complementary work studies how population diversity (Yang et al. , 2026 ; Ye et al. , 2025 ) and negotiation-style frameworks (Liu et al. , 2026 ) shape collective behavior, and how such agents compose into a broader “agentic web” (Yang et al. , 2025 ) . Our framing treats skills and delegation jointly as the skill lever and emphasizes that skill routing under heterogeneous subagents, rather than the existence of skills or subagents, is the next open systems bottleneck. Benchmarks, governance, and agent evolution. A growing line of work evaluates agents as systems through executable, multi-step benchmarks (Jimenez et al. , 2024 ; Liu et al. , 2024b ; Zhou et al. , 2024 ; Merrill et al. , 2026 ) , alongside broader surveys of LLM-based agents (Xi et al. , 2025 ) and catalogues of agentic safety threats (OWASP GenAI Security Project, 2025 ) ; yet single-episode success still dominates the reported metrics, leaving memory quality, context efficiency, communication fidelity, and safe evolution under repeated use largely unmeasured (we return to these in § 5 ). Compared to these lines of work, our contribution is to reframe prior developments through a system-scaling perspective and to make its engineering content concrete through a comparative analysis of Claude Code, OpenClaw, and our Python-native reference harness CheetahClaws. 3 System Scaling: A Framework for Agentic AI Throughout this paper, we use harness to refer to the structured system layer surrounding a foundation model: the tool interface, control loop, context constructor, memory store, skill-routing mechanism, and verification-and-governance layer that together mediate between user intent, model outputs, and the external environment. The harness is what model scaling does not include and what system scaling targets. We use system scaling to denote improvements in this harness that determine how information, computation, authority, and verification are allocated over time, and refer to this engineering agenda as scaling the harness. Under this view, an agent is not simply a model with a prompt, but a system composed of six interacting components: a reasoning substrate ( ℛ \mathcal{R} ), a memory store ( ℳ \mathcal{M} ), a context constructor ( 𝒞 \mathcal{C} ), a skill-routing layer ( 𝒮 \mathcal{S} , which dispatches tools and subagents), an orchestration loop ( 𝒪 \mathcal{O} ), and a verification and governance layer ( 𝒢 \mathcal{G} ). Let performance over a horizon H H be 𝒫 H = Φ ( ℛ , ℳ , 𝒞 , 𝒮 , 𝒪 , 𝒢 ) , \mathcal{P}{H}=\Phi(\mathcal{R},\mathcal{M},\mathcal{C},\mathcal{S},\mathcal{O},\mathcal{G}), (1) where ℛ \mathcal{R} denotes base reasoning quality, ℳ \mathcal{M} memory quality, 𝒞 \mathcal{C} context-construction quality, 𝒮 \mathcal{S} skill selection and composition quality, 𝒪 \mathcal{O} orchestration quality, and 𝒢 \mathcal{G} governance quality. Model scaling primarily improves ℛ \mathcal{R} ; system scaling improves ℳ , 𝒞 , 𝒮 , 𝒪 \mathcal{M},\mathcal{C},\mathcal{S},\mathcal{O} , and 𝒢 \mathcal{G} . The main claim of this paper is that, once models reach a sufficient capability level, long-horizon agent performance may be limited not only by ℛ \mathcal{R} itself, but also by the surrounding system factors. A useful further factorization is ℳ \displaystyle\mathcal{M} = ( precision , durability , retrievability , verifiability ) , \displaystyle=(\text{precision},,\text{durability},,\text{retrievability},,\text{verifiability}), (2) 𝒞 \displaystyle\mathcal{C} = ( relevance , compactness , traceability , refresh policy ) . \displaystyle=(\text{relevance},,\text{compactness},,\text{traceability},,\text{refresh policy}). (3) Each factor names a system-level lever, not a hidden engineering detail. Figure 1 sketches how these components interact: the orchestration loop 𝒪 \mathcal{O} wraps a flow in which 𝒞 \mathcal{C} draws from ℳ \mathcal{M} to assemble inputs for ℛ \mathcal{R} , 𝒮 \mathcal{S} dispatches tools and subagents, and 𝒢 \mathcal{G} gates both intermediate reasoning and external action before any verified result is written back to memory. Status of the decomposition. Equation 1 is a conceptual organization rather than a quantitative model: Φ \Phi has no closed form, the factors are not strictly orthogonal, and we do not claim they jointly determine 𝒫 H \mathcal{P}{H} as a measurable equation. What we do claim is that each factor names a distinct point of intervention , a place where engineering or research effort changes long-horizon behavior, and that existing discussions frequently fail to distinguish between them. We choose these six axes because each one can be changed, turned off, or measured on its own, while keeping the same foundation model. For instance, run 𝒪 \mathcal{O} in a one-shot loop, or turn 𝒢 \mathcal{G} off, and the same ℛ , 𝒞 , 𝒮 \mathcal{R},\mathcal{C},\mathcal{S} start to act like noticeably different agents. Among the six, ℛ \mathcal{R} and 𝒞 \mathcal{C} are the hardest to separate (a stronger reasoning substrate can compensate for noisier context, and vice versa), while ℳ \mathcal{M} and 𝒢 \mathcal{G} are the easiest to isolate, since they govern writes and audit trails that exist independently of any single inference step. Figure 1: A six-component view of an agentic system: 𝒫 H = Φ ( ℛ , ℳ , 𝒞 , 𝒮 , 𝒪 , 𝒢 ) \mathcal{P}_{H}=\Phi(\mathcal{R},\mathcal{M},\mathcal{C},\mathcal{S},\mathcal{O},\mathcal{G}) . The orchestration layer ( 𝒪 \mathcal{O} ) wraps a control loop in which the context constructor ( 𝒞 \mathcal{C} ) draws from durable memory ( ℳ \mathcal{M} ) and the current task to assemble inputs for the reasoning substrate ( ℛ \mathcal{R} , i.e. the foundation model). The skill router ( 𝒮 \mathcal{S} ) dispatches tools or subagents; their effects on the environment, together with the model’s intermediate steps, are gated through verification and governance ( 𝒢 \mathcal{G} ) before they become permitted actions or verified memory write-backs. Model scaling improves ℛ \mathcal{R} ; system scaling improves ℳ , 𝒞 , 𝒮 , 𝒪 \mathcal{M},\mathcal{C},\mathcal{S},\mathcal{O} , and 𝒢 \mathcal{G} . 3.1 Agent Harnesses as System Infrastructure Modern agent harnesses such as OpenClaw (Team, 2026 ) and Claude Code (Anthropic, 2025a ) are better understood as systems infrastructures rather than simple model interfaces: their behavior depends not only on the underlying language model, but on the surrounding tool interface, execution loop, context constructor, memory substrate, and orchestration policy. Claude Code in particular benefits from substantial harness engineering (Ryan Lopopolo, 2026 ) : it bundles tools for codebase navigation, file editing, and command execution; dispatches specialized subagents with their own context windows, prompts, and permissions; and adopts a hybrid context strategy that loads persistent project guidance up front while retrieving information just in time through glob / grep -style tools. 2 2 2 See documentation at https://code.claude.com/docs/en/overview (overview), https://code.claude.com/docs/en/sub-agents (subagents), and https://platform.claude.com/docs/en/agent-sdk/python (SDK). What distinguishes modern agentic coding systems from classic code assistants is therefore not stronger token-level generation alone, but the presence of an execution harness that supports tool use, iterative verification, and task decomposition. These details matter because they shift attention from model capability alone to the system conditions under which that capability is expressed. The relevant unit of analysis is not simply an isolated foundation model conditioned on a prompt, but the interaction among the six components introduced in Equation 1 : ( ℛ , ℳ , 𝒞 , 𝒮 , 𝒪 , 𝒢 ) (\mathcal{R},\mathcal{M},\mathcal{C},\mathcal{S},\mathcal{O},\mathcal{G}) . These are not minor implementation details. They determine what information is available at decision time, how external actions are executed and verified, and how progress accumulates across turns. As a result, they increasingly govern task-level performance in long-horizon settings. Once these components are treated as first-class objects, the key research question shifts from “ How do we prompt the model better? ” to “ How do we allocate computation across memory, retrieval, tools, and subagents over time? ” Table 1: Illustrative harness design patterns. The point is not to rank systems, but to show that comparable agent primitives can be governed differently under different deployment priorities. Claude Code OpenClaw CheetahClaws Implementation TypeScript TypeScript Python Primary setting Vendor coding agent Personal assistant Research reference Primary user interaction Terminal CLI / IDE Messaging app (Discord, Slack, iMessage, …) Terminal CLI Context governance User, project, session User, channel-peer, session User, project, session Memory Persistent text memory, auto-extraction Conversation history, vector retrieval Structured entries with confidence, recency Source availability Closed-source Open-source Open-source A natural skeptical view holds that, once the foundation model is held fixed, most harnesses collapse to the same tool loop, with only superficial differences between them. We show instead that the similar core systems problems, context governance, memory trust, skill routing, and auditability, admit substantially different solutions depending on deployment priorities. Table 1 sketches three illustrative design points built around comparable frontier-model capabilities: Claude Code (v2.1.88), a production-grade vendor harness; OpenClaw (v2026.4.6), a community TypeScript harness for multi-channel personal assistance; and CheetahClaws (v3.05.79), a Python-native reference harness used here as an open illustrative design point. Two observations follow. First, the three systems reflect a shared systems-decomposition principle: each addresses context governance, memory management, and skill routing, even though these levers are realized through different design choices. This convergence suggests that they are intrinsic design problems for agentic AI systems, rather than incidental features of any particular implementation. Second, their main differences are driven less by the foundation model than by deployment priorities: vendor-scale systems prioritize reliable use, personal-assistant systems prioritize a gateway for multi-channel management, and research-oriented harnesses prioritize transparency and reproducibility. The remainder of the paper makes these levers concrete: § 4 expands the three bottlenecks of context, memory, and skill, and § 5 discusses how to evaluate and govern their evolution. All three systems consolidate session content into persistent memory through subagent extraction, background daemons, or dedicated consolidation routines. What differs is the representation of trust: CheetahClaws stores per-entry confidence and recency as first-class fields, used directly in retrieval ranking and conflict resolution. The other two derive trust implicitly from access patterns. In this sense CheetahClaws operationalizes the trust axes of § 4.2 more directly. 3.2 Prompt, Skill, and Memory as Temporal Layers We interpret prompt, skill, and memory as three primary temporal axes of system scaling in agentic AI. This view is complementary to Equation 1 : skill corresponds to 𝒮 \mathcal{S} and memory to ℳ \mathcal{M} , while prompt sits inside each per-turn output of the context constructor 𝒞 \mathcal{C} ; the orchestration 𝒪 \mathcal{O} , verification and governance 𝒢 \mathcal{G} layers determine how the three are sequenced and verified over time. As shown in Table 2 , they operate at different temporal scales and support different forms of adaptation. Prompt. Prompt is the short-horizon control interface. It specifies the immediate role, constraints, and objective. Prompting is flexible and cheap, but also brittle: it does not by itself create persistence, transfer, or reliable long-horizon structure. Skill. A skill is a reusable execution pattern. In practice, a skill may appear as a workflow template, a tool-use routine, a specialized subagent, or a versioned bundle of instructions and scripts. OpenAI’s recent discussions of skills for coding agents make this direction explicit: durable procedures are separated from one-off prompts and packaged as reusable components attached to the execution environment (Kazuhiro Sera, 2026 ; Emre Okcular, 2026 ) . Skills make behavior more reusable, but introduce a routing problem: the agent must decide which skill to invoke, when to switch skills, and how to compose multiple skills in one trajectory. Memory. Memory is the longitudinal layer. It stores what should persist across turns or sessions: project conventions, user preferences, stable facts about the environment, prior failures, and distilled structure from earlier work. Memory is essential for repeated tasks, but it can fail along three trust axes elaborated in Section 4.2 : drift (loss of durability), over-generalization (loss of precision), and pollution (loss of verifiability). These three levers are complementary rather than interchangeable. Prompt controls what to do now ; skill controls how to do this class of things ; memory controls what should survive over time . A robust agent is therefore not merely well prompted. It is well prompted and appropriately skilled and selectively grounded in durable memory. Table 2: Prompt, skill, and memory as three core axes of system scaling in long-horizon agents. Lever Timescale Primary role Typical failure mode Prompt Local Specify current goal, constraints, and style Brittle over long horizons; poor transfer Skill Task-level Reusable procedure or workflow pattern Wrong routing or poor composition Memory Longitudinal Preserve durable facts and prior experience Drift, over-generalization, pollution (durability / precision / verifiability) 4 Three Bottlenecks in System Scaling We now expand three system factors from Equation 1 where model scaling alone has been least sufficient: context construction 𝒞 \mathcal{C} , memory ℳ \mathcal{M} , and skill routing 𝒮 \mathcal{S} , tightly coupled to verification and governance 𝒢 \mathcal{G} . Each subsection names four subaxes of its component, the dominant failure mode, and the system move that addresses it; Table 3 summarizes the three. Table 3: Three bottlenecks of system scaling. Each subsection names four subaxes of one component, a characteristic failure mode, and the system move that addresses it. Component Subaxes Failure mode System move 𝒞 \mathcal{C} governance (§ 4.1 ) relevance, compactness, traceability, refresh exposure without access assembly as a policy; persistent priors plus just-in-time refresh ℳ \mathcal{M} trust (§ 4.2 ) precision, durability, retrievability, verifiability stale-but-confident trust re-established at retrieval; periodic verification against the environment 𝒮 \mathcal{S} routing (§ 4.3 ) specificity, selectivity, composability, verifiability confident-but-unchecked adaptive routing coupled with explicit post-condition checks 4.1 Context Governance The hard problem of context is not capacity, but governance . From the four axes of 𝒞 \mathcal{C} in Equation 2 , an effective context assembly is jointly relevant to the current task, compact (no more than the minimum sufficient set), traceable to its sources, and refreshed against a moving environment. Larger context windows expand capacity, but they do not guarantee relevance, compactness, traceability, or freshness. The threat we are guarding against is exposure without access : as context grows, the model sees more tokens but does not necessarily attend to the right ones. Relevant evidence competes with low-value padding (signal dilution (Gu, 2026 ) ), task-relevant structure is buried in unorganized text, and token salience is driven by local statistics rather than decision importance. Long context does not indicate good context; tokens added without governance often degrade performance rather than improve it. The system move is to treat each turn’s context as the output of a selection policy, not a fixed buffer. The policy should weight semantic relevance, penalize verbosity against a token budget, prefer recently validated content, and record provenance so failures can be attributed at audit time. The right systems question is therefore not how many tokens the model can hold, but how the system constructs the minimum sufficient context for the current subproblem. 4.2 Trustworthy Memory The hard problem of agent memory is not storage, but trust . Matching three of the four axes of ℳ \mathcal{M} in Equation 2 , a memory item earns trust when it is precise within a defined scope, remains durable (its target has not silently drifted), and is verifiable against the current environment. The fourth axis, retrievability, controls whether that trust can be used at acceptable cost. It is a precondition for using trust, not a source of trust. The threat we are guarding against is stale-but-confident . A note that was correct at one point, say “the data loader is defined in utils/loader.py ”, can become flatly wrong after a refactor without any change to its wording. Semantic search and reuse statistics still rank it highly, but its target has drifted, and acting on it is now destructive (calling a deleted symbol, or reintroducing a fixed regression). The failure mode is asymmetric: stale memory rarely prevents retrieval, but regularly leads the agent to act confidently on invalidated assumptions. The system move is to make trust a runtime decision, not a property of the stored item. Retrieval should weight a staleness penalty (against the time of last verification) and a confidence-gated risk term alongside any relevance score, and should treat the retrieved content as a hypothesis until re-checked against the live environment. Claude Code realizes this coupling through a hybrid design: CLAUDE.md carries persistent project context, while built-in primitives ( glob , grep , file reads) provide just-in-time access t

From Model Scaling to System Scaling: Scaling the H... | AI Research

Key Takeaways

What the paper is about

What it covers

Comments (0)

No comments yet