Operating-Layer Controls for Onchain Language-Model...

Operating-Layer Controls for Onchain Language-Model... | AI Research

Key Takeaways

Operating-Layer Controls for Onchain Language-Model Agents Under Real Capital This paper investigates how to build reliable autonomous agents that manage rea...
We study reliability in autonomous language-model agents that translate user mandates into validated tool actions under real capital.
The setting is DX Terminal Pro, a 21-day deployment in which 3,505 user-funded agents traded real ETH in a bounded onchain market.
Users configured vaults through structured controls and natural-language strategies, but only agents could choose normal buy/sell trades.
Pre-launch testing exposed failures that text-only benchmarks rarely measure, including fabricated trading rules, fee paralysis, numeric anchoring, cadence trading, and misread tokenomics.

Paper AbstractExpand

We study reliability in autonomous language-model agents that translate user mandates into validated tool actions under real capital. The setting is DX Terminal Pro, a 21-day deployment in which 3,505 user-funded agents traded real ETH in a bounded onchain market. Users configured vaults through structured controls and natural-language strategies, but only agents could choose normal buy/sell trades. The system produced 7.5M agent invocations, roughly 300K onchain actions, about $20M in volume, more than 5,000 ETH deployed, roughly 70B inference tokens, and 99.9% settlement success for policy-valid submitted transactions. Long-running agents accumulated thousands of sequential decisions, including 6,000+ prompt-state-action cycles for continuously active agents, yielding a large-scale trace from user mandate to rendered prompt, reasoning, validation, portfolio state, and settlement. Reliability did not come from the base model alone; it emerged from the operating layer around the model: prompt compilation, typed controls, policy validation, execution guards, memory design, and trace-level observability. Pre-launch testing exposed failures that text-only benchmarks rarely measure, including fabricated trading rules, fee paralysis, numeric anchoring, cadence trading, and misread tokenomics. Targeted harness changes reduced fabricated sell rules from 57% to 3%, reduced fee-led observations from 32.5% to below 10%, and increased capital deployment from 42.9% to 78.0% in an affected test population. We show that capital-managing agents should be evaluated across the full path from user mandate to prompt, validated action, and settlement.

Operating-Layer Controls for Onchain Language-Model Agents Under Real Capital
This paper investigates how to build reliable autonomous agents that manage real money in financial markets. Rather than focusing solely on the intelligence of the language model itself, the authors argue that reliability is an "operating-layer" property. By studying a 21-day deployment where over 3,500 agents traded real ETH, the researchers demonstrate that the system surrounding the model—including prompt compilation, policy validation, and execution guards—is what ultimately determines whether an agent acts safely and effectively.

The Operating Layer Approach

The researchers define the "operating layer" as the entire system connecting a user’s goal to an onchain result. This includes the user interface, the prompt compiler, the model’s reasoning process, policy checks, and the final execution on the blockchain. The study emphasizes that even a highly capable model can fail if the surrounding system provides poor instructions, misinterprets market data, or fails to enforce strict safety boundaries. By keeping the model, hardware, and infrastructure constant, the team was able to isolate how specific changes to this operating layer influenced agent behavior.

Identifying and Fixing Failure Modes

During pre-launch testing, the team discovered several critical failure modes that standard text-based benchmarks often miss. For example, agents sometimes "fabricated" trading rules that didn't exist, became paralyzed by transaction fees, or misinterpreted complex tokenomics. By analyzing the full path from user mandate to final settlement, the researchers were able to implement targeted fixes. Moving the mention of fees to a different part of the prompt or providing structured context for token mechanics significantly improved performance, proving that these issues are often problems of prompt construction and context management rather than model intelligence.

Reliability Through Observability

A key contribution of this work is the use of "instruction-to-settlement" traces. Because the system logged every step of the decision-making process—from the user's initial strategy to the final blockchain transaction—the researchers could diagnose exactly why a trade succeeded or failed. This level of transparency allowed the team to distinguish between a model error, a contradictory user instruction, or a valid rejection by the system’s safety guards.

Key Takeaways for Agent Design

The study concludes that for agents managing real capital, evaluation must go beyond simple task completion. The authors found that structured controls—such as sliders for risk and activity levels—were more reliable than free-form chat for managing financial mandates. Furthermore, the research suggests that traditional memory and retrieval methods can sometimes introduce more confusion than clarity in fast-moving markets. Instead, providing agents with clear, structured, and prioritized context is the most effective way to ensure they act in alignment with user intent.