AgentFloor: How Far Up the tool use Ladder Can Smal...

AgentFloor: How Far Up the tool use Ladder Can Smal... | AI Research

Key Takeaways

AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go?
Production agentic systems often rely on massive, expensive "frontier" models for...
Production agentic systems make many model calls per user request, and most of those calls are short, structured, and routine.
This raises a practical routing question that existing evaluations do not directly answer: which parts of an agent workflow truly require large frontier intelligence, and which can be handled by smaller models?
We evaluate 16 open-weight models, from 0.27B to 32B parameters, alongside GPT-5 across 16,542 scored runs.

Paper AbstractExpand

Production agentic systems make many model calls per user request, and most of those calls are short, structured, and routine. This raises a practical routing question that existing evaluations do not directly answer: which parts of an agent workflow truly require large frontier intelligence, and which can be handled by smaller models? We introduce AgentFloor, a deterministic 30-task benchmark organized as a six-tier capability ladder, spanning instruction following, tool use, multi-step coordination, and long-horizon planning under persistent constraints. We evaluate 16 open-weight models, from 0.27B to 32B parameters, alongside GPT-5 across 16,542 scored runs. Our results reveal a clear boundary of model necessity. Small and mid-sized open-weight models are already sufficient for much of the short-horizon, structured tool use work that dominates real agent pipelines, and in aggregate, the strongest open-weight model matches GPT-5 on our benchmark while being substantially cheaper and faster to run. The gap appears most clearly on long-horizon planning tasks that require sustained coordination and reliable constraint tracking over many steps, where frontier models still hold an advantage, though neither side reaches strong reliability. We also find that this boundary is not explained by scale alone: some failures respond to targeted interventions, but the effects are model-specific rather than universal. These findings suggest a practical design principle for agentic systems: use smaller open-weight models for the broad base of routine actions, and reserve large frontier models for the narrower class of tasks that truly demand deeper planning and control. We release the benchmark, harness, sweep configurations, and full run corpus.

AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go?
Production agentic systems often rely on massive, expensive "frontier" models for every step of a workflow, even for simple tasks like looking up a record or extracting data. This paper investigates whether this "frontier-first" approach is actually necessary or if smaller, more efficient open-weight models can handle the majority of these routine actions. The authors introduce AgentFloor, a benchmark designed to map out exactly where smaller models succeed and where they fall short compared to flagship models like GPT-5.

A Six-Tier Capability Ladder

To test model performance, the researchers created a deterministic 30-task benchmark organized into six tiers of increasing cognitive difficulty. These tiers range from basic instruction following (Tier A0) and single-tool use (Tier A) to more complex requirements like sequential tool chaining (Tier B), conditional branching (Tier C), multi-source synthesis (Tier D), and finally, long-horizon planning under persistent constraints (Tier E). By using an abstract, in-memory database rather than live web APIs, the researchers ensured the benchmark was stable, repeatable, and free from the noise or contamination often found in real-world environments.

The Performance Boundary

The study evaluated 16 open-weight models, ranging from 0.27B to 32B parameters, against GPT-5. The results reveal a clear, stratified boundary of model necessity. Small and mid-sized open-weight models are highly capable of handling the short-horizon, structured tool use that makes up the bulk of most agentic pipelines. In fact, the strongest open-weight model tested, gemma4:26b, matched the performance of GPT-5 across the benchmark while being significantly faster and cheaper to operate. However, a clear gap emerges at the highest tier (Tier E). When tasks require sustained coordination and reliable constraint tracking over many steps, frontier models still hold an advantage, though the authors note that even these larger models struggle to reach high levels of reliability in these complex scenarios.

Practical Implications for System Design

The findings suggest a new design principle for building agentic systems: developers should route routine, short-horizon tasks to smaller, cost-effective open-weight models, reserving expensive frontier models only for the narrow class of tasks that demand deep planning and control. The research also highlights that there is no "universal" fix for model failures; interventions that improve one model often have no effect on others. By providing this capability-and-cost map, the authors offer a practical guide for engineers to optimize their systems for both performance and efficiency.

AgentFloor: How Far Up the tool use Ladder Can Smal... | AI Research

Key Takeaways

A Six-Tier Capability Ladder

The Performance Boundary

Practical Implications for System Design

Comments (0)

No comments yet