AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go?
Production agentic systems often rely on massive, expensive "frontier" models for every step of a workflow, even for simple tasks like looking up a record or extracting data. This paper investigates whether this "frontier-first" approach is actually necessary or if smaller, more efficient open-weight models can handle the majority of these routine actions. The authors introduce AgentFloor, a benchmark designed to map out exactly where smaller models succeed and where they fall short compared to flagship models like GPT-5.
A Six-Tier Capability Ladder
To test model performance, the researchers created a deterministic 30-task benchmark organized into six tiers of increasing cognitive difficulty. These tiers range from basic instruction following (Tier A0) and single-tool use (Tier A) to more complex requirements like sequential tool chaining (Tier B), conditional branching (Tier C), multi-source synthesis (Tier D), and finally, long-horizon planning under persistent constraints (Tier E). By using an abstract, in-memory database rather than live web APIs, the researchers ensured the benchmark was stable, repeatable, and free from the noise or contamination often found in real-world environments.
The Performance Boundary
The study evaluated 16 open-weight models, ranging from 0.27B to 32B parameters, against GPT-5. The results reveal a clear, stratified boundary of model necessity. Small and mid-sized open-weight models are highly capable of handling the short-horizon, structured tool use that makes up the bulk of most agentic pipelines. In fact, the strongest open-weight model tested, gemma4:26b, matched the performance of GPT-5 across the benchmark while being significantly faster and cheaper to operate. However, a clear gap emerges at the highest tier (Tier E). When tasks require sustained coordination and reliable constraint tracking over many steps, frontier models still hold an advantage, though the authors note that even these larger models struggle to reach high levels of reliability in these complex scenarios.
Practical Implications for System Design
The findings suggest a new design principle for building agentic systems: developers should route routine, short-horizon tasks to smaller, cost-effective open-weight models, reserving expensive frontier models only for the narrow class of tasks that demand deep planning and control. The research also highlights that there is no "universal" fix for model failures; interventions that improve one model often have no effect on others. By providing this capability-and-cost map, the authors offer a practical guide for engineers to optimize their systems for both performance and efficiency.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!