AI Research

Case-Based Calibration of Adaptive Reasoning and Ex... | AI Research

Key Takeaways

Large language models (LLMs) are increasingly used to interact with external tools, but they often struggle to balance the need for deep reasoning with the r...
Tool use extends large language models beyond parametric knowledge, but reliable execution requires balancing appropriate reasoning depth with strict structural validity.
We approach this problem from a case-based perspective to present CAST, a case-driven framework that treats historical execution trajectories as structured cases.
Instead of reusing raw exemplar outputs, CAST extracts case-derived signals to identify complexity profiles for estimating optimal reasoning strategies, alongside failure profiles to map likely structural breakdowns.
The framework translates this knowledge into a fine-grained reward design and adaptive reasoning, enabling the model to autonomously internalize case-based strategies during reinforcement learning.

Paper AbstractExpand

Tool use extends large language models beyond parametric knowledge, but reliable execution requires balancing appropriate reasoning depth with strict structural validity. We approach this problem from a case-based perspective to present CAST, a case-driven framework that treats historical execution trajectories as structured cases. Instead of reusing raw exemplar outputs, CAST extracts case-derived signals to identify complexity profiles for estimating optimal reasoning strategies, alongside failure profiles to map likely structural breakdowns. The framework translates this knowledge into a fine-grained reward design and adaptive reasoning, enabling the model to autonomously internalize case-based strategies during reinforcement learning. Experiments on BFCLv2 and ToolBench demonstrate that CAST improves both schema-faithful execution and task-level tool-use success while reducing unnecessary deliberation. The approach achieves up to 5.85 percentage points gain in overall execution accuracy and reduces average reasoning length by 26%, significantly mitigating high-impact structural errors. Ultimately, this demonstrates how historical execution cases can provide reusable adaptation knowledge for calibrated tool use.

Large language models (LLMs) are increasingly used to interact with external tools, but they often struggle to balance the need for deep reasoning with the requirement for precise, structured execution. The paper "Case-Based Calibration of Adaptive Reasoning and Execution for LLM Tool Use" introduces CAST, a framework that treats past tool-use experiences as "cases" to help models learn when to think deeply and how to avoid structural errors. By analyzing historical successes and failures, CAST enables models to autonomously adjust their reasoning effort and improve the accuracy of their tool invocations.

Learning from Past Execution

Rather than treating every task with the same level of effort, CAST organizes historical execution data into structured cases. Each case includes the original query, the reasoning steps taken, the tool call made, and the final outcome. From this data, the framework extracts two key signals: a "complexity profile" that estimates how much reasoning is necessary for a specific task, and a "failure profile" that identifies common structural pitfalls, such as incorrect function names or parameter mismatches.

Adaptive Reasoning and Optimization

CAST uses these profiles to guide the model during reinforcement learning. For simpler tasks, the model is encouraged to be concise, reducing unnecessary deliberation. For more complex tasks, it is incentivized to maintain a longer reasoning process to ensure constraints are met and arguments are normalized. Simultaneously, the failure profile provides granular feedback on the structure of the tool calls. This dual approach allows the model to learn a more efficient and reliable policy, effectively internalizing the lessons from past experiences to perform better on new, unseen tasks.

Performance Gains

Experiments conducted on the BFCLv2 and ToolBench benchmarks demonstrate that the CAST framework significantly improves tool-use performance. The model achieved up to a 5.85 percentage point increase in overall execution accuracy while reducing the average length of reasoning traces by 26%. These results suggest that by shifting from a "one-size-fits-all" approach to a case-based, adaptive strategy, LLMs can become more efficient and less prone to high-impact structural errors when interacting with external tools.

Comments (0)

No comments yet

Be the first to share your thoughts!