This paper introduces a new framework for managing "LLM cascading," a process where a system sequentially queries different AI models to balance the cost of an API call against the quality of the generated output. The authors propose an online contextual "Pandora’s Box" model, which treats the decision to query an API as opening a box with an associated cost, and the resulting output as a potential reward. Unlike traditional models where opening a box immediately reveals the reward, this system separates the process into a query phase—where costs are incurred to see outputs—and a selection phase—where the best output is chosen and the actual reward is finally revealed.
The Challenge of Cost-Quality Trade-offs
In modern AI applications, choosing a single model is often inefficient. High-end models provide high-quality results but are expensive, while smaller models are cheaper but less reliable. LLM cascading allows a system to start with a low-cost model and only escalate to more expensive ones if the initial output is insufficient. The authors identify that this is not just a routing problem, but a complex sequential search problem. Because the quality of an output depends on the specific request context, the system must learn how to make these decisions dynamically over time without knowing the exact performance or cost distributions of the APIs in advance.
A New Learning Approach
Instead of trying to map out the entire probability distribution of every possible output for every API—which is computationally difficult—the authors focus on modeling "reservation indices." These indices represent the threshold at which an API’s potential output is worth the cost of querying it. By applying a parametric structure to these indices, the decision-maker can use a combination of Generalized Method of Moments (GMM) estimation and Upper Confidence Bound (UCB) strategies. This allows the system to learn which APIs to query and when to stop, while simultaneously refining its understanding of the shared reward function that evaluates the quality of the final output.
Performance and Regret
The authors provide a formal mathematical analysis of their policy, proving that it achieves a cumulative regret of $\widetilde{O}(\sqrt{T})$ over a horizon of $T$ periods. This means the system’s performance gap compared to an ideal, "all-knowing" oracle grows at a controlled, sublinear rate. The analysis holds both for scenarios where the system already has a good idea of how to evaluate rewards (the known-evaluator regime) and for more challenging scenarios where both the reward evaluator and the reservation indices must be learned from scratch.
Key Takeaways
The primary contribution of this work is the formalization of LLM cascading as an online contextual Pandora’s Box problem. By shifting the focus from estimating full distributions to learning reservation indices and shared reward functions, the authors provide a practical, theoretically grounded way to automate cost-effective AI deployment. This approach is designed to handle the reality of business requests where the system must balance the immediate cost of computation against the downstream value of the information generated.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!