Back to AI Research

AI Research

Online Pandora's Box for Contextual LLM Cascading | AI Research

Key Takeaways

  • This paper introduces a new framework for managing "LLM cascading," a process where a system sequentially queries different AI models to balance the cost of...
  • Motivated by Large Language Model (LLM) cascading, we propose an online contextual Pandora's Box model for adaptively querying and selecting LLM APIs.
  • In each period, a decision-maker observes a request context and faces a two-phase decision problem.
  • In the query phase, the decision-maker sequentially queries APIs, where each query reveals a generated output and the decision-maker incurs an (output-dependent) cost.
  • In the selection phase, the decision-maker selects one of the generated outputs to deploy and observes only the downstream reward of the deployed output.
Paper AbstractExpand

Motivated by Large Language Model (LLM) cascading, we propose an online contextual Pandora's Box model for adaptively querying and selecting LLM APIs. In each period, a decision-maker observes a request context and faces a two-phase decision problem. In the query phase, the decision-maker sequentially queries APIs, where each query reveals a generated output and the decision-maker incurs an (output-dependent) cost. In the selection phase, the decision-maker selects one of the generated outputs to deploy and observes only the downstream reward of the deployed output. This output-mediated feedback structure differs from classical online contextual Pandora's Box models, in which opening a box directly reveals its reward. Rather than estimating the full conditional output and cost distributions of each API, we directly model the reservation index and develop a learning approach for the query phase. Specifically, we impose a parametric structure on the contextual reservation index functions induced by the classical Weitzman's policy. Our policy combines generalized method of moments (GMM) type estimation of these reservation indices with UCB-style confidence bounds for both these indices and the shared output-level reward evaluator. Under regularity conditions, we prove that the resulting policy achieves dimension-dependent $\widetilde O(\sqrt T)$ cumulative regret over a horizon of $T$ periods.

This paper introduces a new framework for managing "LLM cascading," a process where a system sequentially queries different AI models to balance the cost of an API call against the quality of the generated output. The authors propose an online contextual "Pandora’s Box" model, which treats the decision to query an API as opening a box with an associated cost, and the resulting output as a potential reward. Unlike traditional models where opening a box immediately reveals the reward, this system separates the process into a query phase—where costs are incurred to see outputs—and a selection phase—where the best output is chosen and the actual reward is finally revealed.

The Challenge of Cost-Quality Trade-offs

In modern AI applications, choosing a single model is often inefficient. High-end models provide high-quality results but are expensive, while smaller models are cheaper but less reliable. LLM cascading allows a system to start with a low-cost model and only escalate to more expensive ones if the initial output is insufficient. The authors identify that this is not just a routing problem, but a complex sequential search problem. Because the quality of an output depends on the specific request context, the system must learn how to make these decisions dynamically over time without knowing the exact performance or cost distributions of the APIs in advance.

A New Learning Approach

Instead of trying to map out the entire probability distribution of every possible output for every API—which is computationally difficult—the authors focus on modeling "reservation indices." These indices represent the threshold at which an API’s potential output is worth the cost of querying it. By applying a parametric structure to these indices, the decision-maker can use a combination of Generalized Method of Moments (GMM) estimation and Upper Confidence Bound (UCB) strategies. This allows the system to learn which APIs to query and when to stop, while simultaneously refining its understanding of the shared reward function that evaluates the quality of the final output.

Performance and Regret

The authors provide a formal mathematical analysis of their policy, proving that it achieves a cumulative regret of $\widetilde{O}(\sqrt{T})$ over a horizon of $T$ periods. This means the system’s performance gap compared to an ideal, "all-knowing" oracle grows at a controlled, sublinear rate. The analysis holds both for scenarios where the system already has a good idea of how to evaluate rewards (the known-evaluator regime) and for more challenging scenarios where both the reward evaluator and the reservation indices must be learned from scratch.

Key Takeaways

The primary contribution of this work is the formalization of LLM cascading as an online contextual Pandora’s Box problem. By shifting the focus from estimating full distributions to learning reservation indices and shared reward functions, the authors provide a practical, theoretically grounded way to automate cost-effective AI deployment. This approach is designed to handle the reality of business requests where the system must balance the immediate cost of computation against the downstream value of the information generated.

Comments (0)

No comments yet

Be the first to share your thoughts!