Back to AI Research

AI Research

Natural Language Query to Configuration for Retriev... | AI Research

Key Takeaways

  • What the paper is about Modern retrieval agents expose many configuration choices -- LLM, retriever, number of documents, number of hops, and synthesis strat...
  • Modern retrieval agents expose many configuration choices -- LLM, retriever, number of documents, number of hops, and synthesis strategy -- each shaping both answer quality and serving cost.
  • Today, these pipelines are typically hand-tuned once per workload, leaving substantial per-query optimization untapped.
  • We formulate the problem: given a natural-language query and either an accuracy or a budget target, select from a predefined pipeline catalog the configuration that minimizes cost or maximizes accuracy at inference time.
  • At inference time, **BRANE** selects the configuration that maximizes predicted correctness penalized by cost, exposing a tunable cost-quality tradeoff without retraining.
Paper AbstractExpand

Modern retrieval agents expose many configuration choices -- LLM, retriever, number of documents, number of hops, and synthesis strategy -- each shaping both answer quality and serving cost. Today, these pipelines are typically hand-tuned once per workload, leaving substantial per-query optimization untapped. We formulate the problem: given a natural-language query and either an accuracy or a budget target, select from a predefined pipeline catalog the configuration that minimizes cost or maximizes accuracy at inference time. We propose **BRANE**, which uses an LLM to convert each query into workload-specific characteristics, then trains a lightweight per-configuration predictor that estimates whether the pipeline will answer the query correctly. At inference time, **BRANE** selects the configuration that maximizes predicted correctness penalized by cost, exposing a tunable cost-quality tradeoff without retraining. Across MuSiQue, BrowseComp-Plus, and FinanceBench, **BRANE** consistently pushes the cost-quality Pareto frontier, matches the best fixed configuration's accuracy at up to 89% lower cost, and outperforms LLM-routing, rule-based, and fine-tuned Qwen3-4B baselines. These results show that per-query configuration of the full retrieval pipeline is a practical alternative to static workload-level tuning.

What the paper is about

Modern retrieval agents expose many configuration choices -- LLM, retriever, number of documents, number of hops, and synthesis strategy -- each shaping both answer quality and serving cost. Today, these pipelines are typically hand-tuned once per workload, leaving substantial per-query optimization untapped. We formulate the problem: given a natural-language query and either an accuracy or a budget target, select from a predefined pipeline catalog the configuration that minimizes cost or maximizes accuracy at inference time. We propose BRANE, which uses an LLM to convert each query into workload-specific characteristics, then trains a lightweight per-configuration predictor that estimates whether the pipeline will answer the query correctly. At inference time, BRANE selects the configuration that maximizes predicted correctness penalized by cost, exposing a tunable cost-quality tradeoff without retraining. Across MuSiQue, BrowseComp-Plus, and FinanceBench, BRANE consistently pushes the cost-quality Pareto frontier, matches the best fixed configuration's accuracy at up to 89% lower cost, and outperforms LLM-routing, rule-based, and fine-tuned Qwen3-4B baselines. These results show that per-query configuration of the full retrieval pipeline is a practical alternative to static workload-level tuning.

What it covers

Natural Language Query to Configuration for Retrieval Agents Melissa Z. Pan 1 Negar Arabzadeh 1 Mathew Jacob 2 Fiodar Kazhamiaka 3 Esha Choukse 3 Matei Zaharia 1 1 UC Berkeley, 2 University of Washington, 3 Microsoft Azure Research - Systems Abstract Modern retrieval agents expose many configuration choices—LLM, retriever, number of documents, number of hops, and synthesis strategy—each shaping both answer quality and serving cost. Today, these pipelines are typically hand-tuned once per workload, leaving substantial per-query optimization untapped. We formulate Query2Conf : given a natural-language query and either an accuracy or a budget target, select from a predefined pipeline catalog the configuration that minimizes cost (or maximizes accuracy) at inference time. We propose BRANE which uses an LLM to convert each query into workload-specific characteristics, then trains a lightweight per-configuration predictor that estimates whether the pipeline will answer the query correctly. At inference time, BRANE selects the configuration that maximizes predicted correctness penalized by cost, exposing a tunable cost–quality tradeoff without retraining. Across MuSiQue, BrowseComp-Plus, and FinanceBench, BRANE consistently pushes the cost–quality Pareto frontier, matches the best fixed configuration’s accuracy at up to 89 % 89% lower cost, and outperforms LLM-routing, rule-based, and fine-tuned Qwen3-4B baselines. These results show that per-query configuration of the full retrieval pipeline is a practical alternative to static workload-level tuning. 1 Introduction Modern AI pipelines have grown complex with a lot of information retrieval alongside the LLM calls [ 21 , 30 , 3 , 24 ] . Customer-support agents look up product documentation [ 23 , 21 , 26 ] , research assistants pull from scientific literature [ 3 , 31 ] , and commercial offerings such as Perplexity, ChatGPT, Gemini, and Claude retrieve from web corpora before answering [ 12 , 20 , 11 , 22 ] . Even reasoning-intensive tasks now ground multi-step inference in external evidence [ 2 ] . A pipeline exposes a large configuration space, including but not limited to a retriever, a synthesis workflow, and an LLM. Each knob shapes both end-to-end dollar cost and answer quality, so configuring these pipelines well unlocks not only large gains in answer quality (accuracy) but also substantial headroom for cost optimization. Although LLM routers [ 29 , 27 , 5 , 19 ] have recently been used to choose the right LLM based on the semantic information in the queries, configuration selection across full pipelines remains unexplored with the semantic information. Systematically configuring such a pipeline is, however, non-trivial: the search space is orders of magnitude larger than LLM routing, and the pipeline knobs interact combinatorially and sometimes non-monotonically. For example, a richer synthesis strategy with extra summarization steps may enhance downstream generation, but adding documents can push the LLM out of distribution, and a stronger model may not compensate for a weaker retriever. A simple query like “Provide the name of the director for the show that…?” can be served by many pipelines at vastly different accuracy and cost (Figure 1 and Figure 3 ). As a result, modern AI systems are still built by hand-tuning a pipeline against a few benchmarks and shipping a fixed system for a given workload [ 21 ] . Figure 1 : Cost-quality design space of knowledge-search pipelines on BrowseComp-Plus across 60 profiled configurations. Circles denote static pipelines, colored by synthesis method (LLM-only, per-chunk summary, agent loop). Each configuration also varies the LLM, retriever, and number of retrieved documents; Appendix A.1 lists the full space. Squares mark prior-work baselines. Pipeline cost spans roughly three orders of magnitude on a single workload. BRANE’s per-query Pareto trace (yellow diamonds; star marks the headline operating point) dominates the static Pareto frontier (black dashed line) across the full cost-quality range (green region): BRANE exceeds the most accurate static pipeline by 0.7% in accuracy at 81.7% lower cost. To leverage the cost-optimization headroom, we formulate a learning problem, Query2Conf : given a natural-language query and an accuracy target, dynamically select the pipeline that minimizes cost from a predefined design space. Three empirical observations ( 𝒪 \mathcal{O} ) motivate Query2Conf. 𝒪 ​ 1 \mathcal{O}1 : Per-query variance is large. A workload is a set of queries on the same corpus. We observe that different queries within a single workload have different optimal configurations (§ 3.1 ), and no single configuration is cost-quality optimal across an entire workload. Given the large cost-quality trade-off space in configuring the full pipeline (Figure 1 ), this opens an opportunity to configure the pipeline per query at inference time. 𝒪 ​ 2 \mathcal{O}2 : The LLM is one knob among many. We demonstrate on FinanceBench [ 10 ] that varying the synthesis method and retriever spans a larger cost-quality range than varying only the LLM (§ 3.2 ). Just choosing the LLM is leaving a lot of optimization opportunity. 𝒪 ​ 3 \mathcal{O}3 : Useful query characteristics differ across workloads. Queries within a workload share common structure, but the structure itself differs across workloads. Prior per-query work uses a small, fixed set of general query characteristics (e.g., need_joint_reasoning ) to pick a synthesis strategy [ 25 ] . These general properties collapse most queries within a workload onto the same values, mapping the bulk of the workload to a single configuration. The distinguishing signal lies in finer-grained, workload-specific predicates (§ 3.3 ). Predicting how well a pipeline configuration works on a natural-language query is, however, challenging. First, modeling the combinatorial effect of these knobs is hard: the joint accuracy surface across the LLM, retriever, depth, and synthesis strategy has no closed form, can be non-monotonic, and spans a space too large to profile exhaustively. Second, raw natural-language queries are noisy: most of their lexical, stylistic, and semantic content is irrelevant to the configuration choice, and the relevant signals have no direct mapping to the configuration space. Third, collecting profiling data on such a large space is costly in both time and dollars; for example, profiling just 600 queries against 60 configurations on a single benchmark for Figure 1 cost ∼ \sim $11,000 USD, and in practice the search space can be even larger. Figure 2 : BRANE framework. BRANE selects a pipeline configuration per query to minimize cost at a target accuracy. (1) Configuration profiling: we run each query against every candidate configuration on the workload, recording correctness and dollar cost. (2) Predictor training: a frontier LLM proposes workload-specific binary characteristics (e.g., requires multi-hop reasoning , involves people ) from a small sample of queries; a cheaper LLM then labels all profiled queries against these characteristics. We train one lightweight predictor per Pareto configuration to estimate the probability that it answers a query correctly given its characteristics. Pareto configurations are an order of magnitude fewer than the full set. (3) Inference (reconfigurable): given an accuracy or budget target, BRANE characterizes the incoming query once, scores every Pareto configuration, and selects the one that best trades predicted accuracy against cost. The engineer retargets at any time without retraining. See § 4.2 for the selection rule. To address these challenges, we present BRANE : B uilding R etrieval A gents via N atural language E xpressions. BRANE consists of two key ideas. First, an LLM extracts workload-specific characteristics from each query ( ℱ ​ ( q ) \mathcal{F}(q) ); these characteristics act as a translation layer between query semantics and the pipeline space. Query characteristics can range from general properties such as requires_multihop to workload-specific ones like involves_regional_cuisine . Second, one lightweight predictor p ^ 𝖼 \hat{p}{\mathsf{c}} per configuration estimates ℙ ​ ( y 𝖼 ​ ( q ) = 1 ∣ ℱ ​ ( q ) ) \mathbb{P}(y{\mathsf{c}}(q)=1\mid\mathcal{F}(q)) , the probability that pipeline 𝖼 \mathsf{c} answers query q q correctly given its characterization. At inference time, BRANE picks the pipeline that maximizes p ^ 𝖼 ​ ( ℱ ​ ( q ) ) − λ ⋅ cost ¯ ​ ( 𝖼 ) \hat{p}{\mathsf{c}}(\mathcal{F}(q))-\lambda\cdot\overline{\text{cost}}(\mathsf{c}) (formalized as Equation 3 in § 4.2 ), where the user tunes λ ≥ 0 \lambda\geq 0 to land anywhere on the cost-quality Pareto frontier. Training labels come from a single offline profiling pass: a few hundred queries, each executed against every configuration. BRANE matches the accuracy of the best fixed pipeline at up to 89% lower cost on MuSiQue, and pushes the cost-quality Pareto frontier across all three benchmarks — MuSiQue [ 28 ] , BrowseComp-Plus [ 7 ] , and FinanceBench [ 10 ] — beyond the best fixed pipeline [ 4 ] , the state-of-the-art LLM router Carrot [ 27 ] , the rule-based query router METIS [ 25 ] , the T5-scale router Adaptive-RAG [ 15 ] , and fine-tuned end-to-end BERT and Qwen3-4B models. Ablations show that LLM-proposed binary characteristics beat embeddings and that the framework is robust to the choice of characterizer LLM and the size of the characteristic set. We make the following contributions: 1. We formulate Query2Conf : given a natural-language query and an accuracy target, select the pipeline from a predefined design space that minimizes cost. Three empirical observations motivate this formulation: per-query variance, full-pipeline impact, and workload-specific signal. 2. We show that LLM-proposed workload-specific binary characteristics outperform generic embeddings as the predictor input on every benchmark, and serve as an effective transformation from natural-language queries to pipeline configurations. 3. We build BRANE , a framework that solves Query2Conf end-to-end via offline profiling, per-configuration predictor training, and Lagrangian routing at inference time. Across MuSiQue, BrowseComp-Plus, and FinanceBench, BRANE matches the best static pipeline’s accuracy at up to 89% lower cost, beats fine-tuned end-to-end BERT and Qwen3-4B baselines, and pushes the cost-quality Pareto frontier. We will open-source BRANE and release all 526 profiling traces (150–600 queries each). 2 Related Work We group prior work into three approaches to configuring LLM-based knowledge-search systems. § 3 shows what each leaves unaddressed and motivates Query2Conf. LLM routing. A well-studied approach picks which LLM to call per query [ 18 , 13 ] . FrugalGPT [ 5 ] cascades models from cheap to expensive. Carrot [ 27 ] and R2-Router [ 29 ] predict per-model cost and accuracy and select via a weighted objective. RouteLLM [ 19 ] learns a win-probability model from preference data, and vLLM Semantic Router [ 17 ] composes Boolean rules over heuristic and neural signals. These methods cast configuration as model selection over a small set (typically 2–15 LLMs) and treat the rest of the pipeline as a black box. This narrow scope lets a single router generalize across workloads, but training still requires 10 4 10^{4} – 10 5 10^{5} labeled queries. We extend this line to the full pipeline, whose configuration space is orders of magnitude larger than model choice alone (§ 3.2 ). Workload-level system optimization. A second line of work picks one configuration for an entire workload, at different layers of the stack. Murakkab [ 4 ] solves a Mixed-Integer Linear Program across the serving stack. Syftr [ 8 ] runs multi-objective Bayesian optimization over the application pipeline. HedraRAG [ 14 ] accelerates serving via graph transformations and dynamic scheduling on a RAGraph abstraction. All three apply one configuration to every query request, ignoring per-query variance. BRANE instead selects per query and matches the accuracy of the best static configuration at up to 91% lower cost. Per-query pipeline with static rules. The closest line of work configures the pipeline per query, but over a smaller space than Query2Conf. METIS [ 25 ] prunes a RAG configuration space with one hand-coded rule over four LLM-generated query labels (complexity, joint-reasoning need, information pieces, and summary length). A resource-aware scheduler then picks the final configuration. Adaptive-RAG [ 15 ] trains a T5-scale classifier to choose among three retrieval strategies (no-retrieval, single-step, multi-step) over a fixed downstream LLM. Both demonstrate that per-query configuration beyond the LLM is feasible. However, both operate over a much smaller design space, and neither reduces cost while matching the best static accuracy, whereas BRANE does (§ 5 ). 3 Query2Conf: Motivations and Formulation We make three observations on standard knowledge-search pipelines that motivate Query2Conf, each surfacing an opportunity to adapt pipeline configuration per query. The configuration space and cost model are in Appendix A.1 . LLM Only Per-chunk Summary Iterative Retrieval Q1 (id 486) 100% $0.0002 80% $0.1033 100% $0.1332 Q2 (id 1208) 0% $0.0001 100% $0.0497 10% $0.7941 Q3 (id 694) 0% $0.0001 0% $0.0514 100% $0.1595 \phantomsubcaption \phantomsubcaption \phantomsubcaption Figure 3 : Three observations from knowledge-search pipelines. (a) Per-query variance on BrowseComp-Plus: three queries × \times three synthesis strategies (LLM only, per-chunk summary, iterative retrieval), each averaged over 10 runs with GPT-5-mini. Cells show accuracy/cost ( ↑ \uparrow better); row-wise best in green. The best strategy varies by query (result on more queries in Appendix D ). (b) Pipeline vs. LLM on FinanceBench ( N = 200 N=200 ): scaling the LLM (blue: GPT-5-nano → \to mini → \to 5.4) covers a narrow band, while sweeping pipeline knobs at fixed GPT-5-nano (orange) covers a ∼ \sim 10 × \times wider cost range and a ∼ \sim 2 × \times wider accuracy span. (c) General query characteristics on BrowseComp-Plus collapse all queries onto the same values (first two bars), so all 600 sampled queries map to the same configuration (third bar). Each bar reports the fraction of queries with the matching value. 3.1 Observation 1: Query-Level Variance Within a Workload A workload is a set of queries over a common corpus. The corpus is fixed; the queries are not. Queries within a workload differ in structure: factual lookups, multi-hop comparisons, and disambiguation queries can co-occur, and the best-serving configuration depends on the query’s structure. Iterative retrieval is overkill for a factual lookup; an LLM-only call leaves a multi-hop comparison underserved. Fixing one configuration for the whole workload (§ 2 ) compromises either accuracy or cost. As an example, we take three queries from BrowseComp-Plus [ 7 ] and run each through three synthesis strategies: 1) no retrieval (LLM only); 2) retrieve documents and feed them in as-is; and 3) retrieve documents and feed them in as summarized chunks. Each query’s performance is averaged over 10 runs with GPT-5-mini fixed. Figure 3 shows the per-pair accuracy/cost. We make two observations. First, queries within a workload land at very different cost-quality points; no single strategy is row-best across all three, so the optimal pipeline differs query to query. Second, the cheapest correct configuration varies by orders of magnitude across queries — so even within one query, the right pipeline depends on the available budget, and a single workload-level choice will be too expensive for some queries and too weak for others (per-cell breakdown and full query text in Appendix D ). The right per-query configuration is not visible from the surface form: Q1, Q2, and Q3 read similarly yet land on different row-best strategies. Per-query configuration is therefore a learning problem. 3.2 Observation 2: Configuring the Full Pipeline Matters, Not Just the LLM The pipeline has many knobs beyond the LLM (Appendix A.1 ), and they interact with the query and the corpus. A policy that picks only the LLM (§ 2 ) cannot reach most of this design space. As an example, we sweep the design space on FinanceBench [ 10 ] and isolate two illustrative cases (Figure 3 ). The full-pipeline sweep fixes the LLM to GPT-5-nano and varies the retriever, retrieval depth k k , and synthesis strategy; it spans 38% accuracy and 93 × \times cost. The LLM-only sweep fixes the pipeline to direct generation (no retrieval) and varies the LLM across GPT-5-nano, GPT-5-mini, and GPT-5.4; it spans only 17% accuracy and 8 × \times cost. The larger axis of cost-quality variation sits outside the LLM. Configuring the full pipeline is nontrivial. Cost scales monotonically with retrieval depth k k and with denser retrievers; accuracy does not. Irrelevant documents push the generator out of distribution [ 9 ] , so accuracy can fall as k k grows [ 1 ] . The number of relevant documents a retriever surfaces at a given k k depends on the query and the corpus jointly, and per-chunk summarization helps for long, redundant documents but hurts for short, dense ones. These interactions are workload-specific, which we demonstrate in 𝒪 ​ 3 \mathcal{O}3 . 3.3 Observation 3: The Salient Query Characteristics Differ Across Workloads While individual queries within a workload differ (§ 3.1 ), the workload as a whole exhibits shared structure that can guide systems configuration. Structured financial filings (FinanceBench) reward different signals than open-domain Wikipedia questions (MuSiQue), so the structure is workload-specific. This within-workload regularity makes learning a configuration policy from each workload’s data viable. Identifying which characteristics matter is itself a learning problem. Prior per-query work [ 25 ] hand-codes dispatch rules over a small fixed set of query characteristics (reasoning need, joint-reasoning need). While this is a useful start, the characteristics are too general and can collapse all queries in a benchmark into the same cluster, as shown in Figure 3 . Thus, query characteristics as the configuration signal must be learned from each workload’s own data. 3.4 Problem Formulation We now formalize Query2Conf as the learning problem implied by 𝒪 ​ 1 \mathcal{O}1 – 𝒪 ​ 3 \mathcal{O}3 . Setup. A workload 𝒲 \mathcal{W} is a distribution over natural-language queries on a fixed corpus; q ∼ 𝒲 q\sim\mathcal{W} denotes a query sampled from the workload. A configuration 𝖼 ∈ 𝒞 \mathsf{c}\in\mathcal{C} is a complete pipeline that takes a query and produces an answer, specifying every knob the developer exposes: a choice of LLM, retriever, number of documents k k , number of hops, synthesis strategy, and any other knobs the developer exposes. The set 𝒞 \mathcal{C} of configurations is finite; Appendix A.1 lists the configuration space we use. Running 𝖼 \mathsf{c} on q q produces a binary correctness outcome y 𝖼 ​ ( q ) ∈ { 0 , 1 } y{\mathsf{c}}(q)\in{0,1} ( 1 1 if the answer is correct, 0 otherwise) and a non-negative dollar cost cost ​ ( q , 𝖼 ) ∈ ℝ ≥ 0 \text{cost}(q,\mathsf{c})\in\mathbb{R}{\geq 0} . Problem. Given an accuracy target A ∈ [ 0 , 1 ] A\in[0,1] , Query2Conf is the problem of learning a policy π \pi that maps each query q q to a configuration π ​ ( q ) ∈ 𝒞 \pi(q)\in\mathcal{C} so as to minimize expected dollar cost on 𝒲 \mathcal{W} while meeting the accuracy target: min π ⁡ 𝔼 q ∼ 𝒲 ​ [ cost ​ ( q , π ​ ( q ) ) ] subject to 𝔼 q ∼ 𝒲 ​ [ y π ​ ( q ) ​ ( q ) ] ≥ A . \min{\pi};\mathbb{E}{q\sim\mathcal{W}}!\left[\text{cost}(q,\pi(q))\right]\quad\text{subject to}\quad\mathbb{E}{q\sim\mathcal{W}}!\left[y_{\pi(q)}(q)\right]\geq A. (1) The same framework also solves the symmetric variant: given a cost budget B B , maximize expected correctness. We solve both by Lagrangian relaxation (§ 4.2 , Equation 3 ). 4 BRANE: Methodology 4.1 Methodology Per-configuration correctness as the prediction target. A direct approach to Equation 1 would model how the configuration knobs interact and search the joint space. We instead score each configuration end-to-end: for every 𝖼 ∈ 𝒞 \mathsf{c}\in\mathcal{C} we train one lightweight predictor p ^ 𝖼 ​ ( q ) ≈ ℙ ​ ( y 𝖼 ​ ( q ) = 1 ) , \hat{p}{\mathsf{c}}(q);\approx;\mathbb{P}\bigl(y{\mathsf{c}}(q)=1\bigr), the estimated probability that 𝖼 \mathsf{c} on q q produces the correct answer. Combinatorial knob interactions are absorbed into the profiled correctness signal, so the policy reduces to an argmax over 𝒞 \mathcal{C} . Workload-specific characterization as the predictor input. A raw query is a noisy input for a small classifier. We first transform each query into a workload-specific binary feature vector ℱ ​ ( q ) \mathcal{F}(q) (top-right of Figure 2 ; intuition and construction in § 4.3 ): ℱ ​ ( q ) = ( ℱ 1 ​ ( q ) , … , ℱ d ​ ( q ) ) ∈ { 0 , 1 } d , \mathcal{F}(q);=;\bigl(\mathcal{F}{1}(q),\dots,\mathcal{F}{d}(q)\bigr);\in;{0,1}^{d}, (2) where each component ℱ j \mathcal{F}{j} is a yes-or-no answer to a workload-specific question, computed by an LLM from the query text alone (e.g., requires_multi_hop , involves_regional_cuisine ). The predictor takes this vector as input: p ^ 𝖼 : { 0 , 1 } d → [ 0 , 1 ] \hat{p}{\mathsf{c}}:{0,1}^{d}\to[0,1] , and p ^ 𝖼 ​ ( ℱ ​ ( q ) ) \hat{p}{\mathsf{c}}(\mathcal{F}(q)) approximates ℙ ​ ( y 𝖼 ​ ( q ) = 1 ∣ ℱ ​ ( q ) ) \mathbb{P}(y{\mathsf{c}}(q)=1\mid\mathcal{F}(q)) . Lagrangian routing. BRANE routes each query to the configuration that maximizes a per-query Lagrangian score: π λ ​ ( q ) = arg ⁡ max 𝖼 ∈ 𝒞 ⁡ p ^ 𝖼 ​ ( ℱ ​ ( q ) ) − λ ⋅ cost ​ ( q , 𝖼 ) , \pi_{\lambda}(q);=;\arg\max_{\mathsf{c},\in,\mathcal{C}};\hat{p}{\mathsf{c}}!\bigl(\mathcal{F}(q)\bigr);-;\lambda\cdot\text{cost}(q,\mathsf{c}), (3) where λ ≥ 0 \lambda\geq 0 trades off accuracy against cost: small λ \lambda favors the most accurate configuration, large λ \lambda favors the cheapest. Sweeping λ \lambda on a log scale traces the cost-quality Pareto frontier; we map a user’s accuracy target A A (or cost budget B B ) to an operating λ \lambda via offline calibration (§ 4.2 , Stage 3). Equation 3 is the per-query optimum of the Lagrangian relaxation of Equation 1 ; because the policy can pick a different configuration for each query independently, the relaxation decomposes pointwise and the per-query argmax is exact. 4.2 BRANE Framework Overview BRANE has three stages, illustrated in Figure 2 : configuration profiling, predictor training, and inference-time selection. Stage 1: Configuration profiling (offline, one-time). We sample N N queries from the workload and run each query through every configuration 𝖼 ∈ 𝒞 \mathsf{c}\in\mathcal{C} , recording the pair ( y 𝖼 ​ ( q ) , cost ​ ( q , 𝖼 ) ) (y{\mathsf{c}}(q),,\text{cost}(q,\mathsf{c})) . This yields an N × | 𝒞 | N\times|\mathcal{C}| correctness matrix and an N × | 𝒞 | N\times|\mathcal{C}| cost matrix. Profiling runs once per workload and is amortized across all subsequent inference; we report it separately from BRANE’s runtime cost. The sample size N N varies by workload (§ 5.1 ). Stage 2: Predictor training (offline, one-time). For each configuration 𝖼 \mathsf{c} that survives Pareto pruning (§ 4.4 ), we train one lightweight binary classifier p ^ 𝖼 \hat{p}{\mathsf{c}} on the profiled pairs { ( ℱ ​ ( q i ) , y 𝖼 ​ ( q i ) ) } i = 1 N {(\mathcal{F}(q{i}),,y_{\mathsf{c}}(q_{i}))}{i=1}^{N} , holding out a disjoint set of queries for evaluation. Each classifier learns how query characteristics map to correctness for one fixed pipeline; adding a new configuration trains exactly one new predictor without touching the rest. Per configuration, we run automated model selection independently over a small family of tabular classifiers—logistic regression, decision tree, random forest, gradient boosting, XGBoost [ 6 ] , and LightGBM [ 16 ] —chosen for fast training and a low memory footprint, and we keep the model with the best inner cross-validated negative log-loss. After training, we sweep λ \lambda on a log scale over the profiling sample to trace the cost-quality Pareto frontier under Equation 3 and to calibrate the mapping from a user’s accuracy target A A (or cost budget B B ) to its operating λ \lambda . Stage 3: Inference-time selection. At inference, BRANE takes a query q q , computes ℱ ​ ( q ) \mathcal{F}(q) with a single LLM call that scores all d d characteristics, evaluates p ^ 𝖼 ​ ( ℱ ​ ( q ) ) \hat{p}{\mathsf{c}}(\mathcal{F}(q)) for every 𝖼 \mathsf{c} that survived Pareto pruning, and returns π λ ​ ( q ) \pi_{\lambda}(q) given by Equation 3 . Per-query cost is unknown ahead of execution, so we substitute the profiling-sample mean cost ¯ ​ ( 𝖼 ) = 1 N ​ ∑ i = 1 N cost ​ ( q i , 𝖼 ) \overline{\text{cost}}(\mathsf{c})=\tfrac{1}{N}\sum_{i=1}^{N}\text{cost}(q_{i},\mathsf{c}) for cost ​ ( q , 𝖼 ) \text{cost}(q,\mathsf{c}) . The cost savings reported in § 5 already include the characterization LLM call. 4.3 Workload-Specific Query Characterization We propose LLM-generated workload-specific characteristics to exploit 𝒪 ​ 3 \mathcal{O}3 (§ 3.3 ): a workload exhibits common characteristics learnable from a small sample of past queries. This section describes how we construct the characterization map ℱ \mathcal{F} from Equation 2 . Figure 2 (top right) illustrates this process. Offline, we prompt a frontier LLM (GPT-5-mini) with a batch of example queries from the workload and ask it to propose d d binary characteristics { ℱ j } j = 1 d {\mathcal{F}{j}}{j=1}^{d} that distinguish them (e.g., involves_people , involves_astronomical_object ); each ℱ j \mathcal{F}{j} must be answerable yes-or-no from the query text alone. We find that as few as d = 10 d=10 characteristics per workload work well in practice, and we ablate d d in Appendix B.2 . At training and inference time, a smaller and cheaper LLM labels each query against all d d characteristics and returns the binary vector ℱ ​ ( q ) ∈ { 0 , 1 } d \mathcal{F}(q)\in{0,1}^{d} . Restricting the frontier LLM to the one-time offline proposal step keeps per-query characterization cost low. Before training the predictors, we drop constant components and one of any pair with absolute correlation above 0.99 0.99 on the profiling sample (e.g., involves_health_medical and involves_disease_outbreak on a near-duplicate pair). Our intuition for workload-specific characterization is that it bridges the gap between the query semantic space and the configuration space. Raw query embeddings do not align with the configuration choice: two queries with near-identical embeddings can require very different pipeline configurations. This problem sharpens in topically homogeneous workloads. On a domain-specific benchmark like FinanceBench, every query is about finance, so queries collapse into a tight cluster in embedding space and a generic embedding has little discriminative power left; the signal that separates queries lies in finer-grained predicates the embedding does not surface. The configuration space itself is large and combinatorial, with knob interactions no single textual signal exposes. Workload-specific characteristics act as a translation layer between the two, surfacing the query signals that correlate with the right configuration while preserving the per-query variance documented in § 3.1 . § 5.3 ablates the characterizer LLM choice and compares against semantic embeddings as the input representation. 4.4 Fuzzy Pareto Pruning BRANE trains one predictor per configuration, so both predictor training and per-query inference can scale linearly in | 𝒞 | |\mathcal{C}| . Our experiments cover up to 335 configurations from a modest knob set, yet production systems can have even more knobs. Based on the observation that configurations not on the Pareto frontier in cost and accuracy are rarely selected, we train predictors only for configurations near the cost-quality Pareto frontier of the profiling sample. The strict frontier is brittle under a small profiling sample: a near-frontier configuration can land on the frontier due to sampling noise, and a true-frontier configuration can be incorrectly dropped. We instead use fuzzy Pareto pruning , which retains a configuration 𝖼 ′ \mathsf{c}^{\prime} whenever some strict-frontier vertex 𝖼 ⋆ \mathsf{c}^{\star} satisfies y ¯ ​ ( 𝖼 ⋆ ) − y ¯ ​ ( 𝖼 ′ ) ≤ τ acc and cost ¯ ​ ( 𝖼 ′ ) ≤ ( 1 + τ cost ) ​ cost ¯ ​ ( 𝖼 ⋆ ) , \overline{y}(\mathsf{c}^{\star})-\overline{y}(\mathsf{c}^{\prime});\leq;\tau{\text{acc}}\quad\text{and}\quad\overline{\text{cost}}(\mathsf{c}^{\prime});\leq;(1+\tau_{\text{cost}}),\overline{\text{cost}}(\mathsf{c}^{\star}), where y ¯ ​ ( 𝖼 ) = 1 N ​ ∑ i y 𝖼 ​ ( q i ) \overline{y}(\mathsf{c})=\tfrac{1}{N}\sum_{i}y_{\mathsf{c}}(q_{i}) and cost ¯ \overline{\text{cost}} is the per-configuration profiling-sample mean cost. The accuracy tolerance τ acc \tau_{\text{acc}} guards against sample noise; the cost tolerance τ cost \tau_{\text{cost}} retains low-cost candidates that broaden the achievable cost-saving range. We train predictors and evaluate Equation 3 only on the fuzzy Pareto set in § 5 . 5 Evaluations and Ablations We evaluate BRANE on three knowledge-search benchmarks against six baselines spanning the three families of prior work (Section 2 ): Murakkab [ 4 ] (workload-level static), Carrot [ 27 ] (LLM-only per-query), METIS [ 25 ] and Adaptive-RAG [ 15 ] (retrieval-strategy per-query), an end-to-end fine-tuned LLM, and the full static-configuration sweep as the non-adaptive reference. We extend the LLM-only and rule-based baselines to a broader search space f

Comments (0)

No comments yet

Be the first to share your thoughts!