Back to AI Research

AI Research

Agent-Agnostic Evaluation of SQL Accuracy in Produc... | AI Research

Key Takeaways

  • Agent-Agnostic Evaluation of SQL Accuracy in Production Text-to-SQL Systems Text-to-SQL (T2SQL) systems allow users to query databases using natural language...
  • Text-to-SQL (T2SQL) evaluation in production environments poses fundamental challenges that existing benchmarks do not address.
  • This disconnect leaves production T2SQL agents largely unevaluated beyond developer-time testing, creating silent quality degradation with no feedback mechanism for continuous improvement.
  • Empirical results demonstrate that STEF enables continuous production monitoring and agent improvement feedback loops without schema dependency, making structured query evaluation viable at scale for the first time.
  • Agent-Agnostic Evaluation of SQL Accuracy in Production Text-to-SQL Systems
Paper AbstractExpand

Text-to-SQL (T2SQL) evaluation in production environments poses fundamental challenges that existing benchmarks do not address. Current evaluation methodologies whether rule-based SQL matching or schema-dependent semantic parsers assume access to ground-truth queries and structured database schema, constraints that are rarely satisfied in real-world deployments. This disconnect leaves production T2SQL agents largely unevaluated beyond developer-time testing, creating silent quality degradation with no feedback mechanism for continuous improvement. We present STEF (Schema-agnostic Text-to-SQL Evaluation Framework), a production-native evaluation system that operates exclusively on natural language inputs the user question, an enriched reformulation, and the generated SQL without requiring database schema or reference queries. STEF extracts semantic specifications from both natural language and SQL representations, performs normalized feature alignment, and produces an interpretable 0 to 100 accuracy score via a composite metric that encompasses filter alignment, semantic verdict, and confidence of the evaluator. Key contributions include: enriched question quality validation as a first-class evaluation signal, configurable application-specific rule injection via prompt templating, and production-robust normalization handling GROUP BY tolerance, ORDER BY defaults, and LIMIT heuristics. Empirical results demonstrate that STEF enables continuous production monitoring and agent improvement feedback loops without schema dependency, making structured query evaluation viable at scale for the first time.

Agent-Agnostic Evaluation of SQL Accuracy in Production Text-to-SQL Systems
Text-to-SQL (T2SQL) systems allow users to query databases using natural language, but evaluating how well these systems perform in real-world production environments is difficult. Current methods rely on "ground-truth" reference queries or access to the underlying database schema, both of which are rarely available once a system is live. This creates a "silent quality degradation" problem where developers cannot track if their agents are failing or improving over time. STEF (Schema-agnostic Text-to-SQL Evaluation Framework) solves this by evaluating T2SQL outputs using only the user's question, an enriched reformulation, and the generated SQL, without needing access to the database itself.

How STEF Works

STEF functions as a multi-stage pipeline that breaks down both the natural language question and the generated SQL into structured semantic components. It extracts key elements like projections, aggregations, filters, and grouping instructions from both sides. By normalizing these features—such as resolving column aliases or mapping natural language terms like "total" to SQL functions like "SUM"—the framework can align the intent of the user with the logic of the generated query. This allows the system to identify exactly which part of a query is misaligned, providing clear diagnostic feedback rather than just a pass/fail result.

Customization Through Rule Injection

Because different enterprise environments have unique SQL conventions, STEF includes a configurable rule injection mechanism. Using a JSON-based configuration, developers can define application-specific rules, such as mapping custom column names to database fields or identifying "benign filters" (default settings like "is_deleted = 0" that should not be penalized). This allows the evaluation framework to adapt to different business contexts without requiring code changes or retraining the underlying models, making it highly flexible for diverse production deployments.

Scoring and Confidence

To provide an actionable metric, STEF produces an interpretable accuracy score on a 0 to 100 scale. This score is calculated using a composite approach that combines the filter alignment status, a semantic verdict from an LLM-as-judge, and a confidence multiplier. By incorporating a confidence score, the framework accounts for evaluator uncertainty, ensuring that the final accuracy rating reflects how certain the system is in its assessment. This allows teams to monitor their T2SQL agents continuously and identify specific categories of queries that consistently underperform.

Practical Benefits

By removing the dependency on database schemas and reference queries, STEF makes structured query evaluation viable at scale for the first time. It enables developers to implement feedback loops for continuous improvement, detect accuracy drops in real-time, and compare the performance of different model versions. This approach bridges the gap between academic benchmarks and the operational realities of enterprise data assistants, providing a robust way to ensure that T2SQL outputs remain accurate and reliable.

Comments (0)

No comments yet

Be the first to share your thoughts!