Agent-Agnostic Evaluation of SQL Accuracy in Production Text-to-SQL Systems
Text-to-SQL (T2SQL) systems allow users to query databases using natural language, but evaluating how well these systems perform in real-world production environments is difficult. Current methods rely on "ground-truth" reference queries or access to the underlying database schema, both of which are rarely available once a system is live. This creates a "silent quality degradation" problem where developers cannot track if their agents are failing or improving over time. STEF (Schema-agnostic Text-to-SQL Evaluation Framework) solves this by evaluating T2SQL outputs using only the user's question, an enriched reformulation, and the generated SQL, without needing access to the database itself.
How STEF Works
STEF functions as a multi-stage pipeline that breaks down both the natural language question and the generated SQL into structured semantic components. It extracts key elements like projections, aggregations, filters, and grouping instructions from both sides. By normalizing these features—such as resolving column aliases or mapping natural language terms like "total" to SQL functions like "SUM"—the framework can align the intent of the user with the logic of the generated query. This allows the system to identify exactly which part of a query is misaligned, providing clear diagnostic feedback rather than just a pass/fail result.
Customization Through Rule Injection
Because different enterprise environments have unique SQL conventions, STEF includes a configurable rule injection mechanism. Using a JSON-based configuration, developers can define application-specific rules, such as mapping custom column names to database fields or identifying "benign filters" (default settings like "is_deleted = 0" that should not be penalized). This allows the evaluation framework to adapt to different business contexts without requiring code changes or retraining the underlying models, making it highly flexible for diverse production deployments.
Scoring and Confidence
To provide an actionable metric, STEF produces an interpretable accuracy score on a 0 to 100 scale. This score is calculated using a composite approach that combines the filter alignment status, a semantic verdict from an LLM-as-judge, and a confidence multiplier. By incorporating a confidence score, the framework accounts for evaluator uncertainty, ensuring that the final accuracy rating reflects how certain the system is in its assessment. This allows teams to monitor their T2SQL agents continuously and identify specific categories of queries that consistently underperform.
Practical Benefits
By removing the dependency on database schemas and reference queries, STEF makes structured query evaluation viable at scale for the first time. It enables developers to implement feedback loops for continuous improvement, detect accuracy drops in real-time, and compare the performance of different model versions. This approach bridges the gap between academic benchmarks and the operational realities of enterprise data assistants, providing a robust way to ensure that T2SQL outputs remain accurate and reliable.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!