Back to AI Research

AI Research

Ask, Don't Judge: Binary Questions for Interpre... | AI Research

Key Takeaways

  • Evaluating the output of Large Language Models (LLMs) is a significant challenge.
  • Traditional automated metrics often fail to capture the nuance of human lan...
  • We propose BINEVAL, a framework that decomposes evaluation criteria into atomic binary questions and aggregates the resulting verdicts into interpretable, multi-dimensional scores.
  • Given a task prompt, a meta-prompt generates fine-grained evaluation questions, and an LLM answers them independently for each output, yielding transparent question-level feedback together with calibrated overall scores.
  • This decomposition makes evaluation easier to inspect, easier to diagnose, and directly usable for prompt improvement.
Paper AbstractExpand

Evaluating LLM outputs remains a major bottleneck in NLP: human evaluation is expensive and slow, lexical metrics correlate poorly with human judgments on open-ended generation, and holistic LLM judges often produce opaque scores that are hard to debug. We propose BINEVAL, a framework that decomposes evaluation criteria into atomic binary questions and aggregates the resulting verdicts into interpretable, multi-dimensional scores. Given a task prompt, a meta-prompt generates fine-grained evaluation questions, and an LLM answers them independently for each output, yielding transparent question-level feedback together with calibrated overall scores. This decomposition makes evaluation easier to inspect, easier to diagnose, and directly usable for prompt improvement. Across SummEval, Topical-Chat, and QAGS, BINEVAL matches or outperforms strong baselines including UniEval and G-Eval, with especially strong results on factual consistency benchmarks such as QAGS. Beyond competitive correlation with human judgments, BINEVAL better matches human score distributions and avoids the ceiling effects common in prior LLM judges, leading to better discrimination between borderline and clearly flawed outputs. We further show that the same question-level feedback supports iterative prompt optimization, improving evaluator prompts on summarization and generation prompts on IFBench under both self-update and cross-model update settings. Overall, BINEVAL provides a task-agnostic, training-free, and interpretable evaluation framework that combines strong empirical performance with practical diagnostic and optimization value.

Evaluating the output of Large Language Models (LLMs) is a significant challenge. Traditional automated metrics often fail to capture the nuance of human language, while asking an LLM to provide a single "holistic" score often results in opaque, difficult-to-debug feedback. The paper Ask, Don't Judge: Binary Questions for Interpretable LLM Evaluation and Self-Improvement introduces BinEval, a framework that replaces broad, subjective scoring with a series of targeted, atomic yes/no questions. By breaking down complex evaluation criteria into simple, checkable components, BinEval provides a transparent and diagnostic way to assess model performance.

How BinEval Works

The framework operates through a three-part process. First, a "meta-prompt" analyzes a task (such as a summary or a dialogue) and decomposes it into a set of specific, atomic requirements. Second, an evaluator LLM answers a series of binary questions based on these requirements, providing a "yes" or "no" verdict for each. These individual answers are then aggregated into multi-dimensional scores. Because each score is tied to specific binary questions and natural-language explanations, developers can easily inspect exactly why a model received a particular rating, turning a black-box score into a clear diagnostic signal.

Iterative Improvement

Beyond simple evaluation, BinEval supports a two-phase optimization loop. Because the system identifies specific failures (the "no" answers), it can generate "lessons" that explain why an output fell short. These lessons are used to automatically refine prompts. The framework supports two types of updates:

  • Cross-model update: A stronger, more capable model acts as a reference, and a weaker model updates its own prompts until its evaluation behavior aligns with the stronger model.

  • Self-update: A model uses its own evaluation failures to iteratively improve its generation prompts, allowing it to learn from its mistakes without requiring human intervention.

Performance and Results

The researchers tested BinEval across several benchmarks, including SummEval, Topical-Chat, and QAGS. The results show that BinEval consistently matches or outperforms existing methods like G-Eval and UniEval. It is particularly effective at measuring factual consistency, where its ability to break down information into verifiable facts provides a more robust signal than holistic scoring. Furthermore, BinEval avoids the "ceiling effects" seen in other judges, meaning it is better at distinguishing between borderline outputs and clearly flawed ones, providing a more accurate reflection of human-like score distributions.

Why This Matters

The primary advantage of BinEval is its interpretability. By moving away from a single scalar score, the framework provides actionable feedback that is directly usable for debugging and prompt engineering. Because the method is task-agnostic and requires no task-specific training, it offers a flexible, "plug-and-play" solution for developers who need to evaluate complex LLM outputs across a variety of domains, from summarization to instruction following.

Comments (0)

No comments yet

Be the first to share your thoughts!