Back to AI Research

AI Research

VecCISC: Improving Confidence-Informed Self-Consist... | AI Research

Key Takeaways

  • VecCISC: Improving Confidence-Informed Self-Consistency with Reasoning Trace Clustering and Candidate Answer Selection Scaling inference-time reasoning is a...
  • A standard technique for scaling inference-time reasoning is Self-Consistency, whereby multiple candidate answers are sampled from an LLM and the most common answer is selected.
  • More recently, it has been shown that weighted majority voting (e.g.
  • In practice, weighted majority voting necessitates calling a critic LLM on each candidate's reasoning trace to produce the answer's confidence score.
  • This secondary series of LLM calls greatly increases the overhead and cost of weighted majority voting, despite its potential performance benefits.
Paper AbstractExpand

A standard technique for scaling inference-time reasoning is Self-Consistency, whereby multiple candidate answers are sampled from an LLM and the most common answer is selected. More recently, it has been shown that weighted majority voting (e.g. Confidence-Informed Self Consistency (CISC)), which assigns a confidence value to each candidate answer and chooses the answer with the largest accumulated score, tends to be more accurate on a wide range of popular benchmarks. In practice, weighted majority voting necessitates calling a critic LLM on each candidate's reasoning trace to produce the answer's confidence score. This secondary series of LLM calls greatly increases the overhead and cost of weighted majority voting, despite its potential performance benefits. To reduce this expense, we propose VecCISC, a lightweight, adaptive framework that uses a measure of semantic similarity to filter reasoning traces that are semantically equivalent to others, degenerate, or hallucinated, thus decreasing the number of candidate answers that must be evaluated by the critic. To ensure adequate experimental thoroughness, we evaluate VecCISC on five challenging, widely-adopted datasets spanning the domains of mathematics, chemistry, biology, commonsense reasoning, and the humanities. Our results demonstrate that VecCISC reduces the total token usage by 47%, while maintaining or exceeding the accuracy of CISC.

VecCISC: Improving Confidence-Informed Self-Consistency with Reasoning Trace Clustering and Candidate Answer Selection
Scaling inference-time reasoning is a popular way to improve how Large Language Models (LLMs) solve complex problems. A common method, Self-Consistency, involves sampling multiple answers and picking the most frequent one. A more advanced version, Confidence-Informed Self-Consistency (CISC), uses a "think twice" approach: a second, critic LLM evaluates the reasoning behind each answer to assign a confidence score, allowing for a weighted vote. While this improves accuracy, it is expensive and slow because it requires the critic LLM to process every single reasoning trace. VecCISC is a new, lightweight framework designed to make this process more efficient by filtering out redundant or low-quality reasoning traces before they reach the critic.

How VecCISC Works

The framework operates by grouping candidate answers and their associated reasoning traces. First, it uses an embedding model to turn each reasoning trace into a numerical vector that captures its semantic meaning. The system then groups these traces based on the final answer they produced. Within each group, VecCISC uses clustering algorithms—specifically K-Means or Hierarchical Agglomerative Clustering—to identify similar reasoning patterns. Instead of sending every trace to the critic, the system selects only a representative trace from each cluster. By focusing on these representative samples, the framework significantly reduces the number of calls made to the critic LLM.

Selecting the Best Reasoning

To ensure the representative traces are high-quality, VecCISC calculates the centroid (the mathematical center) of each cluster. It then selects the specific reasoning trace that is closest to this centroid, using cosine similarity to measure the distance. This approach assumes that the trace closest to the cluster's center is the most representative and the least likely to contain hallucinations or degenerate logic. Once these representative traces are selected, the critic LLM evaluates them to generate confidence scores, which are then used to perform a weighted majority vote to determine the final answer.

Efficiency and Performance

The researchers evaluated VecCISC across five diverse datasets, including mathematics, chemistry, biology, and the humanities, using several different LLMs. The results show that VecCISC successfully maintains or even exceeds the accuracy of standard CISC while significantly reducing costs. Specifically, the framework achieved an average reduction in total token usage of 47% across the entire inference pipeline. By eliminating the need to evaluate redundant or poor-quality samples, VecCISC proves that it is possible to achieve high-performance reasoning without the heavy overhead typically associated with "think twice" methods.

Comments (0)

No comments yet

Be the first to share your thoughts!