VecCISC: Improving Confidence-Informed Self-Consistency with Reasoning Trace Clustering and Candidate Answer Selection
Scaling inference-time reasoning is a popular way to improve how Large Language Models (LLMs) solve complex problems. A common method, Self-Consistency, involves sampling multiple answers and picking the most frequent one. A more advanced version, Confidence-Informed Self-Consistency (CISC), uses a "think twice" approach: a second, critic LLM evaluates the reasoning behind each answer to assign a confidence score, allowing for a weighted vote. While this improves accuracy, it is expensive and slow because it requires the critic LLM to process every single reasoning trace. VecCISC is a new, lightweight framework designed to make this process more efficient by filtering out redundant or low-quality reasoning traces before they reach the critic.
How VecCISC Works
The framework operates by grouping candidate answers and their associated reasoning traces. First, it uses an embedding model to turn each reasoning trace into a numerical vector that captures its semantic meaning. The system then groups these traces based on the final answer they produced. Within each group, VecCISC uses clustering algorithms—specifically K-Means or Hierarchical Agglomerative Clustering—to identify similar reasoning patterns. Instead of sending every trace to the critic, the system selects only a representative trace from each cluster. By focusing on these representative samples, the framework significantly reduces the number of calls made to the critic LLM.
Selecting the Best Reasoning
To ensure the representative traces are high-quality, VecCISC calculates the centroid (the mathematical center) of each cluster. It then selects the specific reasoning trace that is closest to this centroid, using cosine similarity to measure the distance. This approach assumes that the trace closest to the cluster's center is the most representative and the least likely to contain hallucinations or degenerate logic. Once these representative traces are selected, the critic LLM evaluates them to generate confidence scores, which are then used to perform a weighted majority vote to determine the final answer.
Efficiency and Performance
The researchers evaluated VecCISC across five diverse datasets, including mathematics, chemistry, biology, and the humanities, using several different LLMs. The results show that VecCISC successfully maintains or even exceeds the accuracy of standard CISC while significantly reducing costs. Specifically, the framework achieved an average reduction in total token usage of 47% across the entire inference pipeline. By eliminating the need to evaluate redundant or poor-quality samples, VecCISC proves that it is possible to achieve high-performance reasoning without the heavy overhead typically associated with "think twice" methods.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!