Reasoning Structure of Large Language Models
Large reasoning models (LRMs) are typically judged by simple metrics like whether they get the right answer or how many tokens they use. However, these metrics can be misleading because two models might arrive at the same answer through entirely different logical paths. This paper introduces a new way to evaluate these models by looking at the "topology" of their reasoning, turning unstructured text into a measurable, structured map of claims and dependencies.
Moving Beyond Accuracy and Token Counts
Current evaluation methods often conflate different types of model behavior. A model that guesses correctly is treated the same as one that follows a rigorous logical path, and a verbose model is often judged differently than a concise one, regardless of the quality of their logic. By focusing on the underlying structure of the reasoning process, the authors aim to provide a more nuanced view of how models actually "think" when solving complex logic puzzles.
From Unstructured Text to Reasoning Graphs
To analyze how models reason, the researchers developed a pipeline that transforms a model’s unstructured output into a verifiable reasoning graph. In this graph, individual claims and the dependencies between them are mapped out. This approach treats reasoning as a structured object, allowing researchers to apply quantitative analysis to the logical flow of the model.
Measuring Reasoning Efficiency
Building on these reasoning graphs, the authors defined a new metric called "reasoning efficiency." This metric quantifies how concentrated or focused a model’s logical flow is. By measuring the topology of the reasoning graph, the researchers can determine if a model is following a direct, logical path or if its reasoning is scattered and inefficient.
Insights into Model Behavior
When applied to open-source reasoning models, this structural analysis revealed behaviors that traditional metrics miss. The findings show that structural measurements are a practical tool for diagnosing why models fail and for understanding how their reasoning capabilities scale as the difficulty of a puzzle increases. By separating structural quality from simple accuracy, this research provides a clearer picture of how different models handle complex logical tasks.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!