Large Language Models (LLMs) are increasingly used for complex reasoning tasks by utilizing "test-time scaling" (TTS), a process where the model generates additional tokens to "think" through a problem before providing an answer. While this improves performance, it requires significant computational power. This paper investigates whether pruning—the process of removing redundant or less influential parameters—can reduce the size and cost of these models without harming their reasoning capabilities.
Challenging the Conventional Wisdom
Prior research suggested that pruning LLMs, particularly through "structured" methods that remove entire layers, significantly degrades their ability to perform complex, multi-step reasoning. This paper challenges that assumption by testing "unstructured" pruning, which removes individual, less important weights rather than entire blocks. By experimenting with two reasoning-focused models, s1.1-7B and Qwen3-8B, the authors demonstrate that unstructured pruning does not suffer from the same performance pitfalls as structured pruning.
How Unstructured Pruning Works
The researchers evaluated two primary unstructured pruning techniques: Magnitude pruning, which removes weights with the smallest values, and Wanda, which considers both weight magnitude and the importance of input activations. They also tested different "sparsity allocation strategies"—methods for deciding which layers should be pruned more aggressively than others—including Uniform allocation, OWL (Outlier Weighted Layerwise Sparsity), and LayerIF (an influence-based approach). These strategies allow for a more surgical removal of parameters compared to simply cutting out entire layers.
Key Findings
The results were surprising: unstructured pruning consistently allowed the models to maintain their reasoning performance. In many cases, the pruned models actually outperformed the original, unpruned versions across four challenging reasoning benchmarks (MATH500, AIME24, AMC23, and GPQA-Diamond). While structured pruning consistently led to performance drops, unstructured pruning proved to be a highly effective way to reduce model size while simultaneously preserving or even boosting the effectiveness of test-time scaling.
Implications for Future Research
The study suggests that the way a model is pruned is just as important as the amount of pruning performed. By carefully selecting which weights to remove, researchers can create leaner, more efficient models that are better suited for the high-compute demands of modern reasoning tasks. These findings provide a new path for scaling next-generation LLMs, proving that it is possible to do more with less by focusing on the specific weights that contribute to—or detract from—a model's reasoning process.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!