Doing More With Less: Revisiting the Effectiveness...

Doing More With Less: Revisiting the Effectiveness... | AI Research

Key Takeaways

Large Language Models (LLMs) are increasingly used for complex reasoning tasks by utilizing "test-time scaling" (TTS), a process where the model generates ad...
However, specific to reasoning LLMs, prior work has shown that structured pruning (methods which removes entire set of layer blocks), significantly degrades TTS reasoning performance.
In this work, we revisit this assumption and instead investigate whether unstructured pruning (methods that carefully remove only certain redundant/detrimental weights) exhibits similar limitations.
Furthermore, we also empirically study the impact of different layer-wise sparsity allocation strategies, which are an important parametric choice for instantiating unstructured pruning methods.
These findings challenge the conventional notion that pruning always reduces TTS performance and in fact, suggest that carefully undertaken pruning can improve TTS effectiveness even further.

Paper AbstractExpand

While current Large Language Models (LLMs) exhibit remarkable reasoning capabilities through test-time compute scaling (TTS), their massive parameter counts and high inference costs have motivated the development of pruning methods that can reduce model size without sacrificing performance. However, specific to reasoning LLMs, prior work has shown that structured pruning (methods which removes entire set of layer blocks), significantly degrades TTS reasoning performance. In this work, we revisit this assumption and instead investigate whether unstructured pruning (methods that carefully remove only certain redundant/detrimental weights) exhibits similar limitations. Surprisingly, our extensive experiments across four reasoning benchmarks on two reasoning LLMs: s1.1-7B and Qwen3-8B, consistently show that unstructured pruning augments TTS performance compared to structured pruning, and at times can even outperform the unpruned full-weight LLMs. Furthermore, we also empirically study the impact of different layer-wise sparsity allocation strategies, which are an important parametric choice for instantiating unstructured pruning methods. These findings challenge the conventional notion that pruning always reduces TTS performance and in fact, suggest that carefully undertaken pruning can improve TTS effectiveness even further.

Large Language Models (LLMs) are increasingly used for complex reasoning tasks by utilizing "test-time scaling" (TTS), a process where the model generates additional tokens to "think" through a problem before providing an answer. While this improves performance, it requires significant computational power. This paper investigates whether pruning—the process of removing redundant or less influential parameters—can reduce the size and cost of these models without harming their reasoning capabilities.

Challenging the Conventional Wisdom

Prior research suggested that pruning LLMs, particularly through "structured" methods that remove entire layers, significantly degrades their ability to perform complex, multi-step reasoning. This paper challenges that assumption by testing "unstructured" pruning, which removes individual, less important weights rather than entire blocks. By experimenting with two reasoning-focused models, s1.1-7B and Qwen3-8B, the authors demonstrate that unstructured pruning does not suffer from the same performance pitfalls as structured pruning.

How Unstructured Pruning Works

The researchers evaluated two primary unstructured pruning techniques: Magnitude pruning, which removes weights with the smallest values, and Wanda, which considers both weight magnitude and the importance of input activations. They also tested different "sparsity allocation strategies"—methods for deciding which layers should be pruned more aggressively than others—including Uniform allocation, OWL (Outlier Weighted Layerwise Sparsity), and LayerIF (an influence-based approach). These strategies allow for a more surgical removal of parameters compared to simply cutting out entire layers.

Key Findings

The results were surprising: unstructured pruning consistently allowed the models to maintain their reasoning performance. In many cases, the pruned models actually outperformed the original, unpruned versions across four challenging reasoning benchmarks (MATH500, AIME24, AMC23, and GPQA-Diamond). While structured pruning consistently led to performance drops, unstructured pruning proved to be a highly effective way to reduce model size while simultaneously preserving or even boosting the effectiveness of test-time scaling.

Implications for Future Research

The study suggests that the way a model is pruned is just as important as the amount of pruning performed. By carefully selecting which weights to remove, researchers can create leaner, more efficient models that are better suited for the high-compute demands of modern reasoning tasks. These findings provide a new path for scaling next-generation LLMs, proving that it is possible to do more with less by focusing on the specific weights that contribute to—or detract from—a model's reasoning process.