Search-E1: Self-Distillation Drives Self-Evolution...

Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning
Search-E1 is a new method designed to improve how language models perform complex, search-augmented reasoning. Currently, many models are trained using "outcome-level" reinforcement learning, where they are rewarded only if they reach the correct final answer. While effective, researchers often add complex, resource-heavy machinery—such as external teachers, extra reward models, or custom search structures—to boost performance. Search-E1 demonstrates that this extra machinery is not necessary. Instead, it allows an agent to improve itself by alternating between standard training and a process called offline self-distillation, which provides the model with dense, step-by-step guidance using only its own previous attempts.

The Problem with Current Training

Most search-augmented agents are trained using a method called GRPO, which evaluates the entire reasoning chain based on the final result. Because the reward is applied to the whole sequence, the model struggles to distinguish between high-quality search queries and redundant or ineffective ones. To fix this, recent research has introduced auxiliary modules, hand-crafted reward shaping, or external supervision from much larger models. These additions make the training pipeline significantly more complex and resource-intensive, often tying the model to specific designs or external systems that may not always be available.

How Search-E1 Works

Search-E1 simplifies the process by using the model's own rollouts to create a "privileged" learning signal. After a round of standard GRPO training, the model generates multiple attempts at answering a question. The system then identifies pairs of trajectories: a "reference" trajectory that reached the correct answer efficiently, and a "student" trajectory that was less effective.
The model then undergoes an offline self-distillation (OFSD) phase. During this phase, the model acts as both the teacher and the student. The teacher is given the "privileged" context of the successful reference trajectory, while the student is given only the standard question. By aligning the student’s behavior to match the teacher’s efficient reasoning steps through a token-level objective, the model learns to replicate the successful patterns found in its own best work. This cycle of exploration (GRPO) and consolidation (OFSD) can be repeated to steadily improve performance.

Results and Performance

Search-E1 was tested on seven different question-answering benchmarks, including both single-hop and multi-hop tasks. Using the Qwen2.5-3B model, it achieved an average Exact Match (EM) score of 0.440, outperforming existing open-source baselines. The improvements were particularly notable on multi-hop reasoning tasks, where the model must perform multiple search steps to find an answer. By focusing on the quality of individual steps rather than just the final outcome, Search-E1 allows the model to become more efficient and accurate without requiring external annotations or complex auxiliary reward systems.

Key Takeaways

The primary contribution of Search-E1 is its ability to achieve state-of-the-art results using a self-contained pipeline. Because the OFSD stage operates offline on existing rollouts, it does not add overhead to the core reinforcement learning loop. This modularity makes the method flexible and easy to integrate with future improvements in base models or search strategies. By relying on the contrast between its own successful and unsuccessful attempts, the model effectively "self-evolves," turning its own experience into a dense, actionable guide for better reasoning.

Search-E1: Self-Distillation Drives Self-Evolution... | AI Research

Key Takeaways

The Problem with Current Training

How Search-E1 Works

Results and Performance

Key Takeaways

Comments (0)

No comments yet