EvoLM: Self-Evolving Language Models through Co-Evolved Discriminative Rubrics
Current methods for improving language models after their initial training often rely on external guidance, such as human feedback, proprietary AI models, or rigid reward systems. These approaches create a "ceiling" for performance because the model cannot easily surpass the quality of its supervisors. This paper introduces EvoLM, a post-training framework that allows a language model to improve itself by tapping into its own internal evaluative knowledge, removing the need for human annotation or external supervision.
The Self-Evolving Mechanism
EvoLM functions by training two distinct capabilities within a single language model in an alternating cycle. First, the model acts as a "rubric generator," creating specific evaluation criteria for a given task. These rubrics are optimized to help a small, frozen judge model effectively distinguish between high-quality and low-quality responses. Second, the model acts as a policy that is trained using the scores derived from these rubrics as a reward signal. By using temporal contrast—comparing the model's current outputs against its own earlier versions—the system generates its own preference signals, allowing it to refine its performance autonomously.
Performance and Results
The researchers tested the EvoLM approach using a Qwen3-8B model. The results demonstrate significant improvements over existing benchmarks. The model’s generated rubrics outperformed GPT-4.1 on the RewardBench-2 dataset by 25.7%. Furthermore, the policy trained through this method achieved a 69.3% average on the OLMo3-Adapt suite. This performance surpassed policies guided by GPT-4.1 prompted rubrics by 3.9% and outperformed the state-of-the-art 8B reward model, SkyWork-RM, by 16%.
Why This Matters
The primary contribution of EvoLM is the demonstration that language models can effectively "self-supervise" their own improvement. By structuring evaluative capacity into explicit, co-evolving rubrics, the model moves beyond the limitations of human-dependent training. This approach suggests that as models grow more capable, their ability to evaluate and improve their own outputs can scale alongside them, potentially unlocking new levels of performance without the bottlenecks associated with traditional, externally-supervised training methods.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!