CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning
Language models often struggle to improve their reasoning capabilities without massive amounts of data and compute. While existing methods—such as updating model weights or optimizing prompts—can help, they are often expensive, slow, or require thousands of attempts to achieve meaningful progress. This paper introduces Contrastive Reflection (CORE), a new approach that allows a frozen language model to learn from its own successes and failures by distilling them into concise, reusable "insights." By focusing on abstract strategies rather than just raw data, CORE enables models to learn more efficiently and effectively.
How CORE Works
CORE functions as a non-parametric learning algorithm, meaning it does not change the underlying weights of the language model. Instead, it maintains two external memory stores: one for successful past attempts (rollout memory) and one for learned strategies (insight memory).
When the model fails to solve a problem, it triggers a "contrastive reflection" step. The model compares its failed attempt against a similar, successful one from its memory. By analyzing the differences between these two traces, the model generates a short, natural-language insight—such as a specific strategy or constraint—that explains why one approach worked while the other did not. These insights are then tested; only those that demonstrably improve performance are added to the memory store.
Intelligent Retrieval and Utility
A key feature of CORE is how it decides which insights to use. When the model encounters a new problem, it does not simply rely on random information. It retrieves insights based on two factors: how semantically similar the new problem is to past experiences, and the "utility" of the insight.
The system tracks the empirical success of each insight, assigning it a utility score based on how often it has helped solve problems in the past. This allows the model to selectively apply the most relevant and effective strategies for the specific task at hand, ensuring that the context provided to the model remains compact and highly focused.
Performance and Efficiency
Across four different reasoning tasks—including logic puzzles, planning, and arithmetic—CORE demonstrated significant advantages over existing methods. It consistently achieved higher performance while using fewer training problems and fewer total attempts (rollouts) than both parametric approaches (like GRPO) and other non-parametric methods (like GEPA, episodic RAG, and MemRL).
The results suggest that CORE is particularly effective in data-constrained environments. Even with as few as five training samples, the model was able to show substantial gains. Because the learned knowledge is stored as interpretable, natural-language insights rather than opaque parameter updates, the process is not only more efficient but also more transparent, allowing researchers to inspect exactly what the model has learned.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!