Back to AI Research

AI Research

CORE: Contrastive Reflection Enables Rapid Improvem... | AI Research

Key Takeaways

  • CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning Language models often struggle to improve their reasoning capabilities without massive a...
  • Language models can use verifiable rewards to improve at a wide variety of reasoning tasks.
  • prompt optimization) approaches to doing so typically require hundreds of training samples and thousands of model rollouts, making them expensive in the best case and intractable in the worst.
  • Across four reasoning tasks, we demonstrate that CORE enables more rapid improvement than both parametric (GRPO) and non-parametric (GEPA, episodic RAG, and MemRL) methods, while using fewer rollouts.
  • Under fixed rollout budgets with as few as five training samples, we then show that CORE also achieves comparable or greater performance gains than each baseline.
Paper AbstractExpand

Language models can use verifiable rewards to improve at a wide variety of reasoning tasks. However, both parametric (e.g. RLVR) and non-parametric (e.g. prompt optimization) approaches to doing so typically require hundreds of training samples and thousands of model rollouts, making them expensive in the best case and intractable in the worst. To address this challenge, we introduce Contrastive Reflection (CORE), a non-parametric learning algorithm that compares past reasoning traces to generate insights: short natural-language descriptions of reasoning strategies and constraints that capture differences between successful and unsuccessful problem attempts. Across four reasoning tasks, we demonstrate that CORE enables more rapid improvement than both parametric (GRPO) and non-parametric (GEPA, episodic RAG, and MemRL) methods, while using fewer rollouts. Under fixed rollout budgets with as few as five training samples, we then show that CORE also achieves comparable or greater performance gains than each baseline. Finally, we highlight how CORE is also substantially more context-efficient than non-parametric baselines, requiring fewer prompt tokens while storing learned knowledge as compact, interpretable natural-language insights. Our results therefore suggest that distilling contrasts between successful and unsuccessful reasoning traces into abstract and useful insights can provide a more efficient and interpretable route to model self-improvement than weight updates, prompt optimization, or direct reuse of stored reasoning traces.

CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning
Language models often struggle to improve their reasoning capabilities without massive amounts of data and compute. While existing methods—such as updating model weights or optimizing prompts—can help, they are often expensive, slow, or require thousands of attempts to achieve meaningful progress. This paper introduces Contrastive Reflection (CORE), a new approach that allows a frozen language model to learn from its own successes and failures by distilling them into concise, reusable "insights." By focusing on abstract strategies rather than just raw data, CORE enables models to learn more efficiently and effectively.

How CORE Works

CORE functions as a non-parametric learning algorithm, meaning it does not change the underlying weights of the language model. Instead, it maintains two external memory stores: one for successful past attempts (rollout memory) and one for learned strategies (insight memory).
When the model fails to solve a problem, it triggers a "contrastive reflection" step. The model compares its failed attempt against a similar, successful one from its memory. By analyzing the differences between these two traces, the model generates a short, natural-language insight—such as a specific strategy or constraint—that explains why one approach worked while the other did not. These insights are then tested; only those that demonstrably improve performance are added to the memory store.

Intelligent Retrieval and Utility

A key feature of CORE is how it decides which insights to use. When the model encounters a new problem, it does not simply rely on random information. It retrieves insights based on two factors: how semantically similar the new problem is to past experiences, and the "utility" of the insight.
The system tracks the empirical success of each insight, assigning it a utility score based on how often it has helped solve problems in the past. This allows the model to selectively apply the most relevant and effective strategies for the specific task at hand, ensuring that the context provided to the model remains compact and highly focused.

Performance and Efficiency

Across four different reasoning tasks—including logic puzzles, planning, and arithmetic—CORE demonstrated significant advantages over existing methods. It consistently achieved higher performance while using fewer training problems and fewer total attempts (rollouts) than both parametric approaches (like GRPO) and other non-parametric methods (like GEPA, episodic RAG, and MemRL).
The results suggest that CORE is particularly effective in data-constrained environments. Even with as few as five training samples, the model was able to show substantial gains. Because the learned knowledge is stored as interpretable, natural-language insights rather than opaque parameter updates, the process is not only more efficient but also more transparent, allowing researchers to inspect exactly what the model has learned.

Comments (0)

No comments yet

Be the first to share your thoughts!