FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast
This paper introduces FORGE (Failure-Optimized Reflective Graduation and Evolution), a method that allows AI agents to improve their decision-making skills in complex, stochastic environments without needing to undergo expensive model training or weight updates. By using a population-based approach, FORGE enables agents to learn from their mistakes by converting failed attempts into reusable "memory artifacts"—such as rules or examples—that are then shared across a group of agents to improve overall performance.
How FORGE Works
The system uses a hierarchical agent structure where a "Planner" delegates tasks to specialized sub-agents. When an agent fails a task, a reflection mechanism analyzes the failure and creates a knowledge artifact. These artifacts are stored in the agent's memory and injected into future prompts.
FORGE organizes these agents into a population that evolves over several stages. After each stage, the system identifies the best-performing agent (the "champion") and broadcasts its memory to all other agents in the population. To ensure efficiency, the system also uses a "graduation" criterion: once an agent reaches a certain level of performance, it is frozen and removed from further training, which saves computational resources.
Comparing Memory Strategies
The researchers tested three ways to represent memory:
Rules: Textual heuristics or conditional instructions.
Examples: Few-shot demonstrations of successful interactions.
Mixed: A combination of both rules and examples.
The study found that while "Examples" often led to the highest performance, "Rules" were the most efficient, requiring about 40% fewer tokens while maintaining high reliability.
Key Findings
The researchers evaluated FORGE across four different LLM families (Gemini, Grok, Llama, and Qwen) using the CybORG CAGE-2 cyber-defense environment. The results showed that FORGE significantly outperformed both zero-shot baselines and standard single-stream reflection methods. Specifically, FORGE improved average returns by 1.7 to 7.7 times compared to zero-shot performance and by 29% to 72% over standard reflection. Furthermore, the population-broadcast mechanism was identified as the critical driver of these gains, helping to reduce the rate of catastrophic failures to as low as 1%.
Important Considerations
The study highlights that FORGE is particularly beneficial for weaker baseline models, suggesting that this method can help bridge capability gaps rather than just enhancing already powerful models. However, it is important to note that these findings are based specifically on the CAGE-2 B-line cyber-defense scenario. While the results across different model families provide strong directional evidence, the researchers emphasize that further work is needed to see how well these improvements generalize to other environments and different types of challenges.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!