FORGE: Self-Evolving Agent Memory With No Weight Up...

FORGE: Self-Evolving Agent Memory With No Weight Up... | AI Research

Key Takeaways

FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast This paper introduces FORGE (Failure-Optimized Reflective Graduation and Ev...
Can LLM agents improve decision-making through self-generated memory without gradient updates?
We propose FORGE (Failure-Optimized Reflective Graduation and Evolution), a staged, population-based protocol that evolves prompt-injected natural-language memory for hierarchical ReAct agents.
All evidence is confined to CAGE-2 B-line; cross-family findings are directional evidence.
FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast

Paper AbstractExpand

Can LLM agents improve decision-making through self-generated memory without gradient updates? We propose FORGE (Failure-Optimized Reflective Graduation and Evolution), a staged, population-based protocol that evolves prompt-injected natural-language memory for hierarchical ReAct agents. FORGE wraps a Reflexion-style inner loop, where a dedicated reflection agent (using the same underlying LLM, no distillation from a stronger model) converts failed trajectories into reusable knowledge artifacts: textual heuristics (Rules), few-shot demonstrations (Examples), or both (Mixed), with an outer loop that propagates the best-performing instance's memory to the population between stages and freezes converged instances via a graduation criterion. We evaluate on CybORG CAGE-2, a stochastic network-defense POMDP at a 30-step horizon against the B-line attacker, where all four tested LLM families (Gemini-2.5-Flash-Lite, Grok-4-Fast, Llama-4-Maverick, Qwen3-235B) exhibit strongly negative, heavy-tailed zero-shot rewards. Compared against both a zero-shot baseline and a Reflexion baseline (isolated single-stream learning), FORGE improves average evaluation return by 1.7-7.7$\times$ over zero-shot and by 29-72% over Reflexion in all 12 model-representation conditions, reducing major-failure rates (below $-100$) to as low as $\sim$1%. We find that (1) population broadcast is critical mechanism, with a no-graduation ablation confirming that broadcast carries the performance gains while graduation primarily saves compute; (2) Examples achieves the strongest returns for three of four models, Rules offers the best cost-reliability profile with $\sim$40% fewer tokens; and (3) weaker baseline models benefit disproportionately, suggesting FORGE may mitigate capability gaps rather than amplify strong models. All evidence is confined to CAGE-2 B-line; cross-family findings are directional evidence.

FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast
This paper introduces FORGE (Failure-Optimized Reflective Graduation and Evolution), a method that allows AI agents to improve their decision-making skills in complex, stochastic environments without needing to undergo expensive model training or weight updates. By using a population-based approach, FORGE enables agents to learn from their mistakes by converting failed attempts into reusable "memory artifacts"—such as rules or examples—that are then shared across a group of agents to improve overall performance.

How FORGE Works

The system uses a hierarchical agent structure where a "Planner" delegates tasks to specialized sub-agents. When an agent fails a task, a reflection mechanism analyzes the failure and creates a knowledge artifact. These artifacts are stored in the agent's memory and injected into future prompts.
FORGE organizes these agents into a population that evolves over several stages. After each stage, the system identifies the best-performing agent (the "champion") and broadcasts its memory to all other agents in the population. To ensure efficiency, the system also uses a "graduation" criterion: once an agent reaches a certain level of performance, it is frozen and removed from further training, which saves computational resources.

Comparing Memory Strategies

The researchers tested three ways to represent memory:

Rules: Textual heuristics or conditional instructions.
Examples: Few-shot demonstrations of successful interactions.
Mixed: A combination of both rules and examples.
The study found that while "Examples" often led to the highest performance, "Rules" were the most efficient, requiring about 40% fewer tokens while maintaining high reliability.

Key Findings

The researchers evaluated FORGE across four different LLM families (Gemini, Grok, Llama, and Qwen) using the CybORG CAGE-2 cyber-defense environment. The results showed that FORGE significantly outperformed both zero-shot baselines and standard single-stream reflection methods. Specifically, FORGE improved average returns by 1.7 to 7.7 times compared to zero-shot performance and by 29% to 72% over standard reflection. Furthermore, the population-broadcast mechanism was identified as the critical driver of these gains, helping to reduce the rate of catastrophic failures to as low as 1%.

Important Considerations

The study highlights that FORGE is particularly beneficial for weaker baseline models, suggesting that this method can help bridge capability gaps rather than just enhancing already powerful models. However, it is important to note that these findings are based specifically on the CAGE-2 B-line cyber-defense scenario. While the results across different model families provide strong directional evidence, the researchers emphasize that further work is needed to see how well these improvements generalize to other environments and different types of challenges.