Joint Learning of Experiential Rules and Policies for Large Language Model Agents
Large Language Model (LLM) agents often struggle to improve in complex, multi-step environments where rewards are sparse and strategies must be adjusted over time. Existing methods typically handle experience in one of two ways: either by storing it as external natural-language rules for the model to read, or by updating the model’s internal parameters through reinforcement learning. The researchers introduce Joint Learning of Experiential Rules and Policies (JERP), a framework that integrates both approaches into a single, unified training loop. By updating both the rule pool and the model parameters simultaneously, JERP ensures that the agent’s external guidance remains aligned with its evolving internal policy.
A Dual-Learning Approach
JERP treats interaction experience as a dual resource. When an agent performs a task, it retrieves a set of relevant rules from a long-term pool to guide its decision-making. Once the episode concludes, the resulting trajectory data is used for two distinct purposes: first, to optimize the model’s internal parameters using group-relative reinforcement learning, and second, to refine the rule pool itself. By comparing current performance against successful past trajectories, the system can add new insights, merge overlapping rules, or remove outdated instructions, ensuring the rule pool stays relevant as the agent learns.
Keeping Rules and Policy in Sync
A primary challenge in agent training is that external rules can become "stale" if the agent’s internal capabilities change. If a model improves its reasoning, a rule that was once helpful might become redundant or even misleading. JERP addresses this by coupling the rule-update process with the policy-update process. Because the rule pool is revised based on the same trajectories used to train the model, the external guidance evolves alongside the agent’s internal logic. This allows stable, effective behaviors to be gradually "absorbed" into the model’s parameters over time, reducing the reliance on external prompts for basic tasks.
Performance in Complex Tasks
The researchers evaluated JERP on AlfWorld and WebShop, two environments known for requiring multi-step planning and navigation. The experiments demonstrate that JERP provides consistent improvements in decision-making performance compared to methods that treat rule-learning and parameter-optimization as separate processes. By maintaining an inspectable, editable rule pool while simultaneously fine-tuning the model, the framework effectively balances the need for explicit, human-readable guidance with the efficiency of internal policy optimization.
Key Considerations
While JERP shows promise, it relies on the agent’s ability to interpret and apply the retrieved rules correctly. The framework uses a score-based selection process to manage the size of the rule pool, ensuring the model is not overwhelmed by too much information. Because the rule-update mechanism uses an LLM to generate edit operations based on contrastive reflection, the quality of the rule pool is dependent on the model’s ability to accurately reflect on its own successes and failures. This structured approach to rule maintenance allows the agent to build a long-term memory of effective strategies that are both useful for future tasks and aligned with the model's current skill level.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!