Back to AI Research

AI Research

Joint Learning of Experiential Rules and Policies f... | AI Research

Key Takeaways

  • Joint Learning of Experiential Rules and Policies for Large Language Model Agents Large Language Model (LLM) agents often struggle to improve in complex, mul...
  • For LLM agents in multi-step interactive environments, a key challenge is to make effective use of accumulated interaction experience.
  • Existing work has typically separated two uses of such experience: keeping it outside the model as natural-language rules for later prompting, or using trajectories and feedback to update the model parameters.
  • The former is easy to interpret but can fall out of sync with the evolving policy; the latter improves the policy more broadly but provides only limited correction for local mistakes in sparse-reward settings.
  • We present Joint Learning of Experiential Rules and Policies for LLM Agents (JERP), which updates a long-term experiential-rule pool and the policy from the same interaction trajectories.
Paper AbstractExpand

For LLM agents in multi-step interactive environments, a key challenge is to make effective use of accumulated interaction experience. Existing work has typically separated two uses of such experience: keeping it outside the model as natural-language rules for later prompting, or using trajectories and feedback to update the model parameters. The former is easy to interpret but can fall out of sync with the evolving policy; the latter improves the policy more broadly but provides only limited correction for local mistakes in sparse-reward settings. We present Joint Learning of Experiential Rules and Policies for LLM Agents (JERP), which updates a long-term experiential-rule pool and the policy from the same interaction trajectories. At decision time, JERP retrieves task-relevant rules and conditions the agent on them together with the interaction history. After each episode, it uses the collected trajectories both to optimize the policy and to revise the rule pool by comparing current rollouts with reference successful trajectories. This coupling keeps the rule pool aligned with the evolving policy while allowing stable and effective behaviors to be gradually absorbed into the model itself. Experiments on AlfWorld and WebShop show that JERP yields consistent gains in decision performance for complex interactive tasks.

Joint Learning of Experiential Rules and Policies for Large Language Model Agents
Large Language Model (LLM) agents often struggle to improve in complex, multi-step environments where rewards are sparse and strategies must be adjusted over time. Existing methods typically handle experience in one of two ways: either by storing it as external natural-language rules for the model to read, or by updating the model’s internal parameters through reinforcement learning. The researchers introduce Joint Learning of Experiential Rules and Policies (JERP), a framework that integrates both approaches into a single, unified training loop. By updating both the rule pool and the model parameters simultaneously, JERP ensures that the agent’s external guidance remains aligned with its evolving internal policy.

A Dual-Learning Approach

JERP treats interaction experience as a dual resource. When an agent performs a task, it retrieves a set of relevant rules from a long-term pool to guide its decision-making. Once the episode concludes, the resulting trajectory data is used for two distinct purposes: first, to optimize the model’s internal parameters using group-relative reinforcement learning, and second, to refine the rule pool itself. By comparing current performance against successful past trajectories, the system can add new insights, merge overlapping rules, or remove outdated instructions, ensuring the rule pool stays relevant as the agent learns.

Keeping Rules and Policy in Sync

A primary challenge in agent training is that external rules can become "stale" if the agent’s internal capabilities change. If a model improves its reasoning, a rule that was once helpful might become redundant or even misleading. JERP addresses this by coupling the rule-update process with the policy-update process. Because the rule pool is revised based on the same trajectories used to train the model, the external guidance evolves alongside the agent’s internal logic. This allows stable, effective behaviors to be gradually "absorbed" into the model’s parameters over time, reducing the reliance on external prompts for basic tasks.

Performance in Complex Tasks

The researchers evaluated JERP on AlfWorld and WebShop, two environments known for requiring multi-step planning and navigation. The experiments demonstrate that JERP provides consistent improvements in decision-making performance compared to methods that treat rule-learning and parameter-optimization as separate processes. By maintaining an inspectable, editable rule pool while simultaneously fine-tuning the model, the framework effectively balances the need for explicit, human-readable guidance with the efficiency of internal policy optimization.

Key Considerations

While JERP shows promise, it relies on the agent’s ability to interpret and apply the retrieved rules correctly. The framework uses a score-based selection process to manage the size of the rule pool, ensuring the model is not overwhelmed by too much information. Because the rule-update mechanism uses an LLM to generate edit operations based on contrastive reflection, the quality of the rule pool is dependent on the model’s ability to accurately reflect on its own successes and failures. This structured approach to rule maintenance allows the agent to build a long-term memory of effective strategies that are both useful for future tasks and aligned with the model's current skill level.

Comments (0)

No comments yet

Be the first to share your thoughts!