Back to AI Research

AI Research

When in Doubt, Plan It Out: Committed Small Languag... | AI Research

Key Takeaways

  • When in Doubt, Plan It Out: Committed Small Language Model Deliberation for Reactive Reinforcement Learning Reinforcement Learning (RL) agents are excellent...
  • Reinforcement Learning (RL) policies often degrade in unfamiliar environments because they lack explicit deliberation.
  • We propose Plan, Align, Commit, Think (PACT), a hybrid architecture that combines a fast, reactive RL policy with a slow, deliberative Small Language Model (SLM) planner.
  • PACT invokes the SLM asynchronously to generate and validate candidate action plans.
  • Once a plan is verified through simulation as safe, feasible, and complete, it is executed directly, bypassing the RL policy without retraining or modifying it.
Paper AbstractExpand

Reinforcement Learning (RL) policies often degrade in unfamiliar environments because they lack explicit deliberation. We propose Plan, Align, Commit, Think (PACT), a hybrid architecture that combines a fast, reactive RL policy with a slow, deliberative Small Language Model (SLM) planner. PACT invokes the SLM asynchronously to generate and validate candidate action plans. Once a plan is verified through simulation as safe, feasible, and complete, it is executed directly, bypassing the RL policy without retraining or modifying it. Evaluated on three FrozenLake configurations of increasing difficulty, PACT outperforms all baselines while relying on a 2B-parameter SLM backbone, suggesting that deliberative planning and reactive execution are more powerful in concert than either is alone in these settings.

When in Doubt, Plan It Out: Committed Small Language Model Deliberation for Reactive Reinforcement Learning
Reinforcement Learning (RL) agents are excellent at reacting to familiar situations but often fail when they encounter novel environments because they lack the ability to "think ahead." This paper introduces PACT (Plan, Align, Commit, Think), a hybrid architecture that pairs a fast, reactive RL policy with a slow, deliberative Small Language Model (SLM) planner. By combining these two approaches, the system can handle complex, long-horizon tasks that neither component could solve effectively on its own.

How PACT Works

PACT operates on the principle of dual-process theory, distinguishing between fast, automatic reactions and slow, goal-directed reasoning. The agent uses a standard RL policy for day-to-day decision-making. However, if the agent’s uncertainty about a situation exceeds a specific threshold, it triggers the SLM to generate a multi-step plan.
The process follows three distinct stages:

  • Plan Generation: The SLM proposes a sequence of actions to reach the goal.

  • Verification: The system simulates the plan to ensure it is safe, feasible, and complete. If the plan fails these checks, the SLM is prompted to try again.

  • Commitment: Once a valid plan is verified, the agent "commits" to it, bypassing the reactive RL policy and executing the plan directly until completion or until a significant change in the environment requires a new plan.

Bridging the Gap with Alignment

A key innovation in PACT is the alignment module. Because the environment might change or include stochastic (random) elements, the agent’s actual position may drift from the path proposed by the SLM. Instead of abandoning the plan entirely or continuing blindly, the alignment module guides the agent back to a "waypoint" in the original plan. This allows the agent to remain goal-directed while being flexible enough to handle unexpected transitions.

Performance and Results

The researchers tested PACT on FrozenLake, a classic environment for evaluating navigation and decision-making. Across three configurations—ranging from simple maps to those with slippery, unpredictable surfaces and larger, more complex layouts—PACT consistently outperformed traditional RL agents and other language-model-based approaches.
Notably, in the most difficult 8x8 grid, PACT achieved a perfect success rate. The results suggest that the effectiveness of the system does not come from how often the language model is consulted, but rather from the ability to structure, verify, and commit to a coherent plan.

Considerations and Future Directions

While PACT demonstrates that smaller, 2-billion-parameter models can be highly effective planners, the current implementation relies on a hand-crafted transition function to simulate and verify plans. Future research aims to replace this with a learned dynamics model, which would allow the system to function in environments where the rules are not explicitly known. Additionally, while PACT was tested on direct-goal tasks, the authors believe the architecture is well-suited for more complex tasks involving sub-goal decomposition, which remains a focus for future development.

Comments (0)

No comments yet

Be the first to share your thoughts!