Back to AI Research

AI Research

Sample-efficient Neuro-symbolic Proximal Policy Opt... | AI Research

Key Takeaways

  • Deep Reinforcement Learning (DRL) algorithms, such as Proximal Policy Optimization (PPO), are powerful tools for decision-making but often struggle in comple...
  • Deep Reinforcement Learning (DRL) algorithms often require a large amount of data and struggle in sparse-reward domains with long planning horizons and multiple sub-goals.
  • In this paper, we propose a neuro-symbolic extension of Proximal Policy Optimization (PPO) that transfers partial logical policy specifications learned in easier instances to guide learning in more challenging settings.
  • We introduce two integrations of symbolic guidance: (i) H-PPO-Product, which biases the action distribution at sampling time, and (ii) H-PPO-SymLoss, which augments the PPO loss with a symbolic regularization term.
  • Deep Reinforcement Learning (DRL) algorithms, such as Proximal Policy Optimization (PPO), are powerful tools for decision-making but often struggle in complex environments.
Paper AbstractExpand

Deep Reinforcement Learning (DRL) algorithms often require a large amount of data and struggle in sparse-reward domains with long planning horizons and multiple sub-goals. In this paper, we propose a neuro-symbolic extension of Proximal Policy Optimization (PPO) that transfers partial logical policy specifications learned in easier instances to guide learning in more challenging settings. We introduce two integrations of symbolic guidance: (i) H-PPO-Product, which biases the action distribution at sampling time, and (ii) H-PPO-SymLoss, which augments the PPO loss with a symbolic regularization term. We evaluate our methods on three benchmarks (OfficeWorld, WaterWorld, and DoorKey), showing consistently faster learning and higher return at convergence than PPO and a Reward Machine baseline, also under imperfect symbolic knowledge.

Deep Reinforcement Learning (DRL) algorithms, such as Proximal Policy Optimization (PPO), are powerful tools for decision-making but often struggle in complex environments. They typically require vast amounts of data to learn and frequently fail in tasks where rewards are sparse or the path to a goal involves many steps. This paper introduces a neuro-symbolic extension to PPO that improves sample efficiency by transferring logical rules—learned from simpler tasks—to guide the agent in more challenging settings. By combining neural learning with symbolic reasoning, the agent can navigate complex sub-goals more effectively than standard methods.

Integrating Symbolic Guidance

The researchers propose two ways to incorporate logical "rules of thumb" into the PPO training process. These rules are expressed as Horn clauses, which map environmental features to recommended actions.

  • H-PPO-Product: This method adjusts the agent’s action selection during training. It biases the probability distribution toward actions that are logically recommended by the symbolic rules. As training progresses, the influence of these rules is gradually reduced, allowing the agent to rely more on its own learned experience.

  • H-PPO-SymLoss: This approach modifies the optimization process itself. It adds a symbolic penalty term to the standard PPO loss function. This encourages the neural network to align its policy with a reference policy derived from the symbolic rules, effectively steering the agent toward logically sound behaviors during the learning updates.

Performance and Scalability

The researchers tested their methods on three grid-based benchmarks: DoorKey, OfficeWorld, and WaterWorld. These environments are characterized by sparse rewards and long sequences of sub-goals. The results demonstrate that both H-PPO-Product and H-PPO-SymLoss consistently outperform standard PPO and Reward Machine baselines.
The proposed methods show faster learning speeds and higher returns at convergence. Notably, the agents were able to succeed even when the transferred symbolic knowledge was imperfect or incomplete. Furthermore, the framework allows for scaling to larger, more complex versions of these tasks without needing to retune the underlying PPO hyperparameters, which is a significant advantage over traditional DRL approaches.

Key Considerations

A major strength of this approach is its ability to bridge the gap between human-readable logic and neural network performance. By using symbolic guidance as an action-level prior, the agent avoids the pitfalls of traditional reward shaping, which can be computationally expensive and sensitive to inaccurate heuristics. While the methods are designed to handle imperfect rules by decaying their influence over time, the effectiveness of the system still relies on the initial definition of logical predicates and the ability to map environmental states to these symbolic representations.

Comments (0)

No comments yet

Be the first to share your thoughts!