Hierarchical Behaviour Spaces (HBS) is a new approach to reinforcement learning designed to help agents navigate environments that require reasoning over very long time horizons. While traditional hierarchical methods often struggle in online settings, HBS improves performance by allowing a high-level controller to dynamically combine multiple predefined reward functions. This creates a flexible "space" of possible behaviors, enabling the agent to adapt its strategy more effectively than if it were restricted to choosing from a fixed set of pre-set options.
How HBS Works
In standard hierarchical reinforcement learning, an agent might choose between a few rigid options, each optimized for a specific, singular goal. HBS changes this by allowing the controller to specify a "linear combination" of several reward functions. For example, if an agent has reward functions for finding food, gaining experience, and exploring new areas, the controller can blend these together in varying amounts. This creates a continuous spectrum of behaviors, allowing the agent to perform complex tasks that might not be captured by any single reward function on its own. The controller and the low-level policy are trained simultaneously, with the controller operating on a compressed timescale to manage long-term goals.
Performance in NetHack
The researchers tested HBS on the NetHack Learning Environment, a notoriously difficult benchmark that requires agents to make decisions over thousands of steps to succeed. HBS outperformed existing methods, showing a superior ability to reach key milestones and navigate different branches of the game’s map. Notably, as the researchers added more reward functions to the system, HBS became more effective, demonstrating that the method successfully scales by utilizing additional "axes of behavior" to improve its decision-making.
Rethinking Hierarchy and Exploration
A key finding of this research challenges the conventional wisdom regarding why hierarchical methods work. It is often assumed that hierarchy helps agents solve problems by improving "long-term reasoning"—the ability to plan far into the future. However, the experiments with HBS suggest that its success is primarily driven by enhanced exploration. By providing the agent with a diverse, expressive space of behaviors to choose from, the agent is better equipped to discover and navigate different parts of the environment. This suggests that the primary benefit of this hierarchical structure is not necessarily better credit assignment over long periods, but rather the ability to automatically tune and apply intrinsic rewards to encourage more effective exploration.

Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!