Back to AI Research

AI Research

On-line Learning in Tree MDPs by Treating Policies... | AI Research

Key Takeaways

  • This paper introduces a new approach for on-line learning in Tree Markov Decision Problems (T-MDPs), a specific type of decision-making environment where eve...
  • A Tree Markov Decision Problem (T-MDP) is a finite-horizon MDP with a starting state $s_{1}$, in which every state is reachable from $s_{1}$ through exactly one state-action trajectory.
  • T-MDPs arise naturally as abstractions of decision making in sequential games with perfect recall, against stationary opponents.
  • We consider the problem of on-line learning in T-MDPs, both in the PAC and the regret-minimisation regimes.
  • We show that well-known bandit algorithms -- \textsc{Lucb} and \textsc{Ucb} -- can be applied on T-MDPs by treating each policy as an arm.
Paper AbstractExpand

A Tree Markov Decision Problem (T-MDP) is a finite-horizon MDP with a starting state $s_{1}$, in which every state is reachable from $s_{1}$ through exactly one state-action trajectory. T-MDPs arise naturally as abstractions of decision making in sequential games with perfect recall, against stationary opponents. We consider the problem of on-line learning in T-MDPs, both in the PAC and the regret-minimisation regimes. We show that well-known bandit algorithms -- \textsc{Lucb} and \textsc{Ucb} -- can be applied on T-MDPs by treating each policy as an arm. The apparent technical challenge in this approach is that the number of policies is exponential in the number of states. Our main innovation is in the design of confidence bounds based on data shared by the policies, so that the bandit algorithms can yet be implemented with polynomial memory and per-step computation. We obtain instance-dependent upper bounds on sample complexity and regret that sum a ``gap term'' from every terminal state, rather than every policy. Empirically, our algorithms consistently outperform available alternatives on a suite of hidden-information games.

This paper introduces a new approach for on-line learning in Tree Markov Decision Problems (T-MDPs), a specific type of decision-making environment where every state can be reached from the start through exactly one unique path. These problems are common in sequential games with perfect recall, such as poker or reconnaissance-based board games. The authors demonstrate that standard bandit algorithms can be effectively adapted to solve these complex decision problems, providing a way to learn optimal strategies against stationary opponents.

Treating Policies as Bandit Arms

The core challenge in T-MDPs is that the number of possible strategies (policies) is exponential relative to the number of states, making it computationally impossible to treat every policy as an independent "arm" in a traditional bandit algorithm. The authors overcome this by designing a framework where policies share data. By recognizing that different policies often overlap in the paths they take through the tree, the algorithm can use information gathered from one policy to inform the evaluation of others. This allows the system to maintain polynomial memory and computation requirements rather than exponential ones.

Efficient Confidence Bounds

A key technical innovation in the paper is the development of new confidence bounds for policy values. The researchers show that the value of any policy can be represented as a weighted sum of terminal state returns. By tracking how often each terminal state is reached, the algorithm can calculate empirical value estimates and confidence intervals. The authors prove that these estimates are reliable by establishing that the underlying data points—despite being dependent on one another—exhibit a property called "negative cylinder dependence," which allows for the use of standard concentration inequalities to bound errors.

Performance in Complex Games

To validate their approach, the authors implemented their generalized algorithms, dubbed Lucb-T and Ucb-T, on three games of increasing complexity: Kuhn Poker, Leduc Poker, and Reconnaissance Blind Tic-Tac-Toe. The latter is a significantly large environment with millions of states. The empirical results show that these algorithms consistently outperform existing alternatives, particularly as the scale of the problem grows. The study also highlights that while the theoretical guarantees provide a strong foundation, there are observable gaps between these theoretical bounds and practical performance, particularly in the PAC (Probably Approximately Correct) learning setting.

Key Considerations

The approach relies on the assumption that the reward function of the T-MDP is known, with the agent only needing to estimate the transition probabilities. While this is a common and reasonable assumption for many imperfect-information games, it is a specific constraint of the current model. The authors have made their code publicly available to encourage further research and reproduction of their results, providing a scalable tool for researchers working on sequential decision-making in large-scale, tree-structured environments.

Comments (0)

No comments yet

Be the first to share your thoughts!