On-line Learning in Tree MDPs by Treating Policies...

This paper introduces a new approach for on-line learning in Tree Markov Decision Problems (T-MDPs), a specific type of decision-making environment where every state can be reached from the start through exactly one unique path. These problems are common in sequential games with perfect recall, such as poker or reconnaissance-based board games. The authors demonstrate that standard bandit algorithms can be effectively adapted to solve these complex decision problems, providing a way to learn optimal strategies against stationary opponents.

Treating Policies as Bandit Arms

The core challenge in T-MDPs is that the number of possible strategies (policies) is exponential relative to the number of states, making it computationally impossible to treat every policy as an independent "arm" in a traditional bandit algorithm. The authors overcome this by designing a framework where policies share data. By recognizing that different policies often overlap in the paths they take through the tree, the algorithm can use information gathered from one policy to inform the evaluation of others. This allows the system to maintain polynomial memory and computation requirements rather than exponential ones.

Efficient Confidence Bounds

A key technical innovation in the paper is the development of new confidence bounds for policy values. The researchers show that the value of any policy can be represented as a weighted sum of terminal state returns. By tracking how often each terminal state is reached, the algorithm can calculate empirical value estimates and confidence intervals. The authors prove that these estimates are reliable by establishing that the underlying data points—despite being dependent on one another—exhibit a property called "negative cylinder dependence," which allows for the use of standard concentration inequalities to bound errors.

Performance in Complex Games

To validate their approach, the authors implemented their generalized algorithms, dubbed Lucb-T and Ucb-T, on three games of increasing complexity: Kuhn Poker, Leduc Poker, and Reconnaissance Blind Tic-Tac-Toe. The latter is a significantly large environment with millions of states. The empirical results show that these algorithms consistently outperform existing alternatives, particularly as the scale of the problem grows. The study also highlights that while the theoretical guarantees provide a strong foundation, there are observable gaps between these theoretical bounds and practical performance, particularly in the PAC (Probably Approximately Correct) learning setting.

Key Considerations

The approach relies on the assumption that the reward function of the T-MDP is known, with the agent only needing to estimate the transition probabilities. While this is a common and reasonable assumption for many imperfect-information games, it is a specific constraint of the current model. The authors have made their code publicly available to encourage further research and reproduction of their results, providing a scalable tool for researchers working on sequential decision-making in large-scale, tree-structured environments.

On-line Learning in Tree MDPs by Treating Policies... | AI Research

Key Takeaways

Treating Policies as Bandit Arms

Efficient Confidence Bounds

Performance in Complex Games

Key Considerations

Comments (0)

No comments yet