Back to AI Research

AI Research

StratFormer: Adaptive Opponent Modeling and Exploit... | AI Research

Key Takeaways

  • StratFormer: Adaptive Opponent Modeling and Exploitation in Imperfect-Information Games StratFormer is a new AI agent designed to solve a classic dilemma in...
  • We present StratFormer, a transformer-based meta-agent that learns to simultaneously model and exploit opponents in imperfect-information games through a two-phase curriculum.
  • The first phase trains an opponent modeling head to identify behavioral patterns from action histories while the agent plays a game-theoretic optimal (GTO) policy.
  • The second phase progressively shifts the policy toward best-response (BR) exploitation, guided by a per-opponent regularization schedule tied to exploitability.
  • Our architecture introduces dual-turn tokens -- feature vectors constructed at both agent and opponent decision points -- coupled with bucket-rate features that encode opponent tendencies across five strategic contexts.
Paper AbstractExpand

We present StratFormer, a transformer-based meta-agent that learns to simultaneously model and exploit opponents in imperfect-information games through a two-phase curriculum. The first phase trains an opponent modeling head to identify behavioral patterns from action histories while the agent plays a game-theoretic optimal (GTO) policy. The second phase progressively shifts the policy toward best-response (BR) exploitation, guided by a per-opponent regularization schedule tied to exploitability. Our architecture introduces dual-turn tokens -- feature vectors constructed at both agent and opponent decision points -- coupled with bucket-rate features that encode opponent tendencies across five strategic contexts. On Leduc Hold'em, a small poker variant with six cards and two betting rounds, we test against six opponent archetypes at two strength levels each, with exploitability ranging from 0.15 to 1.26 Big Blinds (BB) per hand. StratFormer achieves an average exploitation gain of +0.106 BB per hand over GTO, with peak gains of +0.821 against highly exploitable opponents, while maintaining near-equilibrium safety.

StratFormer: Adaptive Opponent Modeling and Exploitation in Imperfect-Information Games
StratFormer is a new AI agent designed to solve a classic dilemma in game theory: how to play safely against perfect opponents while simultaneously exploiting the predictable mistakes of weaker ones. In games with hidden information, such as poker, agents typically choose between playing a "Game-Theoretic Optimal" (GTO) strategy—which is safe but never wins extra value—or a "Best-Response" (BR) strategy, which exploits an opponent but can be easily defeated if the model of that opponent is wrong. StratFormer uses a transformer-based architecture to learn both how to model an opponent's behavior and how to adjust its own strategy in real-time, bridging the gap between safety and exploitation.

A Two-Phase Learning Curriculum

The researchers trained StratFormer using a two-phase curriculum that separates understanding from acting. In the first phase, the agent plays a GTO policy while training an "opponent modeling head." This component learns to identify patterns in the opponent's actions based on the history of the game. In the second phase, the agent begins to shift its strategy toward a best-response approach. Crucially, the degree of this shift is controlled by a regularization schedule: if the opponent is highly exploitable, the agent plays more aggressively; if the opponent plays near-perfectly, the agent remains tethered to a safer, GTO-like strategy.

Architecture and Feature Engineering

The agent’s intelligence is powered by a causal transformer encoder that processes "dual-turn tokens." These tokens are feature vectors created at every decision point, whether it is the agent's turn or the opponent's. To help the model understand the opponent's tendencies, the researchers included "bucket-rate features." These track running statistics of the opponent's behavior across five different strategic contexts, such as how often they fold, call, or raise when facing pressure. By using these features, the transformer can dynamically weight past observations to make informed predictions about future moves.

Performance and Results

The researchers tested StratFormer on Leduc Hold’em, a poker variant, against twelve different opponent archetypes with varying levels of exploitability. The agent demonstrated a strong ability to adapt, achieving an average exploitation gain of +0.106 Big Blinds per hand over the GTO baseline. Against the most predictable, "maniacal" opponents, the agent achieved peak gains of +0.821 Big Blinds per hand. Importantly, when the agent faced a GTO opponent, it maintained near-equilibrium safety, showing that the model successfully balances the risk of exploitation with the need for a solid defensive foundation.

Key Considerations

While StratFormer shows significant promise, its design relies on a tractable equilibrium baseline to function effectively. The architecture is domain-general, meaning the dual-turn tokens and the two-phase curriculum could theoretically be applied to other sequential games where opponent actions are observable. The researchers noted that separating the policy and modeling heads was essential to prevent "gradient interference," ensuring that the effort to learn about the opponent does not negatively impact the agent's ability to make optimal strategic decisions.

Comments (0)

No comments yet

Be the first to share your thoughts!