Strat-Reasoner: Reinforcing Strategic Reasoning of LLMs in Multi-Agent Games
Large Language Models (LLMs) are highly capable at logical tasks, but they often struggle in multi-agent games like poker or diplomatic negotiations. In these environments, success depends on how well an agent can anticipate and react to the strategies of others. Because other agents are constantly changing their tactics, traditional single-agent training methods fail to provide the necessary guidance for complex, multi-turn strategic reasoning. Strat-Reasoner is a new reinforcement learning framework designed to solve this by teaching LLMs to "think about what others think," leading to more effective and human-like strategic decision-making.
Recursive Reasoning
The core of the framework is a "Recursive Reasoning" module. Instead of acting in isolation, the agent is trained to follow a structured, multi-step thought process that mirrors the alternating nature of these games. At each turn, the agent is prompted to analyze the opponent’s past intent, predict how the opponent perceives the agent’s current move, formulate its own strategy, and finally predict the opponent’s next move. This "Past-Present-Future" loop ensures that the model’s reasoning is deeply integrated with the game's dynamics rather than being a generic response.
Centralized CoT Comparison
To guide the model, Strat-Reasoner uses a "Centralized Chain-of-Thought (CoT) Comparison" module. During training, the system treats the reasoning processes of both agents as global information. It evaluates the ego agent’s performance by checking how well its internal beliefs align with the opponent’s actual thoughts and actions. By comparing the agent’s predictions against the ground truth of the opponent’s behavior, the framework provides fine-grained, turn-by-turn feedback that is much more informative than simply waiting for a win or loss at the end of the game.
Hybrid Advantage Estimation
One of the biggest challenges in multi-agent reinforcement learning is the high level of uncertainty and the difficulty of assigning credit for a specific action. Strat-Reasoner addresses this by using a "Hybrid Advantage" approach. It combines the immediate, dense feedback from the CoT comparison with the long-term, outcome-based rewards of the game. By using "micro-rollouts"—where the model generates multiple potential reasoning paths in parallel—the framework creates a stable, low-variance baseline that helps the model learn which reasoning steps actually lead to better strategic outcomes.
Performance and Impact
Experimental results demonstrate that Strat-Reasoner significantly enhances the strategic capabilities of LLMs. By moving beyond simple outcome-based learning and incorporating explicit opponent modeling, the framework achieved an average performance improvement of 22.1% across various competitive and cooperative multi-agent games. This suggests that teaching models to explicitly model the cognitive states of others is a powerful way to improve their performance in complex, real-world strategic environments.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!