Teaching AI agents to ask better questions by playing “Battleship”
Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and Harvard University’s School of Engineering and Applied Sciences have developed a new method to improve how artificial intelligence agents gather information. By using the classic game of “Battleship” as a test bed, the team discovered that equipping AI models with a Monte Carlo inference strategy allows them to ask more informative questions, enabling smaller, more efficient models to outperform larger, frontier-scale systems at a fraction of the cost.
Refining the art of inquiry
While language models are often optimized to answer complex queries, they frequently struggle to formulate effective questions in uncertain environments—a critical skill for fields like scientific discovery and medical diagnosis. To study this, researchers created “Collaborative Battleship,” a game where a “captain” asks questions to locate hidden ships and a “spotter” provides real-time answers. By analyzing data from over 40 human players, the team identified that while top-tier models could technically beat humans, they often lacked the rational inquiry skills necessary for efficient exploration.
To address this, the researchers implemented a Monte Carlo inference strategy. This approach allows models to treat potential guesses as individual particles, weighting them based on the likelihood of being correct after each response. This calculated, adaptive style of questioning allows agents to extract significantly more information from the spotter. As a result, the Llama 4 Scout model saw its win rate against humans jump from 8 percent to 82 percent, allowing it to outpace the larger GPT-5 while operating at roughly 1 percent of the cost.
Bridging the gap with code
The research team also addressed the difficulty models face when acting as the “spotter.” Smaller systems often struggled to provide accurate information about hidden ships. To improve this, the researchers introduced a technique where questions are automatically converted into encoded Python commands. This allows the spotter to verify answers by running a quick search of the game area rather than relying solely on language generation.
This auto-formalization strategy led to significant performance gains across the board. The lightweight GPT-4o-mini model saw a nearly 30 percent performance increase, while the larger Claude 4 Opus model improved by about eight points. These techniques were further validated in the game “Guess Who?”, where the refined models demonstrated a similar ability to narrow down options more effectively.
Future implications for discovery
The study suggests that the ability to ask informative questions is tied to an agent’s capacity to simulate and predict the world. By providing agents with access to a “world model,” they can make discoveries more efficiently. While “Collaborative Battleship” serves as a controlled environment, the researchers believe these findings have significant potential for “needle-in-a-haystack” discovery tasks, such as identifying molecular structures in scientific research.
The research team, which includes Gabriel Grand, Valerio Pepe, Jacob Andreas, and Joshua Tenenbaum, presented their findings as an oral presentation at the International Conference on Learning Representations in April. Moving forward, the team aims to test these models in more complex settings and explore how humans and AI agents can collaborate more effectively to resolve misunderstandings and track common ground.

Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!