Back to AI Research

AI Research

MaD Physics: Evaluating information seeking under c... | AI Research

Key Takeaways

  • Scientific discovery is a process defined by constraints.
  • Scientists must constantly balance the quality of their measurements against the costs of time, ene...
  • Scientific discovery is fundamentally a resource-constrained process that requires navigating complex trade-offs between the quality and quantity of measurements due to physical and cost constraints.
  • Measurements drive the scientific process by revealing novel phenomena to improve our understanding.
  • The benchmark consists of three environments, each based on a distinct physical law.
Paper AbstractExpand

Scientific discovery is fundamentally a resource-constrained process that requires navigating complex trade-offs between the quality and quantity of measurements due to physical and cost constraints. Measurements drive the scientific process by revealing novel phenomena to improve our understanding. Existing benchmarks for evaluating agents for scientific discovery focus on either static knowledge-based reasoning or unconstrained experimental design tasks, and do not capture the ability to make measurements and plan under constraints. To bridge this gap, we propose Measuring and Discovering Physics (MaD Physics), a benchmark to evaluate the ability of agents to make informative measurements and conclusions subject to constraints on the quality and quantity of measurements. The benchmark consists of three environments, each based on a distinct physical law. To mitigate contamination from existing knowledge, MaD Physics includes altered physical laws. In each trial, the agent makes measurements of the system until it exhausts an allotted budget and then the agent has to infer the underlying physical law to make predictions about the state of the system in the future. MaD Physics evaluates two fundamental capabilities of scientific agents: inferring models from data and planning under constraints. We also demonstrate how MaD Physics can be used to evaluate other capabilities such as multimodality and in-context learning. We benchmark agents on MaD Physics using four Gemini models (2.5 Flash Lite, 2.5 Flash, 2.5 Pro, and 3 Flash), identifying shortcomings in their structured exploration and data collection capabilities and highlighting directions to improve their scientific reasoning.

Scientific discovery is a process defined by constraints. Scientists must constantly balance the quality of their measurements against the costs of time, energy, and money. While AI has made strides in static reasoning and unconstrained experimental design, it often struggles with the practical reality of navigating these trade-offs. The paper introduces MaD Physics (Measuring and Discovering Physics), a new benchmark designed to evaluate how well AI agents can perform scientific discovery when they must operate under strict resource budgets and incomplete information.

The Challenge of Empirical Discovery

Existing AI benchmarks for science often rely on static knowledge—asking models to answer questions based on facts they have already memorized. MaD Physics changes this by forcing agents to act as experimentalists. In this environment, the agent is placed in a system governed by "altered" physical laws—such as modified gravity or non-standard fluid dynamics—that differ from the physics we know. Because these laws are unfamiliar, the agent cannot rely on pre-existing knowledge; it must actively collect data, observe the system, and infer the underlying rules to make accurate predictions about the future.

How the Benchmark Works

The evaluation process is split into two distinct phases:

  • Measurement: The agent is given a fixed budget. It must decide what to measure, when to measure it, and how much "fidelity" (precision) to pay for. Because every action consumes part of the budget, the agent must plan strategically to maximize its information gain.

  • Prediction: Once the budget is exhausted, the agent must use the data it collected to predict the state of the system at a future time.
    The benchmark tests these capabilities across three core domains: classical mechanics (e.g., particle motion with anisotropic mass), fluid mechanics (e.g., turbulent flow with "alien" forcing terms), and quantum mechanics (e.g., systems with non-linear entanglement).

Evaluating AI Performance

The researchers used MaD Physics to test four different Gemini models (2.5 Flash Lite, 2.5 Flash, 2.5 Pro, and 3 Flash). The results highlight significant gaps in the current generation of AI agents. Specifically, the models often struggle with structured exploration—the ability to systematically gather data—and data collection efficiency. While the models show promise in reasoning, they frequently fail to optimize their limited resources effectively, suggesting that current AI still lacks the sophisticated planning required for autonomous scientific research.

Implications for Future Research

MaD Physics serves as a diagnostic tool to help researchers identify where AI agents fall short in scientific reasoning. By highlighting the difficulty of balancing cost and precision, the benchmark points toward a need for better planning and data-collection strategies in AI. The authors suggest that by focusing on these "active discovery" tasks, future AI development can move beyond simple question-answering and toward systems capable of performing genuine, resource-conscious scientific inquiry.

Comments (0)

No comments yet

Be the first to share your thoughts!