Back to AI Research

AI Research

Plausible but Wrong: A case study on Agentic Failur... | AI Research

Key Takeaways

  • Plausible but Wrong: A case study on Agentic Failures in Astrophysical Workflows This paper investigates the reliability of agentic AI systems when they are...
  • Agentic AI systems are increasingly being integrated into scientific workflows, yet their behavior under realistic conditions remains insufficiently understood.
  • We evaluate CMBAgent across two workflow paradigms and eighteen astrophysical tasks.
  • In the One-Shot setting, access to domain-specific context yields an approximately ~6x performance improvement (0.85 vs.
  • ~0 without context), with the primary failure mode being silent incorrect computation - syntactically valid code that produces plausible but inaccurate results.
Paper AbstractExpand

Agentic AI systems are increasingly being integrated into scientific workflows, yet their behavior under realistic conditions remains insufficiently understood. We evaluate CMBAgent across two workflow paradigms and eighteen astrophysical tasks. In the One-Shot setting, access to domain-specific context yields an approximately ~6x performance improvement (0.85 vs. ~0 without context), with the primary failure mode being silent incorrect computation - syntactically valid code that produces plausible but inaccurate results. In the Deep Research setting, the system frequently exhibits silent failures across stress tests, producing physically inconsistent posteriors without self-diagnosis. Overall, performance is strong on well-specified tasks but degrades on problems designed to probe reasoning limits, often without visible error signals. These findings highlight that the most concerning failure mode in agentic scientific workflows is not overt failure, but confident generation of incorrect results. We release our evaluation framework to facilitate systematic reliability analysis of scientific AI agents.

Plausible but Wrong: A case study on Agentic Failures in Astrophysical Workflows
This paper investigates the reliability of agentic AI systems when they are tasked with complex scientific work. While these AI agents are increasingly used to automate data analysis and research pipelines, their performance in realistic, multi-step scientific environments is not well understood. The authors evaluate an existing framework, CMBAgent, across eighteen astrophysical tasks to determine if these systems can perform accurate scientific reasoning or if they are prone to subtle, dangerous errors that go unnoticed.

Evaluating Scientific Reliability

The researchers tested the AI using two distinct operational modes. The "One-Shot" mode focuses on single-pass computational tasks, such as configuring cosmological solvers, to see if the agent can correctly use specialized scientific tools. The "Deep Research" mode utilizes a more advanced planning architecture designed for complex, multi-step problems like Bayesian parameter estimation. By using astrophysics as a testbed—a field with well-defined physical models and clear reference solutions—the team could precisely measure where the AI succeeded and where it failed.

The Problem of Silent Failures

The study reveals a concerning trend: the most significant risk in agentic scientific workflows is not the system crashing or producing an obvious error, but rather the generation of "silent" failures. In the One-Shot setting, the AI often produces syntactically correct code that runs without issue but generates inaccurate numerical results. In the Deep Research setting, the agents frequently produced physically inconsistent results without ever flagging a problem or attempting a self-diagnosis. The AI often appeared to be working perfectly while delivering incorrect or physically impossible conclusions.

Impact of Context and Complexity

The researchers found that providing the AI with domain-specific context, such as documentation for scientific tools, significantly improved performance, leading to a roughly 6x improvement in the One-Shot setting. However, performance consistently degraded when the tasks were designed to probe the limits of the AI’s reasoning. Specifically, when faced with "under-constrained" problems—where data is insufficient to reach a definitive conclusion—the agents tended to ignore these limitations and confidently reported results that were not supported by the data.

Key Takeaways for Scientific AI

The findings suggest that current agentic systems are capable of handling well-specified, routine tasks but struggle with the nuanced, critical thinking required for genuine scientific discovery. Because these agents often fail by producing plausible-looking but wrong outputs, they cannot be fully trusted to operate without human oversight in scientific pipelines. The authors have released their evaluation framework to help other researchers systematically test the reliability of AI agents before they are deployed in sensitive scientific research.

Comments (0)

No comments yet

Be the first to share your thoughts!