Root cause analysis (RCA) is a critical task for LLM agents, requiring them to pinpoint the source of failures in complex, interconnected software systems. While existing benchmarks evaluate whether an agent can identify the correct root cause, they often ignore the "how"—the causal path that links a fault to the observed symptoms. This paper introduces OpenRCA 2.0, a new benchmark that moves beyond simple outcome labels to evaluate the actual reasoning process of LLM agents, revealing that many models identify the right service for the wrong reasons.
From Outcome Labels to Causal Supervision
Current RCA benchmarks rely on fault injection, where researchers deliberately break a component and label it as the root cause. However, this approach only checks if the agent guessed the right component. It fails to verify if the agent correctly understood how the fault propagated through the system. To address this, the authors developed PAVE (Path Annotation via Verified Effects), a protocol that reconstructs the actual causal propagation path. By using the known intervention from the fault injection as a starting point, PAVE performs "forward verification," checking which downstream effects were actually caused by the fault, rather than trying to guess backward from symptoms.
How PAVE Works
PAVE reconstructs verified causal paths by applying three strict conditions to candidate paths:
Structural Conformance: The path must follow the known dependency graph of the software system.
Statistical Deviation: Every node in the path must show a statistically significant change in telemetry compared to a normal, pre-injection baseline.
Temporal Alignment: The timing of anomalies must flow logically from the upstream cause to the downstream effect.
By filtering candidate paths through these three lenses, PAVE creates a "process-level" ground truth. This allows researchers to see not just if an agent found the root cause, but if it correctly mapped the chain of events leading to the failure.
The "Ungrounded Diagnosis" Problem
When testing 11 frontier LLMs on the 500 instances in OpenRCA 2.0, the researchers discovered a significant gap in agent performance. While agents were able to identify at least one correct root-cause service in 76.0% of cases, they were only able to ground that identification in a verified causal path 61.5% of the time. This discrepancy is what the authors call an "ungrounded diagnosis"—the agent gets the right answer by luck or pattern matching, but fails to provide a valid, logical explanation for how the failure occurred.
Key Findings and Limitations
The study highlights that outcome-only evaluation is insufficient for building trustworthy AI agents. Key takeaways include:
Reasoning vs. Guessing: Agents are much better at identifying a faulty service than they are at mapping the directed dependencies between services.
Common Failure Modes: Agents often suffer from "salience capture," where they focus on the loudest or most obvious error signal rather than the actual root cause, or "premature commitment," where they stop investigating once they find one plausible cause, ignoring other complexities.
Evaluation Gap: Because Edge F1 scores (which measure the accuracy of the causal path) are consistently lower than Node F1 scores (which measure the accuracy of identifying the service), the authors argue that future RCA research must prioritize evaluating the derivation process, not just the final answer.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!