Back to AI Research

AI Research

OpenRCA 2.0: From Outcome Labels to Causal Process... | AI Research

Key Takeaways

  • Root cause analysis (RCA) is a critical task for LLM agents, requiring them to pinpoint the source of failures in complex, interconnected software systems.
  • Root cause analysis (RCA) poses a holistic test of LLM agentic capabilities, such as long-context understanding, multi-step reasoning, and tool use.
  • However, existing datasets suffer from a fundamental gap: they label only the root cause, not the propagation path connecting it to the observed symptom, which largely simplifies the task to naive pattern matching.
  • To support rigorous evaluation, we introduce PAVE, a step-wise labeling protocol that leverages known interventions from fault injection to reconstruct causal propagation paths.
  • The mechanism is forward verification: reasoning from cause to effect rather than inferring backward from symptoms.
Paper AbstractExpand

Root cause analysis (RCA) poses a holistic test of LLM agentic capabilities, such as long-context understanding, multi-step reasoning, and tool use. However, existing datasets suffer from a fundamental gap: they label only the root cause, not the propagation path connecting it to the observed symptom, which largely simplifies the task to naive pattern matching. To support rigorous evaluation, we introduce PAVE, a step-wise labeling protocol that leverages known interventions from fault injection to reconstruct causal propagation paths. The mechanism is forward verification: reasoning from cause to effect rather than inferring backward from symptoms. Applying PAVE yields OpenRCA 2.0 (500 instances), the first cross-system RCA benchmark with step-wise causal annotations for LLM agents. Across 11 frontier LLMs, recovering the exact root-cause set succeeds in only 20.7% of cases on average. To locate where this difficulty lies, we relax the criterion and find what we call the ungrounded diagnosis: agents identify at least one correct root-cause service in 76.0% of cases, but ground that service in a verified causal propagation path to the observed symptom in only 61.5%. Outcome-only evaluation hides this failure mode; step-wise causal ground truth is the missing piece for trustworthy LLM-based RCA agents.

Root cause analysis (RCA) is a critical task for LLM agents, requiring them to pinpoint the source of failures in complex, interconnected software systems. While existing benchmarks evaluate whether an agent can identify the correct root cause, they often ignore the "how"—the causal path that links a fault to the observed symptoms. This paper introduces OpenRCA 2.0, a new benchmark that moves beyond simple outcome labels to evaluate the actual reasoning process of LLM agents, revealing that many models identify the right service for the wrong reasons.

From Outcome Labels to Causal Supervision

Current RCA benchmarks rely on fault injection, where researchers deliberately break a component and label it as the root cause. However, this approach only checks if the agent guessed the right component. It fails to verify if the agent correctly understood how the fault propagated through the system. To address this, the authors developed PAVE (Path Annotation via Verified Effects), a protocol that reconstructs the actual causal propagation path. By using the known intervention from the fault injection as a starting point, PAVE performs "forward verification," checking which downstream effects were actually caused by the fault, rather than trying to guess backward from symptoms.

How PAVE Works

PAVE reconstructs verified causal paths by applying three strict conditions to candidate paths:

  • Structural Conformance: The path must follow the known dependency graph of the software system.

  • Statistical Deviation: Every node in the path must show a statistically significant change in telemetry compared to a normal, pre-injection baseline.

  • Temporal Alignment: The timing of anomalies must flow logically from the upstream cause to the downstream effect.
    By filtering candidate paths through these three lenses, PAVE creates a "process-level" ground truth. This allows researchers to see not just if an agent found the root cause, but if it correctly mapped the chain of events leading to the failure.

The "Ungrounded Diagnosis" Problem

When testing 11 frontier LLMs on the 500 instances in OpenRCA 2.0, the researchers discovered a significant gap in agent performance. While agents were able to identify at least one correct root-cause service in 76.0% of cases, they were only able to ground that identification in a verified causal path 61.5% of the time. This discrepancy is what the authors call an "ungrounded diagnosis"—the agent gets the right answer by luck or pattern matching, but fails to provide a valid, logical explanation for how the failure occurred.

Key Findings and Limitations

The study highlights that outcome-only evaluation is insufficient for building trustworthy AI agents. Key takeaways include:

  • Reasoning vs. Guessing: Agents are much better at identifying a faulty service than they are at mapping the directed dependencies between services.

  • Common Failure Modes: Agents often suffer from "salience capture," where they focus on the loudest or most obvious error signal rather than the actual root cause, or "premature commitment," where they stop investigating once they find one plausible cause, ignoring other complexities.

  • Evaluation Gap: Because Edge F1 scores (which measure the accuracy of the causal path) are consistently lower than Node F1 scores (which measure the accuracy of identifying the service), the authors argue that future RCA research must prioritize evaluating the derivation process, not just the final answer.

Comments (0)

No comments yet

Be the first to share your thoughts!