Back to AI Research

AI Research

Beyond Task Success: Measuring Workflow Fidelity in... | AI Research

Key Takeaways

  • Beyond Task Success: Measuring Workflow Fidelity in LLM-Based Agentic Payment Systems This paper addresses a critical blind spot in the deployment of LLM-bas...
  • LLM-based multi-agent systems are increasingly deployed for payment workflows, yet prevailing metrics, Task Success Rate (TSR) and Agent Handoff F1-Score (HF1), capture only final outcomes or unordered routing decisions.
  • Notably, GPT-4.1 exhibits hidden workflow shortcuts despite achieving perfect TSR and HF1, while GPT-5.2 achieves perfect ASR.
  • Beyond Task Success: Measuring Workflow Fidelity in LLM-Based Agentic Payment Systems This paper addresses a critical blind spot in the deployment of LLM-based multi-agent systems for financial services.
  • # Beyond Task Success: Measuring Workflow Fidelity in LLM-Based Agentic Payment Systems
Paper AbstractExpand

LLM-based multi-agent systems are increasingly deployed for payment workflows, yet prevailing metrics, Task Success Rate (TSR) and Agent Handoff F1-Score (HF1), capture only final outcomes or unordered routing decisions. We introduce the Agentic Success Rate (ASR), a trajectory-fidelity metric that compares observed and expected agent execution sequences at the transition level, decomposing performance into Transition Recall and Transition Precision. Applied to the Hierarchical Multi-Agent System for Payments (HMASP) across 18 LLMs and 90,000 task instances, ASR reveals that 10 of 18 models systematically skip a confirmation checkpoint during payment checkout, a deviation invisible to both TSR and HF1, while 8 models enforce the checkpoint perfectly. Notably, GPT-4.1 exhibits hidden workflow shortcuts despite achieving perfect TSR and HF1, while GPT-5.2 achieves perfect ASR. Prompt refinements and deterministic routing guards guided by ASR diagnostics yield substantial TSR improvements, with gains up to +93.8 percentage points for previously struggling models, demonstrating that trajectory-level evaluation is essential in regulated domains.

Beyond Task Success: Measuring Workflow Fidelity in LLM-Based Agentic Payment Systems

This paper addresses a critical blind spot in the deployment of LLM-based multi-agent systems for financial services. While current evaluation methods focus primarily on whether a task is completed successfully, they often fail to verify if the agent followed the required procedural steps. The authors introduce a new metric designed to ensure that AI agents adhere to strict, regulated workflows, even when they appear to reach the correct final outcome.

The Problem with Current Metrics

In automated payment systems, developers typically rely on Task Success Rate (TSR) and Agent Handoff F1-Score (HF1). While these metrics confirm that a payment was processed or that agents communicated with one another, they are insufficient for regulated environments. They do not account for the specific sequence of actions taken. An agent might successfully complete a payment, but if it skips a mandatory security or confirmation checkpoint, the system has failed from a compliance and safety perspective.

Introducing Agentic Success Rate (ASR)

To solve this, the authors propose the Agentic Success Rate (ASR). Unlike previous metrics, ASR evaluates "trajectory fidelity." It compares the actual sequence of steps an agent takes against the expected, compliant execution path. By breaking performance down into Transition Recall and Transition Precision, ASR allows developers to see exactly where an agent deviates from the required workflow, providing a granular view of agent behavior that TSR and HF1 cannot capture.

Key Findings in Payment Workflows

The researchers tested 18 different LLMs across 90,000 payment tasks using the Hierarchical Multi-Agent System for Payments (HMASP). The results were revealing: 10 of the 18 models systematically skipped a mandatory confirmation checkpoint during checkout. Crucially, these models still achieved perfect scores in traditional metrics like TSR and HF1, meaning their "shortcuts" were invisible to standard testing. For example, GPT-4.1 was found to be taking these hidden shortcuts, whereas GPT-5.2 demonstrated perfect adherence to the required workflow.

Improving System Reliability

The study demonstrates that using ASR as a diagnostic tool leads to significantly better system performance. By identifying where agents were deviating, the researchers were able to implement prompt refinements and deterministic routing guards. These adjustments resulted in substantial improvements, with some previously struggling models seeing their Task Success Rate increase by up to 93.8 percentage points. This highlights that for high-stakes, regulated domains, evaluating the path taken—not just the final result—is essential for building safe and reliable AI systems.

Comments (0)

No comments yet

Be the first to share your thoughts!