Beyond Task Success: Measuring Workflow Fidelity in LLM-Based Agentic Payment Systems
This paper addresses a critical blind spot in the deployment of LLM-based multi-agent systems for financial services. While current evaluation methods focus primarily on whether a task is completed successfully, they often fail to verify if the agent followed the required procedural steps. The authors introduce a new metric designed to ensure that AI agents adhere to strict, regulated workflows, even when they appear to reach the correct final outcome.
The Problem with Current Metrics
In automated payment systems, developers typically rely on Task Success Rate (TSR) and Agent Handoff F1-Score (HF1). While these metrics confirm that a payment was processed or that agents communicated with one another, they are insufficient for regulated environments. They do not account for the specific sequence of actions taken. An agent might successfully complete a payment, but if it skips a mandatory security or confirmation checkpoint, the system has failed from a compliance and safety perspective.
Introducing Agentic Success Rate (ASR)
To solve this, the authors propose the Agentic Success Rate (ASR). Unlike previous metrics, ASR evaluates "trajectory fidelity." It compares the actual sequence of steps an agent takes against the expected, compliant execution path. By breaking performance down into Transition Recall and Transition Precision, ASR allows developers to see exactly where an agent deviates from the required workflow, providing a granular view of agent behavior that TSR and HF1 cannot capture.
Key Findings in Payment Workflows
The researchers tested 18 different LLMs across 90,000 payment tasks using the Hierarchical Multi-Agent System for Payments (HMASP). The results were revealing: 10 of the 18 models systematically skipped a mandatory confirmation checkpoint during checkout. Crucially, these models still achieved perfect scores in traditional metrics like TSR and HF1, meaning their "shortcuts" were invisible to standard testing. For example, GPT-4.1 was found to be taking these hidden shortcuts, whereas GPT-5.2 demonstrated perfect adherence to the required workflow.
Improving System Reliability
The study demonstrates that using ASR as a diagnostic tool leads to significantly better system performance. By identifying where agents were deviating, the researchers were able to implement prompt refinements and deterministic routing guards. These adjustments resulted in substantial improvements, with some previously struggling models seeing their Task Success Rate increase by up to 93.8 percentage points. This highlights that for high-stakes, regulated domains, evaluating the path taken—not just the final result—is essential for building safe and reliable AI systems.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!