TraceFix: Repairing Agent Coordination Protocols with TLA+ Counterexamples
TraceFix is a new verification-first pipeline designed to solve the coordination failures common in multi-agent systems powered by Large Language Models (LLMs). When multiple independent agents work together, they often encounter "concurrency hazards"—such as deadlocks, race conditions, or missed messages—that are difficult to predict during the design phase. TraceFix addresses this by using formal verification to automatically test and repair the coordination protocols generated by LLMs, ensuring that agents interact safely before they are deployed.
How the Pipeline Works
The TraceFix process follows a structured loop to move from a natural-language task description to a verified, executable protocol. First, an orchestration agent creates a "protocol topology," which acts as a blueprint defining the agents, shared resources (like locks), and communication channels.
Next, the system generates coordination logic using PlusCal, a language that describes how agents should behave. This logic is then fed into the TLA+ model checker (TLC), which exhaustively tests all possible sequences of events to find potential bugs. If the checker finds a flaw, it produces a "counterexample"—a specific trace of events that leads to a failure. The system uses this evidence to automatically repair the protocol, repeating the process until the model checker confirms the protocol is safe.
From Verification to Execution
Once a protocol is verified, TraceFix compiles the logic into system prompts for each agent. To ensure these agents stick to the verified plan during operation, the system employs a runtime monitor. This monitor acts as a gatekeeper, rejecting any coordination actions—such as sending a message or acquiring a lock—that fall outside the rules defined in the original topology. This approach allows agents to maintain their autonomy in performing domain-specific tasks while strictly adhering to a safe coordination framework.
Performance and Reliability
In testing across 48 diverse tasks, TraceFix demonstrated high efficiency and effectiveness. Every task reached full verification, with 62.5% passing on the very first attempt and none requiring more than four repair iterations. Even for complex scenarios with millions of possible states, verification was completed in under 60 seconds per task.
A comparative study showed that using these verified protocols significantly improved task completion rates and made the system more resilient. When model capabilities were reduced, systems using TraceFix degraded at roughly half the rate of standard prompt-only or chat-only baselines. Furthermore, the use of verified protocols cut the occurrence of deadlocks and livelocks by more than half, particularly in scenarios where faults were intentionally injected.
Important Considerations
While TraceFix is highly effective at ensuring coordination safety, it has specific limitations. The system focuses on "coordination correctness"—ensuring that agents do not deadlock or violate resource access rules—rather than the quality of the agents' output or the accuracy of their domain-specific work. Additionally, the current implementation does not verify liveness properties, such as guaranteeing that a task will always finish or that an agent will never be starved of resources. The runtime monitor enforces the coordination interface, but it does not strictly enforce the specific sequence of steps, meaning some minor coordination issues can still occur if agents deviate from their intended logic.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!