Back to AI Research

AI Research

TraceFix: Repairing Agent Coordination Protocols wi... | AI Research

Key Takeaways

  • TraceFix: Repairing Agent Coordination Protocols with TLA+ Counterexamples TraceFix is a new verification-first pipeline designed to solve the coordination f...
  • We present TraceFix, a verification-first pipeline for Large Language Model (LLM) multi-agent coordination.
  • Verified process bodies are compiled into per-agent system prompts and executed under a runtime monitor that rejects out-of-topology coordination operations.
  • On 48 tasks spanning 16 scenario families, all tasks reach full TLC verification; 62.5% pass on the first attempt and none requires more than four repair iterations.
  • State spaces span six orders of magnitude yet verification completes in under 60 s for every task.
Paper AbstractExpand

We present TraceFix, a verification-first pipeline for Large Language Model (LLM) multi-agent coordination. An agent synthesizes a protocol topology as a structured intermediate representation (IR) from a task description, generates PlusCal coordination logic, and iteratively repairs the protocol using counterexamples from the TLA+ model checker (TLC) until verification succeeds. Verified process bodies are compiled into per-agent system prompts and executed under a runtime monitor that rejects out-of-topology coordination operations. On 48 tasks spanning 16 scenario families, all tasks reach full TLC verification; 62.5% pass on the first attempt and none requires more than four repair iterations. State spaces span six orders of magnitude yet verification completes in under 60 s for every task. A 3,456-run runtime comparison shows that topology-monitored execution achieves the highest task completion (89.4% average, 81.5% full) and that runtimes using the verified protocol degrade at roughly half the rate of prompt-only and chat-only baselines when model capability is reduced. A paired ablation under a fixed runtime shows that TLC-verified protocols cut deadlock/livelock (DL/LL) from 31.1% to 14.1%, with the largest separation under fault injection.

TraceFix: Repairing Agent Coordination Protocols with TLA+ Counterexamples
TraceFix is a new verification-first pipeline designed to solve the coordination failures common in multi-agent systems powered by Large Language Models (LLMs). When multiple independent agents work together, they often encounter "concurrency hazards"—such as deadlocks, race conditions, or missed messages—that are difficult to predict during the design phase. TraceFix addresses this by using formal verification to automatically test and repair the coordination protocols generated by LLMs, ensuring that agents interact safely before they are deployed.

How the Pipeline Works

The TraceFix process follows a structured loop to move from a natural-language task description to a verified, executable protocol. First, an orchestration agent creates a "protocol topology," which acts as a blueprint defining the agents, shared resources (like locks), and communication channels.
Next, the system generates coordination logic using PlusCal, a language that describes how agents should behave. This logic is then fed into the TLA+ model checker (TLC), which exhaustively tests all possible sequences of events to find potential bugs. If the checker finds a flaw, it produces a "counterexample"—a specific trace of events that leads to a failure. The system uses this evidence to automatically repair the protocol, repeating the process until the model checker confirms the protocol is safe.

From Verification to Execution

Once a protocol is verified, TraceFix compiles the logic into system prompts for each agent. To ensure these agents stick to the verified plan during operation, the system employs a runtime monitor. This monitor acts as a gatekeeper, rejecting any coordination actions—such as sending a message or acquiring a lock—that fall outside the rules defined in the original topology. This approach allows agents to maintain their autonomy in performing domain-specific tasks while strictly adhering to a safe coordination framework.

Performance and Reliability

In testing across 48 diverse tasks, TraceFix demonstrated high efficiency and effectiveness. Every task reached full verification, with 62.5% passing on the very first attempt and none requiring more than four repair iterations. Even for complex scenarios with millions of possible states, verification was completed in under 60 seconds per task.
A comparative study showed that using these verified protocols significantly improved task completion rates and made the system more resilient. When model capabilities were reduced, systems using TraceFix degraded at roughly half the rate of standard prompt-only or chat-only baselines. Furthermore, the use of verified protocols cut the occurrence of deadlocks and livelocks by more than half, particularly in scenarios where faults were intentionally injected.

Important Considerations

While TraceFix is highly effective at ensuring coordination safety, it has specific limitations. The system focuses on "coordination correctness"—ensuring that agents do not deadlock or violate resource access rules—rather than the quality of the agents' output or the accuracy of their domain-specific work. Additionally, the current implementation does not verify liveness properties, such as guaranteeing that a task will always finish or that an agent will never be starved of resources. The runtime monitor enforces the coordination interface, but it does not strictly enforce the specific sequence of steps, meaning some minor coordination issues can still occur if agents deviate from their intended logic.

Comments (0)

No comments yet

Be the first to share your thoughts!