Linguistic Firewall: Geometry as Defense in Multi-A...

Linguistic Firewall: Geometry as Defense in Multi-Agent Systems Routing
Multi-Agent Systems (MAS) rely on routers to decide which specialized agent should handle a user's request. Traditionally, these routers function by reading natural-language descriptions of an agent's capabilities. This reliance on text creates a significant security vulnerability: malicious agents can embed "jailbreak" instructions or deceptive metadata into their descriptions to hijack the router and force it to assign them tasks. This paper introduces ANTAP (Automatic Non-Textual Agent Picker), a new routing architecture that replaces text-based decision-making with empirical, geometric evaluation to eliminate this attack surface.

Moving Beyond Text-Based Routing

Current routing methods are inherently insecure because they treat agent descriptions as active instructions. When a router reads a description, it essentially processes untrusted input, which can lead to "Router Hijacking." ANTAP solves this by discarding textual metadata entirely at the inference stage. Instead of interpreting what an agent claims it can do, ANTAP measures what an agent actually does. By testing agents on trusted benchmark tasks, the system builds a mathematical profile of each agent’s true performance, effectively creating a "linguistic firewall" that makes textual injection attacks impossible to execute.

How ANTAP Works

The ANTAP framework operates in two distinct phases. During the offline registration phase, the system evaluates each agent on a set of benchmark queries. It records whether the agent provides a correct answer and avoids unauthorized actions, assigning a binary success or failure score. These results are used to create a "behavioral operator"—a fixed linear mathematical representation of the agent’s competence.
At inference time, the online routing phase is extremely efficient. When a user submits a query, the system converts it into a numerical embedding. It then performs a single matrix-vector multiplication to project the query against the precomputed behavioral operators. The router simply selects the agent whose mathematical profile best aligns with the query, without ever reading or processing any natural-language descriptions.

Security and Performance Results

In experimental testing, ANTAP demonstrated significant improvements in both security and reliability. Against description-based injection attacks, where traditional routers saw success rates as high as 75.3%, ANTAP achieved a near-zero attack success rate (0.2%). The system also proved resilient against "sleeper agents"—malicious entities that hide their intent until triggered—because their failure to perform correctly during the offline benchmark phase ensures they are not selected for real-world tasks. Beyond security, ANTAP also outperformed baseline models in overall routing accuracy, proving that empirical, behavior-based selection is more reliable than relying on potentially biased or misleading agent descriptions.

Key Considerations

While ANTAP provides a robust defense against metadata-based attacks, it is important to note that the system relies on a linear approximation of agent behavior. This means the router is designed to capture coarse-grained capability alignment rather than predicting the exact output of an agent. Additionally, the security of the system depends on the quality of the offline benchmark data; because the router is trained on these trusted inputs, the integrity of the benchmark process itself remains a critical foundation for the system's overall effectiveness.

Linguistic Firewall: Geometry as Defense in Multi-A... | AI Research

Key Takeaways

Moving Beyond Text-Based Routing

How ANTAP Works

Security and Performance Results

Key Considerations

Comments (0)

No comments yet