Back to AI Research

AI Research

Linguistic Firewall: Geometry as Defense in Multi-A... | AI Research

Key Takeaways

  • Linguistic Firewall: Geometry as Defense in Multi-Agent Systems Routing Multi-Agent Systems (MAS) rely on routers to decide which specialized agent should ha...
  • The rapid integration of Large Language Models (LLMs) has driven the evolution of Multi-Agent Systems (MAS), where specialized agents collaborate to execute complex workflows.
  • Effective orchestration in these environments requires robust routing mechanisms to efficiently allocate tasks to the most suitable agent.
  • However, existing routers fundamentally rely on unverified proxies, ranging from textual self-descriptions to static surrogate representations, to gauge an agent's competence.
  • This reliance on non-empirical data creates a critical gap between an agent's projected profile and its actual operational capabilities, introducing severe security vulnerabilities.
Paper AbstractExpand

The rapid integration of Large Language Models (LLMs) has driven the evolution of Multi-Agent Systems (MAS), where specialized agents collaborate to execute complex workflows. Effective orchestration in these environments requires robust routing mechanisms to efficiently allocate tasks to the most suitable agent. However, existing routers fundamentally rely on unverified proxies, ranging from textual self-descriptions to static surrogate representations, to gauge an agent's competence. This reliance on non-empirical data creates a critical gap between an agent's projected profile and its actual operational capabilities, introducing severe security vulnerabilities. Malicious agents can easily misrepresent their proficiencies or harbor covert backdoors that evade both standard external analysis and static representation-learning techniques. In this work, we introduce ANTAP (Automatic Non-Textual Agent Picker), an evaluation-driven routing architecture that discards indirect proxies in favor of active capability testing. By dynamically querying agents to ascertain their true competencies empirically, ANTAP distills performance into fixed behavioral operators within a shared semantic space. At inference time, routing is performed via a purely non-textual algebraic projection, establishing a "linguistic firewall" that renders metadata-based attacks inexpressible. In our experiments, ANTAP achieves near-zero ASR against description-based injection attacks, compared to 67.3\% and above for the description-based router baseline. Against adaptive embedding attacks, ANTAP achieves substantially lower ASR than the embedding-based baseline, with a 20\% reduction, while remaining resilient to description manipulation by design.

Linguistic Firewall: Geometry as Defense in Multi-Agent Systems Routing
Multi-Agent Systems (MAS) rely on routers to decide which specialized agent should handle a user's request. Traditionally, these routers function by reading natural-language descriptions of an agent's capabilities. This reliance on text creates a significant security vulnerability: malicious agents can embed "jailbreak" instructions or deceptive metadata into their descriptions to hijack the router and force it to assign them tasks. This paper introduces ANTAP (Automatic Non-Textual Agent Picker), a new routing architecture that replaces text-based decision-making with empirical, geometric evaluation to eliminate this attack surface.

Moving Beyond Text-Based Routing

Current routing methods are inherently insecure because they treat agent descriptions as active instructions. When a router reads a description, it essentially processes untrusted input, which can lead to "Router Hijacking." ANTAP solves this by discarding textual metadata entirely at the inference stage. Instead of interpreting what an agent claims it can do, ANTAP measures what an agent actually does. By testing agents on trusted benchmark tasks, the system builds a mathematical profile of each agent’s true performance, effectively creating a "linguistic firewall" that makes textual injection attacks impossible to execute.

How ANTAP Works

The ANTAP framework operates in two distinct phases. During the offline registration phase, the system evaluates each agent on a set of benchmark queries. It records whether the agent provides a correct answer and avoids unauthorized actions, assigning a binary success or failure score. These results are used to create a "behavioral operator"—a fixed linear mathematical representation of the agent’s competence.
At inference time, the online routing phase is extremely efficient. When a user submits a query, the system converts it into a numerical embedding. It then performs a single matrix-vector multiplication to project the query against the precomputed behavioral operators. The router simply selects the agent whose mathematical profile best aligns with the query, without ever reading or processing any natural-language descriptions.

Security and Performance Results

In experimental testing, ANTAP demonstrated significant improvements in both security and reliability. Against description-based injection attacks, where traditional routers saw success rates as high as 75.3%, ANTAP achieved a near-zero attack success rate (0.2%). The system also proved resilient against "sleeper agents"—malicious entities that hide their intent until triggered—because their failure to perform correctly during the offline benchmark phase ensures they are not selected for real-world tasks. Beyond security, ANTAP also outperformed baseline models in overall routing accuracy, proving that empirical, behavior-based selection is more reliable than relying on potentially biased or misleading agent descriptions.

Key Considerations

While ANTAP provides a robust defense against metadata-based attacks, it is important to note that the system relies on a linear approximation of agent behavior. This means the router is designed to capture coarse-grained capability alignment rather than predicting the exact output of an agent. Additionally, the security of the system depends on the quality of the offline benchmark data; because the router is trained on these trusted inputs, the integrity of the benchmark process itself remains a critical foundation for the system's overall effectiveness.

Comments (0)

No comments yet

Be the first to share your thoughts!