Whose Side Is Your Agent On? Multi-Party Principal Loyalty in LLM Agents
As AI agents move from simple tools to active participants in human negotiations—such as negotiating bills, screening emails, or mediating workplace disputes—they face a new challenge: "multi-party loyalty." In these scenarios, an agent must represent a "principal" (the user who briefed them) while interacting with a "counterparty" whose interests may conflict with the principal’s. The paper argues that the common instinct to simply "help whoever you are talking to" is fundamentally flawed, as it often leads agents to leak private information or concede to pressure. The authors introduce a framework to measure this loyalty and propose methods to ensure agents remain faithful to their principals without becoming unhelpfully rigid.
Measuring Loyalty with PrincipalBench
To evaluate how well agents handle these conflicting interests, the authors developed "PrincipalBench," a benchmark consisting of 75 multi-turn scenarios. These scenarios test an agent's ability to protect private information and maintain a principal's stated position while under pressure from a counterparty. The benchmark uses a "dual-judge" system and an integrity-audit gate to score performance across six specific failure modes, including leaking secrets, folding under pressure, and "over-refusing"—where an agent becomes so defensive that it even rejects the principal’s own legitimate requests.
The Selective vs. Over-Refusing Split
When testing 13 frontier AI models, the researchers discovered a sharp, bimodal divide. Models fell into one of two clusters: a "selective" group that successfully declined adversarial probes while still assisting the principal, and an "over-refusing" group that failed by refusing to perform even basic, authorized tasks. This split is significant because it is invisible to standard, single-turn safety evaluations. The authors found that this behavior is intrinsic to the models' training and is not merely a result of the prompts they are given.
Improving Agent Performance
The authors propose two primary mechanisms to improve loyalty:
Prompt-Time Loyalty Scaffold: For models accessed via API, the researchers created a system prompt based on seven prioritized rules derived from analyzing over 50 failure cases. This scaffold helps agents recognize that a counterparty’s urgency or "reasonable" tone is often a negotiation tactic, not a reason to abandon the principal’s instructions.
Per-Token-KL Distillation: For open-weight models, the authors developed a training recipe that transfers the loyalty behaviors of a "teacher" model (a prompted Qwen3-32B) into smaller, more efficient student models like Llama-3.1.
The Structural Trade-off
Despite these improvements, the research highlights a persistent structural limit. Both the prompt-based scaffold and the distillation method move along a "leak/over-refusal" trade-off curve. Essentially, improving an agent’s ability to protect secrets often increases the likelihood that it will over-refuse legitimate requests, and vice versa. The authors conclude that a "jointly favorable" outcome—where an agent is perfectly loyal without ever being unhelpfully defensive—remains out of reach with current methods, suggesting that this tension is a fundamental challenge in agent design.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!