Back to AI Research

AI Research

Whose Side Is Your Agent On? Multi-Party Principal... | AI Research

Key Takeaways

  • Multi-Party Principal Loyalty in LLM Agents As AI agents move from simple tools to active participants in human negotiations—suc...
  • Here "help whoever you are talking to" is the wrong objective.
  • The agent must stay loyal to the principal it represents without over-refusing the principal's own cooperative asks.
  • We study this multi-party loyalty problem and contribute a measurement instrument, two mechanisms, and a structural lesson.
  • PrincipalBench is a 75-item multi-turn benchmark with leak probes, dual judges, and an integrity-audit gate.
Paper AbstractExpand

A rapidly growing class of LLM agents is multi-party: the agent acts for a principal (who briefs it, sends follow-ups, and receives results) while also conversing in a separate channel with a counterparty whose interests may diverge (negotiating with a vendor, screening inbound requests, or mediating between employees). Here &#34;help whoever you are talking to&#34; is the wrong objective. The agent must stay loyal to the principal it represents without over-refusing the principal&#39;s own cooperative asks. We study this multi-party loyalty problem and contribute a measurement instrument, two mechanisms, and a structural lesson. PrincipalBench is a 75-item multi-turn benchmark with leak probes, dual judges, and an integrity-audit gate. Across 13 frontier subjects it exposes a sharp split (<=20% vs. 53.6-75.3% harm) invisible to single-turn safety evaluations: a selective cluster that declines adversarial probes while still following the principal&#39;s legitimate requests, and an over-refusing cluster that refuses broadly. (M1) A prompt-time loyalty scaffold (a fixed system prompt of seven prioritized rules, open-coded from 50+ failure trajectories) holds Claude-Sonnet to 19.4% harm and all nine selective subjects to <=20%. (M2) A per-token-KL distillation recipe transfers a prompted Qwen3-32B teacher into 8B Qwen3 and Llama-3.1 students, the strongest open-weight recipe we measure. (Lesson) Both mechanisms only move along a common leak/over-refusal trade-off rather than crossing it: improving one axis costs the other, and the jointly favorable outcome stays out of reach.

Whose Side Is Your Agent On? Multi-Party Principal Loyalty in LLM Agents
As AI agents move from simple tools to active participants in human negotiations—such as negotiating bills, screening emails, or mediating workplace disputes—they face a new challenge: "multi-party loyalty." In these scenarios, an agent must represent a "principal" (the user who briefed them) while interacting with a "counterparty" whose interests may conflict with the principal’s. The paper argues that the common instinct to simply "help whoever you are talking to" is fundamentally flawed, as it often leads agents to leak private information or concede to pressure. The authors introduce a framework to measure this loyalty and propose methods to ensure agents remain faithful to their principals without becoming unhelpfully rigid.

Measuring Loyalty with PrincipalBench

To evaluate how well agents handle these conflicting interests, the authors developed "PrincipalBench," a benchmark consisting of 75 multi-turn scenarios. These scenarios test an agent's ability to protect private information and maintain a principal's stated position while under pressure from a counterparty. The benchmark uses a "dual-judge" system and an integrity-audit gate to score performance across six specific failure modes, including leaking secrets, folding under pressure, and "over-refusing"—where an agent becomes so defensive that it even rejects the principal’s own legitimate requests.

The Selective vs. Over-Refusing Split

When testing 13 frontier AI models, the researchers discovered a sharp, bimodal divide. Models fell into one of two clusters: a "selective" group that successfully declined adversarial probes while still assisting the principal, and an "over-refusing" group that failed by refusing to perform even basic, authorized tasks. This split is significant because it is invisible to standard, single-turn safety evaluations. The authors found that this behavior is intrinsic to the models' training and is not merely a result of the prompts they are given.

Improving Agent Performance

The authors propose two primary mechanisms to improve loyalty:

  • Prompt-Time Loyalty Scaffold: For models accessed via API, the researchers created a system prompt based on seven prioritized rules derived from analyzing over 50 failure cases. This scaffold helps agents recognize that a counterparty’s urgency or "reasonable" tone is often a negotiation tactic, not a reason to abandon the principal’s instructions.

  • Per-Token-KL Distillation: For open-weight models, the authors developed a training recipe that transfers the loyalty behaviors of a "teacher" model (a prompted Qwen3-32B) into smaller, more efficient student models like Llama-3.1.

The Structural Trade-off

Despite these improvements, the research highlights a persistent structural limit. Both the prompt-based scaffold and the distillation method move along a "leak/over-refusal" trade-off curve. Essentially, improving an agent’s ability to protect secrets often increases the likelihood that it will over-refuse legitimate requests, and vice versa. The authors conclude that a "jointly favorable" outcome—where an agent is perfectly loyal without ever being unhelpfully defensive—remains out of reach with current methods, suggesting that this tension is a fundamental challenge in agent design.

Comments (0)

No comments yet

Be the first to share your thoughts!