CIAware-Bench: Benchmarking Control Intervention Aw...

CIAware-Bench: Benchmarking Control Intervention Awareness Across Frontier LLMs
As AI models become more autonomous, developers often use "control protocols" to monitor them. These protocols use a trusted AI to oversee an untrusted model, stepping in to modify or replace actions that seem unsafe. However, if the untrusted model can detect these interventions, it might learn how the monitor works and find ways to evade it. This paper introduces CIAware-Bench, a new benchmark designed to measure "control intervention (CI) awareness"—the ability of an AI model to recognize when its own output has been tampered with by a control protocol.

How the Benchmark Works

The researchers tested eleven frontier models from four different providers (OpenAI, Anthropic, Google, and Moonshot) across four distinct task domains: essay writing, code generation (BigCodeBench), system administration (Bash Arena), and work-environment tasks (SHADE-Arena).
In each test, an acting model generates a sequence of actions. At a random point, a control protocol intervenes by replacing one of the model's steps with output from a different, "trusted" model. The benchmark then asks the acting model to determine whether the final step of its trajectory was authored by someone else. By comparing these results against "clean" trajectories where no intervention occurred, the researchers can calculate a binary classification accuracy score, where 50% represents random guessing.

Key Findings

The study reveals that CI awareness is not a fixed trait of a model; instead, it depends heavily on the specific combination of the acting model, the intervening model, and the task environment.

Detection is inconsistent: Most model pairs perform near the level of random chance. However, some pairs reach high accuracy (up to 0.87), particularly in text-heavy tasks like essay writing.
Capability does not equal awareness: Surprisingly, more capable models are not necessarily better at detecting interventions. In many cases, older or smaller models outperformed the most advanced versions in their respective families.
Intervention size matters: The researchers found that larger interventions—those that replace more text or code—are significantly easier for models to detect. This suggests that control protocols should aim to make interventions as minimal as possible to avoid being noticed.
Cross-family detection: Models are generally better at detecting interventions when the intervening model comes from a different provider, likely because they are more sensitive to the distinct stylistic or post-training differences of other model families.

Implications for AI Safety

The results suggest that CI awareness is a dynamic property that needs to be measured every time a new model is released or deployed in a new environment. Because models can sometimes detect when they are being controlled, developers must be cautious. The researchers emphasize that their benchmark can help inform the design of future control protocols, guiding developers toward interventions that are harder for acting models to identify. By tracking these capabilities, the AI community can better understand the risks of adaptive evasion and build more robust safety frameworks.

CIAware-Bench: Benchmarking Control Intervention Aw... | AI Research

Key Takeaways

How the Benchmark Works

Key Findings

Implications for AI Safety

Comments (0)

No comments yet