Back to AI Research

AI Research

CIAware-Bench: Benchmarking Control Intervention Aw... | AI Research

Key Takeaways

  • CIAware-Bench: Benchmarking Control Intervention Awareness Across Frontier LLMs As AI models become more autonomous, developers often use "control protocols"...
  • AI control protocols oversee untrusted models by monitoring their actions and modifying potentially unsafe steps, often using a trusted model.
  • This partially tampers with the untrusted model's trajectory.
  • If the trusted model detects such an intervention, it may infer properties of the monitor and adapt to evade control.
  • We introduce \textbf{CIAware-Bench}, a benchmark for measuring \textbf{c}ontrol \textbf{i}ntervention (CI) awareness across frontier models.
Paper AbstractExpand

AI control protocols oversee untrusted models by monitoring their actions and modifying potentially unsafe steps, often using a trusted model. This partially tampers with the untrusted model's trajectory. If the trusted model detects such an intervention, it may infer properties of the monitor and adapt to evade control. We introduce \textbf{CIAware-Bench}, a benchmark for measuring \textbf{c}ontrol \textbf{i}ntervention (CI) awareness across frontier models. CIAware-Bench tests whether models can distinguish their own trajectories from those modified by a control intervention. The benchmark is comprised of a suite of four task domains (essay writing, BigCodeBench, Bash Arena, and SHADE-Arena), while varying trajectory watermarking, side-task presence, and the control protocol. Evaluating eleven frontier models, we find low to moderate CI awareness under default settings (up to 0.87; random chance balanced binary classification accuracy is 0.5) with substantial variation across task domains and model pairs. Detection is generally easier across model families, suggesting that models exploit provider-specific differences in style or post-training. Overall, CI awareness is not a fixed model-level property, and should be measured for each new model release and deployment scenario. We release CIAware-Bench to track CI awareness and inform control protocols whose interventions are harder to detect.

CIAware-Bench: Benchmarking Control Intervention Awareness Across Frontier LLMs
As AI models become more autonomous, developers often use "control protocols" to monitor them. These protocols use a trusted AI to oversee an untrusted model, stepping in to modify or replace actions that seem unsafe. However, if the untrusted model can detect these interventions, it might learn how the monitor works and find ways to evade it. This paper introduces CIAware-Bench, a new benchmark designed to measure "control intervention (CI) awareness"—the ability of an AI model to recognize when its own output has been tampered with by a control protocol.

How the Benchmark Works

The researchers tested eleven frontier models from four different providers (OpenAI, Anthropic, Google, and Moonshot) across four distinct task domains: essay writing, code generation (BigCodeBench), system administration (Bash Arena), and work-environment tasks (SHADE-Arena).
In each test, an acting model generates a sequence of actions. At a random point, a control protocol intervenes by replacing one of the model's steps with output from a different, "trusted" model. The benchmark then asks the acting model to determine whether the final step of its trajectory was authored by someone else. By comparing these results against "clean" trajectories where no intervention occurred, the researchers can calculate a binary classification accuracy score, where 50% represents random guessing.

Key Findings

The study reveals that CI awareness is not a fixed trait of a model; instead, it depends heavily on the specific combination of the acting model, the intervening model, and the task environment.

  • Detection is inconsistent: Most model pairs perform near the level of random chance. However, some pairs reach high accuracy (up to 0.87), particularly in text-heavy tasks like essay writing.

  • Capability does not equal awareness: Surprisingly, more capable models are not necessarily better at detecting interventions. In many cases, older or smaller models outperformed the most advanced versions in their respective families.

  • Intervention size matters: The researchers found that larger interventions—those that replace more text or code—are significantly easier for models to detect. This suggests that control protocols should aim to make interventions as minimal as possible to avoid being noticed.

  • Cross-family detection: Models are generally better at detecting interventions when the intervening model comes from a different provider, likely because they are more sensitive to the distinct stylistic or post-training differences of other model families.

Implications for AI Safety

The results suggest that CI awareness is a dynamic property that needs to be measured every time a new model is released or deployed in a new environment. Because models can sometimes detect when they are being controlled, developers must be cautious. The researchers emphasize that their benchmark can help inform the design of future control protocols, guiding developers toward interventions that are harder for acting models to identify. By tracking these capabilities, the AI community can better understand the risks of adaptive evasion and build more robust safety frameworks.

Comments (0)

No comments yet

Be the first to share your thoughts!