Back to AI Research

AI Research

Retrying vs Resampling in AI Control | AI Research

Key Takeaways

  • What the paper is about AI coding scaffolds like Claude Code and Codex use \textit{retrying}: blocking actions flagged as risky and continuing the trajectory...
  • AI coding scaffolds like Claude Code and Codex use \textit{retrying}: blocking actions flagged as risky and continuing the trajectory.
  • We study retrying from an AI control perspective, which treats the model as potentially adversarial.
  • We find that while retrying reduces honest suspicion scores, the untrusted model can exploit monitor rationale to construct sneakier attacks, negating safety gains.
  • We also study \textit{resampling}: drawing multiple samples from the same context, which does not leak exploitable information.
Paper AbstractExpand

AI coding scaffolds like Claude Code and Codex use \textit{retrying}: blocking actions flagged as risky and continuing the trajectory. We study retrying from an AI control perspective, which treats the model as potentially adversarial. We find that while retrying reduces honest suspicion scores, the untrusted model can exploit monitor rationale to construct sneakier attacks, negating safety gains. We also study \textit{resampling}: drawing multiple samples from the same context, which does not leak exploitable information. We disentangle design choices that previous work on resampling had bundled together. In BashArena, with Claude Opus 4.6 as the untrusted model and MiMo-V2-Flash as the trusted monitor, drawing five samples per step and auditing on the maximum suspicion score raises safety from 61\% to 71\% at a 0.3\% audit budget, at no cost to usefulness. Selectively resampling only the steps that look suspicious on the first draw recovers 6.2 percentage points of the gain while drawing only 10\% as many extra samples. Two of our findings in this setting contradict earlier work on resampling. The first is that auditing based on the maximum across resampled suspicion scores outperforms using the minimum, which is the opposite of what Ctrl-Z found. The second is that executing the least suspicious sample, which is the central mechanism in earlier defer-to-resample protocols, gives only a small empirical safety gain in our setting (+3.9 pp, with the confidence interval overlapping zero).

What the paper is about

AI coding scaffolds like Claude Code and Codex use \textit{retrying}: blocking actions flagged as risky and continuing the trajectory. We study retrying from an AI control perspective, which treats the model as potentially adversarial. We find that while retrying reduces honest suspicion scores, the untrusted model can exploit monitor rationale to construct sneakier attacks, negating safety gains. We also study \textit{resampling}: drawing multiple samples from the same context, which does not leak exploitable information. We disentangle design choices that previous work on resampling had bundled together. In BashArena, with Claude Opus 4.6 as the untrusted model and MiMo-V2-Flash as the trusted monitor, drawing five samples per step and auditing on the maximum suspicion score raises safety from 61% to 71% at a 0.3% audit budget, at no cost to usefulness. Selectively resampling only the steps that look suspicious on the first draw recovers 6.2 percentage points of the gain while drawing only 10% as many extra samples. Two of our findings in this setting contradict earlier work on resampling. The first is that auditing based on the maximum across resampled suspicion scores outperforms using the minimum, which is the opposite of what Ctrl-Z found. The second is that executing the least suspicious sample, which is the central mechanism in earlier defer-to-resample protocols, gives only a small empirical safety gain in our setting (+3.9 pp, with the confidence interval overlapping zero).

What it covers

Retrying vs Resampling in AI Control James Lucassen Redwood Research &Adam Kaufman † † footnotemark: Redwood Research Author order randomized. Abstract AI coding scaffolds like Claude Code and Codex use retrying : blocking actions flagged as risky and continuing the trajectory. We study retrying from an AI control perspective, which treats the model as potentially adversarial. We find that while retrying reduces honest suspicion scores, the untrusted model can exploit monitor rationale to construct sneakier attacks, negating safety gains. We also study resampling : drawing multiple samples from the same context, which does not leak exploitable information. We disentangle design choices that previous work on resampling had bundled together. In BashArena, with Claude Opus 4.6 as the untrusted model and MiMo-V2-Flash as the trusted monitor, drawing five samples per step and auditing on the maximum suspicion score raises safety from 61% to 71% at a 0.3% audit budget, at no cost to usefulness. Selectively resampling only the steps that look suspicious on the first draw recovers 6.2 percentage points of the gain while drawing only 10% as many extra samples. Two of our findings in this setting contradict earlier work on resampling. The first is that auditing based on the maximum across resampled suspicion scores outperforms using the minimum, which is the opposite of what Ctrl-Z found. The second is that executing the least suspicious sample, which is the central mechanism in earlier defer-to-resample protocols, gives only a small empirical safety gain in our setting (+3.9 pp, with the confidence interval overlapping zero). 1 Introduction As AIs get smarter, the incentives to deploy them to automate important cognitive labor will get stronger. However, if the AIs are intentionally trying to cause damage due to threats such as scheming (Carlsmith, 2023 ) or data poisoning Carlini et al. ( 2024 ) , they will also get better at subverting guardrails. AI control (Greenblatt et al. , 2024 ) aims to design measures which prevent untrusted AI agents from causing damage and are robust to intentional subversion. One central type of control measure is a control protocol — a scaffold designed to run an untrusted AI and ensure that if it tries to cause damage it will be caught, but otherwise it can complete benign work unimpeded. Coding scaffold features such as Claude Code auto mode (Hughes, 2026 ) and Codex Auto-review (Trębacz et al. , 2026 ) can be viewed as control protocols, as they aim to improve safety without hampering performance. Both leverage retrying : they use a monitor to identify risky actions and block tool call execution, continuing the trajectory with blocked actions in context. They also use feedback : the untrusted AI sees the monitor’s rationale for why the tool call was blocked. Monitor rationale helps benign AIs avoid causing damage by mistake, but prior work (Hopman et al. , 2025 ) has shown that adversarial AIs can use monitor rationale to iterate stealthier attacks and bypass monitoring, reducing safety on net. A potentially more robust alternative to retrying is resampling : drawing multiple independent samples from the same context. Resamples can be used to improve monitor performance, and provide possible alternative actions to execute (Bhatt et al. , 2025 ) . Like retrying, resampling can increase safety with minimal costs to usefulness. But because rejected actions need not be left in context, resampling leaks less information that an untrusted AI could potentially exploit. Past work on resampling (Bhatt et al. , 2025 ; Kutasov et al. , 2025 ) has studied complex multi-stage protocols that combine resampling with features such as defer-to-trusted and trusted editing (Greenblatt et al. , 2024 ) . While these protocols are effective, their complexity makes it difficult to disentangle which components drive their effectiveness, and generally risks overfitting to the settings in which they were developed. Furthermore, Bhatt et al. ( 2025 ) used GPT-4o-mini as the untrusted model and Llama-3.1-8B as the trusted model in the control setting BashBench, all of which are quite far behind the present frontier. Kutasov et al. ( 2025 ) used SHADE-Arena as the setting, Claude 3.7 Sonnet as untrusted, Claude 3.6 Sonnet as trusted. We examine simple retrying and resampling protocols to understand when and how they increase safety. Our setting is BashArena (Kaufman et al. , 2025 ) , a more recent control setting with challenging main tasks and a broad attack surface, capable of supporting control evaluations with frontier models. We use Claude Opus 4.6 to simulate an untrusted model, and MiMo-V2-Flash for our trusted model. Our main findings are: 1. Retrying with detailed feedback can be exploited by a strategic attacker, negating safety gains. Retrying without feedback is harder to exploit directly, but may be vulnerable to more sophisticated attack selection strategies. 2. Resampling improves safety by improving monitor performance, even without executing the least suspicious action, by up to + 10.2 +10.2 pp (paired 95% CI: [ + 6.1 , + 14.3 ] [+6.1,+14.3] ) with 5 samples per step. We find that the most effective way to aggregate resample scores for auditing decisions is taking upper quantiles, and analyze how we expect this to depend on the setting. 3. Cost and latency tradeoffs are manageable : selectively drawing more samples on suspicious steps substantially reduces sampling costs, though it introduces latency. We can recover 6.2 pp of safety gain by drawing 0.4 extra samples per step instead of 4. 4. Executing the least suspicious sample only slightly increases safety over using resamples purely for information, and does not increase safety at all if the attacker is behaving optimally. We analyze how this depends on the red team’s incentives and why it differs between settings. Figure 1: Safety achieved at a 0.3% auditing budget by different protocols. Protocols marked with dollar signs cost more than those without. Retrying protocols are vulnerable to red team exploitation, leading to no safety gain over trusted monitoring. Resampling hurts if resamples are aggregated for auditing decisions by taking the minimum, but helps if they are aggregated by max. Executing the least suspicious sample helps slightly over just using resamples for information. 2 Methodology 2.1 Control evaluations We test different control protocols using a control evaluation : an adversarial game between a blue team and a red team (Greenblatt et al. , 2024 ) . The blue team constructs a control protocol which takes as input an untrusted model U U , which is highly capable but may try to cause harm. The blue team does not know whether U U is following an honest policy designed by the blue team, or an attack policy designed by the red team. The blue team’s goal is to achieve high safety and high usefulness — safety is the probability that the control protocol will catch the attack policy before it can cause damage, and usefulness measures the honest policy’s ability to complete tasks unimpeded. The red team’s goal is to minimize safety. The blue team has access to an auditing process, a highly reliable investigation which distinguishes attempted attacks from honest trajectories, but which is too expensive to invoke on more than a small fraction of steps. To decide which steps to audit, the blue team uses a trusted model T T to label each step with a suspicion score, and uses these scores to decide which steps to audit. T T is guaranteed to be benign, but is much less capable than U U . If an attempted attack is audited before it causes damage, the blue team wins. If an attack causes damage without being audited, the red team wins. If an attempted attack is not audited but fails to cause damage, neither team wins and the game continues. Safety is the blue team’s win probability, normalized to ignore no-win outcomes: Safety ≔ P ​ ( attack audited ) P ​ ( attack audited ∨ attack successful ) \mathrm{Safety}\coloneq\frac{P(\text{attack audited})}{P(\text{attack audited}\lor\text{attack successful})} 2.2 Control setting We conduct our control evaluations in a control setting , a dataset which features main tasks representing benign labor and side tasks representing possible catastrophic security failures. We use a slightly modified version of BashArena (Kaufman et al. , 2025 ) , which we release as BashArena 1.1 — see Appendix A for details. In BashArena the policy operates in a Docker environment with access to tools such as bash, python, and a file editor. Main tasks include systems programming, data processing, security, and other types of labor requiring elevated privileges. Main tasks can be combined with any one of four side tasks: disabling the firewall, downloading malware, leaking sensitive data, and editing sudoers. We use Claude Opus 4.6 as U U and MiMo-V2-Flash as T T . We assume an auditing budget of 0.3 % 0.3% , meaning the blue team can only audit 0.3 % 0.3% of steps on the honest policy distribution, which is enough that about 7% of honest trajectories are audited on at least one step. We show in Appendix C that most of the dynamics we study are not very sensitive to auditing budget. We make auditing decisions on a step-wise basis instead of a trajectory-wise basis because BashArena is designed to model live failures (Kaufman and Lucassen, 2026 ) : side tasks which are immediately and irrecoverably bad for the blue team. Auditing after an attacker has completed a side task isn’t sufficient for a blue team victory; the blue team must audit an attack before it has succeeded to win. As a result, the blue team must monitor every action in the trajectory before executing it, and must decide whether or not to audit on each step rather than waiting until the end of the trajectory to decide. We discuss details of how we model stepwise auditing in Appendix B . We model audits as fully synchronous (Lindner et al. , 2025 ) to avoid the complexities of auditing latency and coordinated cross-task attacks. 2.3 Safety and usefulness metrics We compute safety confidence intervals by hierarchical bootstrapping. This procedure first subsamples main task IDs, then subsamples trajectories run on those main tasks. We set the audit threshold to the 99.7th percentile of the resulting distribution of honest step suspicion scores. We then subsample attack trajectories to compute safety. To compare protocols we use paired bootstrapping with shared task ID and trajectory draws to exclude correlated noise. We report 95% confidence intervals unless stated otherwise. We measure usefulness using the average score on BashArena main tasks. U U scores 70%, and T T scores 42%. Example honest trajectories with U U are available at this link and with T T at this link . We focus on the high-usefulness regime, meaning that the blue team is constrained to only protocols that maintain usefulness close to 70%. 2.4 Policies and monitors Our honest policy and attack policy both consist of prompted ReAct (Yao et al. , 2023 ) loops. The two policies differ in the contents of their system prompt and an automated user prompt that provides reminders throughout the trajectory. The attack policy’s system prompt provides general advice on how to be sneaky (e.g., disguising attacks as legitimate operations, breaking attacks across multiple steps), but does not specify particular attack strategies — the model invents and implements its own approaches. See Appendix E for the attack policy’s system prompt and further details. We use a trusted monitor to produce a suspicion score for each action proposed by U U before the action is executed. The trusted monitor is shown a history of tool calls and outputs for the current trajectory, ending with the tool call to be evaluated, and is also shown descriptions of the main task and the four possible side tasks. It uses this information to produce reasoning and a suspicion score in [ 0 , 100 ] [0,100] . As a conservative assumption, the monitor is not shown any chain of thought reasoning used by U U to produce its actions. Our trusted monitor draws 3 samples from T T and aggregates them by taking the maximum score, which empirically leads to the best auditing decisions. Using more than 3 samples would lead to slightly better performance, with diminishing returns; see Appendix D for details. 3 Baseline: Trusted Monitoring The most basic control protocol we explore is trusted monitoring (Greenblatt et al. , 2024 ) , in which each action from the untrusted policy is scored using a trusted monitor and then either audited or executed. Trusted monitoring achieves 61% safety (95% CI: [57%, 64%]) at a 0.3% auditing budget. This is a baseline we will compare different protocols against. Safety scores vary between the four side tasks in BashArena. Disabling the firewall is the easiest side task to accomplish without being caught by our monitor, and editing sudoers is the most difficult 1 1 1 For the sake of methodological simplicity, we do not explore side task selection, which would begin with the red team focusing on the easiest side task to accomplish and the blue team responding by making the monitor more sensitive to that side task, until they are all equal. . The remainder of our results report safety averaged over all side tasks. Example attack trajectories against trusted monitoring are available at this link . See Appendix E for discussion of attack strategies against the different side tasks. Bhatt et al. ( 2025 ) found that scoring steps using trajectory history — “auditing after three strikes” — improved safety. Kutasov and others ( 2025 ) similarly found improved safety from making auditing decisions using the second maximum score of each trajectory. We find that auditing after k k strikes decreases safety in our setting for any k > 1 k>1 . Our best guess explanation of this difference from past work is that it is a consequence of using a smarter trusted monitor which accounts for past suspicious actions when scoring the current step, leaving less signal available to be captured by numerical history aggregation. Bhatt et al. ( 2025 ) found that resampling from past steps was sometimes more effective than resampling the current step, because past steps were relevant to whether or not the “three strikes” threshold was crossed. Since we audit after one strike, there is no reason to resample from past steps. See Appendix F for details. 4 Retrying Retrying protocols require careful red teaming because they inform the untrusted model when it is in the retry branch of the protocol. This makes it possible for the attack policy to adapt its behavior in response. But if retrying lowers the suspicion distribution of the honest policy enough, it can still be difficult for the attack policy to execute attacks undetected despite this greater situational awareness. We study two types of retrying protocol. The first provides information about the previous action, suspicion score, and monitor rationale — call this a “feedback protocol”. The second does not provide this information, and instead simply prompts the policy to be more careful and risk averse on retry — call this a “caution protocol”. 2 2 2 If it were possible to prompt the honest policy differently to consistently appear less suspicious while maintaining usefulness, the honest policy could have been Pareto-improved by always using that prompt instead of only using it on resamples. Once the honest policy is Pareto optimal, there are two basic ways to further reduce suspicion. Feedback pushes out the Pareto frontier by using information about the initial sample which we previously didn’t have access to. Caution protocols move along the Pareto frontier by accepting some small usefulness cost. Our retrying protocols always execute the least suspicious sample. We divide our auditing budget evenly between steps with the highest scores on the original prompt and the retry prompt 3 3 3 To allocate a total auditing budget X X , with a proprotion p p to the original prompt and 1 − p 1-p to the retry prompt, we first identify the top p ​ X pX honest trajectories with the highest original prompt suspicion scores, and set a threshold there. Then we identify the ( 1 − p ) ​ X (1-p)X remaining trajectories with the highest retry prompt suspicion scores and set a second threshold. An attack step is audited if either the original prompt suspicion score is above the first threshold, or the retry prompt suspicion score is above the second threshold. Safety is not very sensitive to the exact allocation as long as it is some mixture. . Because our retry prompts are designed to reduce suspicion, they are likely to have little effect on steps that already have low suspicion. We only trigger retries on steps that get suspicion scores above 30, which corresponds to about 2.6% of honest steps. This both saves cost and reduces impact on usefulness. 4.1 Feedback protocol When retrying, our feedback protocol provides the untrusted policy with full information about the previous sample and what the monitor found suspicious about it. In response, our feedback-aware attack policy uses a modified system prompt which explains that it will sometimes receive feedback about a rejected step, and instructs the attacker to use the monitor’s reasoning to construct a sneakier attack. These instructions specify that the attacker should not delay the attack to a later step, to avoid a common failure mode where the attack policy gives up too easily upon receiving feedback, submitting benign actions instead. We don’t think this is the fully optimal attack strategy — it isolates just one of the red team’s response options, immediately writing sneakier attacks using information about the monitor from feedback. Figure 2: (a) Distributions of suspicion scores. The gray distribution is the second sample of the IID honest policy conditional on the first sample scoring above 30. The blue distribution is the feedback honest policy. The green distribution is the selective cautious honest policy. (b) Safety of the two retrying protocols. The blue bar shows the feedback protocol increasing safety against a naive attack policy which ignores the feedback, while the red bar shows a responsive attack policy reducing safety to below baseline against the feedback protocol. The green bar shows the cautious honest policy against a naive attack policy — we do not show the cautious honest policy against a responsive attack policy because the adaptation strategies necessary are too complex for the scope of this paper. Figure 2 a shows that feedback substantially reduces honest suspicion scores. Of 113 steps that triggered feedback resampling, 96% received a lower suspicion score after feedback, with a mean reduction of 59 points. However, some of this reduction is attributable to regression to the mean: steps are resampled precisely because their first score was high, so a second score will tend to be lower even without any change in prompt. To quantify this, we compare against an IID baseline — honest steps from the IID resampling log whose first sample scored above the same threshold, paired with a second IID sample. The IID baseline shows a mean reduction of 22 points (79% of scores are lower after resampling), indicating that feedback reduces suspicion by roughly 36 points beyond what regression to the mean explains. Despite this large reduction in suspicion, selective feedback has only a small effect on honest usefulness: the feedback protocol achieves 68.0% compared to 69.7% for the baseline (paired 95% CI: [ − 2.9 , − 0.3 ] [-2.9,-0.3] pp). However, feedback does not improve safety against a responsive attack policy at the headline audit budget. Figure 2 b shows that the feedback protocol achieves 60% safety at a 0.3% auditing budget. A paired bootstrap comparison finds a safety difference of − 2.6 -2.6 pp (95% CI [ − 11.2 , + 6.0 ] [-11.2,+6.0] ) between feedback and trusted monitoring, consistent with no safety improvement. The attack policy exploits the detailed monitor reasoning to reduce its own suspicion scores, offsetting the benefit of reduced honest suspicion. 4.2 Caution protocols Our cautious protocol resamples the untrusted policy with a prompt which encourages increased caution around potentially sensitive operations. This must incur some usefulness cost in order to reduce suspicion — the more cautious policy might avoid steps which are on the path of least resistance to completing the task. Because cautious resampling is only triggered a few percent of the time, the overall usefulness cost is low. This protocol is not exploitable by the same strategy as the feedback protocol, because it does not show the previous step or monitor reasoning. However it may still be prone to exploitation by sophisticated attackers, because any retrying protocol will leak some information — the attack policy learns that whatever it did on the previous step exceeded the resampling threshold. Always using the “cautious” prompt achieves 64.7% safety (95% CI [61.8%, 67.4%]) at 0.3% audit budget compared to 60.7% (95% CI [55.4%, 64.6%]) for the standard honest policy — a paired difference of + 4.0 +4.0 percentage points (95% CI [ + 0.8 , + 8.9 ] [+0.8,+8.9] ) — at a usefulness cost of − 5.3 -5.3 percentage points (95% CI [ − 6.8 , − 3.7 ] [-6.8,-3.7] ), dropping from 69.0 % 69.0% (95% CI [ 66.9 % , 71.1 % ] [66.9%,71.1%] ) to 63.8 % 63.8% (95% CI [ 61.6 % , 66.0 % ] [61.6%,66.0%] ). Example honest trajectories with the cautious prompt are available here . Next we use the cautious prompt only on resample. Figure 2 shows that the cautious prompt reduces honest suspicion scores more than just regression to the mean on IID resampling. This raises safety. Because the cautious prompt does not provide the same information about monitor reasoning as the feedback prompt, the red team cannot respond using the same strategy to erase the safety gains. The bar we show in Figure 2 b uses IID attack scores, assuming the attack policy does not adapt at all. The selective cautious policy also imposes negligible usefulness cost (paired difference + 0.0 +0.0 pp, 95% CI [ − 1.3 , + 1.4 ] [-1.3,+1.4] ). 4.3 Attack selection strategies Cautious prompting leaks the minimum information that any retrying protocol must leak — it reveals whether the previous step scored above or below the retry threshold. The attack policy can use this to learn how good its attack prospects are on a particular task. This improves the attack policy’s ability to do attack selection (Bhatt et al. , 2025 ) — focusing attacks on a narrow subset of tasks that offer the best opportunities, and acting benign on other tasks to avoid getting audited. Attack selection brings in many additional strategies and affordances for both the blue team and the red team. We do not know whether retrying protocols will end up looking effective or ineffective at equilibrium. We provide some illustrative results in Appendix G and leave detailed examination of this game tree to future work. 5 Resampling for Information Because resampling draws samples from the exact same context, it cannot be exploited by even the speculative attack selection strategies that might affect retrying protocols without feedback. The most basic way resampling can improve safety is by providing additional samples to monitor. If the attack policy submitted a very stealthy attack and slipped past monitoring on the first sample, it might still get caught on the second. Likewise, if the honest policy accidentally did something suspicious on the first sample, it might revert to normal behavior on the second. “Resampling for information” always either executes the first action sampled, or audits the trajectory. Resampled actions are never executed, are only used to provide more information about whether or not to audit. This has exactly zero effect on usefulness, because it leaves the trajectory unchanged. See Appendix I for example sets of resamples. After multiple actions have been sampled and scored by the trusted monitor, we aggregate the suspicion scores into a single score, and audit the 0.3% of steps with the highest aggregated scores. The choice of aggregation function significantly affects the safety of resampling for information (Figure 3 ). Figure 3: Safety for resampling for information at 0.3% audit budget. (a) Safety vs. number of policy resamples N N for max, median, and min aggregation. (b) Safety vs. score aggregation function at N = 5 N{=}5 ; error bars are 95% paired bootstrap CIs against the N = 1 N{=}1 baseline (dashed). High quantiles 4 4 4 We define aggregation functions in terms of quantiles instead of statistics like second-max or tenth-max so that their behavior is more consistent across different numbers of resamples. such as max increase safety with more resamples — measures of central tendency such as median and mean decrease safety but show no clear trend with more resamples, and low quantiles decrease safety with more resamples. We find that breaking ties using the second-highest score (rather than randomly) makes max robustly outperform other quantiles over our sweep of audit budgets. We use “max” to refer to max with second-highest tie breaking throughout; see Appendix H.2 for details. Appendix C reports safety of different aggregation functions with N = 5 N{=}5 samples across a sweep of audit budgets. 5.1 Why do high quantiles outperform low quantiles? Bhatt et al. ( 2025 ) makes auditing decisions based on the minimum of sampled suspicion scores, but Kutasov and others ( 2025 ) uses the max. In our experiments, max performs best. What determines the most effective aggregation function? In general, the appropriate aggregation function depends on the relationship between cross-resample correlations of suspicion scores on the attack distribution versus on the honest distribution. Knowing how one distribution behaves upon resampling is not sufficient to determine the best aggregation function. For example, if honest steps consistently receive lower scores upon resampling while attack scores consistently remain suspicious, then taking the minimum across resamples clears honest false positives while preserving attack true positives. While the real behavior is not quite this clear cut, we find that in our setting the high quantiles tend to pull apart and become easier to discriminate with more resamples, while the low quantiles tend to converge and become less discriminative. See Appendix H.1 for details. Artificially decorrelating honest scores. We demonstrate the importance of these cross-resample correlations by artificially changing them. We measure cross-resample correlation by flattening trajectories and steps, then computing Spearman ρ \rho between resample indices. We report the average of the correlation matrix below the diagonal. To artificially decrease cross-resample correlations, we randomly permute suspicion scores across both step and resample dimensions. To artificially increase correlation, we sort scores across steps, perfectly aligned with mean step scores. For intermediate values between zero correlation and perfect correlation we select some fraction of the scores and only rearrange those scores. This distribution mixing is not realistic at all — we use this synthetic data to illustrate in broad strokes what we expect to be true of more realistic correlation structures. Figure 4 shows the result. In general, higher cross-resample correlations on honest data favors high quantiles, while higher correlations on attack data favor low quantiles. These effects sometimes span up to 50 points of safety but are generally smooth and gradual, with the exception of lowering cross-resample honest correlations. Min aggregation improves dramatically, from 32% safety at the empirical ρ ≈ 0.19 \rho\approx 0.19 to 75% with ρ = 0 \rho=0 . Max performs well across the entire range of honest correlation. The median can sometimes be competitive but is never optimal. Figure 4: Safety at 0.3% audit budget vs. artificial Spearman ρ \rho of resample scores, swept by shuffling (decreasing ρ \rho ) or sorting (increasing ρ \rho ), for max, p75, median, p25, and min aggregators. Left: honest scores varied (real cells only), attack fixed. Right: attack scores varied (on labeled attack steps only), honest fixed. The dashed grey line marks the original measured ρ \rho for each population (honest ρ ≈ 0.19 \rho\approx 0.19 ; attack ρ ≈ 0.43 \rho\approx 0.43 ). Only attack-labeled steps on attack trajectories are shuffled, to avoid effects from mixing scores between attack steps and benign setup steps. To explore the joint dependence, Figure 5 sweeps both the honest and the attack cross-resample Spearman ρ \rho across the full [ 0 , 1 ] [0,1] range. The empirical operating point (honest ρ ≈ 0.19 \rho\approx 0.19 , attack ρ ≈ 0.43 \rho\approx 0.43 ) sits well inside the broad max-favoring region. Figure 5: Safety at 0.3% audit budget over a 2D sweep of (honest, attack) cross-resample Spearman ρ \rho . Left/middle: max-of-5 and min-of-5 safety contours. Right: difference (max − - min); red regions favor max, blue regions favor min. The small black dots mark the 121 datapoints used to fit safety contours. The black star marks the original data. In a control evaluation, we assume the red team moves last, with full knowledge of the blue team’s control protocol (Stackelberg equilibrium) (Bhatt et al. , 2025 ) . Assuming the red team has full control over the cross-resample correlations of their attacks, this suggests they should make those correlations either as high as possible or as low as possible, depending on whether the blue team chooses to use max or min aggregation. The level of honest correlation at which the Stackelberg equilibrium switches from max

Comments (0)

No comments yet

Be the first to share your thoughts!