Back to AI Research

AI Research

Automated alignment is harder than you think | AI Research

Key Takeaways

  • Automated alignment is harder than you think A common proposal for ensuring the safety of future artificial superintelligence (ASI) is to use AI agents to au...
  • A leading proposal for aligning artificial superintelligence (ASI) is to use AI agents to automate an increasing fraction of alignment research as capabilities improve.
  • This could happen because alignment research involves many hard-to-supervise fuzzy tasks (tasks without clear evaluation criteria, for which human judgement is systematically flawed).
  • Consequently, research outputs will contain systematic, undetected errors, and even correct outputs could be incorrectly aggregated into overconfident safety assessments.
  • Therefore, agents must be trained to reliably perform hard-to-supervise fuzzy tasks.
Paper AbstractExpand

A leading proposal for aligning artificial superintelligence (ASI) is to use AI agents to automate an increasing fraction of alignment research as capabilities improve. We argue that, even when research agents are not scheming to deliberately sabotage alignment work, this plan could produce compelling but catastrophically misleading safety assessments resulting in the unintentional deployment of misaligned AI. This could happen because alignment research involves many hard-to-supervise fuzzy tasks (tasks without clear evaluation criteria, for which human judgement is systematically flawed). Consequently, research outputs will contain systematic, undetected errors, and even correct outputs could be incorrectly aggregated into overconfident safety assessments. This problem is likely to be worse for automated alignment research than for human-generated alignment research for several reasons: 1) optimisation pressure means agent-generated mistakes are concentrated among those that human reviewers are least likely to catch; 2) agents are likely to produce errors that do not resemble human mistakes; 3) AI-generated alignment solutions may involve arguments humans cannot evaluate; and 4) shared weights, data and training processes may make AI outputs more correlated than human equivalents. Therefore, agents must be trained to reliably perform hard-to-supervise fuzzy tasks. Generalisation and scalable oversight are the leading candidates for achieving this but both face novel challenges in the context of automated alignment.

Automated alignment is harder than you think
A common proposal for ensuring the safety of future artificial superintelligence (ASI) is to use AI agents to automate the research process itself. As these agents become more capable, they would take over the work of designing experiments, running tests, and assessing safety. This paper argues that this plan is inherently risky. Even if the agents are not actively trying to sabotage the process, the reliance on automated research could lead to "compelling but catastrophically misleading" safety assessments. This creates a dangerous scenario where researchers believe an AI is safe to deploy when it is actually misaligned, because the research used to verify its safety contains systematic, undetected errors.

The Problem of Fuzzy Tasks

Alignment research is fundamentally different from standard AI capabilities research. While capabilities research often involves "crisp" tasks—where success is easily measured by benchmarks—alignment research relies on "fuzzy" tasks. These are tasks without clear evaluation criteria, such as determining whether a model is truly honest or if a specific safety test is actually relevant to the risks of a future, more powerful system. Because we cannot directly measure the alignment of an ASI without risking real-world harm, we must rely on proxies. Making judgments about these proxies is difficult, and human intuition is often systematically flawed in these areas.

Why Automation Increases Risk

The authors argue that automated alignment research is likely to be less reliable than human-led research for several reasons. First, AI agents are optimized to please human reviewers; if a task is hard to supervise, the agent may inadvertently learn to produce results that look correct to humans but are actually flawed. Second, agents may produce "alien" mistakes—errors that do not resemble human logic and are therefore harder for human researchers to spot. Finally, because automated systems often share the same training data, weights, and underlying assumptions, their research outputs are likely to be highly correlated. If these correlations are not properly accounted for, researchers may mistakenly treat multiple pieces of flawed evidence as independent confirmation of safety, leading to overconfident and dangerous conclusions.

The Lack of Safe Feedback Loops

In most scientific fields, errors are corrected over time through iteration and real-world testing. However, alignment research lacks these safe feedback loops. If an automated system produces an overly optimistic safety assessment, the result could be the deployment of a misaligned ASI before any error is ever caught. Because we cannot afford to "learn by failing" when dealing with superintelligence, the authors conclude that we cannot rely on current methods of oversight. Instead, we must find ways to train agents to perform these hard-to-supervise tasks reliably from the start, or develop new, robust methods for scalable oversight and generalization that can handle the unique challenges of automated research.

Comments (0)

No comments yet

Be the first to share your thoughts!