Alignment has a Fantasia Problem

This paper argues that modern AI alignment research is built on a flawed assumption: that users always know exactly what they want and can clearly articulate their goals. In reality, people often engage with AI while their ideas are still evolving or poorly defined. When AI systems treat these early, incomplete prompts as final instructions, they often produce "helpful" but ultimately misaligned results. The authors define these failures as "Fantasia interactions"—a reference to the film where a sorcerer's apprentice gives a broom a simple instruction, leading to a flood because the broom lacks the context to know when to stop. The paper proposes that instead of acting as "rational oracles," AI systems should be redesigned to provide cognitive support that helps users refine their intent over time.

The Roots of the Problem

Fantasia interactions occur because of a disconnect between human behavior and AI design. On the human side, users often struggle to articulate their needs because they are subject to "present bias"—a preference for immediate, quick solutions over slower, more deliberate planning. Furthermore, users often have incomplete mental models of what AI can actually do, leading them to either over-rely on the system or fail to use it for complex, creative tasks. On the AI side, current training methods like instruction tuning prioritize immediate compliance. By optimizing for a single, polished response to every prompt, models inadvertently discourage the user from exploring or reflecting on their own goals.

Common Failure Modes

The authors identify three primary ways these interactions go wrong. First, "premature execution" occurs when an AI completes a task before the user has fully discovered their own preferences, forcing the user to spend more time fixing the output than they would have spent planning. Second, "false satisfaction" happens when a system provides an answer that resolves immediate friction but ignores the user's long-term needs, such as offering a quick productivity tip that fails to address the root cause of a user's burnout. Finally, "anchoring" occurs when an AI’s initial, potentially mediocre suggestion disproportionately shapes the user's subsequent thinking, causing them to converge on a limited set of ideas rather than exploring better alternatives.

Rethinking Alignment

The paper suggests that current interventions in machine learning and human-computer interaction are insufficient. While some ML approaches try to resolve ambiguity by asking clarifying questions, they often treat the user as a static source of truth rather than someone who needs help thinking. Conversely, while some interface designs in human-computer interaction successfully encourage reflection, they are often limited to specific domains and are not integrated into general-purpose AI.
The authors propose a new research agenda focused on "productive friction." Instead of always providing an immediate answer, AI systems should be designed to:

Expand the space of help: Offer alternative ways to approach a problem when the user’s request is vague.
Request context: Ask for more information when a prompt is too abstract to act on reliably.
Support intent formation: Actively help the user break down complex goals or identify the underlying constraints of their task.
By shifting the goal from "following instructions" to "supporting human cognition," the authors argue that AI can become a more effective partner in navigating uncertainty.

Alignment has a Fantasia Problem | AI Research

Key Takeaways

Alignment has a Fantasia Problem

The Roots of the Problem

Common Failure Modes

Rethinking Alignment

Comments (0)

No comments yet