The Human Creativity Benchmark (HCB) addresses a fundamental flaw in how we evaluate generative AI: the tendency to treat professional disagreement as "noise" or measurement error. In creative fields, experts often disagree not because they are wrong, but because they have different tastes, aesthetic goals, and creative intents. This research argues that to build better creative AI, we must stop trying to collapse all feedback into a single "quality" score and instead distinguish between where professionals agree (convergence) and where they legitimately differ (divergence).
Measuring Convergence and Divergence
The researchers propose a framework that separates evaluation into two distinct signals. "Convergence" occurs when experts align around shared, verifiable best practices—such as technical correctness, readable typography, or functional layout. "Divergence" occurs when experts evaluate subjective dimensions like aesthetic direction, mood, and conceptual risk. By collecting 15,000 professional judgments across five creative domains, the study demonstrates that these two signals provide different, actionable insights. For instance, a model might be highly consistent on prompt adherence (convergence) while remaining open to interpretation on visual style (divergence).
The Creative Workflow
The study organizes the creative process into three distinct phases: Ideation, Mockup, and Refinement. Each phase requires different capabilities from an AI model. During Ideation, the goal is discovery and exploration; during Mockup, the goal is to bring a specific vision to life; and during Refinement, the goal is to make targeted, consistent adjustments. The benchmark uses these stages to test how well models perform across the entire lifecycle of a project. The results show that no single model excels uniformly across all three phases, suggesting that developers should optimize models differently depending on the intended stage of use.
Why "Good" is Not Enough
The findings reveal that reducing creative output to a single metric of "good" or "bad" discards the most important information for model developers. When evaluators disagree on aesthetic quality, it is often a sign that the model is providing a range of creative options rather than failing to meet a standard. The researchers argue that AI models should be designed to be reliably consistent on convergent axes—where there is a "correct" way to do things—while remaining steerable on divergent axes, allowing human creatives to apply their own professional taste and judgment.
Implications for AI Development
The HCB provides a roadmap for moving beyond simple, one-dimensional benchmarks. By analyzing where experts converge and diverge, the study highlights that the value of an AI tool is highly task-dependent. For model developers, this means the goal should not be to eliminate disagreement, but to understand it. A system that forces all outputs toward a single, average-quality "truth" risks homogenizing creative work, whereas a system that preserves the plurality of expert judgment supports the nuanced, collaborative nature of professional design.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!