Back to AI Research

AI Research

The Human Creativity Benchmark | AI Research

Key Takeaways

  • The Human Creativity Benchmark (HCB) addresses a fundamental flaw in how we evaluate generative AI: the tendency to treat professional disagreement as "noise...
  • Modern AI evaluation frameworks treat evaluator disagreement as noise to be resolved.
  • In creative domains, professional disagreement reflects genuine differences in taste, not measurement error.
  • We argue that evaluating creative AI requires preserving two distinct signals: convergence, where professionals align around shared best practices, and divergence, where individual taste legitimately varies.
  • No model excels uniformly across all phases.
Paper AbstractExpand

Modern AI evaluation frameworks treat evaluator disagreement as noise to be resolved. In creative domains, professional disagreement reflects genuine differences in taste, not measurement error. We argue that evaluating creative AI requires preserving two distinct signals: convergence, where professionals align around shared best practices, and divergence, where individual taste legitimately varies. We present the Human Creativity Benchmark (HCB), a benchmark that operationalizes this separation by collecting pairwise preferences, scalar ratings on prompt adherence, usability, and visual appeal, and qualitative rationale from domain professionals. Across 15,000 professional judgments spanning five creative domains and three workflow phases (ideation, mockup, refinement), we find that convergence concentrates on verifiable dimensions like technical correctness and visual hierarchy, while divergence concentrates on taste-driven dimensions like aesthetic direction and conceptual risk. No model excels uniformly across all phases. Collapsing these signals into a single quality metric discards the most actionable information: where models must be correct versus where they should remain steerable.

The Human Creativity Benchmark (HCB) addresses a fundamental flaw in how we evaluate generative AI: the tendency to treat professional disagreement as "noise" or measurement error. In creative fields, experts often disagree not because they are wrong, but because they have different tastes, aesthetic goals, and creative intents. This research argues that to build better creative AI, we must stop trying to collapse all feedback into a single "quality" score and instead distinguish between where professionals agree (convergence) and where they legitimately differ (divergence).

Measuring Convergence and Divergence

The researchers propose a framework that separates evaluation into two distinct signals. "Convergence" occurs when experts align around shared, verifiable best practices—such as technical correctness, readable typography, or functional layout. "Divergence" occurs when experts evaluate subjective dimensions like aesthetic direction, mood, and conceptual risk. By collecting 15,000 professional judgments across five creative domains, the study demonstrates that these two signals provide different, actionable insights. For instance, a model might be highly consistent on prompt adherence (convergence) while remaining open to interpretation on visual style (divergence).

The Creative Workflow

The study organizes the creative process into three distinct phases: Ideation, Mockup, and Refinement. Each phase requires different capabilities from an AI model. During Ideation, the goal is discovery and exploration; during Mockup, the goal is to bring a specific vision to life; and during Refinement, the goal is to make targeted, consistent adjustments. The benchmark uses these stages to test how well models perform across the entire lifecycle of a project. The results show that no single model excels uniformly across all three phases, suggesting that developers should optimize models differently depending on the intended stage of use.

Why "Good" is Not Enough

The findings reveal that reducing creative output to a single metric of "good" or "bad" discards the most important information for model developers. When evaluators disagree on aesthetic quality, it is often a sign that the model is providing a range of creative options rather than failing to meet a standard. The researchers argue that AI models should be designed to be reliably consistent on convergent axes—where there is a "correct" way to do things—while remaining steerable on divergent axes, allowing human creatives to apply their own professional taste and judgment.

Implications for AI Development

The HCB provides a roadmap for moving beyond simple, one-dimensional benchmarks. By analyzing where experts converge and diverge, the study highlights that the value of an AI tool is highly task-dependent. For model developers, this means the goal should not be to eliminate disagreement, but to understand it. A system that forces all outputs toward a single, average-quality "truth" risks homogenizing creative work, whereas a system that preserves the plurality of expert judgment supports the nuanced, collaborative nature of professional design.

Comments (0)

No comments yet

Be the first to share your thoughts!