Back to AI Research

AI Research

AtelierEval: Agentic Evaluation of Humans & LLMs as... | AI Research

Key Takeaways

  • AtelierEval: Agentic Evaluation of Humans & LLMs as Text-to-Image Prompters introduces a new way to measure how well humans and AI models translate user inte...
  • Text-to-image (T2I) systems increasingly rely on upstream prompters, either humans or multimodal large language models (MLLMs), to translate user intent into detailed prompts.
  • Yet current benchmarks fix the prompt and only evaluate T2I models, leaving the prompting proficiency of this upstream component entirely unmeasured.
  • We introduce AtelierEval, the first unified benchmark that quantifies prompting proficiency across 360 expert-crafted tasks.
  • Grounded in a cognitive view, it spans three task categories and instantiates tasks using a taxonomy of real-world challenges, with a dual interface for both humans and MLLMs.
Paper AbstractExpand

Text-to-image (T2I) systems increasingly rely on upstream prompters, either humans or multimodal large language models (MLLMs), to translate user intent into detailed prompts. Yet current benchmarks fix the prompt and only evaluate T2I models, leaving the prompting proficiency of this upstream component entirely unmeasured. We introduce AtelierEval, the first unified benchmark that quantifies prompting proficiency across 360 expert-crafted tasks. Grounded in a cognitive view, it spans three task categories and instantiates tasks using a taxonomy of real-world challenges, with a dual interface for both humans and MLLMs. To enable scalable and reliable evaluation, we propose AtelierJudge, a skill-based, memory-augmented agentic evaluator. It produces subjective and objective scores for prompt-image pairs, achieving a Spearman correlation of 0.79 with human experts, approaching human performance. Extensive experiments benchmark 8 MLLMs against 48 human users across 4 T2I backends, validate AtelierEval as a robust diagnostic tool, and reveal the superiority of mimicry over planning, advocating for an image-augmented direction for future prompters. Our work is released to support future research.

AtelierEval: Agentic Evaluation of Humans & LLMs as Text-to-Image Prompters introduces a new way to measure how well humans and AI models translate user intent into effective text-to-image prompts. While current benchmarks focus on testing the image generation models themselves, they often ignore the "prompter"—the person or AI responsible for writing the instructions. This paper establishes a standardized, cognitive-based framework to evaluate this crucial upstream skill, ensuring that we can measure and improve the ability to turn abstract ideas into high-quality visual results.

A Cognitive Approach to Prompting

To create a fair and rigorous evaluation, the researchers categorized 360 expert-crafted tasks into three cognitive groups: Open-ended creation (translating abstract requests), Constrained creation (following specific rules or requirements), and Imitation (reproducing visual features from a target image). By grounding these tasks in cognitive science, the benchmark moves away from trial-and-error prompting and toward a structured, diagnostic assessment of how well a prompter can handle different types of creative and technical challenges.

AtelierJudge: The Agentic Evaluator

The core of the evaluation system is "AtelierJudge," an AI agent designed to act as a fair and consistent grader. Inspired by Dual-Process Theory, the judge splits its evaluation into two distinct paths:

  • Subjective Evaluation: Using memory-augmented retrieval, the judge compares prompts and images against expert-curated examples to score them on aesthetic and creative qualities.

  • Objective Verification: The judge uses a checklist to verify if specific constraints—such as object counts, spatial relationships, or text accuracy—were met. This dual-process design allows the system to achieve a high correlation with human expert judgment, effectively mimicking how a person would evaluate the quality and accuracy of a generated image.

Key Findings and Future Directions

The researchers tested 8 different AI models against 48 human users across various image generation backends. The results highlighted a significant gap in prompting proficiency across different skill levels. A notable finding is that while advanced AI middleware can sometimes smooth out quality differences, it can also create logical conflicts when combined with external AI reasoning. The study suggests that "imitation" prompting—focusing on visual alignment rather than complex symbolic planning—is a more reliable strategy for future prompting agents. By providing this benchmark as an open-source tool, the authors aim to help both human users improve their skills and developers build more effective AI-driven prompting assistants.

Comments (0)

No comments yet

Be the first to share your thoughts!