Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work
As AI tools become standard in education and professional work, students are increasingly taught how to use them for productivity—such as summarizing, coding, or searching. However, this focus on efficiency often ignores the need for students to act as responsible judges of AI-generated content. This paper introduces a new educational approach where students learn to evaluate AI by building their own benchmarks. By turning their disciplinary expertise into rigorous, verifiable questions, students move from being passive users of AI to active, accountable overseers of machine-produced knowledge.
From Tool Users to Responsible Judges
The core of this practice is "benchmark construction." Instead of simply prompting an AI for answers, students are tasked with creating the very tests that determine whether an AI’s output is trustworthy. This involves defining what constitutes a valid answer, identifying which sources are authoritative in their specific field, and creating "anti-shortcut" questions that prevent the AI from relying on superficial patterns. By designing these challenges, students learn to recognize the subtle ways AI can fail—such as using outdated terminology, misinterpreting evidence, or providing fluent but inaccurate summaries.
The QuestBench Project
To put this into practice, students from various humanities and social-science disciplines at Peking University collaborated to create QuestBench. This benchmark consists of 256 expert-level questions across 14 domains, including law, history, literature, and international relations. The construction process was rigorous: students subjected each other’s questions to multiple rounds of peer review, auditing them for clarity, accuracy, and the ability to expose AI flaws. This process ensures that the benchmark is not just a collection of trivia, but a set of professional-grade challenges that require deep, domain-specific reasoning to solve.
Performance and Hidden Failures
When researchers tested thirteen state-of-the-art deep research systems against QuestBench, the results revealed significant limitations in current technology. The mean pass rate across all systems was only 16.85%, with even the best-performing model, GPT-5.5, reaching a pass rate of 57.58%. These failures were not just technical glitches; they highlighted how even advanced systems often struggle with the nuances of query formulation, source navigation, and the precise standards of evidence required in professional fields.
Why This Matters for Education
The authors argue that this pedagogical model is essential for the future of knowledge work. As AI becomes more capable, the ability to "prompt" a system will be less important than the ability to judge its output. By engaging in benchmark construction, students gain a clearer understanding of their own role: they are the ones who must decide which questions are worth asking and which answers meet the standards of their profession. This approach transforms disciplinary knowledge from mere content to be retrieved into a vital tool for maintaining accountability in an AI-driven world.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!