Teaching AI Through Benchmark Construction: QuestBe...

Teaching AI Through Benchmark Construction: QuestBe... | AI Research

Key Takeaways

Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work As AI tools become standard in education and...
As AI becomes part of everyday learning, many courses teach students to use it mainly as a productivity tool: how to prompt, search, summarize, write, code, and use tools more efficiently.
We argue that AI education also needs a setting in which students learn to test AI and understand their own role in judging machine-produced knowledge.
To this end, we introduce a course-based practice that teaches AI through benchmark construction, using deep research systems as a concrete example of AI-era knowledge work.
Students turn disciplinary knowledge into verifiable expert-level questions, review one another's designs for ambiguity and shortcuts, and evaluate AI systems on the resulting tasks.

Paper AbstractExpand

As AI becomes part of everyday learning, many courses teach students to use it mainly as a productivity tool: how to prompt, search, summarize, write, code, and use tools more efficiently. We argue that AI education also needs a setting in which students learn to test AI and understand their own role in judging machine-produced knowledge. To this end, we introduce a course-based practice that teaches AI through benchmark construction, using deep research systems as a concrete example of AI-era knowledge work. Students turn disciplinary knowledge into verifiable expert-level questions, review one another's designs for ambiguity and shortcuts, and evaluate AI systems on the resulting tasks. This activity gives students direct exposure to a powerful tool while asking them to specify what a trustworthy answer would require. The produced benchmark, QuestBench, consists of 256 questions across 14 humanities and social-science domains. Evaluation on QuestBench shows that student-designed tasks reveal hidden failures in current deep research systems: across thirteen evaluated systems, the mean question-level pass rate is only 16.85%, and the best-performing system, GPT-5.5, reaches a 57.58% pass rate. The failures are educationally useful because they show how fluent, source-backed answers can still miss the right query, source, term, or evidence standard. Reflections from five student contributors suggest that benchmark construction can help students see professional knowledge not only as content AI may retrieve, but as the basis for judging AI outputs. We present QuestBench as a benchmark artifact and as a reusable classroom setting for a larger educational question: how students can remain responsible knowledge actors as AI enters learning and professional work. The dataset is available at this https URL .

Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work
As AI tools become standard in education and professional work, students are increasingly taught how to use them for productivity—such as summarizing, coding, or searching. However, this focus on efficiency often ignores the need for students to act as responsible judges of AI-generated content. This paper introduces a new educational approach where students learn to evaluate AI by building their own benchmarks. By turning their disciplinary expertise into rigorous, verifiable questions, students move from being passive users of AI to active, accountable overseers of machine-produced knowledge.

From Tool Users to Responsible Judges

The core of this practice is "benchmark construction." Instead of simply prompting an AI for answers, students are tasked with creating the very tests that determine whether an AI’s output is trustworthy. This involves defining what constitutes a valid answer, identifying which sources are authoritative in their specific field, and creating "anti-shortcut" questions that prevent the AI from relying on superficial patterns. By designing these challenges, students learn to recognize the subtle ways AI can fail—such as using outdated terminology, misinterpreting evidence, or providing fluent but inaccurate summaries.

The QuestBench Project

To put this into practice, students from various humanities and social-science disciplines at Peking University collaborated to create QuestBench. This benchmark consists of 256 expert-level questions across 14 domains, including law, history, literature, and international relations. The construction process was rigorous: students subjected each other’s questions to multiple rounds of peer review, auditing them for clarity, accuracy, and the ability to expose AI flaws. This process ensures that the benchmark is not just a collection of trivia, but a set of professional-grade challenges that require deep, domain-specific reasoning to solve.

Performance and Hidden Failures

When researchers tested thirteen state-of-the-art deep research systems against QuestBench, the results revealed significant limitations in current technology. The mean pass rate across all systems was only 16.85%, with even the best-performing model, GPT-5.5, reaching a pass rate of 57.58%. These failures were not just technical glitches; they highlighted how even advanced systems often struggle with the nuances of query formulation, source navigation, and the precise standards of evidence required in professional fields.

Why This Matters for Education

The authors argue that this pedagogical model is essential for the future of knowledge work. As AI becomes more capable, the ability to "prompt" a system will be less important than the ability to judge its output. By engaging in benchmark construction, students gain a clearer understanding of their own role: they are the ones who must decide which questions are worth asking and which answers meet the standards of their profession. This approach transforms disciplinary knowledge from mere content to be retrieved into a vital tool for maintaining accountability in an AI-driven world.