AI Research

X+Slides: Benchmarking Audience-Conditioned Slide G... | AI Research

Key Takeaways

X+Slides: Benchmarking Audience-Conditioned Slide Generation Generating slide decks from documents is a common task for large language models, but current ev...
Automatically generating slide decks from source documents is an important application of large language models (LLMs).
Existing benchmarks primarily assess slide completeness and technical depth, while overlooking the target audience as a critical real-world factor.
For instance, specialists demand rigorous proofs, whereas decision-makers prioritize actionable conclusions.
To bridge this gap, we introduce X+Slides, a benchmark specifically designed for audience-conditioned slide generation.

Paper AbstractExpand

Automatically generating slide decks from source documents is an important application of large language models (LLMs). Existing benchmarks primarily assess slide completeness and technical depth, while overlooking the target audience as a critical real-world factor. For instance, specialists demand rigorous proofs, whereas decision-makers prioritize actionable conclusions. To bridge this gap, we introduce X+Slides, a benchmark specifically designed for audience-conditioned slide generation. Built on a diverse corpus spanning 113 topics and seven presentation scenes, X+Slides employs a dynamic evaluation framework constructed from 8,133 deduplicated, source-grounded probes. By assigning audience-specific utility weights to the same source-grounded probes, X+Slides reports four complementary metrics: Audience Coverage measures how much audience-essential information is conveyed, Domain-wise Coverage shows which information types are covered, Efficiency measures delivered utility per unit of attention cost, and Correctness verifies whether slide claims are supported by the source. Experiments on DeepPresenter, SlideTailor, and NotebookLM show that current systems can recover a substantial but still incomplete part of audience-essential information: at $\tau_A=0.7$, DeepPresenter reaches a best Audience Coverage of 0.714, SlideTailor reaches 0.594, and the NotebookLM ablation reaches 0.853 while showing clear grounding differences. These results indicate that visual quality and broad topic coverage should not be treated as evidence support without source-grounded evaluation.

X+Slides: Benchmarking Audience-Conditioned Slide Generation

Generating slide decks from documents is a common task for large language models, but current evaluation methods often miss a vital component: the target audience. While existing tools focus on how much information is included or how technical the content is, they fail to account for the fact that different listeners need different things—for example, specialists require rigorous proof, while decision-makers need actionable conclusions. The paper introduces X+Slides, a new benchmark designed to evaluate how well AI models tailor slide content to specific audiences.

A New Way to Evaluate Presentations

To bridge the gap between generic slide generation and audience-specific needs, the authors built X+Slides using a diverse dataset covering 113 topics across seven different presentation scenarios. The benchmark utilizes a dynamic evaluation framework containing 8,133 source-grounded probes. By assigning specific "utility weights" to these probes based on the target audience, the system can determine if the generated slides actually contain the information that matters most to the intended viewers.

Measuring Performance with Four Metrics

X+Slides evaluates AI-generated presentations using four complementary metrics:

Audience Coverage: Measures the extent to which essential information for a specific audience is included.
Domain-wise Coverage: Tracks which specific types of information are successfully captured.
Efficiency: Calculates the utility delivered relative to the amount of attention required by the audience.
Correctness: Verifies that the claims made in the slides are accurately supported by the original source document.

Insights from Current AI Systems

The researchers tested several systems, including DeepPresenter, SlideTailor, and NotebookLM, using this new framework. The results show that while these models can recover a significant portion of audience-essential information, they are not yet perfect. For instance, at a specific threshold, DeepPresenter achieved an Audience Coverage of 0.714, while SlideTailor reached 0.594.

The Importance of Grounding

A key takeaway from this research is that visual quality and broad topic coverage are not reliable indicators of accuracy. The authors emphasize that slide decks should not be judged solely on their appearance or the range of topics they cover. Instead, they argue that rigorous, source-grounded evaluation is necessary to ensure that the information presented is both relevant to the audience and factually supported by the source material.

Comments (0)

No comments yet

Be the first to share your thoughts!