Back to AI Research

AI Research

S2ED: From Story to Executable Descriptions for Con... | AI Research

Key Takeaways

  • S2ED: From Story to Executable Descriptions for Consistency-Aware Story Illustration Multi-frame story illustration—the process of turning a narrative into a...
  • Multi-frame story illustration requires long-horizon coherence beyond single-image text-to-image generation, including narrative decomposition and persistent character identity, layout, and affect across frames.
  • We also deploy S2ED in an end-to-end story-to-storybook system for children's illustrated stories, with a supplementary video.
  • # S2ED: From Story to Executable Descriptions for Consistency-Aware Story Illustration
  • Multi-frame story illustration—the process of turning a narrative into a sequence of coherent images—is a difficult task for current AI.
Paper AbstractExpand

Multi-frame story illustration requires long-horizon coherence beyond single-image text-to-image generation, including narrative decomposition and persistent character identity, layout, and affect across frames. We propose Story-to-Executable Descriptions (S2ED), a training-free, model-agnostic, prompt-layer framework that converts a full story into a sequence of explicit, editable executable descriptions for more consistent rendering. S2ED coordinates three agents to segment the narrative, ground canonical character attributes, and enrich spatial and affective cues, enabling interpretable prompt-carried state propagation and local edits to repair drift without retraining the generator. Experiments on Flintstones and Shakoo Maku show that S2ED improves sequence-level consistency and character fidelity over strong prompting, large-model planning, and a reference training-based method, under both automatic metrics and human judgments. We also deploy S2ED in an end-to-end story-to-storybook system for children's illustrated stories, with a supplementary video.

S2ED: From Story to Executable Descriptions for Consistency-Aware Story Illustration

Multi-frame story illustration—the process of turning a narrative into a sequence of coherent images—is a difficult task for current AI. While modern text-to-image models can generate high-quality single images, they often struggle to maintain consistency across a series of frames. This leads to "drift," where characters change their appearance, clothing, or spatial layout as the story progresses. The paper introduces Story-to-Executable Descriptions (S2ED), a training-free framework that acts as a bridge between a story and an image generator, ensuring that identity, layout, and mood remain stable throughout a narrative.

How S2ED Works

S2ED treats story illustration as a "compilation" problem. Instead of asking an AI to generate images directly from raw text, the framework uses three specialized agents to process the story into a structured, persistent format:

  1. Narrative Segmenter: Breaks the full story into individual, frame-aligned captions. 2. Character Consistency Grounder: Identifies characters and retrieves their canonical appearance (like clothing and hair) from a knowledge base. It ensures these traits are carried forward from one frame to the next. 3. Visual Enricher: Adds specific details about the spatial layout and the emotional "affect" of the scene.
    By combining these elements, S2ED creates an "executable description" for each frame. Crucially, this process is recurrent; the description for the current frame is built upon the commitments made in the previous one, allowing the system to remember what a character looks like without needing to retrain the underlying image generator.

Why This Approach Matters

Traditional methods for maintaining consistency often require expensive model retraining or fine-tuning, which can be slow and difficult to manage. S2ED is model-agnostic and training-free, meaning it can be used with existing image generators without the need for high-compute updates. Because the system uses an explicit, text-based state, it also allows for "human-in-the-loop" edits, where a user can manually adjust the description of a frame to fix errors without breaking the rest of the sequence.

Key Results

The researchers tested S2ED using two datasets: Flintstones (a collection of classic animated stories) and Shakoo Maku (a series of family-oriented narratives). In both cases, S2ED outperformed standard prompting techniques and large-model baselines.
Quantitative metrics showed significant improvements in character identity consistency, event alignment, and spatial layout. Human evaluators also consistently preferred the sequences generated by S2ED, rating them higher for character consistency, story relevance, and overall visual quality compared to other methods.

Limitations and Future Considerations

While S2ED improves consistency, the authors note a few areas for improvement:

  • Multi-entity interference: In scenes with many characters, the system can sometimes struggle, leading to appearance swaps or identity drift.

  • Attribute leakage: Occasionally, specific traits (like a prop or color) might accidentally transfer from one character to another during the enrichment phase.

  • Evaluation scope: The current metrics rely heavily on CLIP-based scores, which measure coarse alignment but may not fully capture the nuance of emotional flow or long-term narrative coherence.

Comments (0)

No comments yet

Be the first to share your thoughts!