S2ED: From Story to Executable Descriptions for Consistency-Aware Story Illustration
Multi-frame story illustration—the process of turning a narrative into a sequence of coherent images—is a difficult task for current AI. While modern text-to-image models can generate high-quality single images, they often struggle to maintain consistency across a series of frames. This leads to "drift," where characters change their appearance, clothing, or spatial layout as the story progresses. The paper introduces Story-to-Executable Descriptions (S2ED), a training-free framework that acts as a bridge between a story and an image generator, ensuring that identity, layout, and mood remain stable throughout a narrative.
How S2ED Works
S2ED treats story illustration as a "compilation" problem. Instead of asking an AI to generate images directly from raw text, the framework uses three specialized agents to process the story into a structured, persistent format:
- Narrative Segmenter: Breaks the full story into individual, frame-aligned captions. 2. Character Consistency Grounder: Identifies characters and retrieves their canonical appearance (like clothing and hair) from a knowledge base. It ensures these traits are carried forward from one frame to the next. 3. Visual Enricher: Adds specific details about the spatial layout and the emotional "affect" of the scene.
By combining these elements, S2ED creates an "executable description" for each frame. Crucially, this process is recurrent; the description for the current frame is built upon the commitments made in the previous one, allowing the system to remember what a character looks like without needing to retrain the underlying image generator.
Why This Approach Matters
Traditional methods for maintaining consistency often require expensive model retraining or fine-tuning, which can be slow and difficult to manage. S2ED is model-agnostic and training-free, meaning it can be used with existing image generators without the need for high-compute updates. Because the system uses an explicit, text-based state, it also allows for "human-in-the-loop" edits, where a user can manually adjust the description of a frame to fix errors without breaking the rest of the sequence.
Key Results
The researchers tested S2ED using two datasets: Flintstones (a collection of classic animated stories) and Shakoo Maku (a series of family-oriented narratives). In both cases, S2ED outperformed standard prompting techniques and large-model baselines.
Quantitative metrics showed significant improvements in character identity consistency, event alignment, and spatial layout. Human evaluators also consistently preferred the sequences generated by S2ED, rating them higher for character consistency, story relevance, and overall visual quality compared to other methods.
Limitations and Future Considerations
While S2ED improves consistency, the authors note a few areas for improvement:
Multi-entity interference: In scenes with many characters, the system can sometimes struggle, leading to appearance swaps or identity drift.
Attribute leakage: Occasionally, specific traits (like a prop or color) might accidentally transfer from one character to another during the enrichment phase.
Evaluation scope: The current metrics rely heavily on CLIP-based scores, which measure coarse alignment but may not fully capture the nuance of emotional flow or long-term narrative coherence.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!