Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation (2405.01434v1)

Published 2 May 2024 in cs.CV

Abstract: For recent diffusion-based generative models, maintaining consistent content across a series of generated images, especially those containing subjects and complex details, presents a significant challenge. In this paper, we propose a new way of self-attention calculation, termed Consistent Self-Attention, that significantly boosts the consistency between the generated images and augments prevalent pretrained diffusion-based text-to-image models in a zero-shot manner. To extend our method to long-range video generation, we further introduce a novel semantic space temporal motion prediction module, named Semantic Motion Predictor. It is trained to estimate the motion conditions between two provided images in the semantic spaces. This module converts the generated sequence of images into videos with smooth transitions and consistent subjects that are significantly more stable than the modules based on latent spaces only, especially in the context of long video generation. By merging these two novel components, our framework, referred to as StoryDiffusion, can describe a text-based story with consistent images or videos encompassing a rich variety of contents. The proposed StoryDiffusion encompasses pioneering explorations in visual story generation with the presentation of images and videos, which we hope could inspire more research from the aspect of architectural modifications. Our code is made publicly available at https://github.com/HVision-NKU/StoryDiffusion.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. Segdiff: Image segmentation with diffusion probabilistic models. arXiv preprint, 2021.
  2. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, 2021.
  3. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint, 2023a.
  4. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023b.
  5. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
  6. Seine: Short-to-long video diffusion model for generative transition and prediction. arXiv preprint, 2023.
  7. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2818–2829, 2023.
  8. Training-free structured diffusion guidance for compositional text-to-image synthesis. ICLR, 2023a.
  9. Training-free structured diffusion guidance for compositional text-to-image synthesis. In ICLR, 2023b.
  10. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint, 2022.
  11. Sparsectrl: Adding sparse controls to text-to-video diffusion models. arXiv preprint, 2023.
  12. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. ICLR, 2024.
  13. Animate-a-story: Storytelling with retrieval-augmented video generation. arXiv preprint, 2023.
  14. Video diffusion models. arxiv 2022. arXiv preprint, 2022a.
  15. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  16. Imagen video: High definition video generation with diffusion models. arXiv preprint, 2022b.
  17. Improving sample quality of diffusion models using self-attention guidance. In ICCV, 2023.
  18. Animate anyone: Consistent and controllable image-to-video synthesis for character animation. arXiv preprint, 2023.
  19. Text2performer: Text-driven human video generation. arXiv preprint, 2023.
  20. Multi-concept customization of text-to-image diffusion. In CVPR, 2023.
  21. Photomaker: Customizing realistic human photos via stacked id embedding. arXiv preprint, 2023a.
  22. Open-vocabulary object segmentation with diffusion models. In ICCV, 2023b.
  23. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. NeurIPS, 2022.
  24. Repaint: Inpainting using denoising diffusion probabilistic models. In CVPR, 2022.
  25. Directed diffusion: Direct control of object placement through attention guidance. arXiv preprint, 2023a.
  26. Follow your pose: Pose-guided text-to-video generation using pose-free videos. arXiv preprint, 2023b.
  27. Training-free location-aware text-to-image synthesis. arXiv preprint, 2023.
  28. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint, 2023.
  29. Conditional image-to-video generation with latent flow diffusion models. In CVPR, 2023.
  30. Scalable diffusion models with transformers. arXiv preprint, 2022.
  31. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint, 2023.
  32. Learning transferable visual models from natural language supervision. In ICML, 2021.
  33. Hierarchical text-conditional image generation with clip latents. arXiv preprint, 2022.
  34. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  35. U-net: Convolutional networks for biomedical image segmentation. In MICCAI 2015, 2015.
  36. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023.
  37. Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 2022.
  38. Image super-resolution via iterative refinement. IEEE TPAMI, 2023.
  39. Make-a-video: Text-to-video generation without text-video data. arXiv preprint, 2022.
  40. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
  41. Denoising diffusion implicit models. arXiv preprint, 2020.
  42. Score-based generative modeling through stochastic differential equations. ICLR, 2021.
  43. Diffuse, attend, and segment: Unsupervised zero-shot segmentation using stable diffusion. arXiv preprint, 2023.
  44. Exploiting diffusion prior for real-world image super-resolution. arXiv preprint, 2023a.
  45. Modelscope text-to-video technical report. arXiv preprint, 2023b.
  46. Instantid: Zero-shot identity-preserving generation in seconds. arXiv preprint, 2024.
  47. Disco: Disentangled control for referring human dance generation in real world. arXiv preprint, 2023c.
  48. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint, 2023d.
  49. Lamp: Learn a motion pattern for few-shot-based video generation. arXiv preprint, 2023.
  50. Smartbrush: Text and shape guided object inpainting with diffusion model. arXiv preprint, 2022.
  51. Magicanimate: Temporally consistent human image animation using diffusion model. arXiv preprint, 2023.
  52. Probabilistic adaptation of text-to-video models. arXiv preprint, 2023.
  53. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint, 2023.
  54. Lion: Latent point diffusion models for 3d shape generation. In NeurIPS, 2022.
  55. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint, 2023a.
  56. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023b.
  57. Fast sampling of diffusion models with exponential integrator. ICLR, 2023.
  58. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
  59. Motiondirector: Motion customization of text-to-video diffusion models. arXiv preprint, 2023.
  60. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint, 2023a.
  61. 3d shape generation and completion through point-voxel diffusion. In ICCV, 2021.
  62. Maskdiffusion: Boosting text-to-image consistency with conditional mask. arXiv preprint, 2023b.
Citations (39)

Summary

  • The paper harnesses zero-shot Consistent Self-Attention to preserve character identity across image sequences, ensuring coherent visual storytelling.
  • It deploys a Semantic Motion Predictor that leverages semantic space transitions for smoother, logically consistent video generation.
  • The method demonstrates practical applications in digital storytelling, education, and creative industries by producing enhanced AI-driven visual content.

Enhancing Visual Storytelling with AI: Insights from the StoryDiffusion Approach

StoryDiffusion: Enhancing Text-to-Image and Video Consistency

In a fascinating development in the field of AI-generated visuals, a new technique known as StoryDiffusion introduces innovative methods to tackle the challenging task of creating consistent, high-quality image and video sequences from textual descriptions. The main thrust of this method is its two key components: Consistent Self-Attention and Semantic Motion Predictor.

Consistent Self-Attention

  • Purpose: The primary goal of Consistent Self-Attention is to maintain the identity and attire of characters throughout a series of generated images, crucial for coherent visual storytelling.
  • How it Works: This mechanism operates by sampling features from different images within a batch and using these features to guide the self-attention process in the generation of each image. This ensures that characteristics like the appearance of characters are maintained consistently across the sequence of images.
  • Zero-shot Learning: Remarkably, this approach requires no additional training or fine-tuning. It plugs seamlessly into existing models, leveraging their learned weights effectively.

Semantic Motion Predictor

  • Challenge Addressed: Transitioning smoothly between images in video generation, especially over longer sequences, poses a substantial challenge.
  • Novel Strategy: Unlike typical methods that predict intermediate frames based solely on latent image features, Semantic Motion Predictor tackles this by working in a semantic space. It uses a pre-trained Clip image encoder to convert images into a semantic format before predicting the transitional states.
  • Outcome: By focusing on semantic details, this predictor manages to create transitions that are not only smoother but also more logically coherent with the starting and ending frames, thus enhancing the video's overall fluidity and realism.

Practical Applications and Implications

StoryDiffusion's novel components open up exciting possibilities:

  • Digital Storytelling: Enhanced consistency in characters and settings can significantly improve the quality of visual narratives, making them more engaging and easier to follow.
  • Educational Content: In educational settings, consistent and clear visual content is crucial for maintaining engagement and understanding. StoryDiffusion can be used to create educational videos that maintain continuity in visual themes.
  • Creative Industries: The film and animation industries could use these techniques to draft visual storyboards or generate pre-visualizations for projects, saving time and resources in the creative process.

Speculations on Future Development

The introduction of zero-shot, plug-and-play components like these in diffusion models could herald a new phase of AI-driven content creation where customization and consistency are achieved more effortlessly. Future advancements might include further fine-tuning of these methods to handle even more complex sequences and interactions within stories, potentially integrating other forms of media to create mixed-media narratives.

Conclusion

StoryDiffusion represents a significant stride forward in the generation of coherent visual stories from text. Its innovative use of Consistent Self-Attention for image consistency and Semantic Motion Predictor for smooth video transitions sets a new bar for what's possible in the field of AI-enabled content creation. The ability to generate long, coherent video narratives from textual descriptions without extensive computational overhead could greatly expand both creative and practical applications of AI in various fields. As this technology continues to evolve, we may soon see AI playing a bigger role in content creation, offering tools that empower creators while enhancing the viewer's experience with rich, consistent storytelling.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews

  1. StoryDiffusion (3 points, 0 comments)