StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation (2405.01434v1)
Abstract: For recent diffusion-based generative models, maintaining consistent content across a series of generated images, especially those containing subjects and complex details, presents a significant challenge. In this paper, we propose a new way of self-attention calculation, termed Consistent Self-Attention, that significantly boosts the consistency between the generated images and augments prevalent pretrained diffusion-based text-to-image models in a zero-shot manner. To extend our method to long-range video generation, we further introduce a novel semantic space temporal motion prediction module, named Semantic Motion Predictor. It is trained to estimate the motion conditions between two provided images in the semantic spaces. This module converts the generated sequence of images into videos with smooth transitions and consistent subjects that are significantly more stable than the modules based on latent spaces only, especially in the context of long video generation. By merging these two novel components, our framework, referred to as StoryDiffusion, can describe a text-based story with consistent images or videos encompassing a rich variety of contents. The proposed StoryDiffusion encompasses pioneering explorations in visual story generation with the presentation of images and videos, which we hope could inspire more research from the aspect of architectural modifications. Our code is made publicly available at https://github.com/HVision-NKU/StoryDiffusion.
- Segdiff: Image segmentation with diffusion probabilistic models. arXiv preprint, 2021.
- Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, 2021.
- Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint, 2023a.
- Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023b.
- Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
- Seine: Short-to-long video diffusion model for generative transition and prediction. arXiv preprint, 2023.
- Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2818–2829, 2023.
- Training-free structured diffusion guidance for compositional text-to-image synthesis. ICLR, 2023a.
- Training-free structured diffusion guidance for compositional text-to-image synthesis. In ICLR, 2023b.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint, 2022.
- Sparsectrl: Adding sparse controls to text-to-video diffusion models. arXiv preprint, 2023.
- Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. ICLR, 2024.
- Animate-a-story: Storytelling with retrieval-augmented video generation. arXiv preprint, 2023.
- Video diffusion models. arxiv 2022. arXiv preprint, 2022a.
- Denoising diffusion probabilistic models. In NeurIPS, 2020.
- Imagen video: High definition video generation with diffusion models. arXiv preprint, 2022b.
- Improving sample quality of diffusion models using self-attention guidance. In ICCV, 2023.
- Animate anyone: Consistent and controllable image-to-video synthesis for character animation. arXiv preprint, 2023.
- Text2performer: Text-driven human video generation. arXiv preprint, 2023.
- Multi-concept customization of text-to-image diffusion. In CVPR, 2023.
- Photomaker: Customizing realistic human photos via stacked id embedding. arXiv preprint, 2023a.
- Open-vocabulary object segmentation with diffusion models. In ICCV, 2023b.
- Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. NeurIPS, 2022.
- Repaint: Inpainting using denoising diffusion probabilistic models. In CVPR, 2022.
- Directed diffusion: Direct control of object placement through attention guidance. arXiv preprint, 2023a.
- Follow your pose: Pose-guided text-to-video generation using pose-free videos. arXiv preprint, 2023b.
- Training-free location-aware text-to-image synthesis. arXiv preprint, 2023.
- T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint, 2023.
- Conditional image-to-video generation with latent flow diffusion models. In CVPR, 2023.
- Scalable diffusion models with transformers. arXiv preprint, 2022.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint, 2023.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint, 2022.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- U-net: Convolutional networks for biomedical image segmentation. In MICCAI 2015, 2015.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023.
- Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 2022.
- Image super-resolution via iterative refinement. IEEE TPAMI, 2023.
- Make-a-video: Text-to-video generation without text-video data. arXiv preprint, 2022.
- Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
- Denoising diffusion implicit models. arXiv preprint, 2020.
- Score-based generative modeling through stochastic differential equations. ICLR, 2021.
- Diffuse, attend, and segment: Unsupervised zero-shot segmentation using stable diffusion. arXiv preprint, 2023.
- Exploiting diffusion prior for real-world image super-resolution. arXiv preprint, 2023a.
- Modelscope text-to-video technical report. arXiv preprint, 2023b.
- Instantid: Zero-shot identity-preserving generation in seconds. arXiv preprint, 2024.
- Disco: Disentangled control for referring human dance generation in real world. arXiv preprint, 2023c.
- Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint, 2023d.
- Lamp: Learn a motion pattern for few-shot-based video generation. arXiv preprint, 2023.
- Smartbrush: Text and shape guided object inpainting with diffusion model. arXiv preprint, 2022.
- Magicanimate: Temporally consistent human image animation using diffusion model. arXiv preprint, 2023.
- Probabilistic adaptation of text-to-video models. arXiv preprint, 2023.
- Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint, 2023.
- Lion: Latent point diffusion models for 3d shape generation. In NeurIPS, 2022.
- Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint, 2023a.
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023b.
- Fast sampling of diffusion models with exponential integrator. ICLR, 2023.
- The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
- Motiondirector: Motion customization of text-to-video diffusion models. arXiv preprint, 2023.
- Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint, 2023a.
- 3d shape generation and completion through point-voxel diffusion. In ICCV, 2021.
- Maskdiffusion: Boosting text-to-image consistency with conditional mask. arXiv preprint, 2023b.