Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TrailBlazer: Trajectory Control for Diffusion-Based Video Generation (2401.00896v2)

Published 31 Dec 2023 in cs.CV

Abstract: Within recent approaches to text-to-video (T2V) generation, achieving controllability in the synthesized video is often a challenge. Typically, this issue is addressed by providing low-level per-frame guidance in the form of edge maps, depth maps, or an existing video to be altered. However, the process of obtaining such guidance can be labor-intensive. This paper focuses on enhancing controllability in video synthesis by employing straightforward bounding boxes to guide the subject in various ways, all without the need for neural network training, finetuning, optimization at inference time, or the use of pre-existing videos. Our algorithm, TrailBlazer, is constructed upon a pre-trained (T2V) model, and easy to implement. The subject is directed by a bounding box through the proposed spatial and temporal attention map editing. Moreover, we introduce the concept of keyframing, allowing the subject trajectory and overall appearance to be guided by both a moving bounding box and corresponding prompts, without the need to provide a detailed mask. The method is efficient, with negligible additional computation relative to the underlying pre-trained model. Despite the simplicity of the bounding box guidance, the resulting motion is surprisingly natural, with emergent effects including perspective and movement toward the virtual camera as the box size increases.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Daniel Arijon. Grammar of the Film Language. Focal Press, 1976.
  2. ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers. CoRR, abs/2211.01324, 2022.
  3. Multidiffusion: Fusing diffusion paths for controlled image generation. CoRR, abs/2302.08113, 2023.
  4. Align your latents: High-resolution video synthesis with latent diffusion models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  5. cerspense. zeroscope-v2-576w, 2023. Accessed: 2023-10-01.
  6. Control-a-video: Controllable text-to-video generation with diffusion models, 2023.
  7. Structure and content-guided video synthesis with diffusion models. ArXiv, abs/2302.03011, 2023.
  8. Preserve your own correlation: A noise prior for video diffusion models. Proceedings of the IEEE/CVF International Conference on Computer Vision 2023, 2023.
  9. Flexible diffusion modeling of long videos, 2022.
  10. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  11. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2017.
  12. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 2020.
  13. Imagen video: High definition video generation with diffusion models. ArXiv, abs/2210.02303, 2022a.
  14. Video diffusion models, 2022b.
  15. Videocontrolnet: A motion-guided video-to-video translation framework by using diffusion model with controlnet, 2023.
  16. Huggingface. Stable diffusion 1 demo, 2023. Accessed: 2023-01-01.
  17. Diffusion models for video prediction and infilling, 2022.
  18. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
  19. Cifar-10 (canadian institute for advanced research).
  20. Magicmix: Semantic mixing with diffusion models. CoRR, abs/2210.16056, 2022.
  21. Videofusion: Decomposed diffusion models for high-quality video generation, 2023.
  22. Directed diffusion: Direct control of object placement through attention guidance, 2023.
  23. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  24. Improved denoising diffusion probabilistic models, 2021.
  25. GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2022.
  26. High-fidelity performance metrics for generative models in pytorch, 2020. Version: 0.3.0, DOI: 10.5281/zenodo.4957738.
  27. Pytorch: An imperative style, high-performance deep learning library, 2019.
  28. Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv:2303.09535, 2023.
  29. Learning transferable visual models from natural language supervision. In Proc. ICML, 2021.
  30. Hierarchical text-conditional image generation with CLIP latents. CoRR, abs/2204.06125, 2022.
  31. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022.
  32. Photorealistic text-to-image diffusion models with deep language understanding. CoRR, abs/2205.11487, 2022a.
  33. Photorealistic text-to-image diffusion models with deep language understanding. ArXiv, abs/2205.11487, 2022b.
  34. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, 2015.
  35. Denoising diffusion implicit models. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, 2021.
  36. Generative modeling by estimating gradients of the data distribution. In NeurIPS, 2019.
  37. Learning layout and style reconfigurable gans for controllable image synthesis. TPAMI, 44:5070–5087, 2022.
  38. Mitigating the bias of centered objects in common datasets. CoRR, abs/2112.09195, 2021.
  39. Attention is all you need. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2017.
  40. Mcvd: Masked conditional video diffusion for prediction, generation, and interpolation. In (NeurIPS) Advances in Neural Information Processing Systems, 2022.
  41. Lilian Weng. What are diffusion models?, 2021.
  42. wiki. keyframe, 2023. Accessed: 2023-10-01.
  43. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023.
  44. Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion. CoRR, abs/2307.10816, 2023.
  45. Magicprop: Diffusion-based video editing via motion-aware appearance propagation, 2023.
  46. Diffusion probabilistic modeling for video generation, 2022a.
  47. Modeling image composition for complex scene generation. CVPR, pages 7754–7763, 2022b.
  48. Adding conditional control to text-to-image diffusion models, 2023.
  49. Layout2image: Image generation from layout. Int. J. Comput. Vis., 128(10):2418–2435, 2020.
Citations (20)

Summary

  • The paper introduces TrailBlazer, which uses bounding boxes as a high-level interface to control object trajectories in video generation.
  • It edits spatial and temporal attention maps in a pre-trained diffusion model to facilitate smooth keyframe interpolation and efficient guidance.
  • Quantitative evaluations demonstrate competitive performance with natural object movements, though challenges with multi-object generation persist.

Overview

The development of text-to-video (T2V) generation technology has made significant advances, allowing the creation of videos from textual descriptions. A typical challenge in this domain is controllability—ensuring objects follow specific spatial and temporal paths in the generated video. This paper introduces a novel method, named TrailBlazer, which provides high-level control over object trajectories in video synthesis without requiring detailed guidance such as edge maps or in-depth user input.

Methodology

The novelty of TrailBlazer lies in its usage of bounding boxes as a simple and high-level interface to guide object trajectories, an approach that is accessible even to casual users. Instead of relying on detailed masks or complex signals, users only need to provide bounding boxes and text prompts at certain key points in the video. The underlying mechanism relies on editing spatial and temporal attention maps within a pre-trained denoising diffusion model, allowing for trajectory and appearance control of the subject. Keyframing is introduced to interpolate the bounding box positioning and text prompts, creating smooth transitions without extensive computational overhead.

Implementation

TrailBlazer is built upon a pre-existing pre-trained T2V model and requires no additional training or optimization. The edits are applied during the initial denoising stages, effectively guiding the activation towards the desired object location while preserving the learned text-image association. The core algorithm has low complexity and is highly efficient—easily implementable in less than 200 lines of code. A key factor in the approach is careful tuning of parameters such as trailing attention map indices and the number of temporal denoising steps. These considerations are essential for achieving high-quality results that balance the guidance of the subject and the naturalness of its motion.

Results and Evaluations

TrailBlazer yielded surprising results, demonstrating natural object movements and emergent effects such as perspective shifts and objects approaching or receding from the virtual camera. The system's capability was tested in various scenarios including single and multiple subjects and under different environmental conditions. Quantitative evaluations were conducted using metrics such as the Frechet Inception Distance (FID), showcasing comparable or improved performance versus alternative approaches. Despite its strengths, TrailBlazer does have limitations. Challenges with the underlying diffusion model, such as object deformation and generating multiple objects, persist. Nonetheless, this method lays the groundwork for user-friendly and controllable text-to-video synthesis that is expected to evolve with advances in generative models.

For detailed visuals and supportive materials, readers can visit the provided project page, which includes comprehensive ablations and examples of TrailBlazer's capabilities in practice.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com