Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 45 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 11 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 88 tok/s Pro
Kimi K2 214 tok/s Pro
GPT OSS 120B 460 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

Tora: Trajectory-oriented Diffusion Transformer for Video Generation (2407.21705v4)

Published 31 Jul 2024 in cs.CV

Abstract: Recent advancements in Diffusion Transformer (DiT) have demonstrated remarkable proficiency in producing high-quality video content. Nonetheless, the potential of transformer-based diffusion models for effectively generating videos with controllable motion remains an area of limited exploration. This paper introduces Tora, the first trajectory-oriented DiT framework that concurrently integrates textual, visual, and trajectory conditions, thereby enabling scalable video generation with effective motion guidance. Specifically, Tora consists of a Trajectory Extractor (TE), a Spatial-Temporal DiT, and a Motion-guidance Fuser (MGF). The TE encodes arbitrary trajectories into hierarchical spacetime motion patches with a 3D motion compression network. The MGF integrates the motion patches into the DiT blocks to generate consistent videos that accurately follow designated trajectories. Our design aligns seamlessly with DiT's scalability, allowing precise control of video content's dynamics with diverse durations, aspect ratios, and resolutions. Extensive experiments demonstrate that Tora excels in achieving high motion fidelity compared to the foundational DiT model, while also accurately simulating the complex movements of the physical world. Code is made available at https://github.com/alibaba/Tora .

Citations (15)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper presents a novel DiT framework integrating trajectory extraction and adaptive fusion, achieving 3–5x improvement in trajectory adherence over U-Net models.
  • It leverages a 3D VAE for encoding motion patches and uses adaptive normalization to enhance motion consistency in long, high-resolution videos.
  • The two-stage training strategy with dense-to-sparse flow and rigorous evaluation across FVD, CLIPSIM, and TrajError benchmarks demonstrates robust scalability and fidelity.

Tora: Trajectory-Oriented Diffusion Transformer for Video Generation

Introduction and Motivation

The paper introduces Tora, a trajectory-oriented Diffusion Transformer (DiT) framework for video generation, designed to address the limitations of prior video diffusion models in motion controllability, scalability, and fidelity. While previous approaches—primarily based on U-Net architectures—have demonstrated competence in short, low-resolution video synthesis, they exhibit significant degradation in motion consistency and visual quality as video length and resolution increase. Tora leverages the scalability of DiT architectures and introduces explicit trajectory conditioning, enabling precise, long-range, and physically plausible motion control in generated videos.

Architectural Innovations

Tora is built upon the OpenSora DiT backbone and introduces two key modules: the Trajectory Extractor (TE) and the Motion-guidance Fuser (MGF). The TE encodes user-specified trajectories into hierarchical spacetime motion patches using a 3D VAE, ensuring that motion information is embedded in the same latent space as video patches. The MGF injects these motion conditions into the DiT blocks via adaptive normalization, facilitating seamless integration of trajectory guidance at multiple abstraction levels. Figure 1

Figure 1: Overview of the Tora architecture, highlighting the Trajectory Extractor and Motion-guidance Fuser for trajectory-controlled video generation.

The TE first transforms trajectory vectors into dense trajectory maps, applies Gaussian filtering, and visualizes them in the RGB domain. These are then compressed by a 3D VAE—initialized with SDXL weights for spatial compression—yielding motion latents that are patchified and processed through stacked convolutional layers with skip connections to extract multi-level motion features. The MGF explores several fusion strategies (extra channel, cross-attention, adaptive norm), with adaptive normalization empirically demonstrating the best trade-off between performance and computational efficiency. Figure 2

Figure 2: Comparison of different MGF designs; adaptive normalization yields the best performance for trajectory conditioning.

Training Pipeline and Data Processing

Tora's training pipeline is designed for robust motion controllability across diverse conditions. The model is trained on a curated dataset of 630k video clips, each annotated with captions and dense/sparse motion trajectories. The training employs a two-stage strategy: initial training with dense optical flow to accelerate motion learning, followed by fine-tuning with sparse, user-friendly trajectories. This hybrid approach improves adaptability to various motion patterns and enhances the model's ability to generalize to arbitrary trajectory inputs.

The data processing pipeline includes scene segmentation, aesthetic scoring, optical flow-based filtering, and motion segmentation to ensure high-quality, object-centric motion annotations. Captioning is performed using PLLaVA-13B, and static or camera-motion-dominated clips are filtered out to focus on object motion.

Quantitative and Qualitative Evaluation

Tora is evaluated against state-of-the-art motion-controllable video generation models (e.g., VideoComposer, DragNUWA, AnimateAnything, TrailBlazer, MotionCtrl) on standard metrics: Fréchet Video Distance (FVD), CLIP Similarity (CLIPSIM), and Trajectory Error (TrajError). Tora consistently outperforms baselines, especially as video length increases. For 128-frame sequences, Tora achieves a TrajError of 11.72, compared to 38.39–58.76 for U-Net-based methods, representing a 3–5x improvement in trajectory adherence. FVD and CLIPSIM metrics also indicate superior visual quality and semantic alignment. Figure 3

Figure 3: Trajectory Error across resolutions and durations; Tora maintains gradual error increase, unlike U-Net models which degrade rapidly.

Qualitative results further demonstrate that Tora generates smoother, more physically plausible motion, with reduced artifacts such as motion blur, object deformation, and unintended camera movement. The model is capable of handling multiple objects, diverse aspect ratios, and long durations (up to 204 frames at 720p), with robust trajectory following. Figure 4

Figure 4: Tora-generated samples under various visual and trajectory conditions, including multi-object and multi-condition scenarios.

Figure 5

Figure 5: Qualitative comparison of trajectory control; Tora produces smoother, more realistic motion than competing methods.

Ablation Studies

Ablation experiments validate the design choices in trajectory compression and motion fusion. The 3D VAE-based trajectory encoding outperforms frame sampling and average pooling, preserving global motion context and yielding lower FVD and TrajError. Among fusion strategies, adaptive normalization in the MGF module achieves the best results, attributed to its dynamic feature adaptation and temporal consistency. Integrating MGF within the Temporal DiT block further enhances motion fidelity. The two-stage training strategy (dense-to-sparse flow) is shown to be critical for effective learning of both detailed and user-friendly trajectory conditions.

Implications and Future Directions

Tora establishes a new baseline for trajectory-conditioned video generation, demonstrating that transformer-based diffusion models, when equipped with explicit motion conditioning and scalable architectures, can achieve high-fidelity, long-duration, and physically consistent video synthesis. The explicit separation of motion encoding and fusion enables flexible integration of diverse control signals (text, image, trajectory), paving the way for more interactive and user-driven video generation systems.

Practically, Tora's approach is well-suited for applications in animation, simulation, robotics, and content creation, where precise motion control and high visual fidelity are required. Theoretically, the work highlights the importance of hierarchical motion representation and adaptive conditioning in generative video models. Future research may explore more efficient trajectory encoding schemes, improved temporal consistency mechanisms, and integration with reinforcement learning or physics-based priors for even more realistic world modeling.

Conclusion

Tora advances the state of motion-controllable video generation by introducing a trajectory-oriented DiT framework with hierarchical motion encoding and adaptive fusion. The model achieves strong numerical results in both trajectory adherence and visual quality, particularly for long, high-resolution videos. The architectural and training innovations presented in Tora provide a robust foundation for future research in controllable, scalable, and physically grounded video generation.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com