Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling (2401.15977v2)

Published 29 Jan 2024 in cs.CV

Abstract: We introduce Motion-I2V, a novel framework for consistent and controllable image-to-video generation (I2V). In contrast to previous methods that directly learn the complicated image-to-video mapping, Motion-I2V factorizes I2V into two stages with explicit motion modeling. For the first stage, we propose a diffusion-based motion field predictor, which focuses on deducing the trajectories of the reference image's pixels. For the second stage, we propose motion-augmented temporal attention to enhance the limited 1-D temporal attention in video latent diffusion models. This module can effectively propagate reference image's feature to synthesized frames with the guidance of predicted trajectories from the first stage. Compared with existing methods, Motion-I2V can generate more consistent videos even at the presence of large motion and viewpoint variation. By training a sparse trajectory ControlNet for the first stage, Motion-I2V can support users to precisely control motion trajectories and motion regions with sparse trajectory and region annotations. This offers more controllability of the I2V process than solely relying on textual instructions. Additionally, Motion-I2V's second stage naturally supports zero-shot video-to-video translation. Both qualitative and quantitative comparisons demonstrate the advantages of Motion-I2V over prior approaches in consistent and controllable image-to-video generation. Please see our project page at https://xiaoyushi97.github.io/Motion-I2V/.

Citations (39)

View on Semantic Scholar

Summary

The paper introduces a novel two-stage pipeline that first predicts pixel-wise motion fields using diffusion models and then synthesizes video frames with a motion-augmented temporal attention mechanism.
The paper achieves superior temporal consistency and dynamic motion fidelity, outperforming state-of-the-art methods like VideoComposer and DynamiCrafter.
The paper offers fine-grained control features, including sparse trajectory editing and region-specific animation, enabling precise user-directed video synthesis.

Introduction

The paper introduces Motion-I2V, a novel image-to-video generation (I2V) framework that addresses the challenges of temporal consistency and controllability in animating still images. Traditional methodologies have typically been narrow in scope, handling only specific categories like human portraits or fluid dynamics, which limited their applicability for general I2V tasks. Conversely, emergent diffusion models, although impressive in the rich diversity of generated images, struggled with preserving temporal coherence over sequences, especially with significant motion and viewpoint changes. Motion-I2V circumvents these limitations by decomposing the process into a two-stage pipeline with an explicit focus on motion prediction.

Motion Modeling and Generation

The foundational first stage of Motion-I2V is devoted to discerning motions that can plausibly animate a static image. A diffusion-based motion field predictor is leveraged, which determines the pixel-wise trajectories reflecting how the input image should evolve over time. Key to this stage is a finely-tuned pre-trained video diffusion model, which employs textual instructions alongside the reference image to predict pixel trajectories. By encoding these motion fields into a latent representation, the model learns to produce dynamic and realistic motion while preserving crucial visual semantics inherited from the pre-trained diffusion model. This process involves a sophisticated training strategy that initially targets single displacement fields before scaling to the video level.

Video Rendering with Predicted Motion

The second stage takes over the generated motion fields to synthesize temporally consistent video frames. This is where Motion-I2V introduces a motion-augmented temporal attention framework which enhances the video's fidelity with the motion guidance from the first stage. The approach allows temporal attention to be dynamically adjusted based on the warped features of the reference image, vastly improving the temporal receptive field. This is a significant advancement over the traditional 1-D temporal attention, which often leads to limited temporal consistency due to its restricted modeling capacity.

Fine-Grained Control Mechanisms

Motion-I2V doesn't just innovate in consistency; it also provides mechanisms for user control over the animation process. Integrating a ControlNet allows for sparse trajectory annotations, offering users the ability to dictate precise movements within generated videos. The framework supports region-specific animation, meaning certain parts of an image can be animated while others remain static. Additionally, Motion-I2V extends its capabilities to zero-shot video-to-video translation, where users can transform the style of a video's first frame and propagate that transformation through the sequence using predicted motions.

Comparative Analysis

Quantitatively and qualitatively assessed against state-of-the-art approaches like VideoComposer and DynamiCrafter, Motion-I2V demonstrates superior performance in following textual instructions and sustaining temporal uniformity without sacrificing the range of motion. The controlled experiments showcase improved robustness, with generated videos presenting larger and more consistent motion dynamics when compared to peers. This sets a new benchmark for open-domain I2V tasks.

Conclusion

In summary, Motion-I2V successfully addresses pivotal shortcomings in prior image-to-video methods by splitting the task into dedicated stages for motion prediction and video synthesis. Its explicit motion modeling component ensures larger, more realistic motions, while the second-stage video rendering maintains high fidelity and consistency. Moreover, the incorporated fine-grained control features, from sparse trajectory editing to region-specific animation, point towards a future where users can seamlessly steer the narrative of their generated video content. In the field of I2V synthesis, Motion-I2V represents a significant leap forward.

Acknowledgements

The paper was supported in part by the National Key R&D Program of China Project and the General Research Fund of Hong Kong RGC Project.

PDF Markdown

Related Papers

GitHub

Motion-I2V

Tweets

https://twitter.com/_akhaliq/status/1752159534569738608

https://twitter.com/taziku_co/status/1752455736628756490

https://twitter.com/Gradio/status/1752402847730385287

https://twitter.com/AILucknow/status/1752244263415677308

https://twitter.com/javaeeeee1/status/1752300898443776325

YouTube

Show All Videos