Emergent Mind

Abstract

We introduce Motion-I2V, a novel framework for consistent and controllable image-to-video generation (I2V). In contrast to previous methods that directly learn the complicated image-to-video mapping, Motion-I2V factorizes I2V into two stages with explicit motion modeling. For the first stage, we propose a diffusion-based motion field predictor, which focuses on deducing the trajectories of the reference image's pixels. For the second stage, we propose motion-augmented temporal attention to enhance the limited 1-D temporal attention in video latent diffusion models. This module can effectively propagate reference image's feature to synthesized frames with the guidance of predicted trajectories from the first stage. Compared with existing methods, Motion-I2V can generate more consistent videos even at the presence of large motion and viewpoint variation. By training a sparse trajectory ControlNet for the first stage, Motion-I2V can support users to precisely control motion trajectories and motion regions with sparse trajectory and region annotations. This offers more controllability of the I2V process than solely relying on textual instructions. Additionally, Motion-I2V's second stage naturally supports zero-shot video-to-video translation. Both qualitative and quantitative comparisons demonstrate the advantages of Motion-I2V over prior approaches in consistent and controllable image-to-video generation. Please see our project page at https://xiaoyushi97.github.io/Motion-I2V/.

Motion-I2V overview: deduces plausible motions for animating images, synthesizes frames with motion-augmented temporal layer.

Overview

  • Introduces Motion-I2V, a novel image-to-video generation framework that enhances temporal consistency and controllability.

  • Utilizes diffusion-based motion field predictor for realistic motion trajectories in animated images.

  • Employs a motion-augmented temporal attention framework to produce temporally consistent video frames.

  • Allows fine-grained user control over the animation of static images and video-to-video style translation.

  • Demonstrates superior performance compared to state-of-the-art methods in terms of motion dynamics and fidelity.

Introduction

The study introduces Motion-I2V, a novel image-to-video generation (I2V) framework that addresses the challenges of temporal consistency and controllability in animating still images. Traditional methodologies have typically been narrow in scope, handling only specific categories like human portraits or fluid dynamics, which limited their applicability for general I2V tasks. Conversely, emergent diffusion models, although impressive in the rich diversity of generated images, struggled with preserving temporal coherence over sequences, especially with significant motion and viewpoint changes. Motion-I2V circumvents these limitations by decomposing the process into a two-stage pipeline with an explicit focus on motion prediction.

Motion Modeling and Generation

The foundational first stage of Motion-I2V is devoted to discerning motions that can plausibly animate a static image. A diffusion-based motion field predictor is leveraged, which determines the pixel-wise trajectories reflecting how the input image should evolve over time. Key to this stage is a finely-tuned pre-trained video diffusion model, which employs textual instructions alongside the reference image to predict pixel trajectories. By encoding these motion fields into a latent representation, the model learns to produce dynamic and realistic motion while preserving crucial visual semantics inherited from the pre-trained diffusion model. This process involves a sophisticated training strategy that initially targets single displacement fields before scaling to the video level.

Video Rendering with Predicted Motion

The second stage takes over the generated motion fields to synthesize temporally consistent video frames. This is where Motion-I2V introduces a motion-augmented temporal attention framework which enhances the video's fidelity with the motion guidance from the first stage. The approach allows temporal attention to be dynamically adjusted based on the warped features of the reference image, vastly improving the temporal receptive field. This is a significant advancement over the traditional 1-D temporal attention, which often leads to limited temporal consistency due to its restricted modeling capacity.

Fine-Grained Control Mechanisms

Motion-I2V doesn't just innovate in consistency; it also provides mechanisms for user control over the animation process. Integrating a ControlNet allows for sparse trajectory annotations, offering users the ability to dictate precise movements within generated videos. The framework supports region-specific animation, meaning certain parts of an image can be animated while others remain static. Additionally, Motion-I2V extends its capabilities to zero-shot video-to-video translation, where users can transform the style of a video's first frame and propagate that transformation through the sequence using predicted motions.

Comparative Analysis

Quantitatively and qualitatively assessed against state-of-the-art approaches like VideoComposer and DynamiCrafter, Motion-I2V demonstrates superior performance in following textual instructions and sustaining temporal uniformity without sacrificing the range of motion. The controlled experiments showcase improved robustness, with generated videos presenting larger and more consistent motion dynamics when compared to peers. This sets a new benchmark for open-domain I2V tasks.

Conclusion

In summary, Motion-I2V successfully addresses pivotal shortcomings in prior image-to-video methods by splitting the task into dedicated stages for motion prediction and video synthesis. Its explicit motion modeling component ensures larger, more realistic motions, while the second-stage video rendering maintains high fidelity and consistency. Moreover, the incorporated fine-grained control features, from sparse trajectory editing to region-specific animation, point towards a future where users can seamlessly steer the narrative of their generated video content. In the realm of I2V synthesis, Motion-I2V represents a significant leap forward.

Acknowledgements

The study was supported in part by the National Key R&D Program of China Project and the General Research Fund of Hong Kong RGC Project.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube