- The paper introduces a novel two-stage pipeline that first predicts pixel-wise motion fields using diffusion models and then synthesizes video frames with a motion-augmented temporal attention mechanism.
- The paper achieves superior temporal consistency and dynamic motion fidelity, outperforming state-of-the-art methods like VideoComposer and DynamiCrafter.
- The paper offers fine-grained control features, including sparse trajectory editing and region-specific animation, enabling precise user-directed video synthesis.
Introduction
The paper introduces Motion-I2V, a novel image-to-video generation (I2V) framework that addresses the challenges of temporal consistency and controllability in animating still images. Traditional methodologies have typically been narrow in scope, handling only specific categories like human portraits or fluid dynamics, which limited their applicability for general I2V tasks. Conversely, emergent diffusion models, although impressive in the rich diversity of generated images, struggled with preserving temporal coherence over sequences, especially with significant motion and viewpoint changes. Motion-I2V circumvents these limitations by decomposing the process into a two-stage pipeline with an explicit focus on motion prediction.
Motion Modeling and Generation
The foundational first stage of Motion-I2V is devoted to discerning motions that can plausibly animate a static image. A diffusion-based motion field predictor is leveraged, which determines the pixel-wise trajectories reflecting how the input image should evolve over time. Key to this stage is a finely-tuned pre-trained video diffusion model, which employs textual instructions alongside the reference image to predict pixel trajectories. By encoding these motion fields into a latent representation, the model learns to produce dynamic and realistic motion while preserving crucial visual semantics inherited from the pre-trained diffusion model. This process involves a sophisticated training strategy that initially targets single displacement fields before scaling to the video level.
Video Rendering with Predicted Motion
The second stage takes over the generated motion fields to synthesize temporally consistent video frames. This is where Motion-I2V introduces a motion-augmented temporal attention framework which enhances the video's fidelity with the motion guidance from the first stage. The approach allows temporal attention to be dynamically adjusted based on the warped features of the reference image, vastly improving the temporal receptive field. This is a significant advancement over the traditional 1-D temporal attention, which often leads to limited temporal consistency due to its restricted modeling capacity.
Fine-Grained Control Mechanisms
Motion-I2V doesn't just innovate in consistency; it also provides mechanisms for user control over the animation process. Integrating a ControlNet allows for sparse trajectory annotations, offering users the ability to dictate precise movements within generated videos. The framework supports region-specific animation, meaning certain parts of an image can be animated while others remain static. Additionally, Motion-I2V extends its capabilities to zero-shot video-to-video translation, where users can transform the style of a video's first frame and propagate that transformation through the sequence using predicted motions.
Comparative Analysis
Quantitatively and qualitatively assessed against state-of-the-art approaches like VideoComposer and DynamiCrafter, Motion-I2V demonstrates superior performance in following textual instructions and sustaining temporal uniformity without sacrificing the range of motion. The controlled experiments showcase improved robustness, with generated videos presenting larger and more consistent motion dynamics when compared to peers. This sets a new benchmark for open-domain I2V tasks.
Conclusion
In summary, Motion-I2V successfully addresses pivotal shortcomings in prior image-to-video methods by splitting the task into dedicated stages for motion prediction and video synthesis. Its explicit motion modeling component ensures larger, more realistic motions, while the second-stage video rendering maintains high fidelity and consistency. Moreover, the incorporated fine-grained control features, from sparse trajectory editing to region-specific animation, point towards a future where users can seamlessly steer the narrative of their generated video content. In the field of I2V synthesis, Motion-I2V represents a significant leap forward.
Acknowledgements
The paper was supported in part by the National Key R&D Program of China Project and the General Research Fund of Hong Kong RGC Project.