PLA4D: Pixel-Level Alignments for Text-to-4D Gaussian Splatting (2405.19957v4)
Abstract: Previous text-to-4D methods have leveraged multiple Score Distillation Sampling (SDS) techniques, combining motion priors from video-based diffusion models (DMs) with geometric priors from multiview DMs to implicitly guide 4D renderings. However, differences in these priors result in conflicting gradient directions during optimization, causing trade-offs between motion fidelity and geometry accuracy, and requiring substantial optimization time to reconcile the models. In this paper, we introduce \textbf{P}ixel-\textbf{L}evel \textbf{A}lignment for text-driven \textbf{4D} Gaussian splatting (PLA4D) to resolve this motion-geometry conflict. PLA4D provides an anchor reference, i.e., text-generated video, to align the rendering process conditioned by different DMs in pixel space. For static alignment, our approach introduces a focal alignment method and Gaussian-Mesh contrastive learning to iteratively adjust focal lengths and provide explicit geometric priors at each timestep. At the dynamic level, a motion alignment technique and T-MV refinement method are employed to enforce both pose alignment and motion continuity across unknown viewpoints, ensuring intrinsic geometric consistency across views. With such pixel-level multi-DM alignment, our PLA4D framework is able to generate 4D objects with superior geometric, motion, and semantic consistency. Fully implemented with open-source tools, PLA4D offers an efficient and accessible solution for high-quality 4D digital content creation with significantly reduced generation time.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.