- The paper presents a diffusion-based framework that bypasses precise intrinsic decomposition, achieving high-fidelity face relighting.
- The method encodes 2D facial images into feature vectors and modifies lighting information using both spatial and non-spatial conditioning.
- Experimental results on Multi-PIE benchmarks show that DiFaReli reliably outperforms state-of-the-art techniques in preserving facial details and relighting quality.
Background
Conventional face relighting methods often require complex estimations of facial geometry, albedo, and lighting parameters, as well as an understanding of the interaction between these components, such as cast shadows and global illumination. Prior approaches have faced challenges in handling non-diffuse effects and are typically dependent on the accuracy of estimated intrinsic components, which can be error-prone, particularly in real-world scenarios.
Diffusion-Based Approach
The paper "DiFaReli: Diffusion Face Relighting" introduces a novel framework that bypasses the need for precise intrinsic decomposition by leveraging diffusion models. The authors propose a conditional diffusion implicit model (DDIM) that works with a spatial and non-spatial conditioning technique to effectively relight faces without the requirement of accurately estimated intrinsic components or 3D and lighting ground truth.
The primary innovation of the paper lies in using a modified DDIM, trained solely on 2D images, to both decode and implicitly learn the complex interactions between light and facial geometry. The approach utilizes off-the-shelf estimators for input encoding, avoiding the need for multi-view or light stage data typically required by traditional methods.
Methodology
DiFaReli's approach relies on encoding the input image into a feature vector that disentangles the light information from other facial attributes. During relighting, this vector's light encoding is modified, and the augmented vector is then decoded to obtain the relit image, preserving the subject's identity and details. The model processes spatial encoding by using a shading reference image, spatially aligned with the input's geometry and lighting, while non-spatial encoding incorporates facial identity and cast shadow intensity.
Key to this method is the use of spherical harmonic lighting to condition the generative process, alongside shape and camera parameters inferred from 3D estimators. This conditioning is different from direct rendering, as it relies on synthetic examples to implicitly model complex illumination effects. The authors also introduce novel spatial modulation weights for pixel intensity correlation in the generation process, giving the diffusion model an easier conditioning signal to learn from.
Results
Experimental evaluations on standard benchmarks like Multi-PIE demonstrate that DiFaReli can photorealistically relight images, significantly outperforming state-of-the-art models on both qualitative and quantitative grounds. The approach provides high fidelity in relighting and shadow manipulation while maintaining the subject's original facial details, which are often compromised by alternative methods.
Conclusion
The "DiFaReli: Diffusion Face Relighting" paper presents a groundbreaking diffusion-based framework that tackles the longstanding challenges in face relighting with state-of-the-art performance. By leveraging the power of diffusion models calibrated by light and shadow encodings, this method promises significant advancements in applications requiring photorealistic illumination conditions on faces, such as augmented reality and portrait photography.