- The paper introduces a nonlinear 3DMM that learns from in-the-wild images, eliminating the need for expensive 3D face scans.
- The proposed method utilizes an encoder-decoder architecture with a differentiable rendering layer to accurately convert 2D images into detailed 3D faces.
- Quantitative and qualitative evaluations highlight superior performance in face alignment and reconstruction compared to traditional linear models.
Learning a Nonlinear 3D Morphable Model from In-the-wild Images
The paper "On Learning 3D Face Morphable Model from In-the-wild Images" presents an innovative approach to develop a nonlinear 3D Morphable Model (3DMM) using only in-the-wild images. This circumvents the traditional requirement for 3D face scans, which are expensive and laborious to collect. The framework leverages the power of Deep Neural Networks (DNNs) to achieve end-to-end trainability in a weakly supervised manner, thereby addressing the limitations of previous linear models.
Proposed Nonlinear 3D Morphable Model
Framework Overview
The proposed framework introduces a system that comprises an encoder and two decoders, representing the nonlinear 3DMM. The encoder estimates parameters for projection, lighting, shape, and albedo from a 2D face image. Two decoders, acting as nonlinear mappings, convert these shape and albedo parameters into 3D shapes and albedo maps. Crucially, a differentiable rendering layer allows reconstruction of the input face image by combining the estimated 3D shape and albedo with lighting and projection parameters. This rendering layer is pivotal for the end-to-end training of the model in a weakly supervised setting.
Figure 1: Conventional 3DMM employs linear bases models for shape/albedo, which are trained with 3D face scans and associated controlled 2D images. We propose a nonlinear 3DMM to model shape/albedo via deep neural networks~(DNNs). It can be trained from in-the-wild face images without 3D scans, and also better reconstruct the original images due to the inherent nonlinearity.
Shape and Albedo Representation
The nonlinear 3DMM learns representations directly from large collections of in-the-wild images, circumventing the traditional need for 3D face scans. The proposed model enhances the representation power by replacing the PCA-based linear bases with deep convolutional networks. The shape and albedo are represented as 2D images, maintaining spatial relationships and leveraging CNNs' capability in image synthesis.
Figure 2: Jointly learning a nonlinear 3DMM and its fitting algorithm from unconstrained 2D in-the-wild face image collection, in a weakly supervised fashion.
Differentiable Rendering Layer
A novel differentiable rendering layer is introduced, which facilitates the accurate reconstruction of the face images. This layer integrates shading and albedo information using spherical harmonics to approximate lighting effects. By doing so, the layer ensures that the networks can be trained using 2D image supervision, allowing for realistic texture generation and reconstruction.
Model Learning and Regularization
The network trains end-to-end by minimizing the combination of loss functions, including reconstruction, landmark, and regularization losses. Regularizations ensure plausible reconstructions by incorporating albedo symmetry, albedo constancy, and shape smoothness constraints. The training process employs intermediate supervision using pseudo-groundtruth from the 300W dataset, eventually switching to full model optimization for improved performance.












Figure 3: Effect of albedo regularizations: albedo symmetry (sym) and albedo constancy (const). When there is no regularization being used, shading is mostly baked into the albedo. Using the symmetry property helps to resolve the global lighting. Using constancy constraint further removes shading from the albedo, which results in a better 3D shape.
Applications and Comparisons
Applications
The nonlinear 3DMM framework supports various applications such as 2D face alignment, 3D reconstruction and face editing. For instance, the model can generate realistic face reconstructions even under extreme poses and lighting conditions, demonstrating its robustness.
Qualitative and Quantitative Comparisons
The paper performs extensive evaluations, showcasing the superiority of the nonlinear 3DMM over traditional linear models in terms of expressiveness and representation power. Quantitative analyses, such as NME comparisons on AFLW2000 and Florence datasets, highlight significant improvements in face alignment and reconstruction tasks over existing methods.





Figure 4: Shape representation power comparison on Basel scans. Our nonlinear model is able to reconstruct input 3D scans with smaller errors than the linear model (l_S = 160 for both models). The error map shows the normalized per-vertex errors.
Conclusion
The paper establishes a new paradigm for learning 3DMMs, efficiently using in-the-wild face images and deep neural networks to achieve impressive gains in representation power and model fitting. It indicates a promising direction for future research in unsupervised or weakly supervised learning of 3D models from large-scale 2D datasets, potentially expanding applications to further domains outside facial analysis. The results demonstrate the potential of nonlinear models to overcome the limitations of linear methods, especially for tasks involving complex real-world data.