Self-Supervised Monocular 3D Face Reconstruction by Occlusion-Aware Multi-view Geometry Consistency (2007.12494v1)

Published 24 Jul 2020 in cs.CV

Abstract: Recent learning-based approaches, in which models are trained by single-view images have shown promising results for monocular 3D face reconstruction, but they suffer from the ill-posed face pose and depth ambiguity issue. In contrast to previous works that only enforce 2D feature constraints, we propose a self-supervised training architecture by leveraging the multi-view geometry consistency, which provides reliable constraints on face pose and depth estimation. We first propose an occlusion-aware view synthesis method to apply multi-view geometry consistency to self-supervised learning. Then we design three novel loss functions for multi-view consistency, including the pixel consistency loss, the depth consistency loss, and the facial landmark-based epipolar loss. Our method is accurate and robust, especially under large variations of expressions, poses, and illumination conditions. Comprehensive experiments on the face alignment and 3D face reconstruction benchmarks have demonstrated superiority over state-of-the-art methods. Our code and model are released in https://github.com/jiaxiangshang/MGCNet.

Abstract PDF Chat (Pro)

Citations (127)

View on Semantic Scholar

Summary

The paper introduces MGCNet, a self-supervised framework that addresses pose and depth ambiguities in monocular 3D face reconstruction.
It leverages occlusion-aware view synthesis and multi-view consistency loss functions to refine facial alignment and improve reconstruction accuracy.
Experimental results show a more than 12% improvement in normalized mean error and a 17% reduction in RMSE, outperforming state-of-the-art methods.

Self-Supervised Monocular 3D Face Reconstruction by Occlusion-Aware Multi-view Geometry Consistency

The paper presents MGCNet, a self-supervised approach for monocular 3D face reconstruction by exploiting multi-view geometry consistency to address the inherent challenges in monocular approaches, particularly the ambiguity in face pose and depth estimation. Unlike previous methods that rely primarily on 2D features, this paper leverages multi-view constraints to provide more reliable supervision during training.

Key Contributions

Self-Supervised Architecture: The authors introduce MGCNet, an end-to-end self-supervised framework for 3D face reconstruction and alignment. It is designed to address pose and depth ambiguities by employing multi-view geometry consistency. This is achieved through occlusion-aware view synthesis and novel consistency loss functions.
Occlusion-Aware View Synthesis: A significant innovation of the work is the development of a differentiable covisible map that handles self-occlusion, thereby enhancing view synthesis. The map ensures that only pixels visible in both target and source views contribute to the consistency losses.
Novel Loss Functions: The authors design three loss functions for multi-view geometry consistency: pixel consistency loss, depth consistency loss, and facial landmark-based epipolar loss. These losses collectively ensure improved 3DMM parameter consistency across views.
Experimental Superiority: The approach is demonstrated to outperform state-of-the-art methods significantly. For face alignment, MGCNet improves normalized mean error (NME) by more than 12%, and for 3D face reconstruction, it achieves a substantial 17% reduction in root mean squared error (RMSE) on challenging datasets.

Methodology

The paper leverages the 3D Morphable Model (3DMM) to parameterize face shape and texture, integrating these into an end-to-end framework. The authors use a pinhole camera model and spherical harmonics for illumination modeling, crafting an architecture that synthesizes target views and reinforces consistency via multi-view losses.

Key steps include:

Co-visible Map Generation: By projecting covisible triangles, the method effectively identifies visible regions across multiple views to mitigate occlusion challenges.
Multi-View Consistency Losses: These include pixel consistency loss to minimize the error across synthesized target images, depth consistency loss ensuring robust depth alignment, and facial epipolar loss accommodating landmark-based pose errors with essential matrices.

Implications and Future Directions

The work sets a new benchmark for monocular 3D face reconstruction by effectively incorporating multi-view constraints, and establishing that such constraints profoundly mitigate the ambiguities found in monocular estimation tasks. The self-supervised nature of MGCNet is promising for reducing reliance on substantial amounts of annotated data.

Looking ahead, the implications for the broader AI community include the potential adaptation of this multi-view consistency framework to other domains, such as video-based facial analysis and real-time avatar generation. Future research might explore integrating more sophisticated face models or enhancing the robustness of the method under varying environmental conditions or more extreme facial expressions.

In summary, this paper offers a comprehensive approach that advances the capabilities of 3D face reconstruction from single images, addressing long-standing challenges in pose and depth ambiguity via innovative multi-view consistency techniques.