Emergent Mind

Generating 3D faces using Convolutional Mesh Autoencoders

(1807.10267)
Published Jul 26, 2018 in cs.CV

Abstract

Learned 3D representations of human faces are useful for computer vision problems such as 3D face tracking and reconstruction from images, as well as graphics applications such as character generation and animation. Traditional models learn a latent representation of a face using linear subspaces or higher-order tensor generalizations. Due to this linearity, they can not capture extreme deformations and non-linear expressions. To address this, we introduce a versatile model that learns a non-linear representation of a face using spectral convolutions on a mesh surface. We introduce mesh sampling operations that enable a hierarchical mesh representation that captures non-linear variations in shape and expression at multiple scales within the model. In a variational setting, our model samples diverse realistic 3D faces from a multivariate Gaussian distribution. Our training data consists of 20,466 meshes of extreme expressions captured over 12 different subjects. Despite limited training data, our trained model outperforms state-of-the-art face models with 50% lower reconstruction error, while using 75% fewer parameters. We also show that, replacing the expression space of an existing state-of-the-art face model with our autoencoder, achieves a lower reconstruction error. Our data, model and code are available at http://github.com/anuragranj/coma

Sampling around the mean face in the mesh autoencoder’s latent space along three different components.

Overview

  • The paper introduces Convolutional Mesh Autoencoder (CoMA), a novel 3D face modeling approach leveraging convolutional neural networks specifically adapted for 3D mesh data to effectively capture extreme deformations and non-linear facial expressions.

  • CoMA utilizes spectral convolutional operations, biased ReLU activations, and hierarchical mesh sampling to create a compact, efficient model that outperforms traditional PCA-based models in terms of reconstruction error and robustness to new expressions.

  • Experimental results demonstrate a significant reduction in reconstruction errors and performance improvements in extrapolation tasks, highlighting CoMA's potential for high-fidelity facial animations and realistic tracking systems in computationally constrained environments.

Generating 3D faces using Convolutional Mesh Autoencoders

In the paper titled "Generating 3D faces using Convolutional Mesh Autoencoders," the authors introduce a novel approach to model the highly variable shapes and non-linear expressions of human faces using convolutional neural networks (CNNs) specifically adapted for 3D mesh data. This method, named Convolutional Mesh Autoencoder (CoMA), demonstrates a significant advance over traditional linear models by effectively capturing extreme deformations and non-linear facial expressions with fewer parameters and lower reconstruction errors.

Problem Statement

Traditional 3D face models, such as those based on principal component analysis (PCA) or higher-order tensor representations, suffer limitations in capturing non-linear deformations inherent in extreme facial expressions. These models are essential in various computer vision and graphics applications, including face tracking, 3D reconstruction, character generation, and animation. However, due to their linear nature, these models fall short in reflecting the nuanced and pronounced variations caused by facial expressions.

Methodology

CoMA leverages spectral convolutional operations on a mesh surface to address these limitations. The authors present novel mesh sampling operations that enable a hierarchical, multi-scale representation, which preserves topological structure during down-sampling and up-sampling processes. The network architecture consists of an encoder and a decoder, with the encoder transforming the 3D face mesh into a low-dimensional latent space and the decoder reconstructing it.

Key architectural choices include:

  • Convolutional Layers: Utilization of fast spectral convolutions approximated by Chebyshev polynomials, which make the convolutions memory efficient and feasible for high-resolution mesh processing.
  • Sampling Operations: Introduction of mesh down-sampling and up-sampling layers that maintain vertex-wise associations, allowing the network to capture both global and local facial features.
  • Non-linear Activation: Application of biased ReLU activations post-convolutions enabling the model to handle non-linearities effectively.

The training dataset consists of 20,466 high-resolution meshes covering a range of extreme facial expressions from 12 subjects, ensuring a diverse set of training examples to enhance the model's generalization capability.

Results

The experimental evaluations reveal that CoMA achieves superior performance compared to state-of-the-art PCA-based models and the FLAME model. Notably:

  • Reconstruction Error: CoMA demonstrates a 50% reduction in reconstruction error compared to PCA while employing 75% fewer parameters.
  • Extrapolation Capability: In extrapolation experiments involving expressions unseen during training, CoMA outperforms PCA and FLAME, indicating its robustness in generalizing to novel expressions.
  • Model Compactness: The hierarchical structure and locally invariant convolutional filters contribute to the compact nature of CoMA, facilitating easier training and deployment.

Implications

Practically, the enhanced ability to model and reconstruct 3D faces with extreme non-linear expressions opens new avenues in applications demanding high-fidelity facial animations and realistic tracking systems. The reduced parameter footprint also implies increased efficiency in real-world deployments where computational resources might be constrained.

Theoretically, CoMA sets a precedent for applying spectral convolutional operations and hierarchical mesh sampling in learning representations of structured non-Euclidean data. This approach could stimulate further research into adapting CNNs for other types of graph-structured data, broadening the applicability of deep learning in 3D modeling tasks.

Future Directions

While the results are promising, the authors note potential improvements with access to larger datasets, as current data limitations may hinder the full potential of CoMA for higher-dimensional latent spaces. Additionally, integrating CoMA with image-based convolutional networks to derive 3D mesh representations directly from 2D images presents a compelling direction for future research.

In conclusion, this paper introduces a methodologically sound and practically impactful model for 3D facial representation, significantly improving upon existing techniques in accuracy and efficiency through innovative use of spectral convolutions and hierarchical mesh processing. By providing both the dataset and code, the authors also contribute valuable resources to the research community, encouraging further advancements in the domain.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.