Robust Face Recognition via Multimodal Deep Face Representation

Published 1 Sep 2015 in cs.CV | (1509.00244v1)

Abstract: Face images appeared in multimedia applications, e.g., social networks and digital entertainment, usually exhibit dramatic pose, illumination, and expression variations, resulting in considerable performance degradation for traditional face recognition algorithms. This paper proposes a comprehensive deep learning framework to jointly learn face representation using multimodal information. The proposed deep learning structure is composed of a set of elaborately designed convolutional neural networks (CNNs) and a three-layer stacked auto-encoder (SAE). The set of CNNs extracts complementary facial features from multimodal data. Then, the extracted features are concatenated to form a high-dimensional feature vector, whose dimension is compressed by SAE. All the CNNs are trained using a subset of 9,000 subjects from the publicly available CASIA-WebFace database, which ensures the reproducibility of this work. Using the proposed single CNN architecture and limited training data, 98.43% verification rate is achieved on the LFW database. Benefited from the complementary information contained in multimodal data, our small ensemble system achieves higher than 99.0% recognition rate on LFW using publicly available training set.

Abstract PDF Upgrade to Chat

Authors (2)

Citations (391)

View on Semantic Scholar

Summary

The paper introduces a multimodal framework that fuses features from multiple CNNs via a stacked auto-encoder to produce robust face representations.
It employs an optimized CNN architecture with adaptive pooling and multi-stage training to handle variations in pose, illumination, and expression.
Evaluations on LFW and CASIA-WebFace databases show over 99% verification accuracy, highlighting the framework's competitive performance in real-world conditions.

Robust Face Recognition via Multimodal Deep Face Representation

The paper "Robust Face Recognition via Multimodal Deep Face Representation" by Changxing Ding and Dacheng Tao addresses the challenges inherent in face recognition tasks within multimedia applications, such as social networks and digital entertainment, where images frequently demonstrate substantial variability in pose, illumination, and expression. Traditional face recognition algorithms are susceptible to performance degradation under such conditions, driving the need for a comprehensive framework capable of improving recognition robustness.

Key Contributions and Methodology

The authors introduce a novel multimodal deep learning framework designed to enhance face recognition accuracy by leveraging complementary data extracted from multiple modalities. The framework comprises an ensemble of convolutional neural networks (CNNs) and a three-layer stacked auto-encoder (SAE) for feature extraction and dimensionality reduction.

Multimodal Feature Extraction: The system utilizes multiple CNNs to extract features from different modalities, including holistic face images, pose-corrected images through 3D modeling, and uniformly sampled face patches. This approach aims to capture diverse information that single modality systems might miss, thereby enriching the representation with complementary features.
Deep CNN Architecture: The proposed CNNs are structurally inspired by existing architectures but include crucial modifications such as replacing larger filters with multiple convolutions using small filters, optimizing ReLU non-linearity applications, and employing adaptive pooling strategies. The CNN design also involves multi-stage training with softmax and triplet loss functions, improving learning capacity and generalization.
Feature Fusion via SAE: The extracted high-dimensional features from multiple CNNs are concatenated and compressed using a stacked auto-encoder. This allows for learning a compact, non-linear transformation to produce a robust face representation. The authors evaluated different non-linearities within the SAE, finding the hyperbolic tangent activation function yielded superior results.
Performance Evaluation: The paper benchmarks the proposed system against state-of-the-art algorithms using the Labeled Faces in the Wild (LFW) database and conducts additional identification experiments on the CASIA-WebFace database. The framework achieves over 99.0% verification accuracy on LFW, asserting its competitive performance against existing models.

Numerical Results and Robustness

The authors report a verification rate of 98.43% using a single CNN on the LFW database and exceed a 99.0% recognition rate with the MM-DFR framework using publicly available training data. This demonstrates the efficacy of the proposed multimodal approach, particularly when JB modeling is employed for supervised learning of identity variations. Notably, these results are achieved without access to private datasets, underscoring the framework's reproducibility and practical applicability.

Theoretical and Practical Implications

The study illustrates that robust face representation benefits significantly from incorporating multimodal data, capable of addressing nonlinear appearance variations often problematic in images subjected to real-world conditions. This work suggests further exploration into the integration of additional multimodal cues and enhancements to single deep architectures like NN2 for improved performance.

Future Directions

The framework presents a robust baseline for future work in multimodal deep learning for face recognition. Potential advancements could include expanding the system to encompass more complex modalities and diversifying the training data to enhance the model's adaptability across a broader spectrum of multimedia applications. The research also invites further investigation into optimizing deep architectures to manage increased complexity without compromising efficiency.

In summary, this paper provides a substantial contribution to the methodologies underpinning face recognition systems. By demonstrating the synergistic effect of multiple data modalities and refining deep learning architectures, it offers a promising roadmap for advancing both the reliability and accuracy of face recognition in challenging environments.

Markdown Report Issue