Learning to Amend Facial Expression Representation via De-albino and Affinity (2103.10189v3)

Published 18 Mar 2021 in cs.CV

Abstract: Facial Expression Recognition (FER) is a classification task that points to face variants. Hence, there are certain affinity features between facial expressions, receiving little attention in the FER literature. Convolution padding, despite helping capture the edge information, causes erosion of the feature map simultaneously. After multi-layer filling convolution, the output feature map named albino feature definitely weakens the representation of the expression. To tackle these challenges, we propose a novel architecture named Amending Representation Module (ARM). ARM is a substitute for the pooling layer. Theoretically, it can be embedded in the back end of any network to deal with the Padding Erosion. ARM efficiently enhances facial expression representation from two different directions: 1) reducing the weight of eroded features to offset the side effect of padding, and 2) decomposing facial features to simplify representation learning. Experiments on public benchmarks prove that our ARM boosts the performance of FER remarkably. The validation accuracies are respectively 90.42% on RAF-DB, 65.2% on Affect-Net, and 58.71% on SFEW, exceeding current state-of-the-art methods. Our implementation and trained models are available at https://github.com/JiaweiShiCV/Amend-Representation-Module.

Citations (72)

View on Semantic Scholar

Summary

The paper introduces an ARM that mitigates convolution padding erosion, improving ResNet-18’s accuracy from 77.63% to 82.77% on benchmark datasets.
The ARM incorporates a de-albino block that rearranges feature maps to reposition eroded pixels and enhance feature extraction in CNNs.
The ARM exploits intrinsic expression affinities via a Sharing Affinity block, decomposing features into generic and unique components to optimize FER learning.

Review of "Learning to Amend Facial Expression Representation via De-albino and Affinity"

The paper "Learning to Amend Facial Expression Representation via De-albino and Affinity" introduces a novel approach to improve facial expression recognition (FER) by addressing specific challenges inherent in convolutional neural networks (CNNs). The authors propose an Amending Representation Module (ARM) as a substitute for the pooling layer within CNN architectures, which aims to mitigate feature erosion caused by convolution padding. Moreover, the module leverages intrinsic affinities between facial expressions to bolster representation learning.

Key Contributions

Padding Erosion in CNNs: The paper identifies convolution padding as a source of information distortion, particularly impacting the feature maps' edges, termed "albino features." Extensive convolutional layering exacerbates this erosion, negatively influencing FER performance. The ARM introduces a De-albino block that reduces the weight of eroded features, offsetting padding's adverse effects while enhancing facial expression representations.
Feature Arrangement: To facilitate the de-albino process, the ARM features an auxiliary block that rearranges feature maps. This block repositions severely eroded pixels to the periphery, utilizing convolutional perception bias to amplify the de-albino effect efficiently.
Affinity-Based Feature Decomposition: The ARM exploits the natural affinity between facial expressions by incorporating a Sharing Affinity (SA) block. This block decomposes facial features into generic and unique components, simplifying representation learning and improving FER accuracy.
Empirical Validation: The ARM model demonstrates superior performance across multiple FER benchmarks. Validation accuracies of 90.42%, 65.2%, and 58.71% on RAF-DB, AffectNet, and SFEW datasets respectively, exceeding current methods. The module's robust architecture enables effective representation learning from limited datasets despite varying expressions.

Numerical Results

The ARM emphasizes realism in experimental settings, achieving state-of-the-art (SOTA) results on standard benchmark datasets, significantly outperforming baselines like ResNet-18. On RAF-DB, the ARM improves ResNet-18's mean accuracy from 77.63% to 82.77%. It similarly elevates AffectNet's performance and strategically addresses data imbalance with a minimal random resampling scheme, enhancing eight-category classification accuracy to a notable 61.33%.

Implications and Future Directions

The ARM's development invites broader implications for FER and CNN design. By addressing padding erosion, it highlights the need for re-evaluating core convolutional operations in general image classification tasks. The paper further suggests that exploiting inherent affinities in categorical data can substantially enhance model training dynamics.

Future research could extend ARM's principles to other domains where representation learning suffers from data limitations or intrinsic feature correlations. Moreover, adapting the ARM framework to different CNN architectures could unveil additional performance improvements, fostering more efficient models across machine learning fields.

In conclusion, the ARM represents a significant addition to FER methodologies, providing an effective strategy to circumvent convolutional layer pitfalls while harnessing expression affinities. Its practical applications promise enhanced human-computer interaction systems capable of nuanced emotion understanding, a pivotal aspect of AI-driven user experiences.