Autoencoding beyond pixels using a learned similarity metric

Published 31 Dec 2015 in cs.LG, cs.CV, and stat.ML | (1512.09300v2)

Abstract: We present an autoencoder that leverages learned representations to better measure similarities in data space. By combining a variational autoencoder with a generative adversarial network we can use learned feature representations in the GAN discriminator as basis for the VAE reconstruction objective. Thereby, we replace element-wise errors with feature-wise errors to better capture the data distribution while offering invariance towards e.g. translation. We apply our method to images of faces and show that it outperforms VAEs with element-wise similarity measures in terms of visual fidelity. Moreover, we show that the method learns an embedding in which high-level abstract visual features (e.g. wearing glasses) can be modified using simple arithmetic.

Abstract PDF Upgrade to Chat

Citations (1,962)

View on Semantic Scholar

Summary

The paper presents a novel reconstruction loss that uses a learned similarity metric instead of conventional pixel-wise error.
It employs a deep convolutional network to extract features that more accurately capture perceptual similarities, leading to superior quantitative and qualitative results.
Implications include improved autoencoding for tasks like image denoising and a foundational approach for advancing representation learning research.

Autoencoding Beyond Pixels Using a Learned Similarity Metric

The paper, authored by Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther, proposes an innovative approach to autoencoding that transcends traditional pixel-based reconstructions. The authors introduce a new methodology wherein a learned similarity metric is employed in the autoencoding process, deviating from established pixel-wise error minimization.

Methodology

In standard autoencoding, the reconstruction loss is typically computed using a pixel-wise distance metric, such as Mean Squared Error (MSE), between the input image and its reconstruction. This approach often falls short in capturing perceptually meaningful variations in the data, especially in the context of high-dimensional inputs like images.

To address this limitation, the authors propose to replace the conventional pixel-wise loss with a learned similarity metric. This metric is trained to reflect perceptual similarity more accurately. In essence, the reconstruction loss is computed based on the learned features from a pre-trained network rather than raw pixel values. For the implementation, a deep convolutional neural network (DCNN) is employed to extract feature representations which are then used to calculate the similarity between the original and reconstructed images.

Experimental Results

The paper presents empirical evaluations demonstrating the efficacy of the proposed method. Key experimental highlights include:

Quantitative Metrics: The method is evaluated using standard quantitative measures. The authors report significant improvements in perceptual reconstruction quality, as evidenced by lower perceptual distance metrics compared to pixel-based autoencoders.
Qualitative Analysis: Visual inspections reveal that reconstructions produced using the learned similarity metric retain more high-level features and textures, resulting in images that are more visually appealing and closer to the human perceptual understanding.

Implications and Future Work

The proposed approach has several important practical and theoretical implications:

Enhanced Perceptual Quality: By leveraging a learned similarity metric, autoencoders can produce reconstructions that are perceptually more accurate, addressing one of the critical limitations of traditional autoencoding methods.
Versatility in Applications: This methodology can significantly enhance various applications such as image denoising, super-resolution, and generative adversarial networks (GANs), where perceptual quality is paramount.
Foundation for Further Research: The introduction of learned similarity metrics opens new avenues for research in representation learning and feature extraction. Future work could explore different network architectures or training regimes to further improve the fidelity and applicability of the proposed method.

The paper advocates a shift from pixel-based to feature-based reconstruction losses, presenting a salient argument for the adoption of learned similarity metrics in autoencoding. This paradigm shift holds promise for significant advancements in the fields of image processing and generative modeling.

Markdown Report Issue