Deforming Autoencoders: Unsupervised Disentangling of Shape and Appearance

Published 18 Jun 2018 in cs.CV | (1806.06503v1)

Abstract: In this work we introduce Deforming Autoencoders, a generative model for images that disentangles shape from appearance in an unsupervised manner. As in the deformable template paradigm, shape is represented as a deformation between a canonical coordinate system (template') and an observed image, while appearance is modeled incanonical', template, coordinates, thus discarding variability due to deformations. We introduce novel techniques that allow this approach to be deployed in the setting of autoencoders and show that this method can be used for unsupervised group-wise image alignment. We show experiments with expression morphing in humans, hands, and digits, face manipulation, such as shape and appearance interpolation, as well as unsupervised landmark localization. A more powerful form of unsupervised disentangling becomes possible in template coordinates, allowing us to successfully decompose face images into shading and albedo, and further manipulate face images.

Abstract PDF Upgrade to Chat

Citations (198)

View on Semantic Scholar

Summary

The paper introduces a novel DAE model that disentangles shape deformations from appearance via a spatial transformation layer and differential decoder.
It achieves competitive unsupervised landmark localization and image manipulation across faces, hands, and digits.
The method reduces reliance on manual annotations, offering enhanced explainability and controllability in generative image tasks.

Evaluation of "Deforming Autoencoders: Unsupervised Disentangling of Shape and Appearance"

The paper "Deforming Autoencoders: Unsupervised Disentangling of Shape and Appearance" presents a novel approach for disentangling shape and appearance in image data using a method coined as Deforming Autoencoders (DAE). This study advances the capability of unsupervised image analysis by leveraging the deformable template paradigm, traditionally used in computer vision, and integrating it within an unsupervised deep learning framework.

The core contribution of the work is a DAE architecture that facilitates the unsupervised learning of a separation between an image's deformation attributes and its appearance characteristics. The DAE model synthesizes appearance in a canonical coordinate system, thereby disentangling shape variability due to deformations from the textural surface, what is often referred to as texture or appearance. This structured disentanglement is facilitated by the inclusion of a spatial transformation layer which warps the canonical texture to align with the observed image, while a concise latent vector for texture ensures shape variability is captured within the deformation branch.

The authors propose, build upon, and refine the architectural framework through various novel techniques. A significant innovative contribution is the implementation of a differential decoder, which predicts spatial gradients of the deformation field, thereby ensuring the warping process is smooth and invertible. This treatment addresses common challenges related to non-diffeomorphic transformations, thus enhancing the credibility and applicability of unsupervised learning of deformation models.

Experimentally, the method is validated across a range of challenging tasks, including expression morphing within human faces, hands, and digits, face manipulation tasks such as shape and appearance interpolation, and unsupervised landmark localization. Particularly noteworthy is the model's ability to disentangle appearance into intrinsic components like shading and albedo, further demonstrating its utility in complex image decomposition tasks. Quantitatively, the unsupervised landmark localization accuracy illustrates a competitive performance, markedly outperforming existing state-of-the-art methods for self-supervised correspondence estimation.

The implications of this research are significant for fields requiring advanced image manipulation and interpretation. By extracting these distinct components from images without relying on predetermined labels or annotations, the DAE framework can substantially reduce data preparation costs while enhancing model explainability. Furthermore, this disentangled representation offers enriched controllability over generative processes, beneficial for applications in graphics, enhanced reality, and potentially for fields like computational biology where image data is commonplace but often lacks detailed annotations.

While the study successfully demonstrates the application and benefits of deforming autoencoders, opportunities for future exploration remain. Expanding upon these preliminary yet promising benchmarks, subsequent development might consider broader datasets and varied real-world conditions to determine model robustness across different domains. Additionally, extending the disentangled representation to include three-dimensional information could further tighten the alignment between model-driven inferences and physical reality, potentially leading to breakthroughs in 3D reconstruction and related applications.

Overall, "Deforming Autoencoders: Unsupervised Disentangling of Shape and Appearance" provides a compelling framework and a suite of techniques with far-reaching implications for enhancing the unsupervised learning landscape, opening avenues for more detailed and scalable image analysis methodologies in the age of deep learning.

Markdown Report Issue