Variational Mixture-of-Experts Autoencoders for Multi-Modal Deep Generative Models

Published 8 Nov 2019 in stat.ML and cs.LG | (1911.03393v1)

Abstract: Learning generative models that span multiple data modalities, such as vision and language, is often motivated by the desire to learn more useful, generalisable representations that faithfully capture common underlying factors between the modalities. In this work, we characterise successful learning of such models as the fulfillment of four criteria: i) implicit latent decomposition into shared and private subspaces, ii) coherent joint generation over all modalities, iii) coherent cross-generation across individual modalities, and iv) improved model learning for individual modalities through multi-modal integration. Here, we propose a mixture-of-experts multimodal variational autoencoder (MMVAE) to learn generative models on different sets of modalities, including a challenging image-language dataset, and demonstrate its ability to satisfy all four criteria, both qualitatively and quantitatively.

Abstract PDF Upgrade to Chat

Citations (247)

View on Semantic Scholar

Summary

The paper introduces a novel MMVAE framework that leverages a mixture-of-experts strategy to combine unimodal variational posteriors and overcome modality dominance.
It applies an IWAE estimator to tighten the marginal likelihood bound, leading to higher entropy variational posteriors and improved learning across modalities.
Experiments on MNIST-SVHN and CUB datasets demonstrate that MMVAE achieves superior joint and cross-modal generation coherence compared to baseline models.

The paper presents a novel approach to multi-modal generative modeling, focusing on learning complex data distributions across different modalities such as image and text. The authors introduce the Mixture-of-Experts Multimodal Variational Autoencoder (MMVAE), which aims to overcome the limitations observed in existing methods by effectively learning a joint distribution over different modalities. The MMVAE framework addresses four critical criteria for multi-modal learning: latent factorization, coherent joint generation, coherent cross-generation, and improved model learning for individual modalities.

Key Contributions

Mixture-of-Experts Framework: The authors propose using a mixture of experts (MoE) strategy to model the joint variational posterior. This approach combines unimodal variational posteriors, ensuring that the model can leverage the strengths of each modality while addressing potential modality dominance, a common issue with product of experts (PoE) models.
Flexible Inference Methodology: By utilizing the importance weighted autoencoder (IWAE) estimator, the framework provides tighter bounds on the marginal likelihood, leading to higher entropy in the variational posteriors. This characteristic is beneficial for multi-modal learning, encouraging each modality to contribute meaningfully to the joint representation.
Addressing Coherence in Generations: MMVAE explicitly aims for coherence in both joint and cross-generation scenarios. The paper delineates several novel methods to evaluate the coherence, including digit classification accuracy on MNIST-SVHN data and canonical correlation analysis (CCA) on the CUB image-caption dataset.
Experimentation and Results: The paper demonstrates the effectiveness of MMVAE through experiments on MNIST-SVHN (an image-to-image multi-modal setup) and CUB (image-to-language transformation). Qualitative and quantitative analyses show that MMVAE provides superior performance in both latent factorization and generative coherence compared to previous models like MVAE.

Numerical Results and Claims

The MMVAE model significantly outperforms baseline models in generating coherent joint and cross-modality results. For instance, digit matching accuracy between generated MNIST and SVHN digits was reported to be 42.1% for joint generation and remarkably high for cross-generation (86.4% for MNIST to SVHN, 69.1% for SVHN to MNIST).
In contrast, the state-of-the-art MVAE model yielded digit matching accuracies close to random guess levels, reflecting its inadequate performance in maintaining cross-modality coherence.
On the CUB dataset, the average correlation of generated image-caption pairs was reported to be 0.263, closely approaching the correlation observed in true data pairs (0.273), marking a notable achievement in aligning generated samples across modalities.

Implications and Future Directions

The work presents a compelling case for the use of MoE structures in multi-modal generative models, emphasizing the importance of addressing coherence across modalities. The nuanced handling of variational distributions in MMVAE suggests broader applicability in scenarios where data spans complex modality intersections, such as video-audio, text-graph, and more.

Future research could explore integrating compounding modalities like three-dimensional point clouds and natural language, extending MMVAE’s applicability. Moreover, exploring variations in experts' contributions, potentially guided by attention mechanisms, could enrich model robustness further.

In conclusion, the MMVAE introduces a rigorous and systematically evaluated framework for generative modeling across modalities, laying the groundwork for both theoretical explorations and practical applications in multi-modal machine learning landscapes.

Markdown Report Issue