Handling Incomplete Heterogeneous Data using VAEs (1807.03653v4)

Published 10 Jul 2018 in cs.LG, cs.AI, and stat.ML

Abstract: Variational autoencoders (VAEs), as well as other generative models, have been shown to be efficient and accurate for capturing the latent structure of vast amounts of complex high-dimensional data. However, existing VAEs can still not directly handle data that are heterogenous (mixed continuous and discrete) or incomplete (with missing data at random), which is indeed common in real-world applications. In this paper, we propose a general framework to design VAEs suitable for fitting incomplete heterogenous data. The proposed HI-VAE includes likelihood models for real-valued, positive real valued, interval, categorical, ordinal and count data, and allows accurate estimation (and potentially imputation) of missing data. Furthermore, HI-VAE presents competitive predictive performance in supervised tasks, outperforming supervised models when trained on incomplete data.

Authors (4)

Pablo M. Olmos (45 papers)
Zoubin Ghahramani (108 papers)
Isabel Valera (46 papers)
Alfredo Nazabal (7 papers)

Citations (308)

View on Semantic Scholar

Summary

The paper introduces a novel HI-VAE that generalizes VAEs to effectively handle incomplete, heterogeneous datasets.
It details a factorized recognition model and tailored DNN configurations to accurately process continuous, discrete, and categorical data types.
Empirical results demonstrate that HI-VAE outperforms traditional imputation methods, maintaining strong classification performance even with missing data.

Handling Incomplete Heterogeneous Data using VAEs: A Summary

This paper introduces an innovative approach for applying Variational Autoencoders (VAEs) to the challenge of incomplete and heterogeneous datasets. Typically, VAEs have been utilized to model homogeneous and complete datasets efficiently, but these constraints limit their application to real-world datasets characterized by missing entries and mixed data types. The authors propose a framework, HI-VAE, that generalizes VAEs to handle these scenarios, showing compelling performance over alternative methods.

Theoretical and Methodological Contributions

The authors extend the standard VAE architecture to address the issues of incomplete and mixed-type data. They introduce a generative model that can process various data types, such as continuous, discrete, categorical, and ordinal data. This flexibility allows HI-VAE to model datasets with significant variability in attribute characteristics.

Key contributions of the methodology include:

Handling Incomplete Data: The HI-VAE framework implements a factorized distribution that separately handles observed and missing data points. This decoupling is achieved through a newly designed recognition model that utilizes input dropout strategies, enabling the inference of missing values by focusing only on observed attributes.
Generalization to Heterogeneous Data: The framework incorporates different likelihood models for various data types by employing a distinct Deep Neural Network (DNN) configuration per attribute type. This approach ensures the accurate modeling of real, positive real, count, categorical, and ordinal data.
Latent Variable Structure: To better capture the latent structure of the datasets, a Gaussian mixture model is proposed as an alternative to the standard Gaussian prior in VAEs. This enhancement prevents the Kullback-Leibler divergence from overwhelming the ELBO during training and enables richer posterior distributions.
Model Normalization: To address the disparity in data value ranges across attributes, the authors introduce batch normalization and denormalization layers to stabilize and expedite training convergence, preserving numerical robustness in the process.

Empirical Evaluation

The paper evaluates HI-VAE on various real-world datasets from the UCI repository and compares its imputation performance to several contemporaneous approaches, including Mean Imputation, Multiple Imputation by Chained Equations (MICE), General Latent Feature Model (GLFM), and Generative Adversarial Imputation Nets (GAIN). HI-VAE consistently exhibits superior performance in imputing missing data, particularly in nominal attributes where statistical dependencies are prevalent.

In classification tasks, HI-VAE demonstrates competitive performance compared to deep supervised models, even without requiring data imputation for missing values. Its ability to infer and use all available information efficiently results in less performance degradation when the dataset includes missing entries.

Implications and Future Directions

The paper makes a significant contribution to handling unfinished heterogeneous data by proposing an adaptable generative model that maintains the intrinsic statistical relationships among variables. Beyond missing data imputation, the HI-VAE framework’s ability to model complex, diverse datasets holds promise for uncovers latent patterns in noisy, incomplete environmental or health data.

Future research could delve into expanding this framework to incorporate sequential data, such as time series, and further explore its applicability in dynamic domains like finance or real-time monitoring systems. Additionally, integration with existing efforts in implicit generative models like GANs could extend its capabilities and performance further, particularly in capturing intricate latent structures.

Overall, the HI-VAE framework stands as a robust tool for data scientists and researchers, addressing prevalent challenges in data preprocessing and analysis for mixed-domain datasets.

PDF Markdown

Related Papers

GitHub

GitHub - probabilistic-learning/HI-VAE (91 stars)