Scalable Log Determinants for Gaussian Process Kernel Learning (1711.03481v1)

Published 9 Nov 2017 in stat.ML, cs.AI, and cs.LG

Abstract: For applications as varied as Bayesian neural networks, determinantal point processes, elliptical graphical models, and kernel learning for Gaussian processes (GPs), one must compute a log determinant of an $n \times n$ positive definite matrix, and its derivatives - leading to prohibitive $\mathcal{O}(n^3)$ computations. We propose novel $\mathcal{O}(n)$ approaches to estimating these quantities from only fast matrix vector multiplications (MVMs). These stochastic approximations are based on Chebyshev, Lanczos, and surrogate models, and converge quickly even for kernel matrices that have challenging spectra. We leverage these approximations to develop a scalable Gaussian process approach to kernel learning. We find that Lanczos is generally superior to Chebyshev for kernel learning, and that a surrogate approach can be highly efficient and accurate with popular kernels.

Citations (89)

View on Semantic Scholar

Summary

The paper introduces stochastic trace estimators using Chebyshev, Lanczos, and surrogate models to approximate log determinants with reduced computational costs.
The paper demonstrates faster and more accurate hyperparameter recovery in GP kernel learning across diverse real-world applications.
The paper highlights the potential for scalable inference and suggests future research in GPU acceleration and deep kernel learning integration.

Scalable Log Determinants for Gaussian Process Kernel Learning

The paper "Scalable Log Determinants for Gaussian Process Kernel Learning" presents a series of novel and efficient methods for approximating the log determinant of large positive definite matrices, specifically in the context of Gaussian Process (GP) kernel learning. This task, essential for many machine learning applications such as Bayesian neural networks and graphical models, typically incurs prohibitive computational costs, scaling as $\mathcal{O}(n^3)$ for matrices of size $n \times n$ . The authors propose a suite of stochastic approaches, leveraging fast matrix-vector multiplications (MVMs), that reduce the complexity to $\mathcal{O}(n)$ .

Methodology

The core innovation of the paper lies in the usage of stochastic trace estimators, applied in conjunction with Chebyshev, Lanczos, and surrogate models. These methodologies enable efficient computation of both log determinants and their derivatives, fundamental for GP kernel learning and optimization via marginal likelihood.

Chebyshev and Lanczos Approximations: Both methods present polynomial approximations of the matrix logarithm, facilitating rapid convergence even in challenging scenarios. Lanczos approaches generally outperform Chebyshev due to their efficacy in handling rapidly decaying eigenvalues, a common characteristic of kernel matrices.
Surrogate Models: These models offer an efficient alternative by pre-computing the log determinant at selected hyperparameter points and employing interpolation methods, specifically radial basis functions, to approximate the log determinant across hyperparameter space.
Kernel Approximations: Emphasis is placed on flexible kernel learning scenarios, including cases where fast MVMs are feasible but direct eigenvalue computations are inefficient. The authors advocate using the structured kernel interpolation (SKI) framework to generalize kernel approximation methods beyond grid-structured data.

Experimental Results

The proposed methods demonstrate remarkable scalability and accuracy across a diverse set of experiments involving large datasets. Notably, the Lanczos approach and the surrogate model excel, providing faster and more precise hyperparameter recovery than traditional methods such as the scaled eigenvalue technique or FITC. Significant improvements are observed in applications like sound modeling, crime prediction, and precipitation forecasting, where the number of inducing points can be vastly increased without impacting computational feasibility.

Implications and Future Directions

This paper underscores the potential of stochastic MVM-based methods in addressing computational bottlenecks in GP kernel learning. The practical implications are profound, suggesting that these techniques could be widely adopted in real-world applications requiring large-scale inference and learning. Future research may explore applications beyond GP, enhancing tasks that similarly benefit from log determinant computations, such as fast posterior sampling and diagonal estimation.

Further development may focus on integrating these methods with GPU acceleration, maximizing efficiency in computational environments conducive to parallel processing. Exploration of multi-layer architectures, such as deep kernel learning, might also benefit significantly from these scalable techniques, potentially yielding breakthroughs in high-dimensional data analysis and neural network integration with probabilistic models.

In conclusion, the paper offers compelling evidence for the viability of scalable, stochastic approximations in Gaussian process kernel learning, inviting substantial advancements in both theoretical understanding and practical machine learning capabilities.