Divide and Conquer Kernel Ridge Regression: A Distributed Algorithm with Minimax Optimal Rates (1305.5029v2)

Published 22 May 2013 in math.ST, cs.LG, stat.ML, and stat.TH

Abstract: We establish optimal convergence rates for a decomposition-based scalable approach to kernel ridge regression. The method is simple to describe: it randomly partitions a dataset of size N into m subsets of equal size, computes an independent kernel ridge regression estimator for each subset, then averages the local solutions into a global predictor. This partitioning leads to a substantial reduction in computation time versus the standard approach of performing kernel ridge regression on all N samples. Our two main theorems establish that despite the computational speed-up, statistical optimality is retained: as long as m is not too large, the partition-based estimator achieves the statistical minimax rate over all estimators using the set of N samples. As concrete examples, our theory guarantees that the number of processors m may grow nearly linearly for finite-rank kernels and Gaussian kernels and polynomially in N for Sobolev spaces, which in turn allows for substantial reductions in computational cost. We conclude with experiments on both simulated data and a music-prediction task that complement our theoretical results, exhibiting the computational and statistical benefits of our approach.

Citations (361)

View on Semantic Scholar

Summary

The paper introduces a divide-and-conquer strategy that aggregates local KRR estimators to retain minimax optimal convergence rates.
The method partitions the dataset for independent local estimation using full-sample regularization, ensuring both efficiency and accuracy.
Numerical experiments confirm that the distributed algorithm delivers near-optimal rates and significant computational savings on large datasets.

Divide and Conquer Kernel Ridge Regression: A Distributed Algorithm with Minimax Optimal Rates

The paper presents a novel approach to Kernel Ridge Regression (KRR) that leverages a divide-and-conquer strategy to achieve computational efficiency without sacrificing statistical accuracy. This is particularly relevant for large-scale data problems, where the computational cost of traditional KRR methods can be prohibitive.

The core contribution of the paper is a distributed algorithm that partitions a large dataset into multiple smaller subsets, performs KRR on each subset independently, and then combines the results to form a global predictor. The proposed method ensures that even though each subproblem is solved independently with fewer data points, the aggregated solution retains the minimax optimal convergence rates of the full data KRR. This is contingent on the number of partitions being appropriately bounded relative to the sample size and the complexity of the underlying function space.

Methodology

The distributed algorithm is both conceptually simple and scalable:

Partitioning: The dataset of size $n$ is randomly divided into $m$ subsets of equal size. The parameter $m$ represents the number of partitions and serves as a key factor in the trade-off between statistical accuracy and computational efficiency.
Local Estimation: Each subset is used to compute an independent KRR estimator. Crucially, the choice of regularization parameter for each local KRR is as though the full sample size $n$ is used, not the smaller subset size, which avoids under-regularization.
Aggregation: The final predictor is an average of the local estimators, which effectively reduces variance and maintains low bias.

Theoretical Results

The paper offers rigorous theoretical guarantees demonstrating that the proposed method achieves the minimax rate of convergence for several classes of kernels, including finite-rank, Gaussian, and Sobolev kernels. Specifically, it shows that:

For finite-rank kernels, the algorithm achieves optimal rates provided the number of partitions $m$ is nearly linear in $n$ .
For kernels with polynomially or exponentially decaying eigenvalues, analogous optimal rates are maintained, with $m$ scaling polynomially in $n$ .

The theoretical analysis reveals an interesting interplay between computation and statistics, showing that with appropriate regularization, parallel computation can lead to both statistical efficiency and computational savings.

Numerical Results and Implications

The practical value of the divide-and-conquer strategy is corroborated through experiments. The simulation studies exhibit the estimator's capacity to deliver near-optimal convergence rates while significantly reducing computational load. The algorithm is also evaluated on a music-prediction task with real-world data, where it displays competitive performance against state-of-the-art approximation methods like Nyström sampling and random feature approximations.

The algorithm's strength lies in its simplicity and parallelizability, allowing it to naturally leverage modern distributed computing environments. This scalability is particularly advantageous for dealing with massive datasets where traditional kernel methods are not feasible.

Future Directions

The paper's findings open several avenues for future research in distributed non-parametric regression. One potential direction is exploring adaptive schemes for automatically choosing regularization parameters within the distributed setting. Another area of interest is extending the divide-and-conquer framework to other kernel methods and broader classes of machine learning problems.

Overall, this paper provides valuable insights and methods for overcoming computational bottlenecks in kernel methods, making it a significant contribution to the field of large-scale non-parametric regression.

PDF Markdown