Practical Federated Gradient Boosting Decision Trees (1911.04206v2)

Published 11 Nov 2019 in cs.LG and stat.ML

Abstract: Gradient Boosting Decision Trees (GBDTs) have become very successful in recent years, with many awards in machine learning and data mining competitions. There have been several recent studies on how to train GBDTs in the federated learning setting. In this paper, we focus on horizontal federated learning, where data samples with the same features are distributed among multiple parties. However, existing studies are not efficient or effective enough for practical use. They suffer either from the inefficiency due to the usage of costly data transformations such as secret sharing and homomorphic encryption, or from the low model accuracy due to differential privacy designs. In this paper, we study a practical federated environment with relaxed privacy constraints. In this environment, a dishonest party might obtain some information about the other parties' data, but it is still impossible for the dishonest party to derive the actual raw data of other parties. Specifically, each party boosts a number of trees by exploiting similarity information based on locality-sensitive hashing. We prove that our framework is secure without exposing the original record to other parties, while the computation overhead in the training process is kept low. Our experimental studies show that, compared with normal training with the local data of each party, our approach can significantly improve the predictive accuracy, and achieve comparable accuracy to the original GBDT with the data from all parties.

Citations (171)

View on Semantic Scholar

Summary

The paper introduces SimFL, a framework that leverages locality-sensitive hashing to enhance privacy while maintaining computational efficiency.
It details a two-stage process where preprocessing builds global hash tables and training uses weighted gradients for decision tree construction.
Experimental results show that SimFL improves predictive accuracy over local models and nearly matches fully aggregated data performance.

Practical Federated Gradient Boosting Decision Trees

The paper "Practical Federated Gradient Boosting Decision Trees" by Qinbin Li, Zeyi Wen, and Bingsheng He presents a federated learning framework designed specifically for Gradient Boosting Decision Trees (GBDTs). The authors address challenges encountered in horizontal federated learning settings, where data samples with identical features are distributed across multiple parties. The existing methodologies in this domain face limitations due to inefficiencies caused by computationally expensive techniques like secret sharing and homomorphic encryption or reduced model accuracy as a result of differential privacy designs.

Framework Overview

The authors propose the SimFL framework, which exploits similarity information based on Locality-Sensitive Hashing (LSH) without exposing the original records to other parties. Unlike previous approaches that rely heavily on cryptographic transformations, SimFL aims to balance privacy concerns with computational effectiveness and model accuracy by adopting a more relaxed privacy model. While this model might allow a dishonest party to obtain some information about other parties' data, it ensures that deriving the actual raw data remains infeasible.

Technical Contributions

SimFL consists of two main stages: preprocessing and training. During preprocessing, the parties compute hash values using LSH functions, build global hash tables, and broadcast them to garner similarity information while keeping individual records concealed. In the training stage, each party sequentially builds decision trees using weighted gradients that incorporate the similarity data from others. This weighting method identifies instances that are representative of the broader dataset, thereby aiming to improve model performance.

Additionally, the paper provides theoretical analysis regarding the privacy level, bounding the approximation error introduced by the weighted gradient boosting, and evaluating computational efficiency. The privacy model ensures infinite possible solutions for any given set of outputs, thus preserving privacy against potential inference attacks. Moreover, the computed bounds suggest manageable approximation errors, and the computational overhead remains low.

Experimental Evaluation

The experimental results demonstrate that SimFL significantly improves predictive accuracy compared to models trained with local data alone (denoted as SOLO). The test errors of SimFL are shown to be comparable to those obtained from models trained with aggregated data from all parties without privacy constraints (referred to as ALL-IN). Contrary to another method named TFL (Tree-based Federated Learning), which did not improve prediction accuracy, SimFL consistently produced better results. Furthermore, SimFL exhibited stable performance across various data partitioning scenarios, performing well even with balanced data distributions.

Implications and Future Perspectives

From a practical standpoint, SimFL's efficient design and robust accuracy gains make it a viable federated learning approach for environments where data privacy is a critical concern. By leveraging similarities rather than relying solely on cryptographic methods, researchers can develop models that benefit from collaborative data analysis without compromising local data privacy extensively.

The theoretical contributions provide insights into addressing privacy and efficiency trade-offs in federated learning, opening pathways for further exploration in secure model development. Future developments may focus on enhancing the privacy model to tackle inference attacks more rigorously, as well as adapting the methodology for other machine learning tasks in federated setups.

Overall, SimFL represents a significant step towards practical, efficient, and accurate federated learning, applicable to real-world scenarios with stringent data privacy requirements.

PDF Markdown