Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
157 tokens/sec
GPT-4o
43 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Privacy-Preserving Sharing of Data Analytics Runtime Metrics for Performance Modeling (2403.05692v2)

Published 8 Mar 2024 in cs.DC

Abstract: Performance modeling for large-scale data analytics workloads can improve the efficiency of cluster resource allocations and job scheduling. However, the performance of these workloads is influenced by numerous factors, such as job inputs and the assigned cluster resources. As a result, performance models require significant amounts of training data. This data can be obtained by exchanging runtime metrics between collaborating organizations. Yet, not all organizations may be inclined to publicly disclose such metadata. We present a privacy-preserving approach for sharing runtime metrics based on differential privacy and data synthesis. Our evaluation on performance data from 736 Spark job executions indicates that fully anonymized training data largely maintains performance prediction accuracy, particularly when there is minimal original data available. With 30 or fewer available original data samples, the use of synthetic training data resulted only in a one percent reduction in performance model accuracy on average.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)
  1. Machine Learning Classification over Encrypted Data. Cryptology ePrint Archive (2014).
  2. Apache Flink: Stream and Batch Processing in a Single Engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 36, 4 (2015).
  3. Haokun Fang and Quan Qian. 2021. Privacy Preserving Machine Learning with Homomorphic Encryption and Federated Learning. Future Internet 13, 4 (2021).
  4. Oded Goldreich. 1998. Secure Multi-Party Computation. Manuscript. Preliminary version 78, 110 (1998).
  5. Arrow: Low-level Augmented Bayesian Optimization for Finding the Best Cloud VM. In ICDCS ’18. IEEE.
  6. Performance and Cost-Efficient Spark Job Scheduling Based on Deep Reinforcement Learning in Cloud Computing Environments. IEEE Transactions on Parallel and Distributed Systems 33, 7 (2021).
  7. Federated Learning: Challenges, Methods, and Future Directions. IEEE Signal Processing Magazine 37, 3 (2020).
  8. When Machine Learning Meets Privacy: A Survey and Outlook. ACM Computing Surveys 54, 2 (2021).
  9. DataSynthesizer: Privacy-Preserving Synthetic Datasets. In SSDBM ’17. ACM.
  10. Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing. JMIR Medical Informatics 8, 7 (2020).
  11. On the Potential of Execution Traces for Batch Processing Workload Optimization in Public Clouds. In Big Data ’21. IEEE.
  12. Karasu: A Collaborative Approach to Efficient Cluster Configuration for Big Data Analytics. In IPCCC ’23. IEEE.
  13. Navoda Senavirathne and Vicenç Torra. 2020. On the Role of Data Anonymization in Machine Learning Privacy. In TrustCom ’20. IEEE.
  14. Ernest: Efficient Performance Prediction for Large-scale Advanced Analytics. In NSDI ’16. USENIX.
  15. C3O: Collaborative Cluster Configuration Optimization for Distributed Data Processing in Public Clouds. In IC2E ’21. IEEE.
  16. Spark: Cluster Computing with Working Sets. HotCloud 10, 10 (2010).

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com