Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Communication Compression for Byzantine Robust Learning: New Efficient Algorithms and Improved Rates (2310.09804v2)

Published 15 Oct 2023 in math.OC and cs.LG

Abstract: Byzantine robustness is an essential feature of algorithms for certain distributed optimization problems, typically encountered in collaborative/federated learning. These problems are usually huge-scale, implying that communication compression is also imperative for their resolution. These factors have spurred recent algorithmic and theoretical developments in the literature of Byzantine-robust learning with compression. In this paper, we contribute to this research area in two main directions. First, we propose a new Byzantine-robust method with compression - Byz-DASHA-PAGE - and prove that the new method has better convergence rate (for non-convex and Polyak-Lojasiewicz smooth optimization problems), smaller neighborhood size in the heterogeneous case, and tolerates more Byzantine workers under over-parametrization than the previous method with SOTA theoretical convergence guarantees (Byz-VR-MARINA). Secondly, we develop the first Byzantine-robust method with communication compression and error feedback - Byz-EF21 - along with its bidirectional compression version - Byz-EF21-BC - and derive the convergence rates for these methods for non-convex and Polyak-Lojasiewicz smooth case. We test the proposed methods and illustrate our theoretical findings in the numerical experiments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (91)
  1. Byzantine stochastic gradient descent. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 4618–4628.
  2. QSGD: Communication-efficient sgd via gradient quantization and encoding. Advances in Neural Information Processing Systems, 30.
  3. The convergence of sparsified gradient methods. Advances in Neural Information Processing Systems, 31.
  4. Byzantine-resilient non-convex stochastic gradient descent. In International Conference on Learning Representations.
  5. Fixing by mixing: A recipe for optimal byzantine ml under heterogeneity. In International Conference on Artificial Intelligence and Statistics, pages 1232–1300. PMLR.
  6. Lower bounds for non-convex stochastic optimization. Mathematical Programming, 199(1-2):165–214.
  7. A little is enough: Circumventing defenses for distributed learning. Advances in Neural Information Processing Systems, 32.
  8. Qsparse-local-sgd: Distributed sgd with quantization, sparsification and local computations. Advances in Neural Information Processing Systems, 32.
  9. signsgd with majority vote is communication efficient and fault tolerant. arXiv preprint arXiv:1810.05291.
  10. Stochastic gradient descent-ascent: Unified theory and new efficient methods. arXiv preprint arXiv:2202.07262.
  11. On biased compression for distributed learning. arXiv preprint arXiv:2002.12410.
  12. Distributed methods with compressed communication for solving variational inequalities, with theoretical guarantees. arXiv preprint arXiv:2110.03313.
  13. Machine learning with adversaries: Byzantine tolerant gradient descent. Advances in Neural Information Processing Systems, 30.
  14. Machine learning with adversaries: Byzantine tolerant gradient descent. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  15. Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):1–27.
  16. Draco: Byzantine-resilient distributed training via redundant gradients. In International Conference on Machine Learning, pages 903–912. PMLR.
  17. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 1(2):1–25.
  18. Momentum-based variance reduction in non-convex sgd. Advances in neural information processing systems, 32.
  19. Aggregathor: Byzantine machine learning via robust gradient aggregation. Proceedings of Machine Learning and Systems, 1:81–106.
  20. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. Advances in neural information processing systems, 27.
  21. Distributed deep learning in open collaborations. Advances in Neural Information Processing Systems, 34:7879–7897.
  22. Adaptive gradient quantization for data-parallel sgd. Advances in neural information processing systems, 33:3174–3185.
  23. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. Advances in Neural Information Processing Systems, 31.
  24. Local model poisoning attacks to {{\{{Byzantine-Robust}}\}} federated learning. In 29th USENIX security symposium (USENIX Security 20), pages 1605–1622.
  25. EF21 with bells & whistles: Practical algorithmic extensions of modern error feedback. arXiv preprint arXiv:2110.03294.
  26. Communication-efficient and byzantine-robust distributed learning with error feedback. IEEE Journal on Selected Areas in Information Theory, 2(3):942–953.
  27. Distributed newton can communicate less and resist byzantine workers. Advances in Neural Information Processing Systems, 33:18028–18038.
  28. Goodall, W. (1951). Television by pulse code modulation. Bell System Technical Journal, 30(1):33–49.
  29. Secure distributed training at scale. arXiv preprint arXiv:2106.11257.
  30. MARINA: Faster non-convex distributed learning with compression. In International Conference on Machine Learning, pages 3788–3798. PMLR.
  31. Variance reduction is an antidote to byzantines: Better rates, weaker assumptions and communication compression as a cherry on the top.
  32. Linearly converging error compensated sgd. Advances in Neural Information Processing Systems, 33:20889–20900.
  33. Ef21-p and friends: Improved theoretical communication complexity for distributed optimization with bidirectional compression. In International Conference on Machine Learning, pages 11761–11807. PMLR.
  34. The hidden vulnerability of distributed learning in byzantium. In International Conference on Machine Learning, pages 3521–3530. PMLR.
  35. Federated learning with compression: Unified analysis and sharp guarantees. In International Conference on Artificial Intelligence and Statistics, pages 2350–2358. PMLR.
  36. Natural compression for distributed deep learning. arXiv preprint arXiv:1905.10988.
  37. Stochastic distributed learning with gradient quantization and variance reduction. arXiv preprint arXiv:1904.05115.
  38. Distributed second order methods with fast rates and compressed communication. In International Conference on Machine Learning, pages 4617–4628. PMLR.
  39. Advances and open problems in federated learning. Foundations and Trends® in Machine Learning, 14(1–2):1–210.
  40. Learning from history for byzantine robust optimization. In International Conference on Machine Learning, pages 5311–5319. PMLR.
  41. Byzantine-robust learning on heterogeneous datasets via bucketing. International Conference on Learning Representations.
  42. Error feedback fixes signsgd and other gradient compression schemes. In International Conference on Machine Learning, pages 3252–3261. PMLR.
  43. Distributed learning with compressed gradients. arXiv preprint arXiv:1806.06573.
  44. A hybrid gpu cluster and volunteer computing platform for scalable deep learning. The Journal of Supercomputing, 74(7):3236–3263.
  45. Decentralized stochastic optimization and gossip algorithms with compressed communication. In International Conference on Machine Learning, pages 3478–3487. PMLR.
  46. Federated learning: strategies for improving communication efficiency. In NIPS Private Multi-Party Machine Learning Workshop.
  47. A linearly convergent algorithm for decentralized optimization: Sending less bits for free! In International Conference on Artificial Intelligence and Statistics, pages 4087–4095. PMLR.
  48. The byzantine generals problem. ACM Transactions on Programming Languages and Systems, 4(3):382–401.
  49. Li, C. (2020). Demystifying gpt-3 language model: A technical overview.
  50. An experimental study of Byzantine-robust aggregation schemes in federated learning. IEEE Transactions on Big Data.
  51. PAGE: A simple and optimal probabilistic gradient estimator for nonconvex optimization. In International Conference on Machine Learning, pages 6286–6295. PMLR.
  52. Acceleration for compressed gradient descent in distributed and federated optimization. In International Conference on Machine Learning, pages 5895–5904. PMLR.
  53. Canita: Faster rates for distributed convex optimization with communication compression. Advances in Neural Information Processing Systems, 34.
  54. An optimal hybrid variance-reduced algorithm for stochastic composite nonconvex optimization. arXiv preprint arXiv:2008.09055.
  55. Privacy and robustness in federated learning: Attacks and defenses. arXiv preprint arXiv:2012.06337.
  56. Distributed learning with compressed gradient differences. arXiv preprint arXiv:1901.09269.
  57. Sarah: A novel method for machine learning problems using stochastic recursive gradient. In International Conference on Machine Learning, pages 2613–2621. PMLR.
  58. OpenAI (2023). GPT-4 technical report.
  59. Byzantines can also learn from history: Fall of centered clipping in federated learning. IEEE Transactions on Information Forensics and Security.
  60. Bidirectional compression in heterogeneous settings for distributed or federated learning with partial participation: tight convergence guarantees. arXiv preprint arXiv:2006.14591.
  61. Preserved central model for faster bidirectional compression in distributed settings. Advances in Neural Information Processing Systems, 34.
  62. Robust aggregation for federated learning. IEEE Transactions on Signal Processing, 70:1142–1154.
  63. Error compensated distributed sgd can be accelerated. Advances in Neural Information Processing Systems, 34.
  64. Detox: A redundancy-based framework for faster and more robust gradient aggregation. Advances in Neural Information Processing Systems, 32.
  65. ByGARS: Byzantine SGD with arbitrary number of attackers. arXiv preprint arXiv:2006.13421.
  66. EF21: A new, simpler, theoretically better, and practically faster error feedback. Advances in Neural Information Processing Systems, 34.
  67. Roberts, L. (1962). Picture coding using pseudo-random noise. IRE Transactions on Information Theory, 8(2):145–154.
  68. Dynamic federated learning model for identifying adversarial clients. arXiv preprint arXiv:2007.15030.
  69. Federated optimization algorithms with random reshuffling and gradient compression. arXiv preprint arXiv:2206.07021.
  70. Fednl: Making newton-type methods applicable to federated learning. arXiv preprint arXiv:2106.02969.
  71. Rethinking gradient sparsification as total error minimization. Advances in Neural Information Processing Systems, 34.
  72. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In Fifteenth Annual Conference of the International Speech Communication Association. Citeseer.
  73. Sparsified sgd with memory. Advances in Neural Information Processing Systems, 31.
  74. Fault-tolerant multi-agent optimization: optimal iterative distributed algorithms. In Proceedings of the 2016 ACM symposium on principles of distributed computing, pages 425–434.
  75. Distributed mean estimation with limited communication. In International Conference on Machine Learning, pages 3329–3337. PMLR.
  76. Permutation compressors for provably faster distributed nonconvex optimization. arXiv preprint arXiv:2110.03300.
  77. DoubleSqueeze: Parallel stochastic gradient descent with double-pass error-compensated compression. In International Conference on Machine Learning, pages 6155–6165.
  78. Byzantine-resilient federated learning at edge. IEEE Transactions on Computers.
  79. A hybrid stochastic optimization framework for composite nonconvex optimization. Mathematical Programming, 191(2):1005–1071.
  80. 2direction: Theoretically faster distributed training with bidirectional communication compression. arXiv preprint arXiv:2305.12379.
  81. DASHA: Distributed nonconvex optimization with communication compression, optimal oracle complexity, and no client synchronization. International Conference on Learning Representations.
  82. Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron. In The 22nd international conference on artificial intelligence and statistics, pages 1195–1204. PMLR.
  83. Powersgd: Practical low-rank gradient compression for distributed optimization. Advances in Neural Information Processing Systems, 32.
  84. Weiszfeld, E. (1937). Sur le point pour lequel la somme des distances de n points donnés est minimum. Tohoku Mathematical Journal, First Series, 43:355–386.
  85. Terngrad: Ternary gradients to reduce communication in distributed deep learning. Advances in neural information processing systems, 30.
  86. Federated variance-reduced stochastic gradient descent with robustness to byzantine attacks. IEEE Transactions on Signal Processing, 68:4583–4596.
  87. Fall of empires: Breaking byzantine-tolerant sgd by inner product manipulation. In Uncertainty in Artificial Intelligence, pages 261–270. PMLR.
  88. Towards building a robust and fair federated learning system. arXiv preprint arXiv:2011.10464.
  89. Byzantine-robust distributed learning: Towards optimal statistical rates. In International Conference on Machine Learning, pages 5650–5659. PMLR.
  90. Broadcast: Reducing both stochastic and compression noise to robustify communication-efficient federated learning. arXiv preprint arXiv:2104.06685.
  91. Parallelized stochastic gradient descent. Advances in neural information processing systems, 23.
Citations (4)

Summary

We haven't generated a summary for this paper yet.