Communication Compression for Byzantine Robust Learning: New Efficient Algorithms and Improved Rates (2310.09804v2)
Abstract: Byzantine robustness is an essential feature of algorithms for certain distributed optimization problems, typically encountered in collaborative/federated learning. These problems are usually huge-scale, implying that communication compression is also imperative for their resolution. These factors have spurred recent algorithmic and theoretical developments in the literature of Byzantine-robust learning with compression. In this paper, we contribute to this research area in two main directions. First, we propose a new Byzantine-robust method with compression - Byz-DASHA-PAGE - and prove that the new method has better convergence rate (for non-convex and Polyak-Lojasiewicz smooth optimization problems), smaller neighborhood size in the heterogeneous case, and tolerates more Byzantine workers under over-parametrization than the previous method with SOTA theoretical convergence guarantees (Byz-VR-MARINA). Secondly, we develop the first Byzantine-robust method with communication compression and error feedback - Byz-EF21 - along with its bidirectional compression version - Byz-EF21-BC - and derive the convergence rates for these methods for non-convex and Polyak-Lojasiewicz smooth case. We test the proposed methods and illustrate our theoretical findings in the numerical experiments.
- Byzantine stochastic gradient descent. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 4618–4628.
- QSGD: Communication-efficient sgd via gradient quantization and encoding. Advances in Neural Information Processing Systems, 30.
- The convergence of sparsified gradient methods. Advances in Neural Information Processing Systems, 31.
- Byzantine-resilient non-convex stochastic gradient descent. In International Conference on Learning Representations.
- Fixing by mixing: A recipe for optimal byzantine ml under heterogeneity. In International Conference on Artificial Intelligence and Statistics, pages 1232–1300. PMLR.
- Lower bounds for non-convex stochastic optimization. Mathematical Programming, 199(1-2):165–214.
- A little is enough: Circumventing defenses for distributed learning. Advances in Neural Information Processing Systems, 32.
- Qsparse-local-sgd: Distributed sgd with quantization, sparsification and local computations. Advances in Neural Information Processing Systems, 32.
- signsgd with majority vote is communication efficient and fault tolerant. arXiv preprint arXiv:1810.05291.
- Stochastic gradient descent-ascent: Unified theory and new efficient methods. arXiv preprint arXiv:2202.07262.
- On biased compression for distributed learning. arXiv preprint arXiv:2002.12410.
- Distributed methods with compressed communication for solving variational inequalities, with theoretical guarantees. arXiv preprint arXiv:2110.03313.
- Machine learning with adversaries: Byzantine tolerant gradient descent. Advances in Neural Information Processing Systems, 30.
- Machine learning with adversaries: Byzantine tolerant gradient descent. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):1–27.
- Draco: Byzantine-resilient distributed training via redundant gradients. In International Conference on Machine Learning, pages 903–912. PMLR.
- Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 1(2):1–25.
- Momentum-based variance reduction in non-convex sgd. Advances in neural information processing systems, 32.
- Aggregathor: Byzantine machine learning via robust gradient aggregation. Proceedings of Machine Learning and Systems, 1:81–106.
- Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. Advances in neural information processing systems, 27.
- Distributed deep learning in open collaborations. Advances in Neural Information Processing Systems, 34:7879–7897.
- Adaptive gradient quantization for data-parallel sgd. Advances in neural information processing systems, 33:3174–3185.
- Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. Advances in Neural Information Processing Systems, 31.
- Local model poisoning attacks to {{\{{Byzantine-Robust}}\}} federated learning. In 29th USENIX security symposium (USENIX Security 20), pages 1605–1622.
- EF21 with bells & whistles: Practical algorithmic extensions of modern error feedback. arXiv preprint arXiv:2110.03294.
- Communication-efficient and byzantine-robust distributed learning with error feedback. IEEE Journal on Selected Areas in Information Theory, 2(3):942–953.
- Distributed newton can communicate less and resist byzantine workers. Advances in Neural Information Processing Systems, 33:18028–18038.
- Goodall, W. (1951). Television by pulse code modulation. Bell System Technical Journal, 30(1):33–49.
- Secure distributed training at scale. arXiv preprint arXiv:2106.11257.
- MARINA: Faster non-convex distributed learning with compression. In International Conference on Machine Learning, pages 3788–3798. PMLR.
- Variance reduction is an antidote to byzantines: Better rates, weaker assumptions and communication compression as a cherry on the top.
- Linearly converging error compensated sgd. Advances in Neural Information Processing Systems, 33:20889–20900.
- Ef21-p and friends: Improved theoretical communication complexity for distributed optimization with bidirectional compression. In International Conference on Machine Learning, pages 11761–11807. PMLR.
- The hidden vulnerability of distributed learning in byzantium. In International Conference on Machine Learning, pages 3521–3530. PMLR.
- Federated learning with compression: Unified analysis and sharp guarantees. In International Conference on Artificial Intelligence and Statistics, pages 2350–2358. PMLR.
- Natural compression for distributed deep learning. arXiv preprint arXiv:1905.10988.
- Stochastic distributed learning with gradient quantization and variance reduction. arXiv preprint arXiv:1904.05115.
- Distributed second order methods with fast rates and compressed communication. In International Conference on Machine Learning, pages 4617–4628. PMLR.
- Advances and open problems in federated learning. Foundations and Trends® in Machine Learning, 14(1–2):1–210.
- Learning from history for byzantine robust optimization. In International Conference on Machine Learning, pages 5311–5319. PMLR.
- Byzantine-robust learning on heterogeneous datasets via bucketing. International Conference on Learning Representations.
- Error feedback fixes signsgd and other gradient compression schemes. In International Conference on Machine Learning, pages 3252–3261. PMLR.
- Distributed learning with compressed gradients. arXiv preprint arXiv:1806.06573.
- A hybrid gpu cluster and volunteer computing platform for scalable deep learning. The Journal of Supercomputing, 74(7):3236–3263.
- Decentralized stochastic optimization and gossip algorithms with compressed communication. In International Conference on Machine Learning, pages 3478–3487. PMLR.
- Federated learning: strategies for improving communication efficiency. In NIPS Private Multi-Party Machine Learning Workshop.
- A linearly convergent algorithm for decentralized optimization: Sending less bits for free! In International Conference on Artificial Intelligence and Statistics, pages 4087–4095. PMLR.
- The byzantine generals problem. ACM Transactions on Programming Languages and Systems, 4(3):382–401.
- Li, C. (2020). Demystifying gpt-3 language model: A technical overview.
- An experimental study of Byzantine-robust aggregation schemes in federated learning. IEEE Transactions on Big Data.
- PAGE: A simple and optimal probabilistic gradient estimator for nonconvex optimization. In International Conference on Machine Learning, pages 6286–6295. PMLR.
- Acceleration for compressed gradient descent in distributed and federated optimization. In International Conference on Machine Learning, pages 5895–5904. PMLR.
- Canita: Faster rates for distributed convex optimization with communication compression. Advances in Neural Information Processing Systems, 34.
- An optimal hybrid variance-reduced algorithm for stochastic composite nonconvex optimization. arXiv preprint arXiv:2008.09055.
- Privacy and robustness in federated learning: Attacks and defenses. arXiv preprint arXiv:2012.06337.
- Distributed learning with compressed gradient differences. arXiv preprint arXiv:1901.09269.
- Sarah: A novel method for machine learning problems using stochastic recursive gradient. In International Conference on Machine Learning, pages 2613–2621. PMLR.
- OpenAI (2023). GPT-4 technical report.
- Byzantines can also learn from history: Fall of centered clipping in federated learning. IEEE Transactions on Information Forensics and Security.
- Bidirectional compression in heterogeneous settings for distributed or federated learning with partial participation: tight convergence guarantees. arXiv preprint arXiv:2006.14591.
- Preserved central model for faster bidirectional compression in distributed settings. Advances in Neural Information Processing Systems, 34.
- Robust aggregation for federated learning. IEEE Transactions on Signal Processing, 70:1142–1154.
- Error compensated distributed sgd can be accelerated. Advances in Neural Information Processing Systems, 34.
- Detox: A redundancy-based framework for faster and more robust gradient aggregation. Advances in Neural Information Processing Systems, 32.
- ByGARS: Byzantine SGD with arbitrary number of attackers. arXiv preprint arXiv:2006.13421.
- EF21: A new, simpler, theoretically better, and practically faster error feedback. Advances in Neural Information Processing Systems, 34.
- Roberts, L. (1962). Picture coding using pseudo-random noise. IRE Transactions on Information Theory, 8(2):145–154.
- Dynamic federated learning model for identifying adversarial clients. arXiv preprint arXiv:2007.15030.
- Federated optimization algorithms with random reshuffling and gradient compression. arXiv preprint arXiv:2206.07021.
- Fednl: Making newton-type methods applicable to federated learning. arXiv preprint arXiv:2106.02969.
- Rethinking gradient sparsification as total error minimization. Advances in Neural Information Processing Systems, 34.
- 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In Fifteenth Annual Conference of the International Speech Communication Association. Citeseer.
- Sparsified sgd with memory. Advances in Neural Information Processing Systems, 31.
- Fault-tolerant multi-agent optimization: optimal iterative distributed algorithms. In Proceedings of the 2016 ACM symposium on principles of distributed computing, pages 425–434.
- Distributed mean estimation with limited communication. In International Conference on Machine Learning, pages 3329–3337. PMLR.
- Permutation compressors for provably faster distributed nonconvex optimization. arXiv preprint arXiv:2110.03300.
- DoubleSqueeze: Parallel stochastic gradient descent with double-pass error-compensated compression. In International Conference on Machine Learning, pages 6155–6165.
- Byzantine-resilient federated learning at edge. IEEE Transactions on Computers.
- A hybrid stochastic optimization framework for composite nonconvex optimization. Mathematical Programming, 191(2):1005–1071.
- 2direction: Theoretically faster distributed training with bidirectional communication compression. arXiv preprint arXiv:2305.12379.
- DASHA: Distributed nonconvex optimization with communication compression, optimal oracle complexity, and no client synchronization. International Conference on Learning Representations.
- Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron. In The 22nd international conference on artificial intelligence and statistics, pages 1195–1204. PMLR.
- Powersgd: Practical low-rank gradient compression for distributed optimization. Advances in Neural Information Processing Systems, 32.
- Weiszfeld, E. (1937). Sur le point pour lequel la somme des distances de n points donnés est minimum. Tohoku Mathematical Journal, First Series, 43:355–386.
- Terngrad: Ternary gradients to reduce communication in distributed deep learning. Advances in neural information processing systems, 30.
- Federated variance-reduced stochastic gradient descent with robustness to byzantine attacks. IEEE Transactions on Signal Processing, 68:4583–4596.
- Fall of empires: Breaking byzantine-tolerant sgd by inner product manipulation. In Uncertainty in Artificial Intelligence, pages 261–270. PMLR.
- Towards building a robust and fair federated learning system. arXiv preprint arXiv:2011.10464.
- Byzantine-robust distributed learning: Towards optimal statistical rates. In International Conference on Machine Learning, pages 5650–5659. PMLR.
- Broadcast: Reducing both stochastic and compression noise to robustify communication-efficient federated learning. arXiv preprint arXiv:2104.06685.
- Parallelized stochastic gradient descent. Advances in neural information processing systems, 23.