The Future of Large Language Model Pre-training is Federated (2405.10853v3)
Abstract: Generative pre-trained LLMs have demonstrated impressive performance over a wide range of tasks, thanks to the unprecedented amount of data they have been trained on. As established scaling laws indicate, LLMs' future performance improvement depends on the amount of computing and data sources they can leverage for pre-training. Federated learning (FL) has the potential to unleash the majority of the planet's data and computational resources, which are underutilized by the data-center-focused training methodology of current LLM practice. Our work presents a robust, flexible, reproducible FL approach that enables large-scale collaboration across institutions to train LLMs. We propose a scalable deployment system called Photon to enable the investigation and development of this new training paradigm for LLM pre-training. We show that Photon can be used by organizations interested in collaborating with their private data sources and computational resources for pre-training LLMs with billions of parameters. This paradigm would mobilize more computational and data resources while matching or potentially exceeding centralized performance. We further show the effectiveness of the federated training scales with model size and present our approach for training billion-scale federated LLMs using limited resources. Thus far, we have used Photon to train LLM models to the size of 7B parameters and anticipate larger models being completed in the near future. Finally, we show that LLM training is highly resilient to the classical challenges of federated statistical and hardware heterogeneity. Furthermore, we show that convergence is robust to partial participation, opening the avenue for compute-efficient collaborative training. Photon will help data-rich actors to become the protagonists of LLMs pre-training instead of leaving the stage to compute-rich actors alone.
- Scaling laws for neural language models. CoRR, abs/2001.08361, 2020. URL https://arxiv.org/abs/2001.08361.
- Monthly energy consumption forecast: A deep learning approach. In 2017 International Joint Conference on Neural Networks (IJCNN), pages 4283–4290, 2017.
- Communication-efficient distributed learning: An overview. IEEE J. Sel. Areas Commun., 41(4):851–873, 2023. doi: 10.1109/JSAC.2023.3242710. URL https://doi.org/10.1109/JSAC.2023.3242710.
- The cost of training machine learning models over distributed data sources. IEEE Open J. Commun. Soc., 4:1111–1126, 2023. doi: 10.1109/OJCOMS.2023.3274394. URL https://doi.org/10.1109/OJCOMS.2023.3274394.
- Training compute-optimal large language models. CoRR, abs/2203.15556, 2022. doi: 10.48550/ARXIV.2203.15556. URL https://doi.org/10.48550/arXiv.2203.15556.
- The times sues openai and microsoft over a.i. use of copyrighted work, Dec 2023. URL https://www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html.
- Trusted source alignment in large language models. CoRR, abs/2311.06697, 2023. doi: 10.48550/ARXIV.2311.06697. URL https://doi.org/10.48550/arXiv.2311.06697.
- The curse of recursion: Training on generated data makes models forget. CoRR, abs/2305.17493, 2023. doi: 10.48550/ARXIV.2305.17493. URL https://doi.org/10.48550/arXiv.2305.17493.
- An archival perspective on pretraining data. Patterns (N Y), 5(4):100966, March 2024.
- Considerations for differentially private learning with large-scale public pretraining. CoRR, abs/2212.06470, 2022. doi: 10.48550/ARXIV.2212.06470. URL https://doi.org/10.48550/arXiv.2212.06470.
- OpenAI, Dec 2023a. URL https://openai.com/blog/axel-springer-partnership.
- Will we run out of data? an analysis of the limits of scaling datasets in machine learning. CoRR, abs/2211.04325, 2022. doi: 10.48550/ARXIV.2211.04325. URL https://doi.org/10.48550/arXiv.2211.04325.
- Securing large language models: Threats, vulnerabilities and responsible practices. CoRR, abs/2403.12503, 2024. doi: 10.48550/ARXIV.2403.12503. URL https://doi.org/10.48550/arXiv.2403.12503.
- Jaydeep Borkar. What can we learn from data leakage and unlearning for law? CoRR, abs/2307.10476, 2023. doi: 10.48550/ARXIV.2307.10476. URL https://doi.org/10.48550/arXiv.2307.10476.
- Federated foundation models: Privacy-preserving and collaborative learning for large models. CoRR, abs/2305.11414, 2023. doi: 10.48550/ARXIV.2305.11414. URL https://doi.org/10.48550/arXiv.2305.11414.
- Sebastian U. Stich. Local SGD converges fast and communicates little. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=S1g2JnRcFX.
- Don’t use large mini-batches, use local SGD. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=B1eyO1BFPr.
- Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics. PMLR, 2017.
- Diloco: Distributed low-communication training of language models. CoRR, abs/2311.08105, 2023. doi: 10.48550/ARXIV.2311.08105. URL https://doi.org/10.48550/arXiv.2311.08105.
- Asynchronous local-sgd training for language modeling. CoRR, abs/2401.09135, 2024. doi: 10.48550/ARXIV.2401.09135. URL https://doi.org/10.48550/arXiv.2401.09135.
- Dipaco: Distributed path composition. CoRR, abs/2403.10616, 2024. doi: 10.48550/ARXIV.2403.10616. URL https://doi.org/10.48550/arXiv.2403.10616.
- Introducing FlowerLLM, 2024. URL https://flower.ai/blog/2024-03-14-introducing-flowerllm/.
- Flower: A friendly federated learning research framework. CoRR, abs/2007.14390, 2020. URL https://arxiv.org/abs/2007.14390.
- Pollen: High-throughput simulation of federated learning via resource-aware client placement, 2024.
- Language models are few-shot learners, 2020.
- OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023b. doi: 10.48550/ARXIV.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774.
- Gemini: A family of highly capable multimodal models. CoRR, abs/2312.11805, 2023. doi: 10.48550/ARXIV.2312.11805. URL https://doi.org/10.48550/arXiv.2312.11805.
- BLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100, 2022. doi: 10.48550/ARXIV.2211.05100. URL https://doi.org/10.48550/arXiv.2211.05100.
- Llama: Open and efficient foundation language models, 2023a.
- Llama 2: Open foundation and fine-tuned chat models, 2023b.
- The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023.
- An empirical model of large-batch training. CoRR, abs/1812.06162, 2018. URL http://arxiv.org/abs/1812.06162.
- Dota 2 with large scale deep reinforcement learning. CoRR, abs/1912.06680, 2019. URL http://arxiv.org/abs/1912.06680.
- Pytorch distributed: Experiences on accelerating data parallel training. Proc. VLDB Endow., 13(12):3005–3018, aug 2020. ISSN 2150-8097. doi: 10.14778/3415478.3415530. URL https://doi.org/10.14778/3415478.3415530.
- Horovod: fast and easy distributed deep learning in tensorflow. CoRR, abs/1802.05799, 2018. URL http://arxiv.org/abs/1802.05799.
- Megatron-lm: Training multi-billion parameter language models using model parallelism. CoRR, abs/1909.08053, 2019. URL http://arxiv.org/abs/1909.08053.
- Mesh-tensorflow: Deep learning for supercomputers. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 10435–10444, 2018. URL https://proceedings.neurips.cc/paper/2018/hash/3a37abdeefe1dab1b30f7c5c7e581b93-Abstract.html.
- Zero: memory optimizations toward training trillion parameter models. In Christine Cuicchi, Irene Qualters, and William T. Kramer, editors, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9-19, 2020, page 20. IEEE/ACM, 2020. doi: 10.1109/SC41405.2020.00024. URL https://doi.org/10.1109/SC41405.2020.00024.
- Pytorch FSDP: experiences on scaling fully sharded data parallel. Proc. VLDB Endow., 16(12):3848–3860, 2023a. doi: 10.14778/3611540.3611569. URL https://www.vldb.org/pvldb/vol16/p3848-huang.pdf.
- Training deep nets with sublinear memory cost. CoRR, abs/1604.06174, 2016. URL http://arxiv.org/abs/1604.06174.
- Zero-offload: Democratizing billion-scale model training. In Irina Calciu and Geoff Kuenning, editors, 2021 USENIX Annual Technical Conference, USENIX ATC 2021, July 14-16, 2021, pages 551–564. USENIX Association, 2021. URL https://www.usenix.org/conference/atc21/presentation/ren-jie.
- Machine learning for synthetic data generation: a review. CoRR, abs/2302.04062, 2023. doi: 10.48550/ARXIV.2302.04062. URL https://doi.org/10.48550/arXiv.2302.04062.
- La Javaness. https://lajavaness.medium.com/llm-large-language-model-cost-analysis-d5022bb43e9e, 2023.
- The era of 1-bit llms: All large language models are in 1.58 bits. CoRR, abs/2402.17764, 2024. doi: 10.48550/ARXIV.2402.17764. URL https://doi.org/10.48550/arXiv.2402.17764.
- Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
- Distributed inference and fine-tuning of large language models over the internet, 2023.
- Faster on-device training using new federated momentum algorithm. CoRR, abs/2002.02090, 2020. URL https://arxiv.org/abs/2002.02090.
- Learning differentially private recurrent language models. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=BJ0hF1Z0b.
- Practical secure aggregation for federated learning on user-held data. In NIPS Workshop on Private Multi-Party Machine Learning, 2016. URL https://arxiv.org/abs/1611.04482.
- Advances and open problems in federated learning. Found. Trends Mach. Learn., 14(1-2):1–210, 2021. doi: 10.1561/2200000083. URL https://doi.org/10.1561/2200000083.
- Towards federated learning at scale: System design. In Ameet Talwalkar, Virginia Smith, and Matei Zaharia, editors, Proceedings of Machine Learning and Systems 2019, MLSys 2019, Stanford, CA, USA, March 31 - April 2, 2019. mlsys.org, 2019. URL https://proceedings.mlsys.org/book/271.pdf.
- Trade-offs of local SGD at scale: An empirical study. CoRR, abs/2110.08133, 2021. URL https://arxiv.org/abs/2110.08133.
- Scaling federated learning for fine-tuning of large language models. In Elisabeth Métais, Farid Meziane, Helmut Horacek, and Epaminondas Kapetanios, editors, Natural Language Processing and Information Systems - 26th International Conference on Applications of Natural Language to Information Systems, NLDB 2021, Saarbrücken, Germany, June 23-25, 2021, Proceedings, volume 12801 of Lecture Notes in Computer Science, pages 15–23. Springer, 2021. doi: 10.1007/978-3-030-80599-9“˙2. URL https://doi.org/10.1007/978-3-030-80599-9\_2.
- ALBERT: A lite BERT for self-supervised learning of language representations. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=H1eA7AEtvS.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, 2019. doi: 10.18653/V1/N19-1423. URL https://doi.org/10.18653/v1/n19-1423.
- Performance analysis of federated learning algorithms for multilingual protest news detection using pre-trained distilbert and BERT. IEEE Access, 11:134009–134022, 2023. doi: 10.1109/ACCESS.2023.3334910. URL https://doi.org/10.1109/ACCESS.2023.3334910.
- Can public large language models help private cross-device federated learning? CoRR, abs/2305.12132, 2023. doi: 10.48550/ARXIV.2305.12132. URL https://doi.org/10.48550/arXiv.2305.12132.
- Pretrained models for multilingual federated learning. In Marine Carpuat, Marie-Catherine de Marneffe, and Iván Vladimir Meza Ruíz, editors, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10-15, 2022, pages 1413–1421. Association for Computational Linguistics, 2022. doi: 10.18653/V1/2022.NAACL-MAIN.101. URL https://doi.org/10.18653/v1/2022.naacl-main.101.
- Towards building the federated GPT: federated instruction tuning. CoRR, abs/2305.05644, 2023. doi: 10.48550/ARXIV.2305.05644. URL https://doi.org/10.48550/arXiv.2305.05644.
- FATE-LLM: A industrial grade federated learning framework for large language models. CoRR, abs/2310.10049, 2023. doi: 10.48550/ARXIV.2310.10049. URL https://doi.org/10.48550/arXiv.2310.10049.
- Federatedscope-llm: A comprehensive package for fine-tuning large language models in federated learning. CoRR, abs/2309.00363, 2023. doi: 10.48550/ARXIV.2309.00363. URL https://doi.org/10.48550/arXiv.2309.00363.
- Low-parameter federated learning with large language models. CoRR, abs/2307.13896, 2023. doi: 10.48550/ARXIV.2307.13896. URL https://doi.org/10.48550/arXiv.2307.13896.
- Reducing communication overhead in federated learning for pre-trained language models using parameter-efficient finetuning. In Sarath Chandar, Razvan Pascanu, Hanie Sedghi, and Doina Precup, editors, Conference on Lifelong Learning Agents, 22-25 August 2023, McGill University, Montréal, Québec, Canada, volume 232 of Proceedings of Machine Learning Research, pages 456–469. PMLR, 2023. URL https://proceedings.mlr.press/v232/malaviya23a.html.
- Training large-vocabulary neural language models by private federated learning for resource-constrained devices. In IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2023, Rhodes Island, Greece, June 4-10, 2023, pages 1–5. IEEE, 2023. doi: 10.1109/ICASSP49357.2023.10096570. URL https://doi.org/10.1109/ICASSP49357.2023.10096570.
- Fwdllm: Efficient fedllm using forward gradient, 2024.
- Slora: Federated parameter efficient fine-tuning of language models. CoRR, abs/2308.06522, 2023. doi: 10.48550/ARXIV.2308.06522. URL https://doi.org/10.48550/arXiv.2308.06522.
- Client-customized adaptation for parameter-efficient federated learning. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 1159–1172. Association for Computational Linguistics, 2023. doi: 10.18653/V1/2023.FINDINGS-ACL.75. URL https://doi.org/10.18653/v1/2023.findings-acl.75.
- The power of scale for parameter-efficient prompt tuning. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 3045–3059. Association for Computational Linguistics, 2021. doi: 10.18653/V1/2021.EMNLP-MAIN.243. URL https://doi.org/10.18653/v1/2021.emnlp-main.243.
- Fedprompt: Communication-efficient and privacy-preserving prompt tuning in federated learning. In IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2023, Rhodes Island, Greece, June 4-10, 2023, pages 1–5. IEEE, 2023b. doi: 10.1109/ICASSP49357.2023.10095356. URL https://doi.org/10.1109/ICASSP49357.2023.10095356.
- Federated learning of large language models with parameter-efficient prompt tuning and adaptive optimization. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 7871–7888. Association for Computational Linguistics, 2023. doi: 10.18653/V1/2023.EMNLP-MAIN.488. URL https://doi.org/10.18653/v1/2023.emnlp-main.488.
- Breaking physical and linguistic borders: Multilingual federated prompt tuning for low-resource languages. In The Twelfth International Conference on Learning Representations, 2024.
- OpenAI offers publishers as little as $1 million a year — the information, Jan 2024. URL https://www.theinformation.com/articles/openai-offers-publishers-as-little-as-1-million-a-year.
- Low-resource languages: A review of past work and future challenges. CoRR, abs/2006.07264, 2020. URL https://arxiv.org/abs/2006.07264.
- Neural machine translation for low-resource languages: A survey. ACM Comput. Surv., 55(11):229:1–229:37, 2023. doi: 10.1145/3567592. URL https://doi.org/10.1145/3567592.
- Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2020. URL http://jmlr.org/papers/v21/20-074.html.