LLMCarbon: Modeling the end-to-end Carbon Footprint of Large Language Models (2309.14393v2)
Abstract: The carbon footprint associated with LLMs is a significant concern, encompassing emissions from their training, inference, experimentation, and storage processes, including operational and embodied carbon emissions. An essential aspect is accurately estimating the carbon impact of emerging LLMs even before their training, which heavily relies on GPU usage. Existing studies have reported the carbon footprint of LLM training, but only one tool, mlco2, can predict the carbon footprint of new neural networks prior to physical training. However, mlco2 has several serious limitations. It cannot extend its estimation to dense or mixture-of-experts (MoE) LLMs, disregards critical architectural parameters, focuses solely on GPUs, and cannot model embodied carbon footprints. Addressing these gaps, we introduce \textit{\carb}, an end-to-end carbon footprint projection model designed for both dense and MoE LLMs. Compared to mlco2, \carb~significantly enhances the accuracy of carbon footprint estimations for various LLMs. The source code is released at \url{https://github.com/SotaroKaneda/MLCarbon}.
- Carbon explorer: A holistic framework for designing carbon aware datacenters. In ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pp. 118–132, 2023.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Carbontracker: Tracking and predicting the carbon footprint of training deep learning models. arXiv preprint arXiv:2007.03051, 2020.
- Efficient large scale language modeling with mixtures of experts. arXiv preprint arXiv:2112.10684, 2021.
- Green cloud computing: Balancing energy in processing, storage, and transport. Proceedings of the IEEE, 99(1):149–167, 2011.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pp. 1877–1901, 2020.
- Broken neural scaling laws. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=sckjveqlCZ.
- Are the new ais smart enough to steal your job? iq scores for chatgpt, microsoft bing, google bard and quora poe. IQ Scores for ChatGPT, Microsoft Bing, Google Bard and Quora Poe (April 7, 2023), 2023.
- Pipeline moe: A flexible moe implementation with pipeline parallelism. arXiv preprint arXiv:2304.11414, 2023.
- Jeongdong Choe. Memory technology 2021: Trends & challenges. In 2021 International Conference on Simulation of Semiconductor Processes and Devices (SISPAD), pp. 111–115. IEEE, 2021.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Unsupervised cross-lingual representation learning at scale. In Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451, July 2020.
- Measuring the carbon intensity of ai in cloud instances. In ACM Conference on Fairness, Accountability, and Transparency, pp. 1877––1894, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450393522.
- Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pp. 5547–5569. PMLR, 2022.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
- Dtco including sustainability: Power-performance-area-cost-environmental score (ppace) analysis for logic technologies. In IEEE International Electron Devices Meeting, pp. 41.4.1–41.4.4, 2020.
- Chasing carbon: The elusive environmental footprint of computing. IEEE Micro, 42(4):37––47, jul 2022.
- Towards the systematic reporting of the energy and carbon footprints of machine learning. Journal of Machine Learning Research, 21(1), jan 2020. ISSN 1532-4435.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- In-datacenter performance analysis of a tensor processing unit. In IEEE/ACM International symposium on computer architecture, pp. 1–12, 2017.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Scalable and efficient moe training for multitask multilingual models. arXiv preprint arXiv:2109.10465, 2021.
- Quantifying the carbon emissions of machine learning. arXiv preprint arXiv:1910.09700, 2019.
- A holistic assessment of the carbon footprint of noor, a very large Arabic language model. In Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models, pp. 84–94, may 2022.
- Gshard: Scaling giant models with conditional computation and automatic sharding. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=qrwe7XHTmYb.
- Jurassic-1: Technical details and evaluation. White Paper. AI21 Labs, 1, 2021.
- Energy consumption and emission mitigation prediction based on data center traffic and pue for global data centers. Global Energy Interconnection, 3(3):272–282, 2020.
- Efficient large-scale language model training on gpu clusters using megatron-lm. In ACM International Conference for High Performance Computing, Networking, Storage and Analysis, 2021.
- Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350, 2021.
- The carbon footprint of machine learning training will plateau, then shrink. Computer, 55(7):18–28, 2022.
- The carbon footprint of distributed cloud storage. arXiv preprint arXiv:1803.06973, 2018.
- Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
- Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. In International Conference on Machine Learning, pp. 18332–18346, 2022.
- Katharine Sanderson. Gpt-4 is here: what scientists think. Nature, 615(7954):773, 2023.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
- Green ai. Communications of the ACM, 63(12):54––63, nov 2020.
- zen 2: The amd 7nm energy-efficient high-performance x86-64 microprocessor core. In 2020 IEEE International Solid-State Circuits Conference-(ISSCC), pp. 42–44. IEEE, 2020.
- Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990, 2022.
- Energy and policy considerations for deep learning in nlp. In Annual Meeting of the Association for Computational Linguistics, pp. 3645–3650, 2019.
- The dirty secret of ssds: Embodied carbon. In The 1st Workshop on Sustainable Computer Systems Design and Implementation, 2022.
- Deep learning’s diminishing returns: The cost of improvement is becoming unsustainable. IEEE Spectrum, 58(10):50–55, 2021. doi: 10.1109/MSPEC.2021.9563954.
- Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
- TSMC. TSMC Corporate Social Responsibility Report. https://esg.tsmc.com/download/file/2019-csr-report/english/pdf/e-all.pdf, 2019.
- Wiki. Ampere (microarchitecture). http://en.wikipedia.org/w/index.php?title=Ampere%20(microarchitecture)&oldid=1160464393, 2023a.
- Wiki. Tensor Processing Unit. http://en.wikipedia.org/w/index.php?title=Tensor%20Processing%20Unit&oldid=1158650479, 2023b.
- Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems, 4:795–813, 2022.
- Petuum: A new platform for distributed machine learning on big data. IEEE Transactions on Big Data, 1(2):49–67, 2015.
- Yandex. Yalm 100b. https://github.com/yandex/YaLM-100B, 2022.
- Orca: A distributed serving system for {{\{{Transformer-Based}}\}} generative models. In USENIX Symposium on Operating Systems Design and Implementation, pp. 521–538, 2022.
- GLM-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations, 2023.
- St-moe: Designing stable and transferable sparse expert models. arXiv preprint arXiv:2202.08906, 2022.