Emergent Mind

The Falcon Series of Open Language Models

(2311.16867)
Published Nov 28, 2023 in cs.CL and cs.AI

Abstract

We introduce the Falcon series: 7B, 40B, and 180B parameters causal decoder-only models trained on a diverse high-quality corpora predominantly assembled from web data. The largest model, Falcon-180B, has been trained on over 3.5 trillion tokens of text--the largest openly documented pretraining run. Falcon-180B significantly outperforms models such as PaLM or Chinchilla, and improves upon concurrently developed models such as LLaMA 2 or Inflection-1. It nears the performance of PaLM-2-Large at a reduced pretraining and inference cost, making it, to our knowledge, one of the three best language models in the world along with GPT-4 and PaLM-2-Large. We report detailed evaluations, as well as a deep dive into the methods and custom tooling employed to pretrain Falcon. Notably, we report on our custom distributed training codebase, allowing us to efficiently pretrain these models on up to 4,096 A100s on cloud AWS infrastructure with limited interconnect. We release a 600B tokens extract of our web dataset, as well as the Falcon-7/40/180B models under a permissive license to foster open-science and accelerate the development of an open ecosystem of LLMs.

Overview

  • The Falcon series includes three open language models, Falcon-7B, Falcon-40B, and Falcon-180B, with the largest being trained on 3.5 trillion tokens.

  • Falcon models challenged the norm by using high-quality, filtered web data and introduced multigroup attention for efficient inference.

  • The models were trained on cloud infrastructure utilizing A100-40GB GPUs, 3D parallelism, ZeRO sharding, and FlashAttention kernels for efficiency.

  • Falcon-180B exhibits strong performance across various NLP tasks, showing promise for specialization in chatbot and code-related tasks.

  • The release under open licenses promotes AI research democratization and responsible use of LLMs.

The Falcon series, introduced by the Technology Innovation Institute, comprises three models: Falcon-7B, Falcon-40B, and Falcon-180B, each scaling in size and computational resources. The largest, Falcon-180B, is notable for its training on an unprecedented 3,500 billion tokens of text data. These models are presented as significant contributions to the field of open language models, with the 180B variant being released under a responsible AI license while the smaller models are under Apache 2.0 license.

The research leading to the development of Falcon models involved extensive experimentation to fine-tune the architecture and pretraining datasets. The team took an innovative approach by relying heavily on high-quality web data, carefully filtered and deduplicated, challenging the belief that curated datasets are superior for training language models. This led to the decision not to repeat data during training, to avoid issues with data memorization and degradation. For the architecture, the team incorporated a variant of multiquery attention, known as multigroup attention, to improve inference efficiency, particularly in reducing the size of the required memory cache.

Implementation-wise, the Falcon models are trained on cloud infrastructure, using cost-efficient methods and hardware like A100-40GB GPUs. This is enabled by a custom distributed training framework, Gigatron, which utilizes 3D parallelism and ZeRO optimizer sharding to optimize for memory and computational efficiency. Additionally, FlashAttention kernels are used to expedite training further.

Upon evaluation, Falcon-180B demonstrates competitive performance on a variety of natural language processing tasks, positioning itself among the top language models like the ones from OpenAI's GPT series and Google's PaLM. Through evaluations using the EleutherAI Evaluation Harness, the Falcon series models not only exhibit strong performance on NLP benchmarks but also demonstrate potential for specialization in areas like chatbot development and code-related tasks.

The authors acknowledge limitations in their research, including the potential for different results at larger scales and the possible need to decouple training from inference compute to manage downstream deployment costs. Moreover, Falcon models, predominantly trained on English web data, may struggle with out-of-scope languages and domains.

The release of Falcon models and a portion of the RefinedWeb dataset under open licenses represents a push towards democratization of AI research, fostering collaboration, and ensuring responsible use of LLMs. The models and accompanying research documentation have been made publicly available with the intention of contributing to collective advancement in AI technology.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from tensorflow.org.
  2. Towards a Cleaner Document-Oriented Multilingual Crawled Corpus
  3. Towards a Human-like Open-Domain Chatbot
  4. Scaling Laws for Generative Mixed-Modal Language Models
  5. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
  6. ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension
  7. Santacoder: don’t reach for the stars! In Deep Learning for Code (DL4C) Workshop
  8. PaLM 2 Technical Report
  9. Efficient Large Scale Language Modeling with Mixtures of Experts
  10. Neural Machine Translation by Jointly Learning to Align and Translate
  11. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
  12. Constitutional AI: Harmlessness from AI Feedback
  13. The pushshift reddit dataset
  14. Efficient Training of Language Models to Fill in the Middle
  15. Scibert: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615–3620.
  16. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR.
  17. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence.
  18. Gpt-neox-20b: An open-source autoregressive language model. In Proceedings of BigScience Episode# 5–Workshop on Challenges & Perspectives in Creating Large Language Models, pages 95–136.
  19. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow. If you use this software, please cite it using these metadata.
  20. Jax: Autograd and xla. Astrophysics Source Code Library, pages ascl–2111.
  21. Broder, A. Z. (1997). On the resemblance and containment of documents. In Proceedings. Compression and Complexity of Sequences 1997, pages 21–29. IEEE.
  22. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901.
  23. Quantifying memorization across neural language models. In The Eleventh International Conference on Learning Representations.
  24. One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling
  25. Evaluating Large Language Models Trained on Code
  26. Extending Context Window of Large Language Models via Positional Interpolation
  27. Can Large Language Models Be an Alternative to Human Evaluations?
  28. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality
  29. PaLM: Scaling Language Modeling with Pathways
  30. Unified scaling laws for routed language models. In International Conference on Machine Learning, pages 4057–4086. PMLR.
  31. Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL.
  32. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
  33. Training Verifiers to Solve Math Word Problems
  34. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451.
  35. No Language Left Behind: Scaling Human-Centered Machine Translation
  36. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
  37. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359.
  38. Llm.int8(): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems, 35:30318–30332.
  39. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
  40. Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster
  41. Effective Theory of Transformers at Initialization
  42. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1286–1305.
  43. AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback
  44. Ethnologue: Languages of the World. SIL International, Dallas, TX, USA, twenty-sixth edition.
  45. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270.
  46. Pot: Python optimal transport. Journal of Machine Learning Research, 22(78):1–8.
  47. What’s going on with the open llm leaderboard? "https://huggingface.co/blog/evaluating-mmlu-leaderboard".

  48. Hungry hungry hippos: Towards language modeling with state space models. In The Eleventh International Conference on Learning Representations.
  49. Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
  50. The Pile: An 800GB Dataset of Diverse Text for Language Modeling
  51. A framework for few-shot language model evaluation
  52. SemEval-2012 task 7: Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In *SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), pages 394–398, Montréal, Canada. Association for Computational Linguistics.
  53. Textbooks Are All You Need
  54. news-please: A generic news crawler and extractor. In Proceedings of the 15th International Symposium of Information Science, pages 218–223.
  55. Transformer language models without positional encodings still learn positional information. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1382–1390.
  56. Measuring massive multitask language understanding. In International Conference on Learning Representations.
  57. Scaling Laws and Interpretability of Learning from Repeated Data
  58. Deep Learning Scaling is Predictable, Empirically
  59. Training Compute-Optimal Large Language Models
  60. Hooker, S. (2021). The hardware lottery. Communications of the ACM, 64(12):58–65.
  61. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339.
  62. Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling
  63. Inflection (2023). Inflection 1.
  64. Crowdsourcing multiple choice science questions
  65. Exploring the Limits of Language Modeling
  66. kaiokenmdenv (2023). Extending context is hard… but not impossible. Accessed: 2023-10-02.
  67. Scaling Laws for Neural Language Models
  68. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
  69. The stack: 3 tb of permissively licensed source code. Transactions on Machine Learning Research.
  70. Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for Computational Linguistics, 10:50–72.
  71. Race: Large-scale reading comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785–794.
  72. The bigscience ROOTS corpus: A 1.6TB composite multilingual dataset. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  73. What language model to train if you have one million gpu hours? In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 765–782
  74. Deduplicating training data makes language models better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8424–8445.
  75. Limits to depth efficiencies of self-attention. Advances in Neural Information Processing Systems, 33:22640–22651.
  76. Solving quantitative reasoning problems with language models
  77. StarCoder: may the source be with you!
  78. Sequence Parallelism: Long Sequence Training from System Perspective
  79. Textbooks Are All You Need II: phi-1.5 technical report
  80. XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models
  81. Holistic Evaluation of Language Models
  82. Jurassic-1: Technical details and evaluation. Technical report, AI21 Labs.
  83. Few-shot Learning with Multilingual Language Models
  84. The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
  85. Decoupled Weight Decay Regularization
  86. Decoupled weight decay regularization. In International Conference on Learning Representations.
  87. Your Transformer May Not be as Powerful as You Expect
  88. WizardCoder: Empowering Code Large Language Models with Evol-Instruct
  89. Language models of code are few-shot commonsense learners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1384–1403.
  90. Suffix arrays: a new method for on-line string searches. Journal on Computing, 22(5):935–948.
  91. Mémoli, F. (2011). Gromov–wasserstein distances and the metric approach to object matching. Foundations of computational mathematics, 11:417–487.
  92. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP.
  93. Efficient Estimation of Word Representations in Vector Space
  94. Model cards for model reporting. In Proceedings of the conference on fairness, accountability, and transparency, pages 220–229.
  95. MosaicML (2023). Introducing mpt-30b: Raising the bar for open-source foundation models. Accessed: 2023-06-22.
  96. Scaling Data-Constrained Language Models
  97. Memory-efficient pipeline-parallel dnn training. In International Conference on Machine Learning, pages 7937–7947. PMLR.
  98. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15.
  99. OpenAI (2023a). Gpt-4 technical report. arXiv, pages 2303–08774.
  100. OpenAI (2023b). Model index for researchers. Accessed: 2023-09-26.
  101. Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures. Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019, pages 9 – 16, Mannheim. Leibniz-Institut für Deutsche Sprache.
  102. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  103. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1525–1534, Berlin, Germany. Association for Computational Linguistics.
  104. Paresh, D. (2023). Stack overflow will charge ai giants for training data.
  105. Automatic differentiation in pytorch. In NIPS-W.
  106. The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
  107. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
  108. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics.
  109. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5.
  110. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations.
  111. Using the Output Embedding to Improve Language Models
  112. Self-attention Does Not Need $O(n^2)$ Memory
  113. Improving language understanding by generative pre-training. OpenAI Blog.
  114. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  115. Scaling Language Models: Methods, Analysis & Insights from Training Gopher
  116. Compressive transformers for long-range sequence modelling
  117. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
  118. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE.
  119. Scaling Up Models and Data with $\texttt{t5x}$ and $\texttt{seqio}$
  120. Code Llama: Open Foundation Models for Code
  121. WinoGrande: An Adversarial Winograd Schema Challenge at Scale
  122. Multitask Prompted Training Enables Zero-Shot Task Generalization
  123. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
  124. What Language Model to Train if You Have One Million GPU Hours?
  125. Causes and Cures for Interference in Multilingual Translation
  126. Shannon, C. E. (1951). Prediction and entropy of printed english. Bell system technical journal, 30(1):50–64.
  127. Self-Attention with Relative Position Representations
  128. Fast Transformer Decoding: One Write-Head is All You Need
  129. GLU Variants Improve Transformer
  130. Mesh-TensorFlow: Deep learning for supercomputers. In Neural Information Processing Systems.
  131. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
  132. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
  133. Dolma
  134. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
  135. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(56):1929–1958.
  136. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
  137. RoFormer: Enhanced Transformer with Rotary Position Embedding
  138. Sutton, R. (2019). The bitter lesson. Incomplete Ideas (blog), 13(1).
  139. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.

  140. Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers
  141. Ul2: Unifying language learning paradigms. In The Eleventh International Conference on Learning Representations.
  142. Transcending Scaling Laws with 0.1% Extra Compute
  143. LaMDA: Language Models for Dialog Applications
  144. Amos: An Adam-style Optimizer with Adaptive Weight Decay towards Model-Oriented Scale
  145. Tiedemann, J. (2016). Finding alternative translations in a large corpus of movie subtitle. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 3518–3522, Portorož, Slovenia. European Language Resources Association (ELRA).
  146. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10–19.
  147. LLaMA: Open and Efficient Foundation Language Models
  148. Llama 2: Open Foundation and Fine-Tuned Chat Models
  149. A Simple Method for Commonsense Reasoning
  150. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
  151. Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning
  152. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32.
  153. Glue: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations.
  154. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax.

  155. Large Language Models are not Fair Evaluators
  156. What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?
  157. Self-Instruct: Aligning Language Models with Self-Generated Instructions
  158. Emergent abilities of large language models. Transactions on Machine Learning Research.
  159. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  160. Challenges in detoxifying language models. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2447–2469.
  161. Ccnet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 4003–4012.
  162. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
  163. Effective long-context scaling of foundation models
  164. To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis
  165. Byt5: Towards a token-free future with pre-trained byte-to-byte models. Transactions of the Association for Computational Linguistics, 10:291–306.
  166. mt5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498.
  167. Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer
  168. HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy. Association for Computational Linguistics.
  169. PanGu-$α$: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation
  170. OPT: Open Pre-trained Transformer Language Models
  171. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
  172. AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
  173. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–27.

Show All 173