2000 character limit reached
Gemma 2: Improving Open Language Models at a Practical Size (2408.00118v3)
Published 31 Jul 2024 in cs.CL and cs.AI
Abstract: In this work, we introduce Gemma 2, a new addition to the Gemma family of lightweight, state-of-the-art open models, ranging in scale from 2 billion to 27 billion parameters. In this new version, we apply several known technical modifications to the Transformer architecture, such as interleaving local-global attentions (Beltagy et al., 2020a) and group-query attention (Ainslie et al., 2023). We also train the 2B and 9B models with knowledge distillation (Hinton et al., 2015) instead of next token prediction. The resulting models deliver the best performance for their size, and even offer competitive alternatives to models that are 2-3 times bigger. We release all our models to the community.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- On-policy distillation of language models: Learning from self-generated mistakes. In The Twelfth International Conference on Learning Representations, 2024.
- AI@Meta. Llama 3 model card, 2024. URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md.
- Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023.
- The falcon series of open language models, 2023.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Program synthesis with large language models. CoRR, abs/2108.07732, 2021. URL https://arxiv.org/abs/2108.07732.
- Pathways: Asynchronous distributed dataflow for ml, 2022.
- Neural combinatorial optimization with reinforcement learning. CoRR, abs/1611.09940, 2016. URL http://arxiv.org/abs/1611.09940.
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020a.
- Longformer: The long-document transformer. CoRR, abs/2004.05150, 2020b. URL https://arxiv.org/abs/2004.05150.
- Language models are few-shot learners. CoRR, abs/2005.14165, 2020. URL https://arxiv.org/abs/2005.14165.
- Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2022.
- Evaluating large language models trained on code. CoRR, abs/2107.03374, 2021. URL https://arxiv.org/abs/2107.03374.
- Chatbot arena: An open platform for evaluating llms by human preference, 2024.
- Boolq: Exploring the surprising difficulty of natural yes/no questions. CoRR, abs/1905.10044, 2019. URL http://arxiv.org/abs/1905.10044.
- Training verifiers to solve math word problems. CoRR, abs/2110.14168, 2021. URL https://arxiv.org/abs/2110.14168.
- Gemini Team. Gemini: A family of highly capable multimodal models, 2023.
- Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024.
- Gemma Team. Gemma: Open models based on gemini research and technology, 2024.
- Minillm: Knowledge distillation of large language models. In The Twelfth International Conference on Learning Representations, 2024.
- Measuring massive multitask language understanding. CoRR, abs/2009.03300, 2020. URL https://arxiv.org/abs/2009.03300.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- Preventing verbatim memorization in language models gives a false sense of privacy. arXiv preprint arXiv:2210.17546, 2022.
- Mistral 7b, 2023.
- Llm comparator: Visual analytics for side-by-side evaluation of large language models, 2024. URL https://arxiv.org/abs/2402.10524.
- Evaluating language-model agents on realistic autonomous tasks, 2024. URL https://arxiv.org/abs/2312.11671.
- T. Kudo and J. Richardson. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In E. Blanco and W. Lu, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium, Nov. 2018. Association for Computational Linguistics. 10.18653/v1/D18-2012. URL https://aclanthology.org/D18-2012.
- Madlad-400: A multilingual and document-level large audited dataset. arXiv preprint arXiv:2309.04662, 2023.
- Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466, 2019. 10.1162/tacl_a_00276. URL https://aclanthology.org/Q19-1026.
- Malla: Demystifying real-world large language model integrated malicious services, 2024. URL https://arxiv.org/abs/2401.03315.
- Effective approaches to attention-based neural machine translation. CoRR, abs/1508.04025, 2015. URL http://arxiv.org/abs/1508.04025.
- Personal Communication, 2024.
- Towards agile text classifiers for everyone, 2023. URL https://arxiv.org/abs/2302.06541.
- Scalable extraction of training data from (production) language models. arXiv preprint arXiv:2311.17035, 2023.
- Evaluating frontier models for dangerous capabilities, 2024. URL https://arxiv.org/abs/2403.13793.
- Language models are unsupervised multitask learners, 2019.
- Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR, abs/1910.10683, 2019. URL http://arxiv.org/abs/1910.10683.
- Warp: On the benefits of weight averaged rewarded policies, 2024.
- {{\{{Zero-offload}}\}}: Democratizing {{\{{billion-scale}}\}} model training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 551–564, 2021.
- Scaling up models and data with t5x and seqio. Journal of Machine Learning Research, 24(377):1–8, 2023.
- WINOGRANDE: an adversarial winograd schema challenge at scale. CoRR, abs/1907.10641, 2019. URL http://arxiv.org/abs/1907.10641.
- N. Shazeer. GLU variants improve transformer. CoRR, abs/2002.05202, 2020. URL https://arxiv.org/abs/2002.05202.
- Model evaluation for extreme risks, 2023. URL https://arxiv.org/abs/2305.15324.
- Roformer: Enhanced transformer with rotary position embedding. CoRR, abs/2104.09864, 2021. URL https://arxiv.org/abs/2104.09864.
- Challenging big-bench tasks and whether chain-of-thought can solve them, 2022.
- Q. Team. Introducing qwen1.5, February 2024. URL https://qwenlm.github.io/blog/qwen1.5/.
- The language interpretability tool: Extensible, interactive visualizations and analysis for nlp models, 2020. URL https://arxiv.org/abs/2008.05122.
- Llama: Open and efficient foundation language models, 2023.
- Attention is all you need. CoRR, abs/1706.03762, 2017. URL http://arxiv.org/abs/1706.03762.
- Ethical and social risks of harm from language models, 2021. URL https://arxiv.org/abs/2112.04359.
- xAI. grok-1, 2024. URL https://github.com/xai-org/grok-1.
- XLA. Xla: Optimizing compiler for tensorflow, 2019. URL https://www.tensorflow.org/xla.
- GSPMD: general and scalable parallelization for ML computation graphs. CoRR, abs/2105.04663, 2021. URL https://arxiv.org/abs/2105.04663.
- Intercode: Standardizing and benchmarking interactive coding with execution feedback, 2023. URL https://arxiv.org/abs/2306.14898.
- B. Zhang and R. Sennrich. Root mean square layer normalization. CoRR, abs/1910.07467, 2019. URL http://arxiv.org/abs/1910.07467.
- Lmsys-chat-1m: A large-scale real-world llm conversation dataset. arXiv preprint arXiv:2309.11998, 2023.