Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 173 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 33 tok/s Pro
GPT-5 High 35 tok/s Pro
GPT-4o 124 tok/s Pro
Kimi K2 191 tok/s Pro
GPT OSS 120B 425 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

Gemma: Open Models Based on Gemini Research and Technology (2403.08295v4)

Published 13 Mar 2024 in cs.CL and cs.AI

Abstract: This work introduces Gemma, a family of lightweight, state-of-the art open models built from the research and technology used to create Gemini models. Gemma models demonstrate strong performance across academic benchmarks for language understanding, reasoning, and safety. We release two sizes of models (2 billion and 7 billion parameters), and provide both pretrained and fine-tuned checkpoints. Gemma outperforms similarly sized open models on 11 out of 18 text-based tasks, and we present comprehensive evaluations of safety and responsibility aspects of the models, alongside a detailed description of model development. We believe the responsible release of LLMs is critical for improving the safety of frontier models, and for enabling the next wave of LLM innovations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. The falcon series of open language models, 2023.
  2. Concrete problems in AI safety. arXiv preprint, 2016.
  3. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  4. Program synthesis with large language models. CoRR, abs/2108.07732, 2021. URL https://arxiv.org/abs/2108.07732.
  5. Constitutional ai: Harmlessness from ai feedback, 2022.
  6. Pathways: Asynchronous distributed dataflow for ml, 2022.
  7. PIQA: reasoning about physical commonsense in natural language. CoRR, abs/1911.11641, 2019. URL http://arxiv.org/abs/1911.11641.
  8. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39, 1952.
  9. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2022.
  10. Evaluating large language models trained on code. CoRR, abs/2107.03374, 2021. URL https://arxiv.org/abs/2107.03374.
  11. Palm: Scaling language modeling with pathways, 2022.
  12. Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems, 30, 2017.
  13. Boolq: Exploring the surprising difficulty of natural yes/no questions. CoRR, abs/1905.10044, 2019. URL http://arxiv.org/abs/1905.10044.
  14. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018.
  15. Training verifiers to solve math word problems. CoRR, abs/2110.14168, 2021. URL https://arxiv.org/abs/2110.14168.
  16. Large scale distributed deep networks. In F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012. URL https://proceedings.neurips.cc/paper_files/paper/2012/file/6aca97005c68f1206823815f66102863-Paper.pdf.
  17. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018. URL http://arxiv.org/abs/1810.04805.
  18. Gemini Team. Gemini: A family of highly capable multimodal models, 2023.
  19. Measuring massive multitask language understanding. CoRR, abs/2009.03300, 2020. URL https://arxiv.org/abs/2009.03300.
  20. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021.
  21. Preventing verbatim memorization in language models gives a false sense of privacy. arXiv preprint arXiv:2210.17546, 2022.
  22. Mistral 7b, 2023.
  23. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. CoRR, abs/1705.03551, 2017. URL http://arxiv.org/abs/1705.03551.
  24. How our principles helped define alphafold’s release, 2022.
  25. T. Kudo and J. Richardson. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In E. Blanco and W. Lu, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium, Nov. 2018. Association for Computational Linguistics. 10.18653/v1/D18-2012. URL https://aclanthology.org/D18-2012.
  26. Madlad-400: A multilingual and document-level large audited dataset. arXiv preprint arXiv:2309.04662, 2023.
  27. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466, 2019. 10.1162/tacl_a_00276. URL https://aclanthology.org/Q19-1026.
  28. Deep learning. nature, 521(7553):436–444, 2015.
  29. Efficient estimation of word representations in vector space. In Y. Bengio and Y. LeCun, editors, 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, 2013. URL http://arxiv.org/abs/1301.3781.
  30. Scalable extraction of training data from (production) language models. arXiv preprint arXiv:2311.17035, 2023.
  31. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 2022.
  32. How to catch an ai liar: Lie detection in black-box llms by asking unrelated questions, 2023.
  33. The LAMBADA dataset: Word prediction requiring a broad discourse context. CoRR, abs/1606.06031, 2016. URL http://arxiv.org/abs/1606.06031.
  34. Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR, abs/1910.10683, 2019. URL http://arxiv.org/abs/1910.10683.
  35. Scaling up models and data with t5x and seqio, 2022.
  36. Scaling up models and data with t5x and seqio. Journal of Machine Learning Research, 24(377):1–8, 2023.
  37. WINOGRANDE: an adversarial winograd schema challenge at scale. CoRR, abs/1907.10641, 2019. URL http://arxiv.org/abs/1907.10641.
  38. Socialiqa: Commonsense reasoning about social interactions. CoRR, abs/1904.09728, 2019. URL http://arxiv.org/abs/1904.09728.
  39. N. Shazeer. Fast transformer decoding: One write-head is all you need. CoRR, abs/1911.02150, 2019. URL http://arxiv.org/abs/1911.02150.
  40. N. Shazeer. GLU variants improve transformer. CoRR, abs/2002.05202, 2020. URL https://arxiv.org/abs/2002.05202.
  41. Defining and characterizing reward gaming. In NeurIPS, 2022.
  42. Roformer: Enhanced transformer with rotary position embedding. CoRR, abs/2104.09864, 2021. URL https://arxiv.org/abs/2104.09864.
  43. Sequence to sequence learning with neural networks. CoRR, abs/1409.3215, 2014. URL http://arxiv.org/abs/1409.3215.
  44. Challenging big-bench tasks and whether chain-of-thought can solve them, 2022.
  45. Commonsenseqa: A question answering challenge targeting commonsense knowledge, 2019.
  46. Llama: Open and efficient foundation language models, 2023a.
  47. Llama 2: Open foundation and fine-tuned chat models, 2023b.
  48. Attention is all you need. CoRR, abs/1706.03762, 2017. URL http://arxiv.org/abs/1706.03762.
  49. Chain of thought prompting elicits reasoning in large language models. CoRR, abs/2201.11903, 2022. URL https://arxiv.org/abs/2201.11903.
  50. Ethical and social risks of harm from language models. CoRR, abs/2112.04359, 2021. URL https://arxiv.org/abs/2112.04359.
  51. R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8, 1992.
  52. XLA. Xla: Optimizing compiler for tensorflow, 2019. URL https://www.tensorflow.org/xla.
  53. GSPMD: general and scalable parallelization for ML computation graphs. CoRR, abs/2105.04663, 2021. URL https://arxiv.org/abs/2105.04663.
  54. B. Zhang and R. Sennrich. Root mean square layer normalization. CoRR, abs/1910.07467, 2019. URL http://arxiv.org/abs/1910.07467.
  55. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  56. Agieval: A human-centric benchmark for evaluating foundation models, 2023.
  57. Representation engineering: A top-down approach to ai transparency, 2023.
Citations (298)

Summary

  • The paper introduces the Gemma family of open-source 2B and 7B language models built on Gemini technology, demonstrating strong NLP capabilities.
  • The methodology includes a transformer decoder enhanced by multi-query attention, RoPE embeddings, and GeGLU activations, trained on trillions of tokens.
  • Evaluation results show Gemma 7B achieving superior scores on MMLU, GSM8K, and safety benchmarks while maintaining low memorization rates.

Gemma: An Open Model Based on Gemini Technology

The paper "Gemma: Open Models Based on Gemini Research and Technology" (2403.08295) introduces Gemma, a family of open-source LLMs derived from the same research and technology used to develop the Gemini models. The Gemma family includes two sizes: 2B and 7B parameter models, with both pretrained and fine-tuned versions available. These models exhibit strong performance across a range of NLP tasks, including language understanding, reasoning, and safety.

Model Architecture and Training

The Gemma models are based on the transformer decoder architecture [DBLP:journals/corr/VaswaniSPUJGKP17] and incorporate several improvements, including multi-query attention (MQA) [mqa] in the 2B model, RoPE embeddings [rope], GeGLU activations [geglu], and RMSNorm [rmsnorm]. The models were trained on 3T (2B) and 6T (7B) tokens of primarily English data, using TPUv5e infrastructure. The training process leverages JAX [bradburyJAX] and Pathways [barham2022pathways] for distributed training and optimization.

Pretraining and Instruction Tuning

The pretraining data was filtered to remove unwanted or unsafe utterances and to minimize the risk of memorization. Instruction tuning was performed using supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) [christiano2017deep, ouyang2022training]. The SFT data mixtures were selected based on LM-based side-by-side evaluations [zheng2023judging], and the RLHF reward model was trained on English-only preference data.

Evaluation Results

The Gemma models were evaluated across a range of benchmarks, including automated benchmarks and human preference evaluations. The results indicate that Gemma models outperform similarly sized open models on a variety of tasks. For example, the Gemma 7B model achieves 64.3% on MMLU [mmlu], surpassing other open-source alternatives. In human evaluations, Gemma 7B IT demonstrates a 61.2% win rate against Mistral v0.2 7B Instruct on instruction following and a 63.5% win rate on safety. Figure 1

Figure 1: Language understanding and generation performance of Gemma 7B across different capabilities compared to similarly sized open models, showcasing strong performance across a range of NLP tasks.

The models also exhibit strong performance on mathematics and coding benchmarks, outperforming other models by at least 10 points on GSM8K [gsm8k] and MATH [hendrycksmath2021].

Memorization Analysis

The paper includes an analysis of memorization in the Gemma pretrained models, using the methodology described in [anil2023palm]. The results show that Gemma models have low rates of memorization, comparable to PaLM [chowdhery2022palm] and PaLM 2 [anil2023palm]. Furthermore, the models do not appear to memorize sensitive data. Figure 2

Figure 2: Comparing average memorization rates across model families, demonstrating that Gemma models have similarly low rates of memorization compared to PaLM and PaLM 2 models of comparable size.

Figure 3

Figure 3: Measuring personal and sensitive data memorization rates, showing no cases of memorized sensitive data.

Figure 4

Figure 4: Comparing exact and approximate memorization, finding that roughly 50% more data is approximately memorized compared to exact memorization.

Responsible Deployment

The paper outlines a structured approach to responsible development and deployment of the Gemma models, including filtering and measuring biases in pre-training data, assessing safety through standardized AI safety benchmarks, and internal red teaming. The authors acknowledge the risks associated with malicious uses of LLMs but argue that the benefits of open access to these models outweigh the risks.

Conclusion

The Gemma models represent a significant advancement in open-source LLMs, offering strong performance, safety, and responsible development practices. The release of these models is intended to encourage further AI safety research, community innovation, and the development of beneficial applications in various domains. While limitations exist, the authors express confidence that Gemma models will provide a net benefit to the AI community.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 8 tweets and received 51 likes.

Upgrade to Pro to view all of the tweets about this paper:

Youtube Logo Streamline Icon: https://streamlinehq.com