Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLaMA Pro: Progressive LLaMA with Block Expansion (2401.02415v2)

Published 4 Jan 2024 in cs.CL

Abstract: Humans generally acquire new skills without compromising the old; however, the opposite holds for LLMs, e.g., from LLaMA to CodeLLaMA. To this end, we propose a new post-pretraining method for LLMs with an expansion of Transformer blocks. We tune the expanded blocks using only new corpus, efficiently and effectively improving the model's knowledge without catastrophic forgetting. In this paper, we experiment on the corpus of code and math, yielding LLaMA Pro-8.3B, a versatile foundation model initialized from LLaMA2-7B, excelling in general tasks, programming, and mathematics. LLaMA Pro and its instruction-following counterpart (LLaMA Pro-Instruct) achieve advanced performance among various benchmarks, demonstrating superiority over existing open models in the LLaMA family and the immense potential of reasoning and addressing diverse tasks as an intelligent agent. Our findings provide valuable insights into integrating natural and programming languages, laying a solid foundation for developing advanced language agents that operate effectively in various environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
  2. Llemma: An open language model for mathematics. arXiv preprint arXiv:2310.10631.
  3. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966.
  4. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  6. Lexglue: A benchmark dataset for legal language understanding in english. arXiv preprint arXiv:2110.00976.
  7. Riemannian walk for incremental learning: Understanding forgetting and intransigence. In Proceedings of the European conference on computer vision (ECCV), pages 532–547.
  8. bert2bert: Towards reusable pretrained language models. arXiv preprint arXiv:2110.07143.
  9. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  10. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. Transactions on Machine Learning Research.
  11. Theoremqa: A theorem-driven question answering dataset. arXiv preprint arXiv:2305.12524.
  12. Adapting large language models via reading comprehension. arXiv preprint arXiv:2309.09530.
  13. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
  14. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
  15. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  16. A continual learning survey: Defying forgetting in classification tasks. IEEE transactions on pattern analysis and machine intelligence, 44(7):3366–3385.
  17. How abilities in large language models are affected by supervised fine-tuning data composition. arXiv preprint arXiv:2310.05492.
  18. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270.
  19. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
  20. Making llama see and draw with seed tokenizer. arXiv preprint arXiv:2310.01218.
  21. Efficient training of bert by progressively stacking. In International conference on machine learning, pages 2337–2346. PMLR.
  22. On the transformer growth for progressive bert training. arXiv preprint arXiv:2010.12562.
  23. Continual pre-training of large language models: How to (re) warm your model? arXiv preprint arXiv:2308.04014.
  24. Don’t stop pretraining: Adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964.
  25. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
  26. Measuring mathematical problem solving with the math dataset. NeurIPS.
  27. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
  28. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR.
  29. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  30. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
  31. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161.
  32. Flm-101b: An open llm and how to train it with $100 k budget. arXiv preprint arXiv:2309.03852.
  33. Slimorca: An open dataset of gpt-4 augmented flan reasoning traces, with verification.
  34. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958.
  35. Claudette: an automated detector of potentially unfair clauses in online terms of service. Artificial Intelligence and Law, 27:117–139.
  36. Llm360: Towards fully transparent open-source llms. arXiv preprint arXiv:2312.06550.
  37. David Lopez-Paz and Marc’Aurelio Ranzato. 2017. Gradient episodic memory for continual learning. Advances in neural information processing systems, 30.
  38. Wizardcoder: Empowering code large language models with evol-instruct.
  39. OpenAI. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  40. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  41. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  42. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
  43. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106.
  44. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207.
  45. Fine-tuned language models are continual learners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6107–6122.
  46. Staged training for transformer language models. In International Conference on Machine Learning, pages 19893–19908. PMLR.
  47. Alfworld: Aligning text and embodied environments for interactive learning. arXiv preprint arXiv:2010.03768.
  48. Bert rediscovers the classical nlp pipeline. arXiv preprint arXiv:1905.05950.
  49. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  50. How does bert answer questions? a layer-wise analysis of transformer representations. In Proceedings of the 28th ACM international conference on information and knowledge management, pages 1823–1832.
  51. Learning to grow pretrained models for efficient transformer training. arXiv preprint arXiv:2303.00980.
  52. Trace: A comprehensive benchmark for continual learning in large language models. arXiv preprint arXiv:2310.06762.
  53. Mint: Evaluating llms in multi-turn interaction with tools and language feedback.
  54. How far can camels go? exploring the state of instruction tuning on open resources.
  55. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
  56. π𝜋\piitalic_π-tuning: Transferring multimodal foundation models with optimal multi-task interpolation. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 37713–37727. PMLR.
  57. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564.
  58. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
  59. Lemur: Harmonizing natural language and code for language agents. arXiv preprint arXiv:2310.06830.
  60. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600.
  61. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
  62. 2x faster language model pre-training via masked structural growth. arXiv preprint arXiv:2305.02869.
  63. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284.
  64. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830.
  65. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847.
  66. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
  67. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206.
  68. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.
Citations (39)

Summary

  • The paper introduces block expansion, a method that preserves pretrained knowledge while integrating new domain-specific skills.
  • It applies a technique of copying and fine-tuning Transformer blocks on code and math datasets to combat catastrophic forgetting.
  • It demonstrates state-of-the-art performance on benchmarks like HumanEval and GSM8K, showing robust applicability in specialized tasks.

Introduction to LLaMA Pro

The development of LLMs has been marked by increasingly impressive performances across a range of tasks, yet they face challenges in acquiring new domain-specific skills without losing their existing, generalized abilities. In academia, this phenomenon is recognized as catastrophic forgetting, and it is a significant barrier when fine-tuning LLMs for tasks in domains such as programming and mathematics. The paper introduces a method called block expansion aimed at preserving and augmenting the capabilities of LLMs. The technique involves the expansion of Transformer blocks—an essential LLM component—while retaining the existing knowledge base. The resulting model, LLaMA Pro-8.3B, demonstrates its prowess across varied benchmarks when compared with other models of the LLaMA series.

Methodology

Block expansion operates during the post-pretraining phase and works by adding copied Transformer blocks, which start as identity blocks, to an existing pretrained LLM. The original LLaMA2-7B model is selected for this process. The researchers meticulously tune the newly added blocks using a specialized corpus while keeping the inherited blocks unchanged, ensuring the preservation of the model's original capabilities. To materialize LLaMA Pro, they pre-train the expanded blocks on datasets that concentrate on code and mathematical content. Additionally, the method introduces LLaMA Pro - Instruct, a variant of the model that undergoes 'instruction following' fine-tuning to enhance its capability to understand and execute user instructions.

Performance and Evaluation

LLaMA Pro’s performance is rigorously evaluated on a variety of tasks, comparing favorably against other models in the family and achieving state-of-the-art results. This is particularly evident in programming-related benchmarks like HumanEval and math-focused tasks such as GSM8K. The model is also subjected to real-world scenarios, including tool usage and response to human feedback. Furthermore, LLaMA Pro is compared with other LLMs using a specialized LLM evaluation framework, confirming its superior overall performance and adaptability.

Conclusion and Future Directions

The comprehensive results of the paper underline the effectiveness of the block expansion post-pretraining method in enhancing the skillset of LLMs without the adverse effects of catastrophic forgetting. With LLaMA Pro, we witness a model that excels in both general linguistic tasks and highly specialized domains such as programming. The research opens avenues for future explorations on how to adapt this method to other areas, including multimodal applications, and underscores the significance of harmonizing domain-specific learning with the retention of general competencies in LLMs.

Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews

Reddit Logo Streamline Icon: https://streamlinehq.com