LLaMA Pro: Progressive LLaMA with Block Expansion (2401.02415v2)

Published 4 Jan 2024 in cs.CL

Abstract: Humans generally acquire new skills without compromising the old; however, the opposite holds for LLMs, e.g., from LLaMA to CodeLLaMA. To this end, we propose a new post-pretraining method for LLMs with an expansion of Transformer blocks. We tune the expanded blocks using only new corpus, efficiently and effectively improving the model's knowledge without catastrophic forgetting. In this paper, we experiment on the corpus of code and math, yielding LLaMA Pro-8.3B, a versatile foundation model initialized from LLaMA2-7B, excelling in general tasks, programming, and mathematics. LLaMA Pro and its instruction-following counterpart (LLaMA Pro-Instruct) achieve advanced performance among various benchmarks, demonstrating superiority over existing open models in the LLaMA family and the immense potential of reasoning and addressing diverse tasks as an intelligent agent. Our findings provide valuable insights into integrating natural and programming languages, laying a solid foundation for developing advanced language agents that operate effectively in various environments.

References (68)

Citations (39)

View on Semantic Scholar

Summary

The paper introduces block expansion, a method that preserves pretrained knowledge while integrating new domain-specific skills.
It applies a technique of copying and fine-tuning Transformer blocks on code and math datasets to combat catastrophic forgetting.
It demonstrates state-of-the-art performance on benchmarks like HumanEval and GSM8K, showing robust applicability in specialized tasks.

Introduction to LLaMA Pro

The development of LLMs has been marked by increasingly impressive performances across a range of tasks, yet they face challenges in acquiring new domain-specific skills without losing their existing, generalized abilities. In academia, this phenomenon is recognized as catastrophic forgetting, and it is a significant barrier when fine-tuning LLMs for tasks in domains such as programming and mathematics. The paper introduces a method called block expansion aimed at preserving and augmenting the capabilities of LLMs. The technique involves the expansion of Transformer blocks—an essential LLM component—while retaining the existing knowledge base. The resulting model, LLaMA Pro-8.3B, demonstrates its prowess across varied benchmarks when compared with other models of the LLaMA series.

Methodology

Block expansion operates during the post-pretraining phase and works by adding copied Transformer blocks, which start as identity blocks, to an existing pretrained LLM. The original LLaMA2-7B model is selected for this process. The researchers meticulously tune the newly added blocks using a specialized corpus while keeping the inherited blocks unchanged, ensuring the preservation of the model's original capabilities. To materialize LLaMA Pro, they pre-train the expanded blocks on datasets that concentrate on code and mathematical content. Additionally, the method introduces LLaMA Pro - Instruct, a variant of the model that undergoes 'instruction following' fine-tuning to enhance its capability to understand and execute user instructions.

Performance and Evaluation

LLaMA Pro’s performance is rigorously evaluated on a variety of tasks, comparing favorably against other models in the family and achieving state-of-the-art results. This is particularly evident in programming-related benchmarks like HumanEval and math-focused tasks such as GSM8K. The model is also subjected to real-world scenarios, including tool usage and response to human feedback. Furthermore, LLaMA Pro is compared with other LLMs using a specialized LLM evaluation framework, confirming its superior overall performance and adaptability.

Conclusion and Future Directions

The comprehensive results of the paper underline the effectiveness of the block expansion post-pretraining method in enhancing the skillset of LLMs without the adverse effects of catastrophic forgetting. With LLaMA Pro, we witness a model that excels in both general linguistic tasks and highly specialized domains such as programming. The research opens avenues for future explorations on how to adapt this method to other areas, including multimodal applications, and underscores the significance of harmonizing domain-specific learning with the retention of general competencies in LLMs.

Related Papers

Tweets

https://twitter.com/ge_yixiao/status/1760862244995502476

https://twitter.com/_philschmid/status/1746107254338712024

https://twitter.com/rayzarion/status/1791127646467658091

https://twitter.com/abacaj/status/1748157588007457176

https://twitter.com/MoonL88537/status/1803405993281270086

https://twitter.com/JohnCho57593014/status/1753006824079667342