LLaMA Pro: Progressive LLaMA with Block Expansion (2401.02415v2)
Abstract: Humans generally acquire new skills without compromising the old; however, the opposite holds for LLMs, e.g., from LLaMA to CodeLLaMA. To this end, we propose a new post-pretraining method for LLMs with an expansion of Transformer blocks. We tune the expanded blocks using only new corpus, efficiently and effectively improving the model's knowledge without catastrophic forgetting. In this paper, we experiment on the corpus of code and math, yielding LLaMA Pro-8.3B, a versatile foundation model initialized from LLaMA2-7B, excelling in general tasks, programming, and mathematics. LLaMA Pro and its instruction-following counterpart (LLaMA Pro-Instruct) achieve advanced performance among various benchmarks, demonstrating superiority over existing open models in the LLaMA family and the immense potential of reasoning and addressing diverse tasks as an intelligent agent. Our findings provide valuable insights into integrating natural and programming languages, laying a solid foundation for developing advanced language agents that operate effectively in various environments.
- Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
- Llemma: An open language model for mathematics. arXiv preprint arXiv:2310.10631.
- Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Lexglue: A benchmark dataset for legal language understanding in english. arXiv preprint arXiv:2110.00976.
- Riemannian walk for incremental learning: Understanding forgetting and intransigence. In Proceedings of the European conference on computer vision (ECCV), pages 532–547.
- bert2bert: Towards reusable pretrained language models. arXiv preprint arXiv:2110.07143.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
- Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. Transactions on Machine Learning Research.
- Theoremqa: A theorem-driven question answering dataset. arXiv preprint arXiv:2305.12524.
- Adapting large language models via reading comprehension. arXiv preprint arXiv:2309.09530.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- A continual learning survey: Defying forgetting in classification tasks. IEEE transactions on pattern analysis and machine intelligence, 44(7):3366–3385.
- How abilities in large language models are affected by supervised fine-tuning data composition. arXiv preprint arXiv:2310.05492.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270.
- The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
- Making llama see and draw with seed tokenizer. arXiv preprint arXiv:2310.01218.
- Efficient training of bert by progressively stacking. In International conference on machine learning, pages 2337–2346. PMLR.
- On the transformer growth for progressive bert training. arXiv preprint arXiv:2010.12562.
- Continual pre-training of large language models: How to (re) warm your model? arXiv preprint arXiv:2308.04014.
- Don’t stop pretraining: Adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
- Measuring mathematical problem solving with the math dataset. NeurIPS.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
- Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
- Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161.
- Flm-101b: An open llm and how to train it with $100 k budget. arXiv preprint arXiv:2309.03852.
- Slimorca: An open dataset of gpt-4 augmented flan reasoning traces, with verification.
- Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958.
- Claudette: an automated detector of potentially unfair clauses in online terms of service. Artificial Intelligence and Law, 27:117–139.
- Llm360: Towards fully transparent open-source llms. arXiv preprint arXiv:2312.06550.
- David Lopez-Paz and Marc’Aurelio Ranzato. 2017. Gradient episodic memory for continual learning. Advances in neural information processing systems, 30.
- Wizardcoder: Empowering code large language models with evol-instruct.
- OpenAI. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
- Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106.
- Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207.
- Fine-tuned language models are continual learners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6107–6122.
- Staged training for transformer language models. In International Conference on Machine Learning, pages 19893–19908. PMLR.
- Alfworld: Aligning text and embodied environments for interactive learning. arXiv preprint arXiv:2010.03768.
- Bert rediscovers the classical nlp pipeline. arXiv preprint arXiv:1905.05950.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- How does bert answer questions? a layer-wise analysis of transformer representations. In Proceedings of the 28th ACM international conference on information and knowledge management, pages 1823–1832.
- Learning to grow pretrained models for efficient transformer training. arXiv preprint arXiv:2303.00980.
- Trace: A comprehensive benchmark for continual learning in large language models. arXiv preprint arXiv:2310.06762.
- Mint: Evaluating llms in multi-turn interaction with tools and language feedback.
- How far can camels go? exploring the state of instruction tuning on open resources.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
- π𝜋\piitalic_π-tuning: Transferring multimodal foundation models with optimal multi-task interpolation. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 37713–37727. PMLR.
- Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
- Lemur: Harmonizing natural language and code for language agents. arXiv preprint arXiv:2310.06830.
- Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600.
- React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
- 2x faster language model pre-training via masked structural growth. arXiv preprint arXiv:2305.02869.
- Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284.
- Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830.
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
- Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.