Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 39 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 12 tok/s Pro
GPT-5 High 18 tok/s Pro
GPT-4o 91 tok/s Pro
Kimi K2 191 tok/s Pro
GPT OSS 120B 456 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement (2403.15042v2)

Published 22 Mar 2024 in cs.CL

Abstract: Pretrained LLMs are currently state-of-the-art for solving the vast majority of natural language processing tasks. While many real-world applications still require fine-tuning to reach satisfactory levels of performance, many of them are in the low-data regime, making fine-tuning challenging. To address this, we propose LLM2LLM, a targeted and iterative data augmentation strategy that uses a teacher LLM to enhance a small seed dataset by augmenting additional data that can be used for fine-tuning on a specific task. LLM2LLM (1) fine-tunes a baseline student LLM on the initial seed data, (2) evaluates and extracts data points that the model gets wrong, and (3) uses a teacher LLM to generate synthetic data based on these incorrect data points, which are then added back into the training data. This approach amplifies the signal from incorrectly predicted data points by the LLM during training and reintegrates them into the dataset to focus on more challenging examples for the LLM. Our results show that LLM2LLM significantly enhances the performance of LLMs in the low-data regime, outperforming both traditional fine-tuning and other data augmentation baselines. LLM2LLM reduces the dependence on labor-intensive data curation and paves the way for more scalable and performant LLM solutions, allowing us to tackle data-constrained domains and tasks. We achieve improvements up to 24.2% on the GSM8K dataset, 32.6% on CaseHOLD, 32.0% on SNIPS, 52.6% on TREC and 39.8% on SST-2 over regular fine-tuning in the low-data regime using a Llama-2-7B student model. Our code is available at https://github.com/SqueezeAILab/LLM2LLM .

Definition Search Book Streamline Icon: https://streamlinehq.com
References (72)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Palm 2 technical report, 2023.
  3. Ext5: Towards extreme multi-task scaling for transfer learning. In International Conference on Learning Representations, 2021.
  4. Synthetic and natural noise both break neural machine translation. arXiv preprint arXiv:1711.02173, 2017.
  5. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision, 2023.
  6. Instruction mining: When data mining meets large language model finetuning, 2023.
  7. Alpagasus: Training a better alpaca with fewer data, 2023.
  8. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024.
  9. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
  10. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  11. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  12. Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces. arXiv preprint arXiv:1805.10190, 2018.
  13. Claude Coulombe. Text data augmentation made simple by leveraging nlp cloud apis. arXiv preprint arXiv:1812.04718, 2018.
  14. Auggpt: Leveraging chatgpt for text data augmentation, 2023.
  15. Rephrase and respond: Let large language models ask better questions for themselves, 2023.
  16. Gpt-4 turbo v.s. gpt-4 comparison. https://github.com/da03/implicit_chain_of_thought/tree/main/gpt4_baselines, 2023.
  17. Jon Durbin. Jondurbin/airoboros-l2-70b-3.1.2 · hugging face, Oct 2023.
  18. Using gpt-4 to augment unbalanced data for automatic scoring, 2023.
  19. A survey of data augmentation approaches for NLP. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 968–988, Online, August 2021. Association for Computational Linguistics.
  20. Chain-of-thought hub: A continuous effort to measure large language models’ reasoning performance, 2023.
  21. Specializing smaller language models towards multi-step reasoning. arXiv preprint arXiv:2301.12726, 2023.
  22. Koala: A dialogue model for academic research. Blog post, April 2023.
  23. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023.
  24. Language models can teach themselves to program better, 2023.
  25. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  26. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  27. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  28. Knowledge-augmented reasoning distillation for small language models in knowledge-intensive tasks, 2023.
  29. Data augmentation using pre-trained transformer models. In Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems, pages 18–26, 2020.
  30. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  31. Self-alignment with instruction backtranslation. arXiv preprint arXiv:2308.06259, 2023.
  32. Learning question classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics, 2002.
  33. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, 2004.
  34. Tinygsm  achieving  80% on gsm8k with small language models, 2023.
  35. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172, 2023.
  36. The flan collection: designing data and methods for effective instruction tuning. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
  37. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023.
  38. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.
  39. Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3470–3487, 2022.
  40. Orca 2: Teaching small language models how to reason. arXiv preprint arXiv:2311.11045, 2023.
  41. Crosslingual generalization through multitask finetuning. In Annual Meeting of the Association for Computational Linguistics, 2023.
  42. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707, 2023.
  43. Can generalist foundation models outcompete special-purpose tuning? case study in medicine. arXiv preprint arXiv:2311.16452, 2023.
  44. instructgpt, 2022.
  45. Rephrase, augment, reason: Visual grounding of questions for vision-language models, 2023.
  46. Arthur L Samuel. Some studies in machine learning using the game of checkers. IBM Journal of research and development, 44(1.2):206–226, 2000.
  47. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations, 2021.
  48. Beyond human data: Scaling self-training for problem-solving with language models. arXiv preprint arXiv:2312.06585, 2023.
  49. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642, 2013.
  50. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023.
  51. Gerald Tesauro et al. Temporal difference learning and td-gammon. Communications of the ACM, 38(3):58–68, 1995.
  52. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  53. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  54. Zeroshotdataaug: Generating and augmenting training data with chatgpt. arXiv preprint arXiv:2304.14334, 2023.
  55. Self-instruct: Aligning language models with self-generated instructions. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508, Toronto, Canada, July 2023. Association for Computational Linguistics.
  56. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5085–5109, 2022.
  57. Knowda: All-in-one knowledge mixture model for data augmentation in few-shot nlp. arXiv preprint arXiv:2206.10265, 2022.
  58. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2021.
  59. EDA: Easy data augmentation techniques for boosting performance on text classification tasks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6382–6388, Hong Kong, China, November 2019. Association for Computational Linguistics.
  60. Instructiongpt-4: A 200-instruction paradigm for fine-tuning minigpt-4, 2023.
  61. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
  62. Zeroprompt: Scaling prompt-based pretraining to 1,000 tasks improves zero-shot generalization. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4235–4252, 2022.
  63. Gpt3mix: Leveraging large-scale language models for text augmentation. arXiv preprint arXiv:2104.08826, 2021.
  64. Metamath: Bootstrap your own mathematical questions for large language models, 2023.
  65. Self-rewarding language models. arXiv preprint arXiv:2401.10020, 2024.
  66. Self-taught optimizer (stop): Recursively self-improving code generation, 2023.
  67. Star: Bootstrapping reasoning with reasoning, 2022.
  68. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018.
  69. When does pretraining help? assessing self-supervised learning for law and the casehold dataset. In Proceedings of the 18th International Conference on Artificial Intelligence and Law. Association for Computing Machinery, 2021.
  70. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364, 2023.
  71. Lima: Less is more for alignment, 2023.
  72. Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023.
Citations (29)

Summary

  • The paper introduces an iterative teacher-student paradigm to synthesize training examples addressing model errors.
  • The method achieves performance boosts up to 24.2% on GSM8K and 32.6% on CaseHOLD over baseline approaches.
  • The approach minimizes reliance on manual annotations and outperforms existing techniques like EDA and AugGPT.

Enhancing LLMs with LLM2LLM

The paper "LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement" proposes a novel iterative data augmentation technique designed to enhance the performance of LLMs in low-data regimes. By leveraging the strengths of both teacher and student LLMs, this approach addresses a critical bottleneck in fine-tuning LLMs for specific tasks where limited annotated data is available.

Methodology

LLM2LLM utilizes a teacher-student model paradigm to iteratively generate synthetic data. The process begins by fine-tuning a smaller LLM, termed the student model, on an initial seed dataset. Upon completion, the student model is evaluated on this seed dataset, and incorrect predictions are identified. The teacher LLM is then employed to generate new examples similar to these incorrect instances, which are integrated back into the training dataset. Figure 1

Figure 1: LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement. One iteration of LLM2LLM begins with training and evaluating the model on the training data. Incorrect answers from the training data are used as inputs to generate extra samples with similar styles to the teacher model.

This iterative process of training, evaluating, and augmenting continues, resulting in a progressively refined dataset tailored to the student's deficiencies. Such a strategy is not only effective but also scalable, minimizing the dependency on extensive human-annotated datasets.

Experimental Results

The efficacy of LLM2LLM was evaluated across diverse datasets, including GSM8K and CaseHOLD. Significant improvements were reported, with increases of up to 24.2% on GSM8K and 32.6% on CaseHOLD over baseline fine-tuning approaches. These results underscore the framework's capability to operate effectively in data-constrained environments and efficiently scale task-specific datasets with synthesized data. Figure 2

Figure 2: LLM2LLM on GSM8K (left) and CaseHOLD (right) with various seed data sizes. Each line shows the test accuracy of the finetuned LLaMA-2-7B model with each step of LLM2LLM with varying seed dataset size.

Comparison with Baselines

LLM2LLM was compared against several baseline augmentation methods, including EDA and AugGPT. The framework consistently outperformed these baselines, demonstrating its ability to generate more relevant and challenging examples for the student model, ultimately leading to superior performance.

Teacher Model Selection

Ablation studies reveal that the choice of teacher model significantly impacts the quality of augmentation. Notable differences in performance were observed when using various teacher LLMs like GPT-3.5, GPT-4-Turbo, and LLaMA2-70B. For example, using GPT-4-Turbo as the teacher model resulted in notably higher accuracy due to its superior reasoning capabilities.

Practical Implications and Future Directions

LLM2LLM provides a robust framework for enhancing LLMs in low-resource settings, reducing the need for labor-intensive data curation. Its iterative and targeted approach ensures that generated data addresses specific weaknesses of the student model, making it an invaluable tool for domain-specific LLM applications.

Future research could explore tuning hyperparameters within this framework or integrating it with complementary AI techniques, such as prompt tuning or few-shot learning. Additionally, examining the application of LLM2LLM to other modalities and tasks beyond NLP could further extend its utility.

Conclusion

LLM2LLM presents an effective strategy for augmenting training datasets in data-scarce scenarios, using iterative LLM-based data generation. Its deployment can substantially enhance task-specific LLMs, accelerating their adaptation and performance in specialized domains. Such advancements pave the way for more accessible and efficient AI solutions across various applications.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Youtube Logo Streamline Icon: https://streamlinehq.com