LESS: Selecting Influential Data for Targeted Instruction Tuning (2402.04333v3)
Abstract: Instruction tuning has unlocked powerful capabilities in LLMs, effectively using combined datasets to develop generalpurpose chatbots. However, real-world applications often require a specialized suite of skills (e.g., reasoning). The challenge lies in identifying the most relevant data from these extensive datasets to effectively develop specific capabilities, a setting we frame as targeted instruction tuning. We propose LESS, an optimizer-aware and practically efficient algorithm to effectively estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection. Crucially, LESS adapts existing influence formulations to work with the Adam optimizer and variable-length instruction data. LESS first constructs a highly reusable and transferable gradient datastore with low-dimensional gradient features and then selects examples based on their similarity to few-shot examples embodying a specific capability. Experiments show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks. Furthermore, the selected data is highly transferable: smaller models can be leveraged to select useful data for larger models and models from different families. Our qualitative analysis shows that our method goes beyond surface form cues to identify data that exemplifies the necessary reasoning skills for the intended downstream application.
- Anthropic. Introducing claude, 2023. URL https://www.anthropic.com/index/introducing-claude.
- A latent variable model approach to pmi-based word embeddings. Transactions of the Association for Computational Linguistics, 4:385–399, 2016.
- On exact computation with an infinitely wide neural net. Advances in neural information processing systems, 32, 2019.
- If influence functions are the answer, then what is the question? In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=hzbguA9zMJ.
- Influence functions in deep learning are fragile. In International Conference on Learning Representations, 2020.
- bench authors, B. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=uyTL5Bvosj.
- Data diversity matters for robust instruction tuning. arXiv preprint arXiv:2311.14736, 2023.
- Instruction mining: High-quality instruction data selection for large language models. arXiv preprint arXiv:2307.06290, 2023.
- Alpagasus: Training a better alpaca with fewer data. arXiv preprint arXiv:2307.08701, 2023a.
- Skill-it! a data-driven skills framework for understanding and training language models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b. URL https://openreview.net/forum?id=IoizwO1NLf.
- Tydi qa: A benchmark for information-seeking question answering in typologically diverse languages. Transactions of the Association for Computational Linguistics, 2020.
- Selection via proxy: Efficient data selection for deep learning. In International Conference on Learning Representations, 2019.
- Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023.
- Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023.
- Mods: Model-oriented data selection for instruction tuning. arXiv preprint arXiv:2311.15653, 2023.
- Dsdm: Model-aware dataset selection with datamodels, 2024.
- What neural networks memorize and why: Discovering the long tail via influence estimation. In Advances in Neural Information Processing Systems, volume 33, pp. 2881–2891, 2020.
- Google. An important next step on our ai journey, 2023. URL https://blog.google/technology/ai/bard-google-ai-search-updates/.
- Studying large language model generalization with influence functions, 2023.
- Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020.
- Hampel, F. R. The influence curve and its role in robust estimation. Journal of the american statistical association, 69(346):383–393, 1974.
- Understanding in-context learning via supportive pretraining data. In The 61st Annual Meeting Of The Association For Computational Linguistics, 2023.
- Evaluation of similarity-based explanations. In International Conference on Learning Representations, 2020.
- Measuring massive multitask language understanding. In International Conference on Learning Representations, 2020.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- XTREME: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 4411–4421. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/hu20b.html.
- Datamodels: Predicting predictions from training data. In Proceedings of the 39th International Conference on Machine Learning, 2022.
- Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
- Extensions of lipschitz mappings into hilbert space. Contemporary mathematics, 26:189–206, 1984.
- Learning from less data: A unified data subset selection and active learning framework for computer vision. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1289–1299. IEEE, 2019.
- Grad-match: Gradient matching based data subset selection for efficient deep model training. In International Conference on Machine Learning, pp. 5464–5474. PMLR, 2021a.
- Retrieve: Coreset selection for efficient and robust semi-supervised learning. In Advances in Neural Information Processing Systems, volume 34, pp. 14488–14501, 2021b.
- Adam: A method for stochastic optimization. In Bengio, Y. and LeCun, Y. (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.
- Understanding black-box predictions via influence functions. In International conference on machine learning, pp. 1885–1894. PMLR, 2017.
- Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327, 2023.
- Speciality vs generality: An empirical study on catastrophic forgetting in fine-tuning foundation models. arXiv preprint arXiv:2309.06256, 2023.
- What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning. arXiv preprint arXiv:2312.15685, 2023.
- The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688, 2023.
- Post-hoc interpretability for neural nlp: A survey. ACM Computing Surveys, 55(8):1–42, 2022.
- On the sdes and scaling rules for adaptive gradient algorithms. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 7697–7711. Curran Associates, Inc., 2022.
- A kernel-based view of language model fine-tuning. In International Conference on Machine Learning, pp. 23610–23641. PMLR, 2023.
- Trivial or impossible—dichotomous data difficulty masks model differences (on imagenet and beyond). In International Conference on Learning Representations, 2021.
- Prioritized training on points that are learnable, worth learning, and not yet learnt. In International Conference on Machine Learning, pp. 15630–15649. PMLR, 2022.
- Coresets for data-efficient training of machine learning models. In International Conference on Machine Learning, pp. 6950–6960. PMLR, 2020.
- Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707, 2023.
- Dataset meta-learning from kernel ridge-regression. In International Conference on Learning Representations, 2021.
- OpenAI. OpenAI: Introducing ChatGPT, 2022. URL https://openai.com/blog/chatgpt.
- OpenAI. OpenAI: GPT-4, 2023. URL https://openai.com/research/gpt-4.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Trak: Attributing model behavior at scale. In International Conference on Machine Learning (ICML), 2023.
- Deep learning on a data diet: Finding important examples early in training. In Advances in Neural Information Processing Systems, volume 34, pp. 20596–20607, 2021.
- Phillips, J. M. Coresets and sketches. In Handbook of discrete and computational geometry, pp. 1269–1288. Chapman and Hall/CRC, 2017.
- Estimating training data influence by tracing gradient descent. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 19920–19930. Curran Associates, Inc., 2020.
- The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389, 2009.
- Multitask prompted training enables zero-shot task generalization. In ICLR 2022-Tenth International Conference on Learning Representations, 2022.
- Understanding influence functions and datamodels via harmonic analysis. In The Eleventh International Conference on Learning Representations, 2023.
- Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations, 2018.
- Beyond neural scaling laws: beating power law scaling via data pruning. Advances in Neural Information Processing Systems, 35:19523–19536, 2022.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Scale efficiently: Insights from pretraining and finetuning transformers. In International Conference on Learning Representations, 2021.
- An empirical study of example forgetting during deep neural network learning. In International Conference on Learning Representations, 2018.
- Optimizing data usage via differentiable rewards. In International Conference on Machine Learning, pp. 9983–9995. PMLR, 2020.
- Large language models are latent variable models: Explaining and finding good demonstrations for in-context learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a.
- Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 5085–5109, 2022.
- How far can camels go? exploring the state of instruction tuning on open resources. arXiv preprint arXiv:2306.04751, 2023b.
- Self-instruct: Aligning language models with self-generated instructions. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 13484–13508, Toronto, Canada, July 2023c. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.754. URL https://aclanthology.org/2023.acl-long.754.
- More than a toy: Random matrix models predict how real-world neural representations generalize. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 23549–23588. PMLR, 17–23 Jul 2022a. URL https://proceedings.mlr.press/v162/wei22a.html.
- Finetuned language models are zero-shot learners. In ICLR 2022-Tenth International Conference on Learning Representations, 2022b.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022c.
- Predicting performance for natural language processing tasks. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J. (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8625–8646, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.764. URL https://aclanthology.org/2020.acl-main.764.
- Training trajectories of language models across scales. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 13711–13738, Toronto, Canada, July 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.767. URL https://aclanthology.org/2023.acl-long.767.
- Moderate coreset: A universal method of data selection for real-world data-efficient deep learning. In The Eleventh International Conference on Learning Representations, 2023b.
- Doremi: Optimizing data mixtures speeds up language model pretraining. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a. URL https://openreview.net/forum?id=lXuByUeHhd.
- Data selection for language models via importance resampling. arXiv preprint arXiv:2302.03169, 2023b.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
- Tensor programs iv: Feature learning in infinite-width neural networks. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 11727–11737. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/yang21c.html.
- Enhanced convolutional neural tangent kernels, 2020a. URL https://openreview.net/forum?id=BkgNqkHFPr.
- Gradient surgery for multi-task learning. Advances in Neural Information Processing Systems, 33:5824–5836, 2020b.
- The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586–595, 2018.
- Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.