Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs (2312.05934v3)
Abstract: LLMs encapsulate a vast amount of factual information within their pre-trained weights, as evidenced by their ability to answer diverse questions across different domains. However, this knowledge is inherently limited, relying heavily on the characteristics of the training data. Consequently, using external datasets to incorporate new information or refine the capabilities of LLMs on previously seen information poses a significant challenge. In this study, we compare two common approaches: unsupervised fine-tuning and retrieval-augmented generation (RAG). We evaluate both approaches on a variety of knowledge-intensive tasks across different topics. Our findings reveal that while unsupervised fine-tuning offers some improvement, RAG consistently outperforms it, both for existing knowledge encountered during training and entirely new knowledge. Moreover, we find that LLMs struggle to learn new factual information through unsupervised fine-tuning, and that exposing them to numerous variations of the same fact during training could alleviate this problem.
- Attardi, G. Wikiextractor. https://github.com/attardi/wikiextractor, 2015.
- The reversal curse: Llms trained on” a is b” fail to learn” b is a”. arXiv preprint arXiv:2309.12288, 2023.
- Recall and learn: Fine-tuning deep pretrained language models with less forgetting. arXiv preprint arXiv:2004.12651, 2020.
- Knowprompt: Knowledge-aware prompt-tuning with synergistic optimization for relation extraction. In Proceedings of the ACM Web conference 2022, pp. 2778–2788, 2022.
- Meta-learning via language model in-context tuning. arXiv preprint arXiv:2110.07814, 2021.
- Instructeval: Towards holistic evaluation of instruction-tuned large language models. arXiv preprint arXiv:2306.04757, 2023.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- Crawling the internal knowledge-base of language models. arXiv preprint arXiv:2301.12810, 2023.
- Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. arXiv preprint arXiv:1903.00161, 2019.
- A framework for few-shot language model evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628.
- An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211, 2013.
- Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
- A survey of knowledge enhanced pre-trained language models. IEEE Transactions on Knowledge and Data Engineering, 2023.
- Lawyer llama technical report. arXiv preprint arXiv:2305.15062, 2023.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2019.
- Large language models struggle to learn long-tail knowledge. In International Conference on Machine Learning, pp. 15696–15707. PMLR, 2023.
- Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
- Can language models learn from explanations in context? arXiv preprint arXiv:2204.02329, 2022.
- Common sense or world knowledge? investigating adapter-based knowledge injection into pretrained transformers. arXiv preprint arXiv:2005.11787, 2020.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
- K-bert: Enabling language representation with knowledge graph. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 2901–2908, 2020.
- An empirical study of catastrophic forgetting in large language models during continual fine-tuning. arXiv preprint arXiv:2308.08747, 2023.
- Metaicl: Learning to learn in context. arXiv preprint arXiv:2110.15943, 2021.
- Cross-task generalization via natural language crowdsourcing instructions. arXiv preprint arXiv:2104.08773, 2021.
- Orca 2: Teaching small language models how to reason. arXiv preprint arXiv:2311.11045, 2023.
- Text and code embeddings by contrastive pre-training. ArXiv, abs/2201.10005, 2022. URL https://api.semanticscholar.org/CorpusID:246275593.
- Nguyen, H.-T. A brief report on lawgpt 1.0: A virtual legal assistant based on gpt-3. arXiv preprint arXiv:2302.05729, 2023.
- Capabilities of gpt-4 on medical challenge problems. ArXiv, abs/2303.13375, 2023. URL https://api.semanticscholar.org/CorpusID:257687695.
- OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023. URL https://api.semanticscholar.org/CorpusID:257532815.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Language models as knowledge bases? arXiv preprint arXiv:1909.01066, 2019.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
- Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Systematic review of effect of data augmentation using paraphrasing on named entity recognition. In NeurIPS 2022 Workshop on Synthetic Data for Empowering ML Research, 2022. URL https://openreview.net/forum?id=rc2h1h89aDi.
- Text data augmentation for deep learning. Journal of Big Data, 8, 2021. URL https://api.semanticscholar.org/CorpusID:236096559.
- Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023a.
- Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617, 2023b.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
- Can chatgpt replace traditional kbqa models? an in-depth analysis of the question answering performance of the gpt llm family. In International Semantic Web Conference, pp. 348–367. Springer, 2023.
- Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023.
- Memorization without overfitting: Analyzing the training dynamics of large language models. ArXiv, abs/2205.10770, 2022. URL https://api.semanticscholar.org/CorpusID:248986465.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944, 2023.
- Survey on factuality in large language models: Knowledge, retrieval and domain-specificity. arXiv preprint arXiv:2310.07521, 2023.
- K-adapter: Infusing knowledge into pre-trained models with adapters. arXiv preprint arXiv:2002.01808, 2020.
- Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. arXiv preprint arXiv:2204.07705, 2022.
- Pmc-llama: Further finetuning llama on medical papers. arXiv preprint arXiv:2304.14454, 2023a.
- Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564, 2023b.
- C-pack: Packaged resources to advance general chinese embedding, 2023.
- Fingpt: Open-source financial large language models. arXiv preprint arXiv:2306.06031, 2023.
- A survey of knowledge-enhanced text generation. ACM Computing Surveys, 54(11s):1–38, 2022.
- Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.