Improving Domain Adaptation through Extended-Text Reading Comprehension (2401.07284v2)
Abstract: To enhance the domain-specific capabilities of LLMs, continued pre-training on a domain-specific corpus is a prevalent method. Recent work demonstrates that adapting models using reading comprehension data formatted by regex-based patterns can significantly improve performance on domain-specific tasks. However, regex-based patterns are incapable of parsing raw corpora using domain-specific knowledge. Furthermore, the question and answer pairs are extracted directly from the corpus in predefined formats offers limited context. To address this limitation, we improve reading comprehension via LLM and clustering. LLM focuses on leveraging domain knowledge within the corpus to refine comprehension stage, while clustering supplies relevant knowledge by extending the context to enrich reading stage. Additionally, our method incorporates parameter-efficient fine-tuning to improve the efficiency of domain adaptation. In comparison to AdaptLLM, our method achieves an improvement exceeding 5% in domain-specific tasks. Our code will available at https://github.com/microsoft/LMOps.
- Disc-finllm: A chinese financial large language model based on multiple experts fine-tuning. arXiv preprint arXiv:2310.15205.
- Convfinqa: Exploring the chain of numerical reasoning in conversational finance question answering. arXiv preprint arXiv:2210.03849.
- Adapting large language models via reading comprehension. arXiv preprint arXiv:2309.09530.
- Knowledge neurons in pretrained transformers. arXiv preprint arXiv:2104.08696.
- Franck Dernoncourt and Ji Young Lee. 2017. Pubmed 200k rct: a dataset for sequential sentence classification in medical abstracts. arXiv preprint arXiv:1710.06071.
- The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
- Don’t stop pretraining: Adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964.
- Medalpaca–an open-source collection of medical conversational ai models and training data. arXiv preprint arXiv:2304.08247.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
- What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421.
- Pubmedqa: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2567–2577.
- The inductive bias of in-context learning: Rethinking pretraining example design. arXiv preprint arXiv:2110.04541.
- Chipnemo: Domain-adapted llms for chip design. arXiv preprint arXiv:2311.00176.
- Fingpt: Democratizing internet-scale data for financial large language models. arXiv preprint arXiv:2307.10485.
- Kai Lu. 2023. Can chatgpt help college instructors generate high-quality quiz questions? Human Interaction and Emerging Technologies (IHIET-AI 2023): Artificial Intelligence and Future Applications, 70(70).
- Www’18 open challenge: financial opinion mining and question answering. In Companion proceedings of the the web conference 2018, pages 1941–1942.
- Good debt or bad debt: Detecting semantic orientations in economic texts. Journal of the Association for Information Science and Technology, 65(4):782–796.
- Effective transfer learning for identifying similar questions: matching user questions to covid-19 faqs. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 3458–3465.
- Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707.
- Ms marco: A human generated machine reading comprehension dataset. choice, 2640:660.
- Andrew M Olney. 2023. Generating multiple choice questions from a textbook: Llms match human performance on most metrics. In AIED Workshops.
- Domain adaption of named entity recognition to support credit risk assessment. In Proceedings of the Australasian Language Technology Association Workshop 2015, pages 84–90, Parramatta, Australia.
- In-context pretraining: Language modeling beyond document boundaries. arXiv preprint arXiv:2310.10638.
- Ankur Sinha and Tanmay Khandait. 2021. Impact of news on the commodity market: Dataset and results. In Advances in Information and Communication: Proceedings of the 2021 Future of Information and Communication Conference (FICC), Volume 2, pages 589–601. Springer.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
- Retrieve anything to augment large language models. arXiv preprint arXiv:2310.07554.
- Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206.