How Important is Domain Specificity in Language Models and Instruction Finetuning for Biomedical Relation Extraction? (2402.13470v1)
Abstract: Cutting edge techniques developed in the general NLP domain are often subsequently applied to the high-value, data-rich biomedical domain. The past few years have seen generative LLMs (LMs), instruction finetuning, and few-shot learning become foci of NLP research. As such, generative LMs pretrained on biomedical corpora have proliferated and biomedical instruction finetuning has been attempted as well, all with the hope that domain specificity improves performance on downstream tasks. Given the nontrivial effort in training such models, we investigate what, if any, benefits they have in the key biomedical NLP task of relation extraction. Specifically, we address two questions: (1) Do LMs trained on biomedical corpora outperform those trained on general domain corpora? (2) Do models instruction finetuned on biomedical datasets outperform those finetuned on assorted datasets or those simply pretrained? We tackle these questions using existing LMs, testing across four datasets. In a surprising result, general-domain models typically outperformed biomedical-domain models. However, biomedical instruction finetuning improved performance to a similar degree as general instruction finetuning, despite having orders of magnitude fewer instructions. Our findings suggest it may be more fruitful to focus research effort on larger-scale biomedical instruction finetuning of general LMs over building domain-specific biomedical LMs
- Publicly available clinical BERT embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, pages 72–78, Minneapolis, Minnesota, USA. Association for Computational Linguistics.
- Exploiting semantic patterns over biomedical knowledge graphs for predicting treatment and causative relations. Journal of biomedical informatics, 82:189–199.
- GPT-Neo: Large scale autoregressive language modeling with mesh-tensorflow.
- BioMedLM. https://crfm.stanford.edu/2022/12/15/biomedlm.html. Accessed: 2023-08-18.
- Vaccine adverse event text mining system for extracting features from vaccine safety reports. Journal of the American Medical Informatics Association, 19(6):1011–1018.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- ELECTRA: pre-training text encoders as discriminators rather than generators. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- FuzzyBIO: A proposal for fuzzy representation of discontinuous entities. In Proceedings of the 12th International Workshop on Health Text Mining and Information Analysis, pages 77–82, online. Association for Computational Linguistics.
- Markus Eberts and Adrian Ulges. 2020. Span-based joint entity and relation extraction with transformer pre-training. Proceedings of the 24th European Conference on Artificial Intelligence (ECAI 2020), page 2006–2013.
- Markus Eberts and Adrian Ulges. 2021. An end-to-end model for entity-level relation extraction using multi-instance learning. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 3650–3660, Online. Association for Computational Linguistics.
- Bigbio: A framework for data-centric biomedical natural language processing. arXiv preprint arXiv:2206.15076.
- A sequence-to-sequence approach for document-level relation extraction. In Proceedings of the 21st Workshop on Biomedical Language Processing, pages 10–25, Dublin, Ireland. Association for Computational Linguistics.
- Domain-specific language model pretraining for biomedical natural language processing.
- Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360, Online. Association for Computational Linguistics.
- The ddi corpus: An annotated corpus with pharmacological substances and drug–drug interactions. Journal of biomedical informatics, 46(5):914–920.
- Discovering drug-target interaction knowledge from biomedical literature.
- Discovering drug–target interaction knowledge from biomedical literature. Bioinformatics, 38(22):5100–5107.
- Pere-Lluís Huguet Cabot and Roberto Navigli. 2021. REBEL: Relation extraction by end-to-end language generation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2370–2381, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Yuhang Jiang and Ramakanth Kavuluru. 2023. End-to-end n𝑛nitalic_n-ary relation extraction for combination drug therapies.
- Mimic-iii, a freely accessible critical care database. Scientific data, 3(1):1–9.
- BioELECTRA:pretrained biomedical text encoder using discriminators. In Proceedings of the 20th Workshop on Biomedical Language Processing, pages 143–154, Online. Association for Computational Linguistics.
- Overview of the biocreative vi chemical-protein interaction track. In Proceedings of the sixth BioCreative challenge evaluation workshop, volume 1, pages 141–146.
- BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240.
- A collaborative filtering-based approach to biomedical knowledge discovery. Bioinformatics, 34(4):652–659.
- BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. CoRR, abs/1910.13461.
- Pretrained language models for biomedical and clinical tasks: Understanding and extending the state-of-the-art. In Proceedings of the 3rd Clinical Natural Language Processing Workshop, pages 146–157, Online. Association for Computational Linguistics.
- Biocreative V CDR task corpus: a resource for chemical disease relation extraction. Database J. Biol. Databases Curation, 2016.
- Towards drug safety surveillance and pharmacovigilance: current progress in detecting medication and adverse drug events from electronic health records. Drug safety, 42(1):95–97.
- Ro{bert}a: A robustly optimized {bert} pretraining approach.
- S2ORC: The semantic scholar open research corpus. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4969–4983, Online. Association for Computational Linguistics.
- BioGPT: generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics, 23(6). Bbac409.
- Makoto Miwa and Yutaka Sasaki. 2014. Modeling joint entity and relation extraction with table representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1858–1869.
- Tapas Nayak and Hwee Tou Ng. 2020. Effective modeling of encoder-decoder architecture for joint entity and relation extraction. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 8528–8535. AAAI Press.
- Training language models to follow instructions with human feedback.
- In-BoXBART: Get instructions into biomedical multi-task learning. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 112–128, Seattle, United States. Association for Computational Linguistics.
- Transfer learning in biomedical natural language processing: An evaluation of bert and elmo on ten benchmarking datasets. In Proceedings of the 2019 Workshop on Biomedical Natural Language Processing (BioNLP 2019), pages 58–65.
- Scifive: a text-to-text transformer model for biomedical literature.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
- Multitask prompted training enables zero-shot task generalization.
- Timo Schick and Hinrich Schütze. 2021. Exploiting cloze-questions for few-shot text classification and natural language inference. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 255–269, Online. Association for Computational Linguistics.
- BioMegatron: Larger biomedical domain language model. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4700–4706, Online. Association for Computational Linguistics.
- Megatron-lm: Training multi-billion parameter language models using model parallelism.
- Learning to summarize with human feedback. In Advances in Neural Information Processing Systems, volume 33, pages 3008–3021. Curran Associates, Inc.
- A dataset for n-ary relation extraction of drug combinations. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3190–3203, Seattle, United States. Association for Computational Linguistics.
- Llama 2: Open foundation and fine-tuned chat models.
- Revisiting relation extraction in the era of large language models.
- Automatic multi-label prompting: Simple and interpretable few-shot classification. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5483–5492, Seattle, United States. Association for Computational Linguistics.
- Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508, Toronto, Canada. Association for Computational Linguistics.
- Discontinuous named entity recognition as maximal clique discovery. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 764–774, Online. Association for Computational Linguistics.
- Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
- BioBART: Pretraining and evaluation of a biomedical generative language model. In Proceedings of the 21st Workshop on Biomedical Language Processing, pages 97–109, Dublin, Ireland. Association for Computational Linguistics.
- Copymtl: Copy mechanism for joint extraction of entities and relations with multi-task learning.
- Extracting relational facts by an end-to-end neural model with copy mechanism. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 506–514, Melbourne, Australia. Association for Computational Linguistics.
- Drug repurposing for covid-19 via knowledge graph completion. Journal of biomedical informatics, 115:103696.
- Zexuan Zhong and Danqi Chen. 2021. A frustratingly easy approach for entity and relation extraction. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 50–61, Online. Association for Computational Linguistics.
- Aviv Brokman (3 papers)
- Ramakanth Kavuluru (23 papers)