Matching domain experts by training from scratch on domain knowledge (2405.09395v2)
Abstract: Recently, LLMs have outperformed human experts in predicting the results of neuroscience experiments (Luo et al., 2024). What is the basis for this performance? One possibility is that statistical patterns in that specific scientific literature, as opposed to emergent reasoning abilities arising from broader training, underlie LLMs' performance. To evaluate this possibility, we trained (next word prediction) a relatively small 124M-parameter GPT-2 model on 1.3 billion tokens of domain-specific knowledge. Despite being orders of magnitude smaller than larger LLMs trained on trillions of tokens, small models achieved expert-level performance in predicting neuroscience results. Small models trained on the neuroscience literature succeeded when they were trained from scratch using a tokenizer specifically trained on neuroscience text or when the neuroscience literature was used to finetune a pretrained GPT-2. Our results indicate that expert-level performance may be attained by even small LLMs through domain-specific, auto-regressive training approaches.
- Textbooks Are All You Need. 2023. doi: 10.48550/ARXIV.2306.11644. URL https://arxiv.org/abs/2306.11644. Publisher: arXiv Version Number: 1.
- Measuring Massive Multitask Language Understanding, January 2021. URL http://arxiv.org/abs/2009.03300. arXiv:2009.03300 [cs].
- PubMedQA: A Dataset for Biomedical Research Question Answering, September 2019. URL http://arxiv.org/abs/1909.06146. arXiv:1909.06146 [cs, q-bio].
- Decoupled Weight Decay Regularization, January 2019. URL http://arxiv.org/abs/1711.05101. arXiv:1711.05101 [cs, math].
- Large language models surpass human experts in predicting neuroscience results, March 2024. URL http://arxiv.org/abs/2403.03230. arXiv:2403.03230 [cs, q-bio].
- MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering, March 2022. URL http://arxiv.org/abs/2203.14371. arXiv:2203.14371 [cs].
- Language Models are Unsupervised Multitask Learners. 2019.
- Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. 2022. doi: 10.48550/ARXIV.2206.04615. URL https://arxiv.org/abs/2206.04615. Publisher: arXiv Version Number: 3.
- Strack, R. Visual proteomics. Nature Methods, 20(12):1868–1868, December 2023. ISSN 1548-7091, 1548-7105. doi: 10.1038/s41592-023-02104-6. URL https://www.nature.com/articles/s41592-023-02104-6.
- exKidneyBERT: a language model for kidney transplant pathology reports and the crucial role of extended vocabularies. PeerJ Computer Science, 10:e1888, February 2024. ISSN 2376-5992. doi: 10.7717/peerj-cs.1888. URL https://peerj.com/articles/cs-1888.