BLADE: Enhancing Black-box Large Language Models with Small Domain-Specific Models (2403.18365v1)
Abstract: LLMs like ChatGPT and GPT-4 are versatile and capable of addressing a diverse range of tasks. However, general LLMs, which are developed on open-domain data, may lack the domain-specific knowledge essential for tasks in vertical domains, such as legal, medical, etc. To address this issue, previous approaches either conduct continuous pre-training with domain-specific data or employ retrieval augmentation to support general LLMs. Unfortunately, these strategies are either cost-intensive or unreliable in practical applications. To this end, we present a novel framework named BLADE, which enhances Black-box LLMs with small Domain-spEcific models. BLADE consists of a black-box LLM and a small domain-specific LM. The small LM preserves domain-specific knowledge and offers specialized insights, while the general LLM contributes robust language comprehension and reasoning capabilities. Specifically, our method involves three steps: 1) pre-training the small LM with domain-specific data, 2) fine-tuning this model using knowledge instruction data, and 3) joint Bayesian optimization of the general LLM and the small LM. Extensive experiments conducted on public legal and medical benchmarks reveal that BLADE significantly outperforms existing approaches. This shows the potential of BLADE as an effective and cost-efficient solution in adapting general LLMs for vertical domains.
- 2023. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. arXiv:2211.05100 [cs.CL]
- Roee Aharoni and Yoav Goldberg. 2020. Unsupervised domain clusters in pretrained language models. arXiv preprint arXiv:2004.02105 (2020).
- LeanContext: Cost-Efficient Domain-Specific Question Answering Using LLMs. arXiv preprint arXiv:2309.00841 (2023).
- Qwen Technical Report. arXiv:2309.16609 [cs.CL]
- Baichuan. 2023. Baichuan 2: Open Large-scale Language Models. arXiv preprint arXiv:2309.10305 (2023). https://arxiv.org/abs/2309.10305
- Improving language models by retrieving from trillions of tokens. In International conference on machine learning. PMLR, 2206–2240.
- Ilias Chalkidis. 2023. ChatGPT may Pass the Bar Exam soon, but has a Long Way to Go for the LexGLUE benchmark. arXiv:2304.12202 [cs.CL]
- InstructZero: Efficient Instruction Optimization for Black-Box Large Language Models. arXiv preprint arXiv:2306.03082 (2023).
- Adapting Large Language Models via Reading Comprehension. arXiv:2309.09530 [cs.CL]
- PRE: A Peer Review Based Large Language Model Evaluator. arXiv preprint arXiv:2401.15641 (2024).
- ChatLaw: Open-Source Legal Large Language Model with Integrated External Knowledge Bases. arXiv:2306.16092 [cs.CL]
- Promptagator: Few-shot dense retrieval from 8 examples. arXiv preprint arXiv:2209.11755 (2022).
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
- I3 Retriever: Incorporating Implicit Interaction in Pre-trained Language Models for Passage Retrieval. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. 441–451.
- Aligning the Capabilities of Large Language Models with the Context of Information Retrieval via Contrastive Feedback. arXiv preprint arXiv:2309.17078 (2023).
- Incorporating Explicit Knowledge in Pre-trained Language Models for Passage Re-ranking. arXiv preprint arXiv:2204.11673 (2022).
- Qian Dong and Shuzi Niu. 2021a. Latent Graph Recurrent Network for Document Ranking. In International Conference on Database Systems for Advanced Applications. Springer, 88–103.
- Qian Dong and Shuzi Niu. 2021b. Legal judgment prediction via relational learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 983–992.
- GLM: General Language Model Pretraining with Autoregressive Blank Infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 320–335.
- Luciano Floridi and Massimo Chiriatti. 2020. GPT-3: Its nature, scope, limits, and consequences. Minds and Machines 30 (2020), 681–694.
- Peter I Frazier. 2018. A tutorial on Bayesian optimization. arXiv preprint arXiv:1807.02811 (2018).
- Don’t stop pretraining: Adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964 (2020).
- Nikolaus Hansen. 2016. The CMA evolution strategy: A tutorial. arXiv preprint arXiv:1604.00772 (2016).
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).
- Lawyer LLaMA Technical Report. ArXiv abs/2305.15062 (2023).
- Contextualized representations using textual encyclopedic knowledge. arXiv preprint arXiv:2004.12006 (2020).
- Jon M Kleinberg. 1997. Two algorithms for nearest-neighbor search in high dimensions. In Proceedings of the twenty-ninth annual ACM symposium on Theory of computing. 599–608.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33 (2020), 9459–9474.
- Question and Answer Test-Train Overlap in Open-Domain Question Answering Datasets. arXiv:2008.02637 [cs.CL]
- Paq: 65 million probably-asked questions and what you can do with them. Transactions of the Association for Computational Linguistics 9 (2021), 1098–1115.
- SAILER: Structure-Aware Pre-Trained Language Model for Legal Case Retrieval (SIGIR ’23). Association for Computing Machinery, New York, NY, USA, 1035–1044. https://doi.org/10.1145/3539618.3591761
- Constructing Tree-based Index for Efficient and Effective Dense Retrieval. arXiv:2304.11943 [cs.IR]
- MLEC-QA: A Chinese multi-choice biomedical question answering dataset. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 8862–8874.
- Prompting large language models for zero-shot domain adaptation in speech recognition. arXiv preprint arXiv:2306.16007 (2023).
- Towards General Text Embeddings with Multi-stage Contrastive Learning. arXiv:2308.03281 [cs.CL]
- Generated knowledge prompting for commonsense reasoning. arXiv preprint arXiv:2110.08387 (2021).
- Ilya Loshchilov and Frank Hutter. 2018. Fixing weight decay regularization in adam. (2018).
- Taiyi: A Bilingual Fine-Tuned Large Language Model for Diverse Biomedical Tasks. arXiv preprint arXiv:2311.11608 (2023).
- MultiLegalPile: A 689GB Multilingual Legal Corpus. arXiv preprint arXiv:2306.02069 (2023).
- OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730–27744.
- Improving language understanding by generative pre-training. (2018).
- Donald B Rubin. 1980. Randomization analysis of experimental data: The Fisher randomization test comment. Journal of the American statistical association 75, 371 (1980), 591–593.
- Efficient domain adaptation of language models via adaptive tokenization. arXiv preprint arXiv:2109.07460 (2021).
- Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652 (2023).
- Black-box tuning for language-model-as-a-service. In International Conference on Machine Learning. PMLR, 20841–20855.
- Recitation-augmented language models. arXiv preprint arXiv:2210.01296 (2022).
- Attention is all you need. In Advances in neural information processing systems. 5998–6008.
- Elaboration-generating commonsense question answering at scale. arXiv preprint arXiv:2209.01232 (2022).
- He sicheng Wang Yuxin, Sun Qingxuan. 2023. M3E: Moka Massive Mixed Embedding Model.
- C-Pack: Packaged Resources To Advance General Chinese Embedding. arXiv preprint arXiv:2309.07597 (2023).
- T2Ranking: A Large-scale Chinese Benchmark for Passage Ranking. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (, Taipei, Taiwan,) (SIGIR ’23). Association for Computing Machinery, New York, NY, USA, 2681–2690. https://doi.org/10.1145/3539618.3591874
- Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-world Multi-turn Dialogue. arXiv:2308.03549 [cs.CL]
- Generate rather than retrieve: Large language models are strong context generators. arXiv preprint arXiv:2209.10063 (2022).
- Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414 (2022).
- Fengshenbang 1.0: Being the Foundation of Chinese Cognitive Intelligence. CoRR abs/2209.02970 (2022).
- When does pretraining help? assessing self-supervised learning for law and the casehold dataset of 53,000+ legal holdings. In Proceedings of the eighteenth international conference on artificial intelligence and law. 159–168.
- JEC-QA: a legal-domain question answering dataset. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 9701–9708.