DrugAssist: A Large Language Model for Molecule Optimization (2401.10334v1)
Abstract: Recently, the impressive performance of LLMs on a wide range of tasks has attracted an increasing number of attempts to apply LLMs in drug discovery. However, molecule optimization, a critical task in the drug discovery pipeline, is currently an area that has seen little involvement from LLMs. Most of existing approaches focus solely on capturing the underlying patterns in chemical structures provided by the data, without taking advantage of expert feedback. These non-interactive approaches overlook the fact that the drug discovery process is actually one that requires the integration of expert experience and iterative refinement. To address this gap, we propose DrugAssist, an interactive molecule optimization model which performs optimization through human-machine dialogue by leveraging LLM's strong interactivity and generalizability. DrugAssist has achieved leading results in both single and multiple property optimization, simultaneously showcasing immense potential in transferability and iterative optimization. In addition, we publicly release a large instruction-based dataset called MolOpt-Instructions for fine-tuning LLMs on molecule optimization tasks. We have made our code and data publicly available at https://github.com/blazerye/DrugAssist, which we hope to pave the way for future research in LLMs' application for drug discovery.
- Molecular generation with recurrent neural networks (rnns). arXiv preprint arXiv:1705.04612, 2017.
- Comprehensive evaluation of molecule property prediction with chatgpt. Methods, 2023.
- Syntax-directed variational autoencoder for molecule generation. In Proceedings of the international conference on learning representations, 2018.
- mmpdb: An open-source matched molecular pair platform for large multiproperty data sets. Journal of chemical information and modeling, 58(5):902–910, 2018.
- A continual learning survey: Defying forgetting in classification tasks. IEEE transactions on pattern analysis and machine intelligence, 44(7):3366–3385, 2021.
- How abilities in large language models are affected by supervised fine-tuning data composition. arXiv preprint arXiv:2310.05492, 2023.
- Mol-instructions: A large-scale biomolecular instruction dataset for large language models. arXiv preprint arXiv:2306.08018, 2023.
- Generative recurrent networks for de novo drug design. Molecular informatics, 37(1-2):1700111, 2018.
- Medalpaca–an open-source collection of medical conversational ai models and training data. arXiv preprint arXiv:2304.08247, 2023.
- Molecular optimization by capturing chemist’s intuition using deep neural networks. Journal of cheminformatics, 13(1):1–17, 2021.
- Transformer-based molecular optimization beyond matched molecular pairs. Journal of cheminformatics, 14(1):18, 2022.
- iDrug, 2020. URL https://drug.ai.tencent.com.
- Zinc- a free database of commercially available compounds for virtual screening. Journal of chemical information and modeling, 45(1):177–182, 2005.
- Junction tree variational autoencoder for molecular graph generation. In International conference on machine learning, pp. 2323–2332. PMLR, 2018a.
- Learning multimodal graph-to-graph translation for molecular optimization. arXiv preprint arXiv:1812.01070, 2018b.
- Hierarchical generation of molecular graphs using structural motifs. In International conference on machine learning, pp. 4839–4848. PMLR, 2020.
- drugan: an advanced generative adversarial autoencoder model for de novo generation of new molecules with desired molecular properties in silico. Molecular pharmaceutics, 14(9):3098–3104, 2017.
- A comprehensive study of gpt-4v’s multimodal capabilities in medical imaging. medRxiv, pp. 2023–11, 2023a.
- A comprehensive evaluation of gpt-4v on knowledge-intensive visual question answering. arXiv preprint arXiv:2311.07536, 2023b.
- Retrieval-augmented multi-modal chain-of-thoughts reasoning for large language models. arXiv preprint arXiv:2312.01714, 2023a.
- Constrained graph variational autoencoders for molecule design. Advances in neural information processing systems, 31, 2018.
- Chatgpt-powered conversational drug editing using retrieval and domain feedback. arXiv preprint arXiv:2305.18090, 2023b.
- Biomedgpt: Open multimodal generative pre-trained transformer for biomedicine. arXiv preprint arXiv:2308.09442, 2023.
- Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration. arXiv preprint arXiv:2306.09093, 2023.
- Pathways language model (palm): Scaling to 540 billion parameters for breakthrough performance. Google AI Blog, 2022.
- Molecular de-novo design through deep reinforcement learning. Journal of cheminformatics, 9(1):1–14, 2017.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Reinforced adversarial neural computer for de novo molecular design. Journal of chemical information and modeling, 58(6):1194–1204, 2018.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS central science, 4(1):120–131, 2018.
- Graphvae: Towards generation of small graphs using variational autoencoders. In Artificial Neural Networks and Machine Learning–ICANN 2018: 27th International Conference on Artificial Neural Networks, Rhodes, Greece, October 4-7, 2018, Proceedings, Part I 27, pp. 412–422. Springer, 2018.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Pmc-llama: Towards building open-source language models for medicine. arXiv preprint arXiv:2305.10415, 2023.
- Chatdoctor: A medical chat model fine-tuned on llama model using medical domain knowledge. arXiv preprint arXiv:2303.14070, 2023.
- Interactive molecular discovery with natural language. arXiv preprint arXiv:2306.11976, 2023.
- Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023.