Small Molecule Optimization with Large Language Models (2407.18897v1)

Published 26 Jul 2024 in cs.LG, cs.NE, and q-bio.QM

Abstract: Recent advancements in LLMs have opened new possibilities for generative molecular drug design. We present Chemlactica and Chemma, two LLMs fine-tuned on a novel corpus of 110M molecules with computed properties, totaling 40B tokens. These models demonstrate strong performance in generating molecules with specified properties and predicting new molecular characteristics from limited samples. We introduce a novel optimization algorithm that leverages our LLMs to optimize molecules for arbitrary properties given limited access to a black box oracle. Our approach combines ideas from genetic algorithms, rejection sampling, and prompt optimization. It achieves state-of-the-art performance on multiple molecular optimization benchmarks, including an 8% improvement on Practical Molecular Optimization compared to previous methods. We publicly release the training corpus, the LLMs and the optimization algorithm.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces innovative LLMs, Chemlactica and Chemma, for optimizing small molecules in drug discovery.
The paper fine-tunes these models on a robust PubChem dataset of 110 million molecules, achieving state-of-the-art results in benchmark tasks.
The paper presents a novel optimization algorithm that integrates genetic methods, rejection sampling, and prompt optimization to efficiently explore chemical space.

Small Molecule Optimization with LLMs

The paper "Small Molecule Optimization with LLMs," authored by Philipp Guevorguian and colleagues, introduces a novel approach to molecular optimization leveraging LLMs. The authors present two models, Chemlactica and Chemma, which have been fine-tuned on a comprehensive molecular dataset. The dataset itself is derived from PubChem and includes 110 million molecules with computed properties, totaling 40 billion tokens. The motivation behind this work is grounded in the need for efficient drug discovery methodologies that can navigate the vast chemical space with higher efficacy than traditional methods.

Model and Dataset

The core contributions of the paper include the development of two LLMs, Chemlactica and Chemma, which demonstrate strong performance in generating molecules with specified properties and predicting new molecular characteristics from limited samples. The models are trained on a robust dataset from PubChem, which includes not only the typical SMILES representations but also enriched molecular properties. This dataset allows the models to develop a nuanced understanding of molecular structures and their associated properties.

Optimization Algorithm

A significant highlight of the paper is the introduction of a novel optimization algorithm that combines ideas from genetic algorithms, rejection sampling, and prompt optimization. This algorithm is designed to efficiently traverse the chemical space by leveraging LLMs to generate candidate molecules and optimize them for arbitrary properties. The optimization algorithm is tailored to work with a black-box oracle, making it versatile for various molecular design tasks.

Key steps in the optimization process include:

Generating prompts for molecule generation using selected molecules from a pool.
Using the LLM to generate new molecule candidates.
Evaluating these candidates with an oracle function and updating the pool with high-performing molecules.
Incorporating fine-tuning of the model when performance stagnates to ensure diversity and improvement in candidate quality.

The authors demonstrate that integrating numerical property descriptions into the prompt during the molecule generation phase can significantly enhance the performance of the optimization algorithm.

Benchmark Results

The efficacy of the proposed models and optimization algorithm is validated on several benchmark tasks:

Practical Molecular Optimization (PMO): The models achieve state-of-the-art performance, with Chemlactica-1.3B and Chemma-2B models substantially outperforming existing methods on multiple tasks. Notably, in tasks like sitagliptin_mpo, the models show significant improvement, highlighting their practical relevance in drug discovery scenarios.
Multi-property Optimization with Docking: The models excel in drug discovery case studies by optimizing docking scores for specific protein targets such as DRD2, MK2-kinase, and acetylcholinesterase. The results illustrate that Chemma-2B, in particular, outperforms existing approaches in generating viable drug candidates, thus proving effective in complex molecular design tasks.
QED Maximization with Similarity Constrained Molecular Design: The Chemlactica-125M model achieves a high success rate in optimizing molecules for high QED while maintaining structural similarity to given molecules, demonstrating its robustness and efficiency.

Model Calibration and Property Prediction

The paper also addresses the calibration of the models, showing that both Chemlactica and Chemma provide well-calibrated outputs across multiple computed properties. This calibration is crucial for enabling accurate predictions and reliable molecular generation. Additionally, the models exhibit competitive performance in property prediction tasks, such as ESOL, FreeSolv, and Lipophilicity, outperforming existing models like ChemFormer and MolT5.

Practical Implications and Future Work

The implications of this research are substantial for the field of computational drug discovery. The proposed models and algorithms facilitate the efficient exploration and optimization of chemical space, potentially accelerating the drug development process. The adaptability of the models to fine-tuning for various properties suggests that these tools could be seamlessly integrated into existing drug discovery pipelines.

Future work may explore expanding the models' capabilities to include 3D molecular structures and interactions with biological entities such as proteins, enhancing their applicability in more complex biochemical environments. Additionally, further refinement of the optimization algorithm and exploration of its integration with other computational methods could yield even more robust performance.

Conclusion

The authors provide a comprehensive framework for small molecule optimization using advanced LLMs. The results presented in the paper underscore the potential of Chemlactica and Chemma models to revolutionize the field of molecular drug design. By open-sourcing their training corpus, models, and optimization algorithm, the authors also pave the way for future collaborations and advancements in computational chemistry.

PDF Markdown

Related Papers

Tweets

https://twitter.com/arankomatsuzaki/status/1817930582191456386

https://twitter.com/fly51fly/status/1818036650276196410

https://twitter.com/SahakyanHK/status/1817890715323572571

https://twitter.com/GptMaestro/status/1819510346681602081