Emergent Mind

Small Molecule Optimization with Large Language Models

(2407.18897)
Published Jul 26, 2024 in cs.LG , cs.NE , and q-bio.QM

Abstract

Recent advancements in LLMs have opened new possibilities for generative molecular drug design. We present Chemlactica and Chemma, two language models fine-tuned on a novel corpus of 110M molecules with computed properties, totaling 40B tokens. These models demonstrate strong performance in generating molecules with specified properties and predicting new molecular characteristics from limited samples. We introduce a novel optimization algorithm that leverages our language models to optimize molecules for arbitrary properties given limited access to a black box oracle. Our approach combines ideas from genetic algorithms, rejection sampling, and prompt optimization. It achieves state-of-the-art performance on multiple molecular optimization benchmarks, including an 8% improvement on Practical Molecular Optimization compared to previous methods. We publicly release the training corpus, the language models and the optimization algorithm.

Optimization process for sitagliptin\_mpo task using Chemlactica-1.3B model with four different seeds.

Overview

  • The paper introduces two fine-tuned LLMs, Chemlactica and Chemma, designed for molecular optimization using a dataset from PubChem comprising 110 million molecules with computed properties.

  • A novel optimization algorithm combining genetic algorithms, rejection sampling, and prompt optimization is presented to efficiently generate and refine candidate molecules for drug discovery.

  • Benchmark results demonstrate the models' state-of-the-art performance in various molecular optimization tasks, including practical molecular optimization, multi-property optimization with docking, and QED maximization, showcasing their potential in computational drug discovery.

Small Molecule Optimization with LLMs

The paper "Small Molecule Optimization with LLMs," authored by Philipp Guevorguian and colleagues, introduces a novel approach to molecular optimization leveraging LLMs. The authors present two models, Chemlactica and Chemma, which have been fine-tuned on a comprehensive molecular dataset. The dataset itself is derived from PubChem and includes 110 million molecules with computed properties, totaling 40 billion tokens. The motivation behind this work is grounded in the need for efficient drug discovery methodologies that can navigate the vast chemical space with higher efficacy than traditional methods.

Model and Dataset

The core contributions of the paper include the development of two language models, Chemlactica and Chemma, which demonstrate strong performance in generating molecules with specified properties and predicting new molecular characteristics from limited samples. The models are trained on a robust dataset from PubChem, which includes not only the typical SMILES representations but also enriched molecular properties. This dataset allows the models to develop a nuanced understanding of molecular structures and their associated properties.

Optimization Algorithm

A significant highlight of the paper is the introduction of a novel optimization algorithm that combines ideas from genetic algorithms, rejection sampling, and prompt optimization. This algorithm is designed to efficiently traverse the chemical space by leveraging LLMs to generate candidate molecules and optimize them for arbitrary properties. The optimization algorithm is tailored to work with a black-box oracle, making it versatile for various molecular design tasks.

Key steps in the optimization process include:

  1. Generating prompts for molecule generation using selected molecules from a pool.
  2. Using the language model to generate new molecule candidates.
  3. Evaluating these candidates with an oracle function and updating the pool with high-performing molecules.
  4. Incorporating fine-tuning of the model when performance stagnates to ensure diversity and improvement in candidate quality.

The authors demonstrate that integrating numerical property descriptions into the prompt during the molecule generation phase can significantly enhance the performance of the optimization algorithm.

Benchmark Results

The efficacy of the proposed models and optimization algorithm is validated on several benchmark tasks:

  1. Practical Molecular Optimization (PMO): The models achieve state-of-the-art performance, with Chemlactica-1.3B and Chemma-2B models substantially outperforming existing methods on multiple tasks. Notably, in tasks like sitagliptin_mpo, the models show significant improvement, highlighting their practical relevance in drug discovery scenarios.
  2. Multi-property Optimization with Docking: The models excel in drug discovery case studies by optimizing docking scores for specific protein targets such as DRD2, MK2-kinase, and acetylcholinesterase. The results illustrate that Chemma-2B, in particular, outperforms existing approaches in generating viable drug candidates, thus proving effective in complex molecular design tasks.
  3. QED Maximization with Similarity Constrained Molecular Design: The Chemlactica-125M model achieves a high success rate in optimizing molecules for high QED while maintaining structural similarity to given molecules, demonstrating its robustness and efficiency.

Model Calibration and Property Prediction

The paper also addresses the calibration of the models, showing that both Chemlactica and Chemma provide well-calibrated outputs across multiple computed properties. This calibration is crucial for enabling accurate predictions and reliable molecular generation. Additionally, the models exhibit competitive performance in property prediction tasks, such as ESOL, FreeSolv, and Lipophilicity, outperforming existing models like ChemFormer and MolT5.

Practical Implications and Future Work

The implications of this research are substantial for the field of computational drug discovery. The proposed models and algorithms facilitate the efficient exploration and optimization of chemical space, potentially accelerating the drug development process. The adaptability of the models to fine-tuning for various properties suggests that these tools could be seamlessly integrated into existing drug discovery pipelines.

Future work may explore expanding the models' capabilities to include 3D molecular structures and interactions with biological entities such as proteins, enhancing their applicability in more complex biochemical environments. Additionally, further refinement of the optimization algorithm and exploration of its integration with other computational methods could yield even more robust performance.

Conclusion

The authors provide a comprehensive framework for small molecule optimization using advanced LLMs. The results presented in the paper underscore the potential of Chemlactica and Chemma models to revolutionize the field of molecular drug design. By open-sourcing their training corpus, models, and optimization algorithm, the authors also pave the way for future collaborations and advancements in computational chemistry.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.