Efficient Evolutionary Search Over Chemical Space with Large Language Models (2406.16976v3)

Published 23 Jun 2024 in cs.NE, cs.AI, cs.LG, and physics.chem-ph

Abstract: Molecular discovery, when formulated as an optimization problem, presents significant computational challenges because optimization objectives can be non-differentiable. Evolutionary Algorithms (EAs), often used to optimize black-box objectives in molecular discovery, traverse chemical space by performing random mutations and crossovers, leading to a large number of expensive objective evaluations. In this work, we ameliorate this shortcoming by incorporating chemistry-aware LLMs into EAs. Namely, we redesign crossover and mutation operations in EAs using LLMs trained on large corpora of chemical information. We perform extensive empirical studies on both commercial and open-source models on multiple tasks involving property optimization, molecular rediscovery, and structure-based drug design, demonstrating that the joint usage of LLMs with EAs yields superior performance over all baseline models across single- and multi-objective settings. We demonstrate that our algorithm improves both the quality of the final solution and convergence speed, thereby reducing the number of required objective evaluations. Our code is available at http://github.com/zoom-wang112358/MOLLEO

Citations (4)

View on Semantic Scholar

Summary

The paper introduces MolLEO, a hybrid framework combining evolutionary algorithms and language models to enhance molecular discovery by reducing computational costs.
It leverages domain-specific prompts and refined selection mechanisms to generate chemically relevant mutations, outperforming traditional methods in property and multi-objective tasks.
Empirical results on PMO and TDC benchmarks demonstrate improved drug-likeness, protein inhibition, and docking scores, highlighting its potential in accelerating drug discovery.

Efficient Evolutionary Search Over Chemical Space with LLMs

In recent years, the intersection of AI and molecular discovery has gained substantial traction, underscoring the importance of developing robust methods capable of navigating the expansive and computationally demanding chemical space. The paper "Efficient Evolutionary Search Over Chemical Space with LLMs" delineates an innovative approach that integrates Evolutionary Algorithms (EAs) with chemistry-aware LLMs to enhance the discovery and optimization of novel molecular structures.

Introduction and Motivation

Molecular discovery, driven by the need for novel compounds in sectors such as pharmaceuticals and material science, poses considerable computational challenges. Traditional EAs, typically employed for optimizing black-box objectives, traverse chemical space through random mutations and crossovers, often necessitating numerous expensive evaluations. This inefficiency is a significant bottleneck in the practical application of EAs in molecular discovery. LLMs, trained on extensive corpora of chemical and scientific literature, present an opportunity to imbue EAs with domain-specific knowledge, potentially reducing the required evaluation resources and accelerating the convergence towards optimal solutions.

Methodology

The proposed Molecular Language-Enhanced Evolutionary Optimization (MolLEO) framework redefines the crossover and mutation operations within EAs using LLMs. The approach comprises several key components:

Crossover and Mutation via LLMs: LLMs like GPT-4, BioT5, and MoleculeSTM are utilized to perform crossover and mutation operations. By leveraging the extensive chemical knowledge embedded within these models, the framework aims to generate more chemically relevant and high-fitness offspring compared to random alterations.
Task-Specific Prompts: Each LLM is provided with carefully crafted prompts describing the specific chemical optimization objective. This aids the models in generating proposals that are more likely to meet the desired criteria.
Selection Mechanism: To address the issue of invalid or low-fitness molecules, a selection mechanism is employed where the proposed offspring are filtered based on their structural similarity to the top-performing molecules in the population.
Adaptation and Integration: Different versions of LLMs are integrated and adapted to work within the MolLEO framework, and their performance is empirically validated across multiple molecular optimization tasks.

Empirical Results

The efficacy of MolLEO is demonstrated through extensive experiments on several tasks within the Practical Molecular Optimization (PMO) and Therapeutics Data Commons (TDC) benchmarks, including property optimization, similarity-based rediscovery, and structure-based drug design. Key findings from the performance evaluations include:

Property Optimization: MolLEO consistently outperformed baseline models such as Graph-GA, REINVENT, and Gaussian Process Bayesian Optimization (GP BO) in optimizing properties like drug-likeness (QED), protein inhibition (JNK3, GSK3β, DRD2), and synthetic accessibility (SAscore).
Multi-Objective Optimization: The framework demonstrated superior performance in various multi-objective tasks by achieving higher summation and hypervolume scores. For example, in tasks involving simultaneous optimization of QED, JNK3 inhibition, and SAscore, MolLEO achieved higher diversity and coverage in the Pareto frontier.
Protein-Ligand Docking: MolLEO showcased significant improvements in docking scores against proteins like DRD3, EGFR, and Adenosine A2A receptor, indicating its potential utility in drug design applications.

Implications and Future Directions

The demonstrated capability of MolLEO in efficiently navigating chemical space with fewer evaluations has several important implications:

Acceleration of Molecular Discovery: By reducing the computational resources and time required for molecule optimization, MolLEO can expedite the discovery of novel compounds, aiding in faster deployment for practical applications.
Enhanced Design of Experimental Protocols: The ability to provide high-quality molecular candidates with fewer evaluations aligns well with experimental settings where resources are limited, thus optimizing the allocation of experimental efforts.
Generalizability: The framework's reliance on LLMs for task-specific knowledge suggests that it could be adapted for other domains within chemistry and material science, promoting interdisciplinary applications of AI.

Conclusion

MolLEO represents a significant advancement in the integration of machine learning and evolutionary algorithms for molecular optimization. By leveraging the domain-specific insights encapsulated in LLMs, the framework offers a more efficient pathway to high-fitness molecular candidates, reducing dependency on computationally expensive evaluations. Future developments may focus on refining the integration of LLMs into broader chemical informatics pipelines and exploring the combination of MolLEO with other generative models, potentially unlocking new frontiers in computational chemistry and drug discovery.

Related Papers

GitHub

GitHub - zoom-wang112358/MOLLEO: Source code of MOLLEO (31 stars)
GitHub - zoom-wang112358/MOLLEO: Source code of MOLLEO (31 stars)

Tweets

https://twitter.com/MarioKrenn6240/status/1883215594016932345

https://twitter.com/janhjensen/status/1805881723168706674

https://twitter.com/fly51fly/status/1807368557132673303

https://twitter.com/bronzeagepapi/status/1940933005889229253