QComp: A QSAR-Based Data Completion Framework for Drug Discovery (2405.11703v1)

Published 20 May 2024 in cs.LG

Abstract: In drug discovery, in vitro and in vivo experiments reveal biochemical activities related to the efficacy and toxicity of compounds. The experimental data accumulate into massive, ever-evolving, and sparse datasets. Quantitative Structure-Activity Relationship (QSAR) models, which predict biochemical activities using only the structural information of compounds, face challenges in integrating the evolving experimental data as studies progress. We develop QSAR-Complete (QComp), a data completion framework to address this issue. Based on pre-existing QSAR models, QComp utilizes the correlation inherent in experimental data to enhance prediction accuracy across various tasks. Moreover, QComp emerges as a promising tool for guiding the optimal sequence of experiments by quantifying the reduction in statistical uncertainty for specific endpoints, thereby aiding in rational decision-making throughout the drug discovery process.

References (47)

Summary

The paper introduces QComp, a data completion framework that improves QSAR predictions by leveraging correlations in sparse experimental data.
It employs a probabilistic model with one-shot data completion to update biochemical activity predictions based on new data.
Experimental results demonstrate significant accuracy improvements, with r² scores rising from 0.487 to 0.620 in key ADMET datasets.

Streamlined Drug Discovery with QComp: Enhancing QSAR Models

Background on QSAR Models

In the world of drug discovery, predicting a molecule's biochemical activities, such as its efficacy and toxicity, is crucial. This activity falls under the umbrella of Quantitative Structure-Activity Relationship (QSAR) models. QSAR models correlate the chemical structure of compounds with their biological activities, making them essential for high-throughput screening in material and drug discovery.

However, these models face significant challenges. Experimental data keep accumulating and evolving, making historic data sets massive but often sparse. Simply retraining QSAR models with new data isn't always feasible or cost-effective, especially when new experimental data is relatively small compared to pre-existing data.

Introducing QSAR-Complete (QComp)

Enter QSAR-Complete, or QComp, a novel data completion framework designed to meet these challenges head-on. QComp enhances traditional QSAR models by leveraging correlations inherent in the available experimental data, improving prediction accuracy across various tasks. Its benefits extend to guiding the sequence of experiments in drug discovery, reducing uncertainties and supporting more rational decision-making.

Methodology

Probabilistic Framework: At the core of QComp is a probabilistic model that treats the biochemical activities of a molecule as a probability distribution influenced by the molecule's chemical structure. This model accounts for both the known and unknown biochemical activities, updating predictions based on new experimental data. The underlying assumption is that the deviations of these activities from QSAR model predictions follow a normal distribution.

Training and Data Completion: The QComp model is trained using a log-likelihood loss function. Once trained, it can perform one-shot data completion for missing biochemical activity data by estimating the most probable values based on observed data and pre-existing QSAR models.

Experimental Results

Data and Models: QComp was evaluated using several datasets, including three proprietary ADMET datasets and one public dataset. For instance, the ADMET-750k dataset comprises data from 32 assays related to small molecules, while the public dataset involves data from 25 assays for over 114,000 small molecules. QComp utilized multi-task Chemprop models and other baseline models for data completion.

Benchmark Performance:

Improvement Across the Board: In the ADMET-750k dataset, QComp systematically outperformed various baseline data completion methods like MICE, Missforest, and Macau. It improved the mean squared Pearson correlation coefficient ( $r^2$ ) from 0.487 (base QSAR) to 0.620. Compared to other methods, QComp was the most robust, maintaining or improving accuracy for nearly all assays tested.
Human vs. Animal Data: In the fup dataset, which contains fraction unbound in plasma data for humans, rats, and dogs, QComp significantly enhanced human assay predictions when animal data was available. The $r^2$ score improved from 0.494 (base QSAR) to 0.751 when using both rat and dog data.
Peptide Dataset Performance: QComp also showed its versatility by improving predictions in a peptide dataset, increasing the average $r^2$ score from 0.428 to 0.673 for assays with adequate data.

Practical Implications

Data-Driven Decision-Making: Beyond prediction accuracy, QComp's ability to quantify the gain of certainty (GOC) in predictions makes it invaluable for guiding experimental design. For example, when focusing on "MRT, rat" assays in the ADMET-750k dataset, QComp could prioritize which in vitro assays to measure first, optimizing resource allocation in drug discovery.

Limitations and Future Work

Homogeneity of Covariance Assumption: Currently, QComp assumes a uniform covariance matrix across all compounds, which might not capture the nuanced variations in some datasets.

Integration with QSAR Models: Future work could explore the benefits of concurrently training QSAR and QComp models, potentially yielding even greater improvements in prediction accuracy.

Cost-Effective Experimentation: The proposed greedy scheme for experiment prioritization doesn't yet consider the economic or ethical costs. Future developments could incorporate these factors to provide more nuanced guidance.

Conclusion

QComp represents a significant step forward in leveraging sparse and evolving experimental data within the framework of traditional QSAR models. By intelligently integrating new data and providing a structured approach to experimental design, QComp can effectively streamline the drug discovery process, saving both time and resources.