QComp: A QSAR-Based Data Completion Framework for Drug Discovery (2405.11703v1)
Abstract: In drug discovery, in vitro and in vivo experiments reveal biochemical activities related to the efficacy and toxicity of compounds. The experimental data accumulate into massive, ever-evolving, and sparse datasets. Quantitative Structure-Activity Relationship (QSAR) models, which predict biochemical activities using only the structural information of compounds, face challenges in integrating the evolving experimental data as studies progress. We develop QSAR-Complete (QComp), a data completion framework to address this issue. Based on pre-existing QSAR models, QComp utilizes the correlation inherent in experimental data to enhance prediction accuracy across various tasks. Moreover, QComp emerges as a promising tool for guiding the optimal sequence of experiments by quantifying the reduction in statistical uncertainty for specific endpoints, thereby aiding in rational decision-making throughout the drug discovery process.
- Encyclopedia of computational chemistry. Wiley Online Library, 1998.
- Random forest: a classification and regression tool for compound classification and qsar modeling. Journal of Chemical Information and Computer Sciences, 43(6):1947–1958, 2003.
- William S Noble. What is a support vector machine? Nature Biotechnology, 24(12):1565–1567, 2006.
- Gaussian processes: a method for automatic qsar modeling of adme properties. Journal of Chemical Information and Modeling, 47(5):1847–1857, 2007.
- Deep neural nets as a method for quantitative structure–activity relationships. Journal of Chemical Information and Modeling, 55(2):263–274, 2015.
- Deep learning in drug discovery. Molecular Informatics, 35(1):3–14, 2016.
- Neural message passing for quantum chemistry. In International Conference on Machine Learning, pages 1263–1272. PMLR, 2017.
- Analyzing Learned Molecular Representations for Property Prediction. Journal of Chemical Information and Modeling, 59(8):3370–3388, 2019.
- Deeppurpose: a deep learning library for drug–target interaction prediction. Bioinformatics, 36(22-23):5545–5547, 2020.
- Integrating qsar modelling and deep learning in drug discovery: the emergence of deep qsar. Nature Reviews Drug Discovery, pages 1–15, 2023.
- Multi-task neural networks for qsar predictions. arXiv preprint arXiv:1406.1231, 2014.
- Modeling industrial admet data with multitask networks. arXiv preprint arXiv:1606.08793, 2016.
- Predictive Multitask Deep Neural Network Models for ADME-Tox Properties: Learning from Large Data Sets. Journal of Chemical Information and Modeling, 59(3):1253–1268, 3 2019.
- Improvement in admet prediction with multitask deep featurization. Journal of Medicinal Chemistry, 63(16):8835–8848, 2020.
- Qsar without borders. Chemical Society Reviews, 49(11):3525–3564, 2020.
- Machine learning for in silico admet prediction. Artificial Intelligence in Drug Design, pages 447–460, 2022.
- Analysis of the benefits of imputation models over traditional qsar models for toxicity prediction. Journal of Cheminformatics, 14(1):1–27, 2022.
- Missing value estimation methods for dna microarrays. Bioinformatics, 17(6):520–525, 2001.
- A bayesian missing value estimation method for gene expression profile data. Bioinformatics, 19(16):2088–2096, 2003.
- Missing value imputation for gene expression data: computational techniques to recover missing data from available information. Briefings in Bioinformatics, 12(5):498–513, 2011.
- Missing value estimation for dna microarray gene expression data: local least squares imputation. Bioinformatics, 21(2):187–198, 2005.
- Flexible multivariate imputation by MICE. Leiden: TNO, 1999.
- Missforest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1):112–118, 2012.
- Macau: scalable bayesian multi-relational factorization with side information using mcmc. arXiv preprint arXiv:1509.04610, 2015.
- Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, 2009.
- Imputation of assay bioactivity data using deep learning. Journal of Chemical Information and Modeling, 59(3):1197–1204, 2019.
- Profile-QSAR 2.0: Kinase Virtual Screening Accuracy Comparable to Four-Concentration IC50s for Realistically Novel Compounds. Journal of Chemical Information and Modeling, 57(8):2077–2088, 2017.
- Missing the point: Non-convergence in iterative imputation algorithms. In First Workshop on the Art of Learning with Missing Values (Artemiss) hosted by the 37 th International Conference on Machine Learning (ICML), 2020.
- Practical strategies for handling breakdown of multiple imputation procedures. Emerging Themes in Epidemiology, 18(1):5, 2021.
- Extension of pqsar: Ensemble model generated by random forest and partial least squares regressions. IEEE Access, 8:180087–180099, 2020.
- Predicting Total Drug Clearance and Volumes of Distribution Using the Machine Learning-Mediated Multimodal Method through the Imputation of Various Nonclinical Data. Journal of Chemical Information and Modeling, 62(17):4057–4065, 9 2022.
- PubChem 2023 update. Nucleic Acids Research, 51(D1):D1373–D1380, 1 2023.
- Predicting Fraction Unbound in Human Plasma from Chemical Structure: Improved Accuracy in the Low Value Ranges. Molecular Pharmaceutics, 15(11):5302–5311, 11 2018.
- Reliable Prediction of Caco-2 Permeability by Supervised Recursive Machine Learning Approaches. Pharmaceutics, 14(10), 10 2022.
- Combining machine learning and molecular dynamics to predict P-glycoprotein substrates. Journal of Chemical Information and Modeling, 60(10):4730–4749, 10 2020.
- Pred-hERG: A Novel web-Accessible Computational Tool for Predicting Cardiac Toxicity. Molecular Informatics, 34(10):698–701, 10 2015.
- Comparison of logP and logD correction models trained with public and proprietary data sets. Journal of Computer-Aided Molecular Design, 36(3):253–262, 3 2022.
- Pruned Machine Learning Models to Predict Aqueous Solubility. ACS Omega, 5(27):16562–16567, 7 2020.
- Boosting the predictive performance with aqueous solubility dataset curation. Scientific Data, 9(1), 12 2022.
- Predicting Solubility Limits of Organic Solutes for a Wide Range of Solvents and Temperatures. Journal of the American Chemical Society, 144(24):10785–10797, 6 2022.
- Chemprop: A machine learning package for chemical property prediction. Journal of Chemical Information and Modeling, 64:9–17, 2024.
- Effect of missing data on multitask prediction methods. Journal of Cheminformatics, 10(1):1–12, 2018.
- Predicting Critical Properties and Acentric Factors of Fluids Using Multitask Machine Learning. Journal of Chemical Information and Modeling, 63(15):4574–4588, 8 2023.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- fancyimpute: An imputation library for python. URL https://github. com/iskandr/fancyimpute, 2016.
- mice: Multivariate imputation by chained equations in r. Journal of Statistical Software, 45:1–67, 2011.
- Greg Landrum. RDKit: Open-Source Cheminformatics, 2006. (accessed November 29, 2023).
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.