Dual-stage optimizer for systematic overestimation adjustment applied to multi-objective genetic algorithms for biomarker selection (2312.16624v3)
Abstract: The challenge in biomarker discovery using machine learning from omics data lies in the abundance of molecular features but scarcity of samples. Most feature selection methods in machine learning require evaluating various sets of features (models) to determine the most effective combination. This process, typically conducted using a validation dataset, involves testing different feature sets to optimize the model's performance. Evaluations have performance estimation error and when the selection involves many models the best ones are almost certainly overestimated. Biomarker identification with feature selection methods can be addressed as a multi-objective problem with trade-offs between predictive ability and parsimony in the number of features. Genetic algorithms are a popular tool for multi-objective optimization but they evolve numerous solutions thus are prone to overestimation. Methods have been proposed to reduce the overestimation after a model has already been selected in single-objective problems, but no algorithm existed capable of reducing the overestimation during the optimization, improving model selection, or applied in the more general multi-objective domain. We propose DOSA-MO, a novel multi-objective optimization wrapper algorithm that learns how the original estimation, its variance, and the feature set size of the solutions predict the overestimation. DOSA-MO adjusts the expectation of the performance during the optimization, improving the composition of the solution set. We verify that DOSA-MO improves the performance of a state-of-the-art genetic algorithm on left-out or external sample sets, when predicting cancer subtypes and/or patient overall survival, using three transcriptomics datasets for kidney and breast cancer.
- On over-fitting in model selection and subsequent selection bias in performance evaluation. The Journal of Machine Learning Research, 11:2079–2107, 2010.
- Multi-objective Optimization, pages 403–449. Springer US, Boston, MA, 2014. ISBN 978-1-4614-6940-7. doi:10.1007/978-1-4614-6940-7_15. URL https://doi.org/10.1007/978-1-4614-6940-7_15.
- MaNGA: a novel multi-niche multi-objective genetic algorithm for QSAR modelling. Bioinformatics, 36(1):145–153, 06 2019. ISSN 1367-4803. doi:10.1093/bioinformatics/btz521. URL https://doi.org/10.1093/bioinformatics/btz521.
- Feature set optimization in biomarker discovery from genome-scale data. Bioinformatics, 36(11):3393–3400, 04 2020. ISSN 1367-4803. doi:10.1093/bioinformatics/btaa144. URL https://doi.org/10.1093/bioinformatics/btaa144.
- Improved nsga-ii algorithms for multi-objective biomarker discovery. Bioinformatics, 38(Supplement_2):ii20–ii26, 09 2022. ISSN 1367-4803. doi:10.1093/bioinformatics/btac463. URL https://doi.org/10.1093/bioinformatics/btac463.
- Improving biomarker selection for cancer subtype classification through multi-objective optimization. 10 2023. doi:10.36227/techrxiv.24321154.v2. URL https://doi.org/10.36227/techrxiv.24321154.
- Cross-validation pitfalls when selecting and assessing regression and classification models. Journal of cheminformatics, 6(1):1–15, 2014.
- Tzu-Tsung Wong. Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation. Pattern Recognition, 48(9):2839–2846, 2015. ISSN 0031-3203. doi:https://doi.org/10.1016/j.patcog.2015.03.009. URL https://www.sciencedirect.com/science/article/pii/S0031320315000989.
- Reliable accuracy estimates from k-fold cross validation. IEEE Transactions on Knowledge and Data Engineering, 32(8):1586–1594, 2020. doi:10.1109/TKDE.2019.2912815.
- Performance-estimation properties of cross-validation-based protocols with simultaneous hyper-parameter optimization. International Journal on Artificial Intelligence Tools, 24(05):1540023, 2015.
- A bias correction for the minimum error rate in cross-validation. The Annals of Applied Statistics, 3(2):822 – 829, 2009. doi:10.1214/08-AOAS224. URL https://doi.org/10.1214/08-AOAS224.
- Just add data: automated predictive modeling for knowledge discovery and feature selection. NPJ precision oncology, 6(1):38, 2022.
- The cancer genome atlas: creating lasting value beyond its data. Cell, 173(2):283–285, 2018.
- Clinical value of rna sequencing–based classifiers for prediction of the five conventional breast cancer biomarkers: a report from the population-based multicenter sweden cancerome analysis network—breast initiative. JCO precision oncology, 2:1–18, Mar 2018. doi:10.1200/PO.17.00135. URL https://doi.org/10.1200/PO.17.00135. PMID: 32913985.
- An evolutionary many-objective optimization algorithm using reference-point-based nondominated sorting approach, part i: Solving problems with box constraints. IEEE Transactions on Evolutionary Computation, 18(4):577–601, 2014. doi:10.1109/TEVC.2013.2281535.
- A fast and elitist multiobjective genetic algorithm: Nsga-ii. IEEE Transactions on Evolutionary Computation, 6(2):182–197, 2002. doi:10.1109/4235.996017.
- Classification and regression trees. 1984.
- Support vector regression machines. In M.C. Mozer, M. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems, volume 9. MIT Press, 1996.
- Sensitivity analysis of k-fold cross validation in prediction error estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(3):569–575, 2010. doi:10.1109/TPAMI.2009.187.
- Bias in estimating the variance of k-fold cross-validation. In Statistical modeling and analysis for complex data problems, pages 75–95. Springer, 2005.
- A survey of cross-validation procedures for model selection. Statistics Surveys, 4(none):40 – 79, 2010. doi:10.1214/09-SS054. URL https://doi.org/10.1214/09-SS054.
- Multiomic integration of public oncology databases in bioconductor. JCO Clinical Cancer Informatics, 1:958–971, 2020.
- The cancer genome atlas comprehensive molecular characterization of renal cell carcinoma. Cell reports, 23(1):313–326, 2018.
- An integrated tcga pan-cancer clinical data resource to drive high-quality survival outcome analytics. Cell, 173(2):400–416, 2018.
- Quality evaluation of solution sets in multiobjective optimisation: A survey. ACM Comput. Surv., 52(2), mar 2019. ISSN 0360-0300. doi:10.1145/3300148. URL https://doi.org/10.1145/3300148.