Self-Normalizing Foundation Model for Enhanced Multi-Omics Data Analysis in Oncology (2405.08226v2)
Abstract: Multi-omics research has enhanced our understanding of cancer heterogeneity and progression. Investigating molecular data through multi-omics approaches is crucial for unraveling the complex biological mechanisms underlying cancer, thereby enabling more effective diagnosis, treatment, and prevention strategies. However, predicting patient outcomes through the integration of all available multi-omics data is still an under-study research direction. Here, we present SeNMo, a foundation model that has been trained on multi-omics data across 33 cancer types. SeNMo is particularly efficient in handling multi-omics data characterized by high-width and low-length attributes. We trained SeNMo for the task of overall survival of patients using pan-cancer multi-omics data involving 33 cancer sites from the GDC. The training multi-omics data includes gene expression, DNA methylation, miRNA expression, DNA mutations, protein expression modalities, and clinical data. SeNMo was validated on two independent cohorts: Moffitt Cancer Center and CPTAC lung squamous cell carcinoma. We evaluated the model's performance in predicting patient's overall survival using the C-Index. SeNMo performed consistently well in the training regime, reflected by the validation C-Index of 0.76 on GDC's public data. In the testing regime, SeNMo performed with a C-Index of 0.758 on a held-out test set. The model showed an average accuracy of 99.8% on the task of classifying the primary cancer type on the pan-cancer test cohort. SeNMo demonstrated robust performance on the classification task of predicting the primary cancer type of patients. SeNMo further demonstrated significant performance in predicting tertiary lymph structures from multi-omics data, showing generalizability across cancer types, molecular data types, and clinical endpoints.
- Big data in basic and translational cancer research. Nature Reviews Cancer, 22(11):625–639, 2022.
- Predicting cancer outcomes with radiomics and artificial intelligence in radiology. Nature reviews Clinical oncology, 19(2):132–146, 2022.
- R Krithiga and P Geetha. Breast cancer detection, segmentation and classification on histopathology images analysis: a systematic review. Archives of Computational Methods in Engineering, 28(4):2607–2619, 2021.
- An artificial intelligence framework integrating longitudinal electronic health records with real-world data enables continuous pan-cancer prognostication. Nature Cancer, 2(7):709–722, 2021.
- An integrative analysis of the age-associated multi-omic landscape across cancers. Nature communications, 12(1):2345, 2021.
- Hallmarks of cancer: the next generation. cell, 144(5):646–674, 2011.
- Multimodal biomedical ai. Nature Medicine, 28(9):1773–1784, 2022.
- Dahui Qin. Next-generation sequencing and its clinical application. Cancer biology & medicine, 16(1):4, 2019.
- Multimodal data integration for oncology in the era of deep neural networks: a review. arXiv preprint arXiv:2303.06471, 2023.
- Tutorial on survival modeling with applications to omics data. Bioinformatics, page btae132, 2024.
- Multi-omics approaches to disease. Genome biology, 18:1–15, 2017.
- Timothy Underwood. Pan-cancer analysis of whole genomes. Nature, 578(7793):82–93, 2020.
- Multi-cancer analysis of clonality and the timing of systemic spread in paired primary tumors and metastases. Nature genetics, 52(7):701–708, 2020.
- Oncogenic signaling pathways in the cancer genome atlas. Cell, 173(2):321–337, 2018.
- Cell-of-origin patterns dominate the molecular classification of 10,000 tumors from 33 types of cancer. Cell, 173(2):291–304, 2018.
- The immune landscape of cancer. Immunity, 48(4):812–830, 2018.
- Pan-cancer proteogenomics connects oncogenic drivers to functional states. Cell, 186(18):3921–3944, 2023.
- A comprehensive review of machine learning techniques for multi-omics data integration: challenges and applications in precision oncology. Briefings in Functional Genomics, page elae013, 2024.
- Brain tumor segmentation and surveillance with deep artificial neural networks. Deep Learning for Biomedical Data Analysis: Techniques, Approaches, and Applications, pages 311–350, 2021.
- Failure detection in deep neural networks for medical imaging. Frontiers in Medical Technology, 4:919046, 2022.
- Exploring robust architectures for deep artificial neural networks. Communications Engineering, 1(1):46, 2022.
- Artificial intelligence for multimodal data integration in oncology. Cancer cell, 40(10):1095–1110, 2022.
- Harnessing multimodal data integration to advance precision oncology. Nature Reviews Cancer, 22(2):114–126, 2022.
- Artificial intelligence-based multi-omics analysis fuels cancer precision medicine. In Seminars in Cancer Biology, volume 88, pages 187–200. Elsevier, 2023.
- Multimodal data fusion for cancer biomarker discovery with deep learning. Nature machine intelligence, 5(4):351–362, 2023.
- Bio24-031: Hierarchical multimodal learning on pan-squamous cell carcinomas for improved survival outcomes. Journal of the National Comprehensive Cancer Network, 22(2.5), 2024.
- Multimodal transformer model improves survival prediction in lung cancer compared to unimodal approaches. Cancer Research, 84(6_Supplement):4905–4905, 2024.
- Building flexible, scalable, and machine learning-ready multimodal oncology datasets. Sensors, 24(5):1634, 2024.
- Pan-cancer classification based on self-normalizing neural networks and feature selection. Frontiers in Bioengineering and Biotechnology, 8:766, 2020.
- Pan-cancer integrative histology-genomic analysis via multimodal deep learning. Cancer Cell, 40(8):865–878, 2022.
- Deepprog: an ensemble of deep-learning and machine-learning models for prognosis prediction using multi-omics data. Genome medicine, 13:1–15, 2021.
- Integration of pan-cancer multi-omics data for novel mixed subgroup identification using machine learning methods. Plos one, 18(10):e0287176, 2023.
- Integrate multi-omics data with biological interaction networks using multi-view factorization autoencoder (mae). BMC genomics, 20(Suppl 11):944, 2019.
- Identification of pan-cancer prognostic biomarkers through integration of multi-omics data. Frontiers in Bioengineering and Biotechnology, 8:268, 2020.
- Autoencoder-based multimodal prediction of non-small cell lung cancer survival. Scientific Reports, 13(1):15761, 2023.
- Quantifying the advantage of multimodal data fusion for survival prediction in cancer patients. bioRxiv, pages 2024–01, 2024.
- Mcluster-vaes: an end-to-end variational deep learning-based clustering method for subtype discovery using multi-omics data. Computers in Biology and Medicine, 150:106085, 2022.
- Multi-head attention mechanism learning for cancer new subtypes and treatment based on cancer multi-omics data. arXiv preprint arXiv:2307.04075, 2023.
- Feature dimensionality reduction: a review. Complex & Intelligent Systems, 8(3):2663–2693, 2022.
- The feature selection bias problem in relation to high-dimensional gene data. Artificial intelligence in medicine, 66:63–71, 2016.
- Causal feature selection in the presence of sample selection bias. ACM Transactions on Intelligent Systems and Technology, 14(5):1–18, 2023.
- Revolutionizing digital pathology with the power of generative artificial intelligence and foundation models. Laboratory Investigation, page 100255, 2023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Vision-language models for medical report generation and visual question answering: A review. arXiv preprint arXiv:2403.02469, 2024.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32, 2019.
- Review The Cancer Genome Atlas (TCGA): An immeasurable source of knowledge. Contemporary Oncology, 2015(1):68–77, 2015.
- Connecting Genomic Alterations to Cancer Biology with Proteomics: The NCI Clinical Proteomic Tumor Analysis Consortium. Cancer Discovery, 3(10):1108–1112, 10 2013.
- scgpt: toward building a foundation model for single-cell multi-omics using generative ai. Nature Methods, pages 1–11, 2024.
- Samms: Multi-modality deep learning with the foundation model for the prediction of cancer patient survival. In 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 3662–3668. IEEE, 2023.
- Interpretable rna foundation model from unannotated data for highly accurate rna structure and function predictions. arXiv preprint arXiv:2204.00300, 2022.
- Path-gptomic: A balanced multi-modal learning framework for survival outcome prediction. arXiv preprint arXiv:2403.11375, 2024.
- When is a foundation model a foundation model. arXiv preprint arXiv:2309.11510, 2023.
- The ucsc xena platform for public and private cancer genomics data visualization and interpretation. biorxiv, page 326470, 2018.
- A proteogenomic portrait of lung squamous cell carcinoma. Cell, 184(16):4348–4371, 2021.
- Proteogenomic landscape of squamous cell lung cancer. Nature communications, 10(1):3578, 2019.
- Molecular biomarkers in cancer. Biomolecules, 12(8):1021, 2022.
- Moving pan-cancer studies from basic research toward the clinic. Nature cancer, 2(9):879–890, 2021.
- A dna methylation atlas of normal human cell types. Nature, 613(7943):355–364, 2023.
- The role of dna methylation in cancer. DNA Methyltransferases-Role and Function, pages 151–172, 2016.
- Comparison of beta-value and m-value methods for quantifying methylation levels by microarray analysis. BMC bioinformatics, 11:1–9, 2010.
- A framework for analyzing dna methylation data from illumina infinium humanmethylation450 beadchip. BMC bioinformatics, 19:15–22, 2018.
- Systematic comparison and assessment of rna-seq procedures for gene expression quantitative analysis. Scientific reports, 10(1):19737, 2020.
- Gene expression profiling as a potential tool for precision oncology in non-small cell lung cancer. Cancers, 13(19):4734, 2021.
- On the gene expression landscape of cancer. Plos one, 18(2):e0277786, 2023.
- Exploring drivers of gene expression in the cancer genome atlas. Bioinformatics, 35(1):62–68, 2019.
- EBI Gene Expression Team. Expression atlas. https://www.ebi.ac.uk/gxa/FAQ.html/. Accessed: 2024-05-13.
- The role of micrornas in human cancer. Signal transduction and targeted therapy, 1(1):1–9, 2016.
- Large-scale profiling of micrornas for the cancer genome atlas. Nucleic acids research, 44(1):e3–e3, 2016.
- Integrative analysis of tcga data identifies mirnas as drug-specific survival biomarkers. Scientific Reports, 12(1):6785, 2022.
- GDC Documentation. Reverse phase protein array. https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/RPPA_intro/, 2024. Accessed: 2024-05-13.
- MD Anderson. Rppa description. https://www.mdanderson.org/documents/core-facilities/FunctionalProteomicsRPPACoreFacility/RPPADescription_2016.pdf, 2024. Accessed: 2024-05-13.
- Tcpa v3. 0: an integrative platform to explore the pan-cancer analysis of functional proteomic data. Molecular & Cellular Proteomics, 18(8):S15–S25, 2019.
- Tcpa: a resource for cancer functional proteomics data. Nature methods, 10(11):1046–1047, 2013.
- Development of a robust classifier for quality control of reverse-phase protein arrays. Bioinformatics, 31(6):912–918, 2015.
- Genomic Data Commons. Mutation annotation format. https://docs.gdc.cancer.gov/Encyclopedia/pages/Mutation_Annotation_Format//#:~:text=MAF%20files%20are%20generated%20at low%20quality%20or%20germline%20mutations, 2024. Accessed: 2024-05-13.
- Genomic Data Commons. File format - vcf. https://docs.gdc.cancer.gov/Data/File_Formats/VCF_Format/, 2024. Accessed: 2024-05-13.
- Genomic Data Commons. File format - maf. https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/, 2024. Accessed: 2024-05-13.
- Cancer gene mutation frequencies for the us population. Nature communications, 12(1):5961, 2021.
- Targeting mutations in cancer. The Journal of clinical investigation, 132(8), 2022.
- Risk factors for the diagnosis of colorectal cancer. Cancer Control, 29:10732748211056692, 2022.
- Genome-wide sex and gender differences in cancer. Frontiers in oncology, 10:597788, 2020.
- Cancer health disparities in racial/ethnic minorities in the united states. British journal of cancer, 124(2):315–332, 2021.
- Research and application of artificial intelligence based on electronic health records of patients with cancer: systematic review. JMIR Medical Informatics, 10(4):e33799, 2022.
- Tpm, fpkm, or normalized counts? a comparative study of quantification measures for the analysis of rna-seq data from the nci patient-derived models repository. Journal of translational medicine, 19(1):269, 2021.
- Spatial normalization of reverse phase protein array data. PloS one, 9(12):e97213, 2014.
- A comprehensive comparison of normalization methods for loading control and variance stabilization of reverse-phase protein array data. Cancer informatics, 13:CIN–S13329, 2014.
- A review of integrative imputation for multi-omics datasets. Frontiers in Genetics, 11:570255, 2020.
- Conceptual and empirical comparison of dimensionality reduction algorithms (pca, kpca, lda, mds, svd, lle, isomap, le, ica, t-sne), comput. sci. rev., 40, 100378. ISI, 2021.
- Survey of main tools for querying and analyzing tcga data. In 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 1711–1718. IEEE, 2018.
- Tcga expression analyses of 10 carcinoma types reveal clinically significant racial differences. Cancers, 15(10):2695, 2023.
- Feature-engine, a python library for feature engineering and selection. https://feature-engine.trainindata.com/en/latest/index.html. Accessed: 2024-05-13.
- Benchmark of filter methods for feature selection in high-dimensional gene expression survival data. Briefings in Bioinformatics, 23(1):bbab354, 2022.
- Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
- Beware to ignore the rare: how imputing zero-values can improve the quality of 16s rrna gene studies results. BMC bioinformatics, 22(Suppl 15):618, 2021.
- Analysis of simple data imputation in disease dataset. In International Conference on Science and Technology (ICST 2018), pages 471–475. Atlantis Press, 2018.
- Tomas Rakvåg Ulriksborg. Imputation of missing time series values using statistical and mathematical strategies. Department of Informatics, 2022.
- A neural autoregressive approach to collaborative filtering. In International Conference on Machine Learning, pages 764–773. PMLR, 2016.
- Autoimpute: Autoencoder based imputation of single-cell rna-seq data. Scientific reports, 8(1):16329, 2018.
- Why not to use zero imputation? correcting sparsity bias in training neural networks. arXiv preprint arXiv:1906.00150, 2019.
- Pathomic fusion: an integrated framework for fusing histopathology and genomic features for cancer diagnosis and prognosis. IEEE Transactions on Medical Imaging, 41(4):757–770, 2020.
- Towards a survival risk prediction model for metastatic nsclc patients on durvalumab using whole-lung ct radiomics. bioRxiv, pages 2024–02, 2024.
- Cancer treatment and survivorship statistics, 2019. CA: a cancer journal for clinicians, 69(5):363–385, 2019.
- Self-normalizing neural networks. Advances in neural information processing systems, 30, 2017.
- Travers Ching. Cox regression. http://traversc.github.io/cox-nnet/docs/, 2024. Accessed: 2024-05-13.
- Cameron Davidson-Pilon. lifelines, survival analysis in python. https://doi.org/10.5281/zenodo.10456828, Jan 2024. Accessed: 2024-05-13.
- Lukas Biewald. Experiment tracking with weights and biases, 2020. Software available from wandb.com.
- Survival prediction via hierarchical multimodal co-attention transformer: A computational histology-radiology solution. IEEE Transactions on Medical Imaging, 2023.
- Kan: Kolmogorov-arnold networks. arXiv preprint arXiv:2404.19756, 2024.