Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Human Limits in Machine Learning: Prediction of Plant Phenotypes Using Soil Microbiome Data (2306.11157v2)

Published 19 Jun 2023 in stat.ML, cs.LG, and stat.AP

Abstract: The preservation of soil health is a critical challenge in the 21st century due to its significant impact on agriculture, human health, and biodiversity. We provide the first deep investigation of the predictive potential of machine learning models to understand the connections between soil and biological phenotypes. We investigate an integrative framework performing accurate machine learning-based prediction of plant phenotypes from biological, chemical, and physical properties of the soil via two models: random forest and Bayesian neural network. We show that prediction is improved when incorporating environmental features like soil physicochemical properties and microbial population density into the models, in addition to the microbiome information. Exploring various data preprocessing strategies confirms the significant impact of human decisions on predictive performance. We show that the naive total sum scaling normalization that is commonly used in microbiome research is not the optimal strategy to maximize predictive power. Also, we find that accurately defined labels are more important than normalization, taxonomic level or model characteristics. In cases where humans are unable to classify samples accurately, machine learning model performance is limited. Lastly, we provide domain scientists via a full model selection decision tree to identify the human choices that optimize model prediction power. Our work is accompanied by open source reproducible scripts (https://github.com/solislemuslab/soil-microbiome-nn) for maximum outreach among the microbiome research community.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. CN: a consensus algorithm for inferring gene regulatory networks using the sorder algorithm and conditional mutual information test. Molecular BioSystems, 11(3):942–949, 2015.
  2. John Aitchison. The statistical analysis of compositional data. Journal of the Royal Statistical Society: Series B (Methodological), 44(2):139–160, 1982.
  3. Applications of machine learning to the problem of antimicrobial resistance: an emerging model for translational research. Journal of Clinical Microbiology, 59(7):e01260–20, 2021.
  4. Shrinkage improves estimation of microbial associations under different normalization methods. NAR genomics and bioinformatics, 2(4):lqaa100, 2020.
  5. Random forest in remote sensing: A review of applications and future directions. ISPRS journal of photogrammetry and remote sensing, 114:24–31, 2016.
  6. The rhizosphere microbiome and plant health. Trends in Plant Science, 17(8):478–486, 2012.
  7. Exact sequence variants should replace operational taxonomic units in marker-gene data analysis. The ISME journal, 11(12):2639–2643, 2017.
  8. Algorithms for computing the sample variance: Analysis and recommendations. The American Statistician, 37(3):242–247, 1983.
  9. Model selection in bayesian neural networks via horseshoe priors. Journal of Machine Learning Research, 20(182):1–46, 2019.
  10. Quantifying biodiversity: procedures and pitfalls in the measurement and comparison of species richness. Ecology letters, 4(4):379–391, 2001.
  11. From diversity to complexity: Microbial networks in soils. Soil Biology and Biochemistry, 169:108604, 2022.
  12. Probabilistic backpropagation for scalable learning of Bayesian Neural Networks. ICML’15: Proceedings of the 32nd International Conference on International Conference on Machine Learning, 37:1861–1869, July 2015.
  13. Predicting decision-making time for diagnosis over ngs cycles: An interpretable machine learning approach. bioRxiv, 2023.
  14. Sparse and compositionally robust inference of microbial ecological networks. PLoS computational biology, 11(5):e1004226, 2015.
  15. Strategy for on-orbit space object classification using deep learning. Proceedings of the Institution of Mechanical Engineers, Part G: Journal of Aerospace Engineering, 235(15):2326–2341, 2021.
  16. Dealing with zeros and missing values in compositional data sets using nonparametric imputation. Mathematical Geology, 35(3):253–278, 2003.
  17. Bayesian-multiplicative treatment of count zeros in compositional data sets. Statistical Modelling, 15(2):134–158, 2015.
  18. Waste not, want not: why rarefying microbiome data is inadmissible. PLoS computational biology, 10(4):e1003531, 2014.
  19. Radford M. Neal. MCMC Using Hamiltonian Dynamics. CRC Press, 2011.
  20. Radford M. Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012.
  21. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  22. Comparison of normalization methods for the analysis of metagenomic gene abundance data. BMC Genomics, 19(1):274, 2018.
  23. Netcomi: network construction and comparison for microbiome data in r. Briefings in bioinformatics, 22(4):bbaa290, 2021.
  24. Full model selection using regression trees for numeric predictions of biomarkers for metabolic challenges in dairy cows. Preventive Veterinary Medicine, 193:105422, 2021.
  25. Full model selection in the space of data mining operators. In Proceedings of the 14th annual conference companion on genetic and evolutionary computation, pages 1503–1504, 2012.
  26. Towards a framework for designing full model selection and optimization systems. In International Workshop on Multiple Classifier Systems, pages 259–270. Springer, 2013.
  27. Response of soil properties and microbial communities to agriculture: Implications for primary productivity and soil health indicators. Frontiers in Plant Science, 7, 2016.
  28. Fungal-bacterial diversity and microbiome complexity predict ecosystem functioning. Nature communications, 10(1):4841, 2019.
  29. Statistical analysis of microbiome data with R, volume 847. Springer, 2018.
  30. Prediction of geological characteristics from shield operational parameters by integrating grid search and k-fold cross validation into stacking classification algorithm. Journal of Rock Mechanics and Geotechnical Engineering, 2022.
  31. Quality of uncertainty quantification for bayesian neural network inference. Proceedings at the International Conference on Machine Learning: Workshop on Uncertainty & Robustness in Deep Learning, June 2019.
  32. A new family of power transformations to improve normality or symmetry. Biometrika, 87(4):954–959, 2000.
  33. Microbial networks in spring-semi-parametric rank-based correlation and partial correlation estimation for quantitative microbiome data. Frontiers in genetics, 10:516, 2019.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com