Is machine learning good or bad for the natural sciences? (2405.18095v2)
Abstract: Machine learning (ML) methods are having a huge impact across all of the sciences. However, ML has a strong ontology - in which only the data exist - and a strong epistemology - in which a model is considered good if it performs well on held-out training data. These philosophies are in strong conflict with both standard practices and key philosophies in the natural sciences. Here we identify some locations for ML in the natural sciences at which the ontology and epistemology are valuable. For example, when an expressive machine learning model is used in a causal inference to represent the effects of confounders, such as foregrounds, backgrounds, or instrument calibration parameters, the model capacity and loose philosophy of ML can make the results more trustworthy. We also show that there are contexts in which the introduction of ML introduces strong, unwanted statistical biases. For one, when ML models are used to emulate physical (or first-principles) simulations, they amplify confirmation biases. For another, when expressive regressions are used to label datasets, those labels cannot be used in downstream joint or ensemble analyses without taking on uncontrolled biases. The question in the title is being asked of all of the natural sciences; that is, we are calling on the scientific communities to take a step back and consider the role and value of ML in their fields; the (partial) answers we give here come from the particular perspective of physics.
- The ATLAS experiment at the CERN Large Hadron Collider. Journal of Instrumentation, 3(S08003), 2008.
- Acquaviva, V. Machine Learning for Physics and Astronomy. Princeton University Press, 2023.
- Planck 2018 results—VI. Cosmological parameters. Astronomy & Astrophysics, 641:A6, 2020a.
- Planck 2018 results-III. High Frequency Instrument data processing and frequency maps. Astronomy & Astrophysics, 641:A3, 2020b.
- Planck 2018 results—IV. Diffuse component separation. Astronomy & Astrophysics, 641:A4, 2020.
- Alpaydin, E. Introduction to machine learning. MIT press, 2020.
- Baker, M. Reproducibility crisis. nature, 533(26):353–66, 2016.
- Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020.
- Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- Bottou, L. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade: Second Edition, pp. 421–436. Springer, 2012.
- Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. arXiv preprint arXiv:2104.13478, 2021.
- Chen, Y.-C. A tutorial on kernel density estimation and recent advances. Biostatistics & Epidemiology, 1(1):161–187, 2017.
- A data-driven search for mid-infrared excesses among five million main-sequence FGK stars. arXiv preprint arXiv:2403.18941, 2024.
- Copernicus, N. De revolutionibus orbium coelestium. 1543.
- Approximating likelihood ratios with calibrated discriminative classifiers. arXiv preprint arXiv:1506.02169, 2015.
- The frontier of simulation-based inference. Proceedings of the National Academy of Sciences, 117(48):30055–30062, 2020a.
- Discovering symbolic models from deep learning with inductive biases. Advances in Neural Information Processing Systems, 33:17429–17442, 2020b.
- Approximate Bayesian computation (ABC) in practice. Trends in ecology & evolution, 25(7):410–418, 2010.
- Translation and rotation equivariant normalizing flow (TRENF) for optimal cosmological analysis. Monthly Notices of the Royal Astronomical Society, 516(2):2363–2373, 2022.
- Darwin, C. On the origin of species. John Murray, London, 1859.
- Donoho, D. Data science at the singularity. Harvard Data Science Review, 6(1), 2024.
- Stan and BART for causal inference: Estimating heterogeneous treatment effects using the power of stan and the flexibility of machine learning. Entropy, 24(12):1782, 2022.
- Einstein, A. Die Feldgleichungen der Gravitation. Sitzungsberichte der Königlich Preußischen Akademie der Wissenschaften, pp. 844–847, 1915.
- Freire, J. The Singularity in Data and Computation-Driven Science: Can It Scale Beyond Machine Learning? Harvard Data Science Review, 6(1), may 24 2024. https://hdsr.mitpress.mit.edu/pub/4wk8b8ix.
- Data analysis using regression and multilevel/hierarchical models. Cambridge University Press, 2006.
- Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
- HgCdTe detectors for space and science imaging: general issues and latest achievements. Journal of Electronic materials, 45:4532–4541, 2016.
- Principal component analysis. Nature Reviews Methods Primers, 2(1):100, 2022.
- SIMBIG: Likelihood-free inference of galaxy clustering. Machine Learning for Astrophysics, pp. 24, 2022.
- Hogg, D. W. Is cosmology just a plausibility argument? arXiv preprint arXiv:0910.3374, 2009.
- Spectrophotometric parallaxes with linear models: Accurate distances for luminous red-giant stars. The Astronomical Journal, 158(4):147, 2019.
- Plausible adversarial attacks on direct parameter inference models in astrophysics. arXiv preprint arXiv:2211.14788, 2022.
- Hubble, E. A relation between distance and radial velocity among extra-galactic nebulae. PNAS, 15(3):168–173, 1929.
- The velocity-distance relation among extra-galactic nebulae. Astrophysical Journal, vol. 74, p. 43, 74:43, 1931.
- LSST: from science drivers to reference design and anticipated data products. The Astrophysical Journal, 873(2):111, 2019.
- Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873):583–589, 2021.
- Conditional molecular design with deep generative models. Journal of chemical information and modeling, 59(1):43–52, 2018.
- Physics-informed machine learning. Nature Reviews Physics, 3(6):422–440, 2021.
- The Ethical Algorithm: The science of socially aware algorithm design. Oxford University Press, 2019.
- Kendall, H. W. Deep inelastic scattering: Experiments on the proton and the observation of scaling. Reviews of Modern Physics, 63(3):597, 1991.
- Kepler, J. Astronomia nova. 1609.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995.
- A variational encoder–decoder approach to precise spectroscopic age estimation for large Galactic surveys. Monthly Notices of the Royal Astronomical Society, 522(3):4577–4597, 2023.
- AspGap: Augmented stellar parameters and abundances for 37 million red giant branch stars from Gaia XP low-resolution spectra. The Astrophysical Journal Supplement Series, 272(1):2, 2024.
- Galaxy Zoo: ‘Hanny’s Voorwerp’, a quasar light echo? Monthly Notices of the Royal Astronomical Society, 399(1):129–140, 2009.
- Milanfar, P. Data Science at the Precipice. Harvard Data Science Review, 6(1), may 24 2024. https://hdsr.mitpress.mit.edu/pub/k9gp9fzh.
- Mitchell, T. Machine Learning. McGraw Hill, 1997.
- Molnar, C. Interpretable machine learning. Lulu.com, 2020.
- Machine-learning-based brokers for real-time classification of the LSST Alert Stream. The Astrophysical Journal Supplement Series, 236(1):9, 2018a.
- Machine-learning-based brokers for real-time classification of the LSST alert stream. The Astrophysical Journal Supplement Series, 236(1):9, 2018b.
- The Cannon: A data-driven approach to stellar label determination. The Astrophysical Journal, 808(1):16, 2015.
- Newton, I. Philosophiae Naturalis Principia Mathematica. 1687.
- An improved photometric calibration of the Sloan Digital Sky Survey imaging data. The Astrophysical Journal, 674(2):1217, 2008.
- Pearl, J. Causal inference in statistics: An overview. Statistics Surveys, 3:96–146, 2009.
- Peebles, P. J. E. The physicists philosophy of physics. arXiv preprint arXiv:2401.16506, 2024.
- Fast radio bursts. The Astronomy and Astrophysics Review, 27:1–75, 2019.
- Prechelt, L. Early stopping—but when? In Neural Networks: Tricks of the trade, pp. 55–69. Springer, 2002.
- Explainable machine learning for scientific insights and discoveries. Ieee Access, 8:42200–42216, 2020.
- Schmidt, M. 3 C 273: A star-like object with large red-shift. Nature, 197(4872):1040–1040, 1963.
- Dropout: A simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
- Support vector machines. Springer Science & Business Media, 2008.
- Thomas, N. C. The early history of spectroscopy. Journal of chemical education, 68(8):631, 1991.
- Ensemble adversarial training: Attacks and defenses. arXiv preprint arXiv:1705.07204, 2017.
- AI Feynman: A physics-inspired method for symbolic regression. Science Advances, 6(16):eaay2631, 2020.
- Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
- Scalars are universal: Equivariant machine learning, structured like classical physics. Advances in Neural Information Processing Systems, 34:28848–28863, 2021.
- Towards fully covariant machine learning. Transactions on Machine Learning Research, 2023a.
- Dimensionless machine learning: Imposing exact units equivariance. Journal of Machine Learning Research, 24(109):1–32, 2023b.
- Scientific discovery in the age of artificial intelligence. Nature, 620(7972):47–60, 2023.
- A survey of transfer learning. Journal of Big data, 3(1):1–40, 2016.
- Wigner, E. P. The unreasonable effectiveness of mathematics in the natural sciences. Richard Courant lecture in mathematical sciences delivered at New York University, May 11, 1959. Communications on Pure and Applied Mathematics, 13(1):1–14, 1960.
- Gaussian processes for machine learning. MIT Press, Cambridge, MA, 2006.
- What can neural networks reason about? ICLR 2020, 2020.
- Yuan, Y. On the power of foundation models. In International Conference on Machine Learning, pp. 40519–40530. PMLR, 2023.
- Artificial intelligence for science in quantum, atomistic, and continuum systems. arXiv preprint arXiv:2307.08423, 2023.
- David W. Hogg (189 papers)
- Soledad Villar (45 papers)