Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Is machine learning good or bad for the natural sciences? (2405.18095v2)

Published 28 May 2024 in stat.ML, astro-ph.IM, cs.LG, and physics.data-an

Abstract: Machine learning (ML) methods are having a huge impact across all of the sciences. However, ML has a strong ontology - in which only the data exist - and a strong epistemology - in which a model is considered good if it performs well on held-out training data. These philosophies are in strong conflict with both standard practices and key philosophies in the natural sciences. Here we identify some locations for ML in the natural sciences at which the ontology and epistemology are valuable. For example, when an expressive machine learning model is used in a causal inference to represent the effects of confounders, such as foregrounds, backgrounds, or instrument calibration parameters, the model capacity and loose philosophy of ML can make the results more trustworthy. We also show that there are contexts in which the introduction of ML introduces strong, unwanted statistical biases. For one, when ML models are used to emulate physical (or first-principles) simulations, they amplify confirmation biases. For another, when expressive regressions are used to label datasets, those labels cannot be used in downstream joint or ensemble analyses without taking on uncontrolled biases. The question in the title is being asked of all of the natural sciences; that is, we are calling on the scientific communities to take a step back and consider the role and value of ML in their fields; the (partial) answers we give here come from the particular perspective of physics.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. The ATLAS experiment at the CERN Large Hadron Collider. Journal of Instrumentation, 3(S08003), 2008.
  2. Acquaviva, V. Machine Learning for Physics and Astronomy. Princeton University Press, 2023.
  3. Planck 2018 results—VI. Cosmological parameters. Astronomy & Astrophysics, 641:A6, 2020a.
  4. Planck 2018 results-III. High Frequency Instrument data processing and frequency maps. Astronomy & Astrophysics, 641:A3, 2020b.
  5. Planck 2018 results—IV. Diffuse component separation. Astronomy & Astrophysics, 641:A4, 2020.
  6. Alpaydin, E. Introduction to machine learning. MIT press, 2020.
  7. Baker, M. Reproducibility crisis. nature, 533(26):353–66, 2016.
  8. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020.
  9. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.
  10. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  11. Bottou, L. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade: Second Edition, pp. 421–436. Springer, 2012.
  12. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. arXiv preprint arXiv:2104.13478, 2021.
  13. Chen, Y.-C. A tutorial on kernel density estimation and recent advances. Biostatistics & Epidemiology, 1(1):161–187, 2017.
  14. A data-driven search for mid-infrared excesses among five million main-sequence FGK stars. arXiv preprint arXiv:2403.18941, 2024.
  15. Copernicus, N. De revolutionibus orbium coelestium. 1543.
  16. Approximating likelihood ratios with calibrated discriminative classifiers. arXiv preprint arXiv:1506.02169, 2015.
  17. The frontier of simulation-based inference. Proceedings of the National Academy of Sciences, 117(48):30055–30062, 2020a.
  18. Discovering symbolic models from deep learning with inductive biases. Advances in Neural Information Processing Systems, 33:17429–17442, 2020b.
  19. Approximate Bayesian computation (ABC) in practice. Trends in ecology & evolution, 25(7):410–418, 2010.
  20. Translation and rotation equivariant normalizing flow (TRENF) for optimal cosmological analysis. Monthly Notices of the Royal Astronomical Society, 516(2):2363–2373, 2022.
  21. Darwin, C. On the origin of species. John Murray, London, 1859.
  22. Donoho, D. Data science at the singularity. Harvard Data Science Review, 6(1), 2024.
  23. Stan and BART for causal inference: Estimating heterogeneous treatment effects using the power of stan and the flexibility of machine learning. Entropy, 24(12):1782, 2022.
  24. Einstein, A. Die Feldgleichungen der Gravitation. Sitzungsberichte der Königlich Preußischen Akademie der Wissenschaften, pp.  844–847, 1915.
  25. Freire, J. The Singularity in Data and Computation-Driven Science: Can It Scale Beyond Machine Learning? Harvard Data Science Review, 6(1), may 24 2024. https://hdsr.mitpress.mit.edu/pub/4wk8b8ix.
  26. Data analysis using regression and multilevel/hierarchical models. Cambridge University Press, 2006.
  27. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  28. HgCdTe detectors for space and science imaging: general issues and latest achievements. Journal of Electronic materials, 45:4532–4541, 2016.
  29. Principal component analysis. Nature Reviews Methods Primers, 2(1):100, 2022.
  30. SIMBIG: Likelihood-free inference of galaxy clustering. Machine Learning for Astrophysics, pp.  24, 2022.
  31. Hogg, D. W. Is cosmology just a plausibility argument? arXiv preprint arXiv:0910.3374, 2009.
  32. Spectrophotometric parallaxes with linear models: Accurate distances for luminous red-giant stars. The Astronomical Journal, 158(4):147, 2019.
  33. Plausible adversarial attacks on direct parameter inference models in astrophysics. arXiv preprint arXiv:2211.14788, 2022.
  34. Hubble, E. A relation between distance and radial velocity among extra-galactic nebulae. PNAS, 15(3):168–173, 1929.
  35. The velocity-distance relation among extra-galactic nebulae. Astrophysical Journal, vol. 74, p. 43, 74:43, 1931.
  36. LSST: from science drivers to reference design and anticipated data products. The Astrophysical Journal, 873(2):111, 2019.
  37. Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873):583–589, 2021.
  38. Conditional molecular design with deep generative models. Journal of chemical information and modeling, 59(1):43–52, 2018.
  39. Physics-informed machine learning. Nature Reviews Physics, 3(6):422–440, 2021.
  40. The Ethical Algorithm: The science of socially aware algorithm design. Oxford University Press, 2019.
  41. Kendall, H. W. Deep inelastic scattering: Experiments on the proton and the observation of scaling. Reviews of Modern Physics, 63(3):597, 1991.
  42. Kepler, J. Astronomia nova. 1609.
  43. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  44. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995.
  45. A variational encoder–decoder approach to precise spectroscopic age estimation for large Galactic surveys. Monthly Notices of the Royal Astronomical Society, 522(3):4577–4597, 2023.
  46. AspGap: Augmented stellar parameters and abundances for 37 million red giant branch stars from Gaia XP low-resolution spectra. The Astrophysical Journal Supplement Series, 272(1):2, 2024.
  47. Galaxy Zoo: ‘Hanny’s Voorwerp’, a quasar light echo? Monthly Notices of the Royal Astronomical Society, 399(1):129–140, 2009.
  48. Milanfar, P. Data Science at the Precipice. Harvard Data Science Review, 6(1), may 24 2024. https://hdsr.mitpress.mit.edu/pub/k9gp9fzh.
  49. Mitchell, T. Machine Learning. McGraw Hill, 1997.
  50. Molnar, C. Interpretable machine learning. Lulu.com, 2020.
  51. Machine-learning-based brokers for real-time classification of the LSST Alert Stream. The Astrophysical Journal Supplement Series, 236(1):9, 2018a.
  52. Machine-learning-based brokers for real-time classification of the LSST alert stream. The Astrophysical Journal Supplement Series, 236(1):9, 2018b.
  53. The Cannon: A data-driven approach to stellar label determination. The Astrophysical Journal, 808(1):16, 2015.
  54. Newton, I. Philosophiae Naturalis Principia Mathematica. 1687.
  55. An improved photometric calibration of the Sloan Digital Sky Survey imaging data. The Astrophysical Journal, 674(2):1217, 2008.
  56. Pearl, J. Causal inference in statistics: An overview. Statistics Surveys, 3:96–146, 2009.
  57. Peebles, P. J. E. The physicists philosophy of physics. arXiv preprint arXiv:2401.16506, 2024.
  58. Fast radio bursts. The Astronomy and Astrophysics Review, 27:1–75, 2019.
  59. Prechelt, L. Early stopping—but when? In Neural Networks: Tricks of the trade, pp.  55–69. Springer, 2002.
  60. Explainable machine learning for scientific insights and discoveries. Ieee Access, 8:42200–42216, 2020.
  61. Schmidt, M. 3 C 273: A star-like object with large red-shift. Nature, 197(4872):1040–1040, 1963.
  62. Dropout: A simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
  63. Support vector machines. Springer Science & Business Media, 2008.
  64. Thomas, N. C. The early history of spectroscopy. Journal of chemical education, 68(8):631, 1991.
  65. Ensemble adversarial training: Attacks and defenses. arXiv preprint arXiv:1705.07204, 2017.
  66. AI Feynman: A physics-inspired method for symbolic regression. Science Advances, 6(16):eaay2631, 2020.
  67. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
  68. Scalars are universal: Equivariant machine learning, structured like classical physics. Advances in Neural Information Processing Systems, 34:28848–28863, 2021.
  69. Towards fully covariant machine learning. Transactions on Machine Learning Research, 2023a.
  70. Dimensionless machine learning: Imposing exact units equivariance. Journal of Machine Learning Research, 24(109):1–32, 2023b.
  71. Scientific discovery in the age of artificial intelligence. Nature, 620(7972):47–60, 2023.
  72. A survey of transfer learning. Journal of Big data, 3(1):1–40, 2016.
  73. Wigner, E. P. The unreasonable effectiveness of mathematics in the natural sciences. Richard Courant lecture in mathematical sciences delivered at New York University, May 11, 1959. Communications on Pure and Applied Mathematics, 13(1):1–14, 1960.
  74. Gaussian processes for machine learning. MIT Press, Cambridge, MA, 2006.
  75. What can neural networks reason about? ICLR 2020, 2020.
  76. Yuan, Y. On the power of foundation models. In International Conference on Machine Learning, pp. 40519–40530. PMLR, 2023.
  77. Artificial intelligence for science in quantum, atomistic, and continuum systems. arXiv preprint arXiv:2307.08423, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. David W. Hogg (189 papers)
  2. Soledad Villar (45 papers)
Citations (4)

Summary

  • The paper contrasts ML’s data-centric, performance-driven methods with natural sciences’ theory-based approaches.
  • It highlights risks like confirmation and estimator biases when ML-generated labels replace traditional simulations.
  • The study calls for critical evaluation and conservative ML integration to uphold scientific rigor in research.

An Expert Review of "Position: Is machine learning good or bad for the natural sciences?"

The paper by Hogg and Villar investigates the complex interfaces between ML and the natural sciences, posing critical questions about the appropriateness and impact of ML methodologies in scientific research. With affiliations spanning the Center for Cosmology and Particle Physics at NYU, the Max-Planck-Institut für Astronomie, the Flatiron Institute, and Johns Hopkins University, the authors offer a robust, multidisciplinary perspective.

ML Philosophies vs. Natural Sciences Philosophies

The paper begins by contrasting the ontological and epistemological foundations of ML and the natural sciences. ML operates with a strong ontology that privileges data over latent structures and subscribes to a performance-centric epistemology, placing significant value on how models perform against held-out training data. In contrast, the natural sciences prioritize understanding underlying mechanisms and latent structures, valuing theories based on their explanatory power and integration with wider scientific knowledge.

Definitions and Orientations

The authors provide comprehensive definitions for ML and natural science within the context of their argument. They define ML as methods whose capabilities improve significantly with increased data exposure. This encompasses classical and contemporary techniques, from convolutional neural networks (CNNs) to principal components analysis (PCA). The paper delineates natural sciences as fields primarily aimed at understanding natural phenomena, excluding engineering-oriented questions more suitable for ML applications.

Core Contributions

Hogg and Villar's principal contributions can be summarized as:

  1. Philosophical Contrast:
    • The paper lucidly details the fundamental contrasts between the ontologies and epistemologies of ML and the natural sciences.
  2. Statistical Biases:
    • The authors highlight the introduction of confirmation biases when ML models replace physical simulations and estimator biases when ML-generated dataset labels are used in downstream analyses.
  3. Identifying Safe ML Applications:
    • Several scenarios where ML can be effectively and conservatively applied are discussed.
    • Causal contexts and operational parts of scientific projects are noted as particularly amenable to ML integration.
  4. Call to Action:
    • The paper urges scientific communities to critically evaluate the role and value of ML in their disciplines.

Discussion of Technological Integration

Beneficial Applications

The paper outlines several domains where ML's data-centric philosophy can provide substantial benefits:

  • Label Transfer and Classification:
    • Efficiently predicting labels for large, unlabeled datasets when labels are computationally expensive to obtain.
  • Speeding up Decisions:
    • Applications that require rapid real-time decisions such as in high-energy particle physics experiments.
  • Modeling Nuisances:
    • ML's utility in modeling foregrounds and backgrounds, focusing on effective, rather than detailed, model comprehension.
  • Outlier Detection and Information Theoretic Insights:
    • Identifying anomalies and providing insights into data’s informational content.

Problematic Applications

Conversely, the paper identifies potential pitfalls:

  • Simulation Emulation:
    • The use of ML to augment or replace physical simulations may result in confirmation biases, jeopardizing scientific integrity.
  • ML-based Labeling:
    • When ML-generated labels are used in combined analyses, amplified estimator biases can occur, introducing significant risks.

Future Directions

The paper propels forward-looking discourse on the epistemic role of ML within the broader goals of the natural sciences, highlighting areas such as symbolic regression and foundation models. While major discoveries directly facilitated by ML remain elusive, the potential for future breakthrough discoveries is acknowledged.

Implications

Practical Implications

The practical implications of this research are substantial. Natural science projects and large-scale scientific investigations rely increasingly on ML methodologies. The paper serves as a call to prudently integrate ML, ensuring coherence with the traditional epistemological rigor of the natural sciences.

Theoretical Implications

The paper advances the discussion on the intersection of ML and natural sciences, prompting researchers to consider the philosophical alignment of their methodologies. It underscores the necessity for balance; while ML's utility in handling and interpreting large datasets is unmatched, its integration must not compromise scientific standards.

Conclusion

Hogg and Villar’s paper is a critical reflection on the role of ML in the natural sciences, offering both a philosophical dissection and practical guidance. The call to carefully consider ML's appropriated usage and to safeguard against statistical biases is essential for advancing scientific integrity. This paper will serve as a pivotal reference for researchers navigating the intricate balance between leveraging ML's capabilities and adhering to the epistemic standards of the natural sciences.