Is machine learning good or bad for the natural sciences? (2405.18095v2)

Published 28 May 2024 in stat.ML, astro-ph.IM, cs.LG, and physics.data-an

Abstract: Machine learning (ML) methods are having a huge impact across all of the sciences. However, ML has a strong ontology - in which only the data exist - and a strong epistemology - in which a model is considered good if it performs well on held-out training data. These philosophies are in strong conflict with both standard practices and key philosophies in the natural sciences. Here we identify some locations for ML in the natural sciences at which the ontology and epistemology are valuable. For example, when an expressive machine learning model is used in a causal inference to represent the effects of confounders, such as foregrounds, backgrounds, or instrument calibration parameters, the model capacity and loose philosophy of ML can make the results more trustworthy. We also show that there are contexts in which the introduction of ML introduces strong, unwanted statistical biases. For one, when ML models are used to emulate physical (or first-principles) simulations, they amplify confirmation biases. For another, when expressive regressions are used to label datasets, those labels cannot be used in downstream joint or ensemble analyses without taking on uncontrolled biases. The question in the title is being asked of all of the natural sciences; that is, we are calling on the scientific communities to take a step back and consider the role and value of ML in their fields; the (partial) answers we give here come from the particular perspective of physics.

References (77)

Citations (4)

View on Semantic Scholar

Summary

The paper contrasts ML’s data-centric, performance-driven methods with natural sciences’ theory-based approaches.
It highlights risks like confirmation and estimator biases when ML-generated labels replace traditional simulations.
The study calls for critical evaluation and conservative ML integration to uphold scientific rigor in research.

An Expert Review of "Position: Is machine learning good or bad for the natural sciences?"

The paper by Hogg and Villar investigates the complex interfaces between ML and the natural sciences, posing critical questions about the appropriateness and impact of ML methodologies in scientific research. With affiliations spanning the Center for Cosmology and Particle Physics at NYU, the Max-Planck-Institut für Astronomie, the Flatiron Institute, and Johns Hopkins University, the authors offer a robust, multidisciplinary perspective.

ML Philosophies vs. Natural Sciences Philosophies

The paper begins by contrasting the ontological and epistemological foundations of ML and the natural sciences. ML operates with a strong ontology that privileges data over latent structures and subscribes to a performance-centric epistemology, placing significant value on how models perform against held-out training data. In contrast, the natural sciences prioritize understanding underlying mechanisms and latent structures, valuing theories based on their explanatory power and integration with wider scientific knowledge.

Definitions and Orientations

The authors provide comprehensive definitions for ML and natural science within the context of their argument. They define ML as methods whose capabilities improve significantly with increased data exposure. This encompasses classical and contemporary techniques, from convolutional neural networks (CNNs) to principal components analysis (PCA). The paper delineates natural sciences as fields primarily aimed at understanding natural phenomena, excluding engineering-oriented questions more suitable for ML applications.

Core Contributions

Hogg and Villar's principal contributions can be summarized as:

Philosophical Contrast:
- The paper lucidly details the fundamental contrasts between the ontologies and epistemologies of ML and the natural sciences.
Statistical Biases:
- The authors highlight the introduction of confirmation biases when ML models replace physical simulations and estimator biases when ML-generated dataset labels are used in downstream analyses.
Identifying Safe ML Applications:
- Several scenarios where ML can be effectively and conservatively applied are discussed.
- Causal contexts and operational parts of scientific projects are noted as particularly amenable to ML integration.
Call to Action:
- The paper urges scientific communities to critically evaluate the role and value of ML in their disciplines.

Discussion of Technological Integration

Beneficial Applications

The paper outlines several domains where ML's data-centric philosophy can provide substantial benefits:

Label Transfer and Classification:
- Efficiently predicting labels for large, unlabeled datasets when labels are computationally expensive to obtain.
Speeding up Decisions:
- Applications that require rapid real-time decisions such as in high-energy particle physics experiments.
Modeling Nuisances:
- ML's utility in modeling foregrounds and backgrounds, focusing on effective, rather than detailed, model comprehension.
Outlier Detection and Information Theoretic Insights:
- Identifying anomalies and providing insights into data’s informational content.

Problematic Applications

Conversely, the paper identifies potential pitfalls:

Simulation Emulation:
- The use of ML to augment or replace physical simulations may result in confirmation biases, jeopardizing scientific integrity.
ML-based Labeling:
- When ML-generated labels are used in combined analyses, amplified estimator biases can occur, introducing significant risks.

Future Directions

The paper propels forward-looking discourse on the epistemic role of ML within the broader goals of the natural sciences, highlighting areas such as symbolic regression and foundation models. While major discoveries directly facilitated by ML remain elusive, the potential for future breakthrough discoveries is acknowledged.

Implications

Practical Implications

The practical implications of this research are substantial. Natural science projects and large-scale scientific investigations rely increasingly on ML methodologies. The paper serves as a call to prudently integrate ML, ensuring coherence with the traditional epistemological rigor of the natural sciences.

Theoretical Implications

The paper advances the discussion on the intersection of ML and natural sciences, prompting researchers to consider the philosophical alignment of their methodologies. It underscores the necessity for balance; while ML's utility in handling and interpreting large datasets is unmatched, its integration must not compromise scientific standards.

Conclusion

Hogg and Villar’s paper is a critical reflection on the role of ML in the natural sciences, offering both a philosophical dissection and practical guidance. The call to carefully consider ML's appropriated usage and to safeguard against statistical biases is essential for advancing scientific integrity. This paper will serve as a pivotal reference for researchers navigating the intricate balance between leveraging ML's capabilities and adhering to the epistemic standards of the natural sciences.