Weak baselines and reporting biases lead to overoptimism in machine learning for fluid-related partial differential equations (2407.07218v1)

Published 9 Jul 2024 in math.NA, cs.LG, cs.NA, and physics.flu-dyn

Abstract: One of the most promising applications of ML in computational physics is to accelerate the solution of partial differential equations (PDEs). The key objective of ML-based PDE solvers is to output a sufficiently accurate solution faster than standard numerical methods, which are used as a baseline comparison. We first perform a systematic review of the ML-for-PDE solving literature. Of articles that use ML to solve a fluid-related PDE and claim to outperform a standard numerical method, we determine that 79% (60/76) compare to a weak baseline. Second, we find evidence that reporting biases, especially outcome reporting bias and publication bias, are widespread. We conclude that ML-for-PDE solving research is overoptimistic: weak baselines lead to overly positive results, while reporting biases lead to underreporting of negative results. To a large extent, these issues appear to be caused by factors similar to those of past reproducibility crises: researcher degrees of freedom and a bias towards positive results. We call for bottom-up cultural changes to minimize biased reporting as well as top-down structural reforms intended to reduce perverse incentives for doing so.

Citations (14)

View on Semantic Scholar

Summary

The paper demonstrates that weak baselines and biased reporting lead to overoptimistic ML performance claims for solving fluid-related PDEs.
It analyzes 82 studies, finding that nearly 80% violate fair baseline comparison rules, which undermines reliable evaluation of ML solvers.
Replication experiments show traditional numerical methods often outperform ML approaches, highlighting the need for robust testing and transparency.

The paper under review, authored by McGreivy and Hakim, provides a critical analysis of the current literature on using ML for solving partial differential equations (PDEs) related to fluid mechanics. The primary focus is on identifying the causes of overoptimistic results in this domain, specifically citing weak baselines and reporting biases as major concerns.

Main Points

Reproducibility Crisis in ML-Based Science

The paper begins by situating its discussion within the broader reproducibility crisis affecting many scientific fields. It notes that ML and ML-based scientific research are not immune to these issues. This is further corroborated by large-scale analyses documenting reproducibility concerns in various subfields of ML, such as medical applications.

Scope and Methodology

The authors focus on fluid-related PDEs, analyzing research that employs ML to solve these equations more efficiently than traditional numerical methods. They consider 82 articles and identify common pitfalls like weak baselines and reporting biases.

Weak Baselines

Two primary rules are established for ensuring fair comparisons between ML-based solvers and traditional numerical methods:

Rule 1: Comparisons at Equal Accuracy or Equal Runtime: Comparing efficiency (or speed) only makes sense if both methods have the same accuracy. Violations occur when a high-accuracy traditional method is compared with a less accurate ML approach.
Rule 2: Compare to an Efficient Numerical Method: It's crucial to compare ML-based solvers to state-of-the-art, highly efficient numerical methods, rather than older or less efficient ones.

The paper reports that 79% of the reviewed studies violated at least one of these rules. For instance, many articles compared their ML-based solvers against outdated or suboptimal numerical baselines, thus overestimating the advantage of the ML approach.

Reporting Biases

The analysis also identifies a pervasive bias towards reporting positive results. The paper finds that 94.8% of the reviewed articles only mentioned positive results, while only 5.2% mentioned both positive and negative results. None of the articles reported solely negative outcomes. This imbalance suggests a significant publication bias, where negative results are either discouraged or omitted, leading to an inflated perception of ML tools' effectiveness.

Statistical and Anecdotal Evidence

The paper presents both statistical and anecdotal evidence to support its claims. For example, some ML methods which show promising results in one context perform poorly when tested under different conditions. This is indicative of selective reporting and outcome switching—practices that further contribute to the reproducibility crisis.

Reproducing Results with Stronger Baselines

The authors attempted to replicate results from ten highly cited articles using stronger baselines. In most cases, the more efficient numerical methods outperformed the ML-based solvers. Notably, only three out of ten ML-based methods remained competitive when compared against these optimized traditional methods.

Implications and Future Directions

Practical Implications

The findings caution against overly optimistic assessments of ML for fluid-related PDEs. The tendency to use weak baselines and report biased results misguides subsequent research and applications, potentially leading to suboptimal solutions in real-world scenarios.

Theoretical Implications

The paper highlights the need for more rigorous standards in ML-based scientific research. Ensuring fair comparisons and complete reporting would help provide a more accurate picture of an ML model's true efficacy.

Recommendations for Best Practices

Fair Comparisons: Compare ML-based solvers with both traditional numerical methods and other ML-based methods.
Adherence to Rule 1: Always ensure comparisons are made at equal accuracy or runtime.
Multiple Baselines: Employ multiple numerical methods where possible to ensure robust baselines.
Transparency: Explicitly discuss how baselines were chosen and justify their efficiency.
Report Efficiency Metrics: Besides runtime and accuracy, include metrics like computational cost to generate training data and train models.

Conclusion

The paper provides a nuanced and empirical critique of current ML practices in solving fluid-related PDEs. By identifying and quantifying issues around weak baselines and reporting biases, it sets a high bar for future research methodologies in this domain. The recommendations made by McGreivy and Hakim aim to foster a more rigorous and transparent approach, ultimately benefitting both the ML community and the broader scientific landscape.

PDF Markdown

Related Papers

Tweets

https://twitter.com/chenna1985/status/1814329001671795183

https://twitter.com/hipsterelectron/status/1813623496859156774

https://twitter.com/chenna1985/status/1831776265365586073

https://twitter.com/Abhinavsns/status/1812413714915529024

https://twitter.com/shardoolk19/status/1839365848068403588

https://twitter.com/__tm__157/status/1914636990034706715

Reddit

[Research] A vindication for CFD people -- turns out ML for Fluid PDEs is garbage (publised in Nature: MI) (52 points, 10 comments)
[R] Weak baselines and reporting biases lead to overoptimism in machine learning for fluid-related partial differential equations (20 points, 3 comments)