Papers
Topics
Authors
Recent
2000 character limit reached

Reproducibility Issues for BERT-based Evaluation Metrics (2204.00004v3)

Published 30 Mar 2022 in cs.CL and cs.LG

Abstract: Reproducibility is of utmost concern in machine learning and NLP. In the field of natural language generation (especially machine translation), the seminal paper of Post (2018) has pointed out problems of reproducibility of the dominant metric, BLEU, at the time of publication. Nowadays, BERT-based evaluation metrics considerably outperform BLEU. In this paper, we ask whether results and claims from four recent BERT-based metrics can be reproduced. We find that reproduction of claims and results often fails because of (i) heavy undocumented preprocessing involved in the metrics, (ii) missing code and (iii) reporting weaker results for the baseline metrics. (iv) In one case, the problem stems from correlating not to human scores but to a wrong column in the csv file, inflating scores by 5 points. Motivated by the impact of preprocessing, we then conduct a second study where we examine its effects more closely (for one of the metrics). We find that preprocessing can have large effects, especially for highly inflectional languages. In this case, the effect of preprocessing may be larger than the effect of the aggregation mechanism (e.g., greedy alignment vs. Word Mover Distance).

Citations (16)

Summary

  • The paper evaluates reproducibility issues in four BERT-based metrics, highlighting how subtle preprocessing changes affect results.
  • Methodology involved empirical sensitivity analyses of tokenization, punctuation removal, and dataset-specific IDF-weighting across language pairs.
  • Findings emphasize the need for transparent documentation and standardized practices to improve reliability in natural language generation evaluations.

Analyzing "Reproducibility Issues for BERT-based Evaluation Metrics"

The paper "Reproducibility Issues for BERT-based Evaluation Metrics" seeks to explore the reproducibility challenges inherent in BERT-based evaluation metrics particularly used in natural language generation (NLG) tasks, including machine translation (MT). It investigates the reproducibility of four well-regarded BERT-based metrics, identifies preprocessing and infrastructural issues affecting reproducibility, and conducts sensitivity analyses of preprocessing effects.

Introduction to Reproducibility in BERT-based Metrics

Reproducibility is a critical issue in ML and NLP, impacting experimental trust and reliability. This paper acknowledges the reproducibility crisis identified by several sources, emphasizing the need for reproducible results in NLG metrics. While BLEU, a traditional metric, is noted for its reproducibility issues, newer BERT-based metrics are believed to have improved. However, this paper identifies several obstacles in reproducing results and verifies these challenges through empirical examination.

Reproduction of BERT-based Metrics

Metrics Under Review

The paper evaluates four BERT-based metrics: BERTScore, MoverScore, BaryScore, and SentSim, which are used across various NLG tasks such as MT, summarization, and image captioning. These metrics reportedly outperform lexical-based metrics like BLEU. The focus is placed on their reproducibility across different tasks:

  • BERTScore: Calculates token-level similarities using BERT embeddings.
  • MoverScore: Employs Word Mover Distance to gauge semantic similarities.
  • BaryScore: Utilizes Wasserstein Barycenter for embeddings.
  • SentSim: Evaluates cross-lingual semantic alignment.

Experimental Results and Observations

The paper rigorously tests these metrics, revealing several reproducibility issues often influenced by:

  • Undocumented preprocessing
  • Variability in reported competitor scores
  • Missing code/documentation

The experiments demonstrate that even minor undocumented preprocessing changes can substantially affect evaluation outcomes, complicating reproducibility in MT datasets, as exemplified below. Figure 1

Figure 1: CVSTOP_{\text{STOP}} for different BERT-based metrics across various language pairs.

Sensitivity Analysis

Preprocessing Impact

The paper provides an extensive analysis of various preprocessing techniques such as tokenization, punctuation removal, stopwords filtering, and IDF-weighting. It highlights that:

  • Preprocessing significantly affects metric performance, especially in inflectional languages.
  • Fine-tuning BERT with dataset-specific IDF-weighting generally yields better correlations with human judgments. Figure 2

    Figure 2: RD(dis,ori) and RD(dis,pr) indicating sensitivity to punctuation and stopwords removal.

Variability Across Languages

Diversified experiments over multiple language pairs suggest that reproducibility issues are exacerbated in morphologically rich languages like Finnish and Turkish.

Discussion

The paper critiques the current inability of reproducibility in BERT-based metrics compared to their reported performance when established methodologies and configurations are not clearly accessible or consistently applied. Practical recommendations are provided for improving reproducibility standards, which include:

  • Emphasizing transparency in preprocessing documentation.
  • Encouraging authors to release complete implementations, including scripts and configurations.
  • Recommending uniform preprocessing practices across different metric evaluations. Figure 3

Figure 3

Figure 3

Figure 3: Variations in performance metrics (WMT17-19) with and without proper IDF-weighting.

Conclusion

The paper concludes that although some progress has been made towards reproducibility in NLP metrics, considerable work is still necessary. The over-reliance on certain preprocessing methods points to deeper systemic issues in reproducibility. Moving forward, a consolidated approach involving detailed documentation, standard practices, and open access to resources is essential for reliable evaluation metrics in AI research, ensuring consistency across different experiments and studies.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.