Neural Media Bias Detection Using Distant Supervision With BABE -- Bias Annotations By Experts (2209.14557v1)

Published 29 Sep 2022 in cs.CL

Abstract: Media coverage has a substantial effect on the public perception of events. Nevertheless, media outlets are often biased. One way to bias news articles is by altering the word choice. The automatic identification of bias by word choice is challenging, primarily due to the lack of a gold standard data set and high context dependencies. This paper presents BABE, a robust and diverse data set created by trained experts, for media bias research. We also analyze why expert labeling is essential within this domain. Our data set offers better annotation quality and higher inter-annotator agreement than existing work. It consists of 3,700 sentences balanced among topics and outlets, containing media bias labels on the word and sentence level. Based on our data, we also introduce a way to detect bias-inducing sentences in news articles automatically. Our best performing BERT-based model is pre-trained on a larger corpus consisting of distant labels. Fine-tuning and evaluating the model on our proposed supervised data set, we achieve a macro F1-score of 0.804, outperforming existing methods.

Citations (62)

View on Semantic Scholar

Summary

The paper demonstrates that fine-tuning BERT on the expert-labeled BABE dataset combined with distant supervision achieves a macro F1-score of 0.804.
It introduces the BABE dataset with 3,700 expert-annotated sentences and utilizes bias-specific word embeddings to enhance classification accuracy.
The approach outperforms traditional feature-based methods, offering a robust framework for analyzing media bias in politically polarized content.

Neural Media Bias Detection Using Distant Supervision

The paper explores the challenge of detecting media bias in online news articles by introducing a new dataset called BABE (Bias Annotations By Experts) as a robust benchmark, and evaluating neural network classifiers, particularly leveraging the BERT model architecture. The significant improvement in bias detection performance is achieved through a combination of expert-labeled data and a pre-training technique utilizing distant supervision via bias-specific word embeddings.

Motivation and Background

Media bias, particularly through word choice, significantly influences public perceptions and can shape societal discourse. Detecting such bias automatically is complex due to the subjective nature of language and the absence of high-quality labeled datasets. Previous attempts often relied on crowd-sourced annotations, resulting in varying quality due to the annotators' lack of domain expertise. The paper presents BABE as an updated resource with annotations by trained experts, improving annotation quality and inter-annotator agreement.

BABE Dataset and Methodology

The BABE dataset comprises 3,700 annotated sentences, collected and labeled by experts to capture biases at both word and sentence levels. Annotations were focused on controversial topics from diverse U.S. media outlets, during a period marked by increased political polarization. The data collection and annotation process emphasized training annotators to distinguish bias expertly and reliably.

Figure 1: Data collection and annotation workflow highlighting expert-based bias annotation methodology.

Distant Supervision

To augment the classifier's ability to detect bias, a distant supervision approach is employed in pre-training. This involves utilizing a large corpus of news headlines, automatically labeled based on the political leaning of their source, to embed bias information into word vectors. This approach allows leveraging large-scale, noisily-labeled data to improve the initial embedding representations for more effective downstream task learning.

Neural Network Models

The paper evaluates several transformer-based architectures, including BERT, its variants (DistilBERT, RoBERTa), and other models such as ELECTRA and XLNet. The primary focus is on fine-tuning these models on BABE and assessing their performance through macro $F_{1}$ -scores.

Implementation Details

The models were implemented using the HuggingFace Transformers library, fine-tuned via cross-entropy loss optimization. A significant computational resource was a Tesla T4 GPU, with models converging within approximately five hours. BERT and RoBERTa showed the most promise, benefitting substantially from the distant supervision pre-training.

Results and Analysis

The BERT model, especially when enhanced with distant supervision, outperforms traditional feature-based approaches significantly. BERT achieves a macro $F_{1}$ -score of 0.804 on the expanded dataset (SG2), showcases the potential of integrating distant supervision to learn more nuanced bias-specific embeddings. Meanwhile, the baseline model using feature engineering techniques shows inferior performance, highlighting the limitations of traditional methods in handling complex bias phenomena.

Conclusions

The paper confirms that neural models, given high-quality annotations and task-specific pre-training, can substantially advance the state-of-the-art in media bias detection. The combination of distant supervision and expert-labeled datasets forms a compelling strategy, yielding superior representation learning and offering a robust framework for bias analysis in media content. The strategies proposed in this paper, including dataset publication and methodological innovations, pave the way for further advancements in analyzing subjective content using AI.