DADIT: A Dataset for Demographic Classification of Italian Twitter Users and a Comparison of Prediction Methods (2403.05700v1)

Published 8 Mar 2024 in cs.CL

Abstract: Social scientists increasingly use demographically stratified social media data to study the attitudes, beliefs, and behavior of the general public. To facilitate such analyses, we construct, validate, and release publicly the representative DADIT dataset of 30M tweets of 20k Italian Twitter users, along with their bios and profile pictures. We enrich the user data with high-quality labels for gender, age, and location. DADIT enables us to train and compare the performance of various state-of-the-art models for the prediction of the gender and age of social media users. In particular, we investigate if tweets contain valuable information for the task, since popular classifiers like M3 don't leverage them. Our best XLM-based classifier improves upon the commonly used competitor M3 by up to 53% F1. Especially for age prediction, classifiers profit from including tweets as features. We also confirm these findings on a German test set.

References (35)

Summary

The paper introduces DADIT, a comprehensive dataset of Italian Twitter users enriched with rigorously verified demographic labels.
The paper demonstrates that incorporating tweet content significantly boosts prediction accuracy, with the XLM model showing a 53% F1-score improvement for age classification.
The paper highlights model robustness by showing effective generalization to German Twitter data and promising ensemble methods for improved gender prediction.

DADIT: A Dataset for Demographic Classification of Italian Twitter Users and a Comparison of Prediction Methods

The paper, titled "DADIT: A Dataset for Demographic Classification of Italian Twitter Users and a Comparison of Prediction Methods," introduces a new dataset named DADIT, which encompasses around 30 million tweets from 20,000 Italian Twitter users. The dataset is enriched with high-quality demographic labels for gender, age, and location, addressing crucial needs in social science research that relies on stratified social media data.

Dataset Overview

DADIT is a robust and representative dataset tailored for demographic analysis. It includes not only the tweets of the users but also their bios and profile pictures. The high-quality demographic labels included for gender and age have undergone rigorous manual verification to ensure their accuracy. Gender labels were derived based on the full name fields, leveraging the specificity of Italian naming conventions. Age labels were generated using regex patterns that match statements of age or birth year in the user's bio or tweets. The dataset has proven to be representative of the broader Italian Twitter user base in terms of demographic characteristics.

Methodology for Gender and Age Classification

The paper highlights the performance of various models trained on this dataset for predicting user demographics, with a focus on gender and age. The primary models considered include:

M3 Classifier: A state-of-the-art multimodal model that uses user bios, profile pictures, and optionally usernames for demographic prediction. However, it struggles in the absence of certain profile information.
CV (Computer Vision) Model: Utilizes profile pictures to infer demographic attributes.
XLM (Transformer Model): A classifier based on a fine-tuned twitter-XLM-roberta-base, which makes extensive use of users' bios and tweets for prediction.
Flan-T5 and GPT3.5: State-of-the-art LLMs tested in zero-shot and few-shot settings to classify demographics based on text inputs.

The findings suggest that incorporating tweet data significantly improves model performance, particularly for age classification. The best-performing model was the XLM-based classifier, which achieved an F1-score improvement of up to 53% over the M3 classifier for age prediction and remained effective even when evaluated on German Twitter data.

Experimental Results

The experimental results underscore three crucial aspects:

Significant Improvement with Tweet Inclusion: Both gender and age classifiers exhibited remarkable performance gains when tweets were included as features. Specifically, the fine-tuned XLM model outperformed the M3 model significantly, achieving higher F1-scores across tasks.
Model Robustness: The XLM model trained on Italian data generalized well to the German dataset, highlighting the model's robustness and the value of multilingual, demographically annotated datasets like DADIT.
Ensemble Learning: Further performance gains, particularly in gender classification, were observed through ensemble methods, combining the predictions from XLM and M3 models. However, such gains were not evident in age predictions.

Implications and Future Directions

The construction and release of DADIT have broad implications for computational social science and NLP research. The dataset provides a rich resource for developing models that require robust demographic information. Its potential extends beyond Italy, as demonstrated by the successful application of models trained on Italian data to German Twitter users.

Future research could explore the following avenues:

Enhanced Multimodal Approaches: Further improving multimodal models by training vision models on DADIT directly, rather than relying on pre-trained weights.
Advanced Ensemble Methods: Developing sophisticated ensemble techniques that synergize text and image models more effectively.
Broader Application: Extending demographic prediction models to other languages and cultural contexts using similar datasets.

Conclusion

DADIT not only fills a critical gap in resources needed for demographic analysis on social media but also provides evidence on the importance of integrating tweet content for predicting demographic attributes accurately. The paper demonstrates that modern LLMs fine-tuned on specific datasets can significantly outperform traditional multimodal approaches, paving the way for more accurate and inclusive social media analytics.

The release of this dataset and the accompanying findings will substantially aid researchers in computational social science, NLP, and related fields, providing them with the necessary tools to stratify user data demographically and conduct nuanced analyses.

Ultimately, the paper makes a compelling case for the integration of text data into demographic prediction models, significantly enhancing our ability to understand and analyze the rich tapestry of human behaviors manifested on social media platforms.

PDF Markdown

Related Papers

Tweets

https://twitter.com/lorelupo/status/1767491762489090090

https://twitter.com/paulromanbose/status/1767516083613831647

https://twitter.com/lorelupo/status/1767487345710784968