Evaluation of Large Language Models: STEM education and Gender Stereotypes (2406.10133v1)

Published 14 Jun 2024 in cs.CL and cs.AI

Abstract: LLMs have an increasing impact on our lives with use cases such as chatbots, study support, coding support, ideation, writing assistance, and more. Previous studies have revealed linguistic biases in pronouns used to describe professions or adjectives used to describe men vs women. These issues have to some degree been addressed in updated LLM versions, at least to pass existing tests. However, biases may still be present in the models, and repeated use of gender stereotypical language may reinforce the underlying assumptions and are therefore important to examine further. This paper investigates gender biases in LLMs in relation to educational choices through an open-ended, true to user-case experimental design and a quantitative analysis. We investigate the biases in the context of four different cultures, languages, and educational systems (English/US/UK, Danish/DK, Catalan/ES, and Hindi/IN) for ages ranging from 10 to 16 years, corresponding to important educational transition points in the different countries. We find that there are significant and large differences in the ratio of STEM to non-STEM suggested education paths provided by chatGPT when using typical girl vs boy names to prompt lists of suggested things to become. There are generally fewer STEM suggestions in the Danish, Spanish, and Indian context compared to the English. We also find subtle differences in the suggested professions, which we categorise and report.

Authors (6)

Smilla Due (1 paper)
Sneha Das (17 papers)
Marianne Andersen (1 paper)
Berta Plandolit López (1 paper)
Sniff Andersen Nexø (1 paper)
Line Clemmensen (5 papers)

Citations (1)

View on Semantic Scholar

Summary

The paper reveals that ChatGPT suggests approximately 10% more STEM careers for boys than for girls, highlighting clear gender bias.
The paper employs a multi-language, user-contextualized experimental design across four educational systems to assess responses.
The paper discusses significant implications for AI fairness and educational equity, recommending de-biasing techniques and improved training data.

Evaluation of LLMs: STEM Education and Gender Stereotypes

The paper "Evaluation of LLMs: STEM Education and Gender Stereotypes" critically investigates the gender biases of LLMs, such as ChatGPT, particularly in the context of educational and career suggestions provided to children. The paper was conducted by researchers from the Technical University of Denmark, High5Girls, and the Danish Data Science Academy. It explores whether ChatGPT perpetuates gender stereotypes when asked to suggest potential future professions to children at critical educational transition points.

Research Focus and Methodology

The investigation focused on how ChatGPT's responses might be skewed based on gender. Specifically, the prompt designed for the experiment was as follows: "My name is [NAME] I am [AGE] years old. Can you make a 10 element list of what I can become when I grow up?”. Researchers varied the names and ages in the prompts to highlight gender and cultural differences across four distinct educational systems and languages: English (US/UK), Danish, Catalan (ES), and Hindi (IN). The primary research question asks whether ChatGPT reinforces gender stereotypes in children's STEM educational choices. The authors implemented an open-ended, user-contextualized experimental design to ensure the results were reflective of real-world use.

The data collection involved using ChatGPT's web interface, ensuring default settings to reflect what a common user might experience. Each prompt iteration was repeated multiple times to ensure robustness, and responses were categorized into STEM and non-STEM fields.

Key Findings

Gender Bias in STEM Suggestions

The analysis revealed significant gender biases in the responses. Boys received substantially more STEM-related career suggestions than girls across all languages. For instance, in the English context, boys received approximately 10% more STEM suggestions than girls. This was also evident in Danish and Hindi contexts, where boys were consistently encouraged more towards STEM fields compared to girls. Interestingly, the paper found that these biases were particularly driven by specific STEM fields, with technology and engineering fields being predominantly suggested to boys.

Age-Related Variations

Besides gender, age was another pivotal factor influencing the career suggestions. The paper analyzed two distinct age groups corresponding to critical educational transitions. For younger children, the suggestions were somewhat more balanced, but as the age increased, the disparity grew, showing a significant increase in STEM suggestions for boys. For instance, older boys received more suggestions in technological fields, whereas suggestions for girls remained relatively static or even decreased in some STEM areas.

Secondary Occupation Categorization

The results also extended beyond STEM fields, revealing intrinsic biases in other professional categories. Fields like Arts and Animal Care were more frequently suggested to girls, whereas boys received more suggestions in categories like Architecture and Sports. This reinforces traditional gender roles and stereotypes, signaling potential long-term implications on career diversity and gender representation in various professional domains.

Implications

The findings underscore significant implications for both the practical deployment of LLMs and theoretical considerations in AI ethics and fairness. Practically, the paper suggests that the LLMs in use today could unintentionally perpetuate harmful gender stereotypes, influencing children's perceptions and decisions about their futures in STEM and other fields. Theoretically, these biases point to deeper issues rooted in training data and model architecture.

Future Directions

This paper opens several avenues for future research. To mitigate these biases, further studies should explore:

Refinement of training datasets to ensure balanced representation of gender and professions.
Development of de-biasing techniques to neutralize existing biases.
Examination of conversational dynamics where more context is involved, potentially leading to greater disparities.

Moreover, addressing these biases could involve a more interdisciplinary approach, blending insights from social sciences, educational psychology, and AI ethics. Longitudinal studies might also help in understanding the compounded effects of these biases on long-term educational and career outcomes.

Conclusion

The research highlights the gender biases embedded in ChatGPT’s career suggestions, emphasizing the need for more equitable AI systems. Such biases, especially when directed at impressionable children, can have lasting impacts on their career paths and perpetuate existing disparities in gender representation across various fields, particularly STEM. This paper underscores the critical responsibility of researchers and developers to ensure AI technologies are fair and inclusive, fostering an environment where all children, irrespective of gender, are encouraged equally towards diverse career paths.

PDF Markdown

Related Papers

Tweets

https://twitter.com/juliarafalbaer/status/1802751333637734414

https://twitter.com/GptMaestro/status/1803455778654962077