BLUEX: A benchmark based on Brazilian Leading Universities Entrance eXams

Published 11 Jul 2023 in cs.CL | (2307.05410v1)

Abstract: One common trend in recent studies of LMs is the use of standardized tests for evaluation. However, despite being the fifth most spoken language worldwide, few such evaluations have been conducted in Portuguese. This is mainly due to the lack of high-quality datasets available to the community for carrying out evaluations in Portuguese. To address this gap, we introduce the Brazilian Leading Universities Entrance eXams (BLUEX), a dataset of entrance exams from the two leading universities in Brazil: UNICAMP and USP. The dataset includes annotated metadata for evaluating the performance of NLP models on a variety of subjects. Furthermore, BLUEX includes a collection of recently administered exams that are unlikely to be included in the training data of many popular LMs as of 2023. The dataset is also annotated to indicate the position of images in each question, providing a valuable resource for advancing the state-of-the-art in multimodal language understanding and reasoning. We describe the creation and characteristics of BLUEX and establish a benchmark through experiments with state-of-the-art LMs, demonstrating its potential for advancing the state-of-the-art in natural language understanding and reasoning in Portuguese. The data and relevant code can be found at https://github.com/Portuguese-Benchmark-Datasets/BLUEX

Abstract PDF Upgrade to Chat

Citations (5)

View on Semantic Scholar

Summary

The paper presents a novel benchmark derived from leading Brazilian university exams to assess NLP models’ multilingual and multimodal reasoning.
The dataset comprises over 1,000 annotated multiple-choice questions, including image-based queries that test text comprehension, mathematical reasoning, and cultural knowledge.
Experimental results show that state-of-the-art models like GPT-4 excel yet fall short of human-level performance, highlighting key challenges for future NLP research.

Evaluating LLMs with the BLUEX Benchmark

The paper "BLUEX: A benchmark based on Brazilian Leading Universities Entrance Exams" presents a novel dataset aimed at addressing the paucity of high-quality standardized evaluation resources for assessing NLP models in Portuguese. The dataset, BLUEX, is constructed from entrance exams of two preeminent Brazilian universities, UNICAMP and USP, covering exams administered from 2018 to 2023. Given the prominence of Portuguese as the fifth most spoken language globally, the introduction of BLUEX represents a significant contribution to the field of NLP research in this linguistic context.

Significance of the BLUEX Dataset

The motivation behind BLUEX is to provide a rigorous benchmark for evaluating LMs on a variety of subjects in a real-world educational setting. The dataset encompasses over 1,000 multiple-choice questions, intricately annotated to facilitate a comprehensive evaluation of LMs across different dimensions such as text comprehension, image understanding, and mathematical reasoning. Blueprints of the rich metadata include flags for capabilities like domain-specific knowledge and reasoning skills, tailored for engaging with subject matters like Brazilian culture and history, thereby enhancing the precision of LM evaluations.

The dataset's uniqueness lies in its incorporation of multimodal elements, as it includes image-based questions that require models to process and interpret visual as well as textual information. This feature is particularly crucial given the increasing interest in developing multimodal models capable of integrating diverse data formats for more effective reasoning and understanding.

Experimental Validation and Results

Experiments conducted with various state-of-the-art LLMs, including OpenAI's GPT-4, GPT-3.5-Turbo, and several open-source models, establish BLUEX as a robust benchmark for measuring LM performance in Portuguese. GPT-4 demonstrated superior performance but still fell short of achieving human-level performance required for competitive university admissions. This underscores the dataset's efficacy in highlighting the challenges faced by LMs in multilingual and multimodal contexts.

The results show that while LMs like GPT-4 achieve impressive scores, they are yet to attain cutoff scores needed for the most competitive courses, such as medicine, emphasizing the dataset's potential to push the boundaries of current LM capabilities. Moreover, the analysis of model performance based on annotated metadata—such as those requiring Mathematical Reasoning (MR) and Brazilian Knowledge (BK)—reveals critical insights into specific areas where models require substantial improvements. For instance, questions dependent on mathematical reasoning remain challenging across models, suggesting a fertile ground for future research.

Future Directions

The paper outlines several avenues for future research and developments. The authors propose further exploration into few-shot and zero-shot learning settings to assess if they can enhance model performance. Additionally, employing chain-of-thought prompting, as shown beneficial in related studies, might yield improvements when applied in tandem with BLUEX.

With the inclusion of multimodal data components, BLUEX opens a pathway for the innovation of models capable of interpreting and integrating textual and visual data, thus pushing the frontier of multimodal understanding. Such advancements are imperative for applications requiring nuanced comprehension and reasoning capabilities across diverse data types.

Conclusion

BLUEX fills a significant gap in contemporary NLP research by providing a structured, annotated benchmark for evaluating LLMs in Portuguese. By encompassing rigorous testing parameters across diverse subjects and modalities, the dataset offers invaluable insights into the strengths and limitations of current models, guiding researchers toward developing more sophisticated, culturally cognizant, and performance-enhanced LMs. As an open-source resource, BLUEX is expected to catalyze progress in the NLP domain, particularly for languages that have hitherto been underserved in model evaluation benchmarks.

Markdown Report Issue