GPT-4 passes most of the 297 written Polish Board Certification Examinations (2405.01589v2)

Published 29 Apr 2024 in cs.CL and cs.AI

Abstract: Introduction: Recently, the effectiveness of LLMs has increased rapidly, allowing them to be used in a great number of applications. However, the risks posed by the generation of false information through LLMs significantly limit their applications in sensitive areas such as healthcare, highlighting the necessity for rigorous validations to determine their utility and reliability. To date, no study has extensively compared the performance of LLMs on Polish medical examinations across a broad spectrum of specialties on a very large dataset. Objectives: This study evaluated the performance of three Generative Pretrained Transformer (GPT) models on the Polish Board Certification Exam (Pa\'nstwowy Egzamin Specjalizacyjny, PES) dataset, which consists of 297 tests. Methods: We developed a software program to download and process PES exams and tested the performance of GPT models using OpenAI Application Programming Interface. Results: Our findings reveal that GPT-3.5 did not pass any of the analyzed exams. In contrast, the GPT-4 models demonstrated the capability to pass the majority of the exams evaluated, with the most recent model, gpt-4-0125, successfully passing 222 (75%) of them. The performance of the GPT models varied significantly, displaying excellence in exams related to certain specialties while completely failing others. Conclusions: The significant progress and impressive performance of LLM models hold great promise for the increased application of AI in the field of medicine in Poland. For instance, this advancement could lead to the development of AI-based medical assistants for healthcare professionals, enhancing the efficiency and accuracy of medical services.

Citations (2)

View on Semantic Scholar

Summary

The paper demonstrates GPT-4's ability to pass over 60% of the exams, marking a significant improvement over GPT-3.5.
It employs a blind multiple-choice methodology across 57 specialties to provide an unbiased evaluation of the AI model.
The study highlights varied performance across specialties and raises concerns about data contamination impacting results.

Understanding GPT-4's Performance on Polish Medical Board Exams

Overview of GPT Models on the Polish Board Certification Exam (PES)

The paper assessed the performance of various Generative Pretrained Transformer (GPT) models, focusing on the Polish Board Certification Exam in 57 medical specialties. This encompassed a holistic evaluation using a massive dataset of 297 exams to determine the adaptability and accuracy of these models in processing and providing correct answers to medical exam questions in Polish, a non-English context which adds a layer of linguistic complexity.

Evolution and Capabilities of GPT Models

GPT-3.5: The earlier model, GPT-3.5, showed a limitation as it failed all exam attempts. This result underscores the incremental learning and necessary improvements that were subsequently integrated into later versions.
GPT-4 Versions: In contrast, two versions of GPT-4 were tested. The variant labeled gpt-4-0613 passed 184 out of the 297 exams, which constitutes a straightforward majority. Remarkably, the gpt-4-0125-preview, a more recent iteration, successfully passed 222 of the 297 tests, a notable enhancement marking about a 75% success rate.

Specialty-Specific Performance Insights

The discrepancies in performance across various specialties highlight GPT-4's uneven understanding and response capability. Specific areas such as Family Medicine saw better results, potentially due to the broader and more general nature of the questions compared to more specialized fields like Maxillofacial Surgery or Orthodontics, where the model performed poorly. This suggests that GPT-4's training dataset might have comprehensive coverage in general medicine rather than specialized domains.

Methodological Approach in Testing

The models were tested using a blind setup with multiple-choice questions, avoiding potential biases and dependencies on real-time internet data. This isolates and tests the model's stored knowledge and ability to apply learned information logically and contextually to medical scenarios. Moreover, the exclusion of certain exams due to digital conversion issues hints at the necessity for robust data preparation in future testing scenarios.

Potential Limitations and Challenges

Data Contamination: A critical point to consider is data contamination. The most recent model, gpt-4-0125-preview, trained on data up to December 2023, may have had access to the PES questions inadvertently through large-scale data scraping, potentially skewing its higher success rate.
Reproducibility and Consistency: Variability in model performance due to settings like 'temperature' during testing may affect reproducibility. The use of a zero temperature setting aimed to reduce randomness but does not guarantee absolute determinism in responses.

Implications and Theoretical Applications

Major implications of this paper include the prospective application of GPT-4 in supporting medical education and practice, particularly in non-English-speaking settings. While AI models shouldn't replace medical professionals, they can serve as valuable educational aids, quick reference tools, or preliminary diagnostic support, thereby enhancing medical service efficiency, especially in resource-limited or linguistically diverse settings.

Future Directions

Further research could involve more expansive and rigorous testing across all available specialties, potentially integrating OCR technologies to include tests not currently in machine-readable format. Another interesting avenue could be testing AI models with internet access during exams to simulate real-world data usage and decision-making processes.

Concluding Thoughts

GPT-4's ability to pass a significant portion of Polish medical specialty exams is intriguing for the future role of AI in medical education and assistance. Continuous advances in AI training methodologies, transparency in data usage, and mitigating bias are essential for expanding these technologies into practical healthcare applications responsibly. While AI's readiness to assist in medical decision-making robustly increases, the human oversight remains indispensable, ensuring that AI supports rather than supplants human expertise.

Related Papers

Tweets

https://twitter.com/LChoshen/status/1787464707546161549

https://twitter.com/realmofresearch/status/1787320465217736910

https://twitter.com/GptMaestro/status/1788146329203696007

YouTube

Show All Videos

HackerNews

GPT-4 passes most of the 297 written Polish Board Certification Examinations (1 point, 0 comments)