ChatGPT's Potential in Navigating the Complexity of the Polish Anaesthesiology Specialist Examination

Bielowka, Michał; Kufel, Jakub; Rojek, Marcin; Mitręga, Adam; Kaczynska, Dominika; Czogalik, Łukasz; Kondoł, Dominika; Palkij, Kacper; Bartnikowska, Wiktoria

doi:10.21608/asja.2024.290716.1111

	ChatGPT's Potential in Navigating the Complexity of the Polish Anaesthesiology Specialist Examination
Ain-Shams Journal of Anesthesiology
Volume 17, Issue 1, January 2025, Pages 1-4 PDF (378.12 K)
Document Type: Original Article
DOI: 10.21608/asja.2024.290716.1111
Authors
Michał Bielowka^* ¹; Jakub Kufel²; Marcin Rojek¹; Adam Mitręga¹; Dominika Kaczynska¹; Łukasz Czogalik¹; Dominika Kondoł³; Kacper Palkij³; Wiktoria Bartnikowska⁴
¹Student Scientific Association of Computer Analysis and Artificial Intelligence at the Department of Radiology and Nuclear Medicine of the Medical University of Silesia in Katowice
²Department of Radiodiagnostics, Interventional Radiology and Nuclear Medicine
³Dr B. Hager Memorial Multi-specialty District Hospital, Pyskowicka 47-51, 42-600 Tarnowskie Góry, Poland
⁴Faculty of Medical Sciences in Katowice, Medical University of Silesia, 40-752 Katowice, Poland
Abstract
Purpose: This study aims to assess the capability of an artificial intelligence (AI) model, specifically ChatGPT-3.5, in answering questions from the test section of the Polish National Specialist Examination (PES) in anaesthesiology and intensive care. Materials and Methods: A pool of 118 questions from the spring 2023 PES exam was utilized. Bloom's classification was employed to categorize questions based on comprehension, critical thinking, and memory. The questions were then presented to ChatGPT-3.5 in five independent sessions to evaluate its performance. Statistical analyses were conducted to assess correlations between the model's confidence, question difficulty, and correctness of answers. Results: ChatGPT-3.5 achieved an overall accuracy of 47.5%, with variations observed across different question types and subtypes. Significant correlations were found between the model's confidence and answer correctness. However, no correlation was observed between the certainty index and question difficulty or answer correctness based on category or subcategory. Conclusions: While ChatGPT-3.5 exhibited moderate performance, it fell short of the 60% threshold required to pass the PES exam. Comparison with similar AI studies in Japan suggests superior performance by the Polish AI model, albeit with limitations in expertise level. Human candidates consistently outperformed the AI model, indicating the current superiority of human expertise in this domain. Despite current limitations, continued research and collaboration offer promising prospects for AI integration in medical practice, supporting diagnostics, therapeutics, and patient care.
Keywords
Anaesthesiology; artificial intelligence; ChatGPT; intensive care; medical education; specialty examinations

Statistics Article View: 218 PDF Download: 132