Discussion
In this study, we evaluated ChatGPT’s performance on a neurology board-like test using the BV online question bank. BV provides commercially available question banks for a variety of examinations and medical specialties including neurology. It is accredited by the Accreditation Council for Continuing Medical Education to provide continuing medical education to physicians.16 A third-party survey found that BV users had a 95% pass rate on the Neurology Board Exam compared with the national average of 89% and that 70% of respondents thought BV helped improve their Neurology Board Exam score.17
The results of this study demonstrate reasonable performance on the first attempt/iteration, which further improved with subsequent attempts. ChatGPT’s performance falls within the expected performance ranges as established by neurology learners using the BV question bank. Our results highlight ChatGPT’s ability to interpret and answer appropriately to clinical questions and vignettes. Previous work assessing ChatGPT’s performance on various medical exams has been written about. For practice questions simulating the USMLE Step exams, ChatGPT correctly answered between 55.8% and 61.3% of the questions and on the practice questions for the American Board of Neurological Surgery board exam questions, ChatGPT answered 53.2% correctly on the first attempt.10 18 The results of our study showing 65.82% correct answers on the first attempt is slightly higher in comparison, which could be a result of the difference in availability of specialty-specific materials in ChatGPT’s database that can be used and potential improvement or patches to the existing ChatGPT model or database. ChatGPT performed statistically worse than the users after one attempt at answering the questions. This difference was not present after ChatGPT was given three attempts at answering the question.
It is interesting to note that ChatGPT was sensitive to questions regarding depression and suicide and referred us to a suicide hotline. This suggests either a naturally grown sensitivity towards certain issues like suicide, or that ‘guardrails’ are implemented from a top-down approach to the system. If the latter, it suggests that more hard-coded guidance could be implemented to tailor the system towards healthcare, and perhaps even narrowing the scope of the AI to healthcare specialties. Furthermore, it may be possible to have Large Language Models (LLMs) act as guides or tutors alongside question banks or as an interactive chatbot when reviewing medical concepts in online reference materials.
One of the criticisms of ChatGPT is the application’s ability to ‘hallucinate’. In this situation of hallucination, the answer by ChatGPT is factually incorrect but the answer is provided in a way that is very reasonable and convincing that it almost always ‘looks’ correct. While ‘hallucinations’ are a common criticism of ChatGPT, no ‘hallucinations’ were detected in our trials per se; this may be due to multiple reasons including posing a limited, constrained scenario (in the form of a question stem) to ChatGPT, and the fact that it may be difficult to discern an incorrect answer from a ‘hallucination’. This problem can be difficult to discern and thus, any medical use of ChatGPT must incorporate steps to verify the accuracy of its answers.
ChatGPT is still a relatively young technology at the time of this writing, and we anticipate room for improvement. For example, the integration of plugins could enhance its current abilities; incorporating the WolframAlpha plugin could improve its weaknesses in mathematics.19 The WolframAlpha plugin is an additional add on AI tool to enhance ChatGPT performance. With this installation, ChatGPT can be turned into a powerful computational tool in order to perform accurate mathematics, curate knowledge to be more precise, and provides real-time data monitoring. Although the base model lacks visual input and thus is currently unable to elucidate image-based questions, collaborations with visual accessibility companies such as Be My Eyes could potentially yield exciting results.20 Be My Eyes is a first-ever digital visual assistant that is powered by ChatGPT language model to provide blind people with a powerful new resource to navigate their physical environments, address their activities of daily living needs and gain more independence. Users can send images via the Be My Eyes application which can answer any questions about that image and provides immediate visual assistance for an array of tasks. Once informatically matured, it could have significant implications for the medical field.
There are several limitations associated with this study. We used a commercial question bank without access to the underlying data to verify the provided statistics. Moreover, the study did not involve official (mock) exams provided by the American Academy of Neurology, thus predictions about its performance on the neurology board exams would be speculative. There is also concern about the model’s performance in clinical reasoning. ChatGPT, as an LLM, produces structured texts based on probabilities but is also prone to state ‘facts’ that are untrue without self-awareness. This makes it difficult to be trusted at higher levels of clinical decision making. Last, the comparison between human candidates and ChatGPT may be more complicated. The board exam tests the ability of candidates to draw on their memory bank and must rely on their unaided memory to answer the question in real time. ChatGPT has access to large amounts of online information that it can curate for the best answer to the question. Human candidates allowed freely to use electronic devices to answer questions may perform much better than the results quoted in our paper.
Lastly, future studies into ChatGPT can potentially address ChatGPT’s ability to simply recall facts versus synthesise information together to perform next-step thinking. Multiple-choice questions test two things (1) the possession of a key piece of information for factual recall and (2) the ability to synthesise those facts to solve a problem. The second ability is considered higher order thinking that requires the learner to think more critically. Ultimately, residency training programmes and curriculum encourage the development of critically thinking physicians who can think critically and sometimes out of the box when faced with clinical patient scenarios. While we did not specifically test ChatGPT’s critical thinking skills in this manuscript, the next questions about ChatGPT’s higher order thinking can be investigated in the future.