Artificial intelligence–powered chatbots, such as ChatGPT and Google Bard, have been heralded as revolutionary turning points in the current AI revolution. These chatbots are LLMs that employ machine learning and natural language processing to interact with people via text or voice interfaces. The possible application of LLMs in medicine has garnered a lot of hype in recent months [2, 26]. Healthcare professionals may get evidence-based real-time advices from chatbots to enhance patient outcomes [27]. For complicated medical situations, they may provide clinical recommendations, identify possible medication interactions, and recommend suitable courses of action [28]. Chatbots are able to identify possible issues that human providers would not immediately notice, thanks to their rapid access to vast volumes of data and processing speed. They might also provide the most recent information on recommendations and treatment alternatives, guaranteeing the patients the best possible cares [27, 28].
In this research, we gathered 60 medical records to examine ChatGPT-4 and Google Gemini’s capacity to define a correct surgical plan for glaucomatous patients. In particular, in order to test the two chatbots’ capacity on performing correct differential diagnoses and deriving a coherent surgical planning, we divided our sample into “ordinary” and “challenging” scenarios and compared the given answers, to those given by glaucoma experts, which were asked to analyze the same pool of clinical cases. Furthermore, we exploited the 5-point Global Quality Score as a subjective parameter of chatbots’ quality, in terms of user-friendliness, speed of use, and accuracy and exhaustiveness in responses.
Overall, ChatGPT showed an acceptable rate of agreement with the glaucoma specialists (58%), while being able to provide a coherent list of answers in another 35% of cases, limiting completely incorrect answer to 7%. ChatGPT’s results significantly outperformed Gemini in this setting. Google’s chatbot indeed showed high unreliability, with only 32% rate of agreement with specialists and a 27% rate of non-completed tasks. In particular, Gemini often stated “As a large language model, I am not a medical professional and cannot perform surgery” or “I cannot definitively recommend one specific treatment for glaucoma surgery.” Moreover, while GPT-4 was almost always able to analyze the clinical case in details and develop coherent reasoning in order to offer an unambiguous choice, Gemini’s answers were much more generalist and synthetic, even though the cited literature appeared coherent.
As expectable, in the analysis of “ordinary” clinical cases, the accuracy of the two LLMs was higher when compared to specialists’ opinion, reaching 65% rate of agreement for ChatGPT and 38% for Google Gemini. However, ordinary cases often allow multiple surgical or parasurgical possibilities based on surgeons’ preference. When analyzing only clearly incorrect answers, GPT-4 only missed 8% of cases, while Gemini had a higher rate of errors (13%) and not given answers (15%):
When focusing on “challenging” scenarios, both chatbots’ performance lowered, but ChatGPT was shown to be much more accurate than Gemini (45 vs. 20% in terms of agreements’ rate with specialists). Surely, it is important to acknowledge that complex cases require a multimodal management and, meantime, that different kind of treatments should be evaluated. Nevertheless, even not always being concordant with specialists’ opinions, ChatGPT showed an incredible capability to analyze all parts of the scenario, as visible in the answers in Table 3, being able to propose combined surgeries (e.g., cataract, vitrectomy, artificial iris implantation) in a comprehensive treatment. On the other hand, Gemini analysis in these cases was often scarce and incomplete, thus not being able to define a thorough surgical plan.
These results reflected also on GQS, which were significantly higher for ChatGPT (3.5 ± 1.2) rather than Gemini (2.1 ± 1.5), and this difference augmented when focusing on “challenging” cases. ChatGPT was indeed more user-friendly and able to perform a more in-depth analysis of the clinical cases, giving specific and coherent answers even in the highly specific context of surgical glaucoma. Moreover, when asked to choose only one treatment, it was always able to pick one of the previously listed treatments, different from Gemini’s approach.
The effectiveness of ChatGPT in ophthalmology panorama was already investigated: a resent research showed that ChatGPT scored an accuracy of 59.4% answering questions of the Ophthalmic Knowledge Assessment Program (OKAP) test, while on the OphthoQuestions testing set, it achieved 49.2% [6]. ChatGPT also was much more accurate than Isabel Pro, one of the most popular and accurate diagnostic assistance systems, in terms of diagnostic accuracy in ophthalmology cases [29]. Similarly, the diagnostic capabilities of this LLM were studied by Delsoz et al. on glaucoma patients, reporting a 72.7% accuracy in preliminary diagnosis [30]. Furthermore, compared to ophthalmology residents, ChatGPT constantly showed a higher number of differential diagnoses [30]. Recently, Kianian et al. highlighted ChatGPT’s effectiveness in creating content and rewriting information regarding uveitis, in order to make them easier to digest and aid patients in learning more about this pathology and treat it more adequately [31].
Since its recent introduction, few studies have compared the performances of the former Google Bard and ChatGPT in healthcare settings. Gan et al. compared the performance of the two chatbots in triaging patients in mass casualty incidents, showing that Google Bard was able to make a 60% rate of correct triages, similar to that of medical students. On the other side, ChatGPT had a significantly higher rate of over-triage [9]. Koga et al. analyzed LLMs’ ability to generate differential diagnoses about neurodegenerative disorders, based on clinical summaries. ChatGPT-3.5, ChatGPT-4, and Google Bard included correct diagnoses in 76%, 84%, and 76% of cases, respectively, demonstrating that LLMs can predict pathological diagnoses with reasonable accuracy [8].
The theoretical advantage of using a LLM in a clinical setting relies on more accurate, fast, and impartial answers, indeed available at any time. Moreover, these models are able to actively train, based on reinforcement-learning capabilities, allowing the models to improve over time and rectify prior errors [30]. Although these models require less human monitoring and supervision for active training than supervised learning models, their training data contain non-peer-reviewed sources which could include factual inaccuracies [32]. It is important to say that due to a large overlap between physiological and pathological parameters, glaucoma management is extremely subjective and has low agreement, even among highly competent glaucoma experts [33]. In this setting, AI chatbots may not be able to handle circumstances requiring human judgment, empathy, or specialized knowledge due to their limits, making regular human oversight essential; before making any final judgment and doing the necessary steps, their regular oversight and evaluation are thus essential [34]. In our research, ChatGPT outperformed Google Gemini in terms of surgical choice, giving more specific answers on the majority of clinical cases. These results are consistent with a previous research conducted by our group, in which ChatGPT and Google Gemini were asked to face surgical cases of retinal detachment. In that setting, GPT-4 reached an 84% rate of agreement with vitreoretinal surgeons, while Gemini only reached 70% of agreement [35]. Notably, those results are significantly higher than those reported in a glaucoma setting. We indeed hypothesize that surgical decisions in vitreoretinal surgery are much more limited and follow more precise hinges, when compared to the immensity of glaucoma surgical treatments and the possible overlap between them. However, in both studies, ChatGPT showed better performances in terms of scenario analysis and surgical planning, suggesting that the newly presented Google Gemini still lacks optimization in the medical field.
Furthermore, although ChatGPT has demonstrated encouraging results, its immediate application in clinical care settings may be constrained. The incapacity to interpret diagnostic data, such as eye fundus pictures or visual fields, may impair the ability to conduct comprehensive examinations and furnish glaucoma specialists with accurate diagnoses. Considering how much ophthalmology depends on visual examination and imaging for patient diagnosis and treatment, it appears necessary to include additional transformer models—like the Contrastive Language-Image Pretraining model—that can handle different data sources [36].
As far as we know, this is the first investigation in which AI-chatbots were asked to outline a surgical approach in a glaucoma setting. Moreover, we focused on a very specific subject, such as surgical glaucoma, to analyze the computing performances of ChatGPT and Google Gemini. Indeed, our research is affected by several limitations. First off, since the case description was retrospective in nature, missing data could have influenced chatbots’ answers. Second, we concentrated on a limited sample size, making further studies necessary to assess the large-scale applications of our findings. Additionally, the comparison with glaucoma surgeons’ choices not always defines the best possible treatment, possibly limiting the repeatability of these results.
In conclusion, LLMs have the potential to revolutionize ophthalmology. In the future, particularly with the implementation of new inputs, such as video or images, AI-based chatbots might become reliable mates in clinical and surgical practice. Having already showed their role in ophthalmology teaching, we demonstrated that ChatGPT-4 has the potential to coherently analyze medical records of glaucomatous patients, showing a good level of agreement with knowledgeable glaucoma experts. On the other side, Google Gemini showed strong limitations in this setting, presenting high rates of unprecise or missed answers, thus still requiring significant updates before effective application in the clinic.
Comments (0)