The efficacy of artificial intelligence in urology: a detailed analysis of kidney stone-related queries

The increasing prominence of AI in various fields raises the question of its efficacy in delivering accurate and comprehensive responses, particularly in complex areas like healthcare. The study evaluated the performance of an AI language model, OpenAI's GPT-4, in addressing questions related to urological kidney stones. The investigation focused its performance on binary and descriptive questions of varying difficulty levels, using the parameters of accuracy and completeness. The results offer intriguing insights into the capability of the AI model in processing and responding to complex medical queries.

The application of artificial intelligence, particularly ChatGPT, a natural language processing tool by OpenAI, in medicine, specifically urology, is a rapidly growing field of interest [19]. Several studies have sought to investigate this model's utility, quality, and limitations in providing medical advice and patient information, alongside its role in academic medicine [19]. It is clear from these studies that while ChatGPT does possess considerable potential, significant concerns remain regarding its accuracy, the quality of its content, and the ethical implications of its utilization [8]. Cocci et al. investigated the use of ChatGPT in diagnosing urological conditions, comparing its responses to those provided by a board-certified urologist [21]. It provided appropriate responses for non-oncology conditions, but its performance fell short in oncology and emergency urology cases. Furthermore, the quality of the information provided was deemed poor, underlining the need to evaluate any medical information provided by AI carefully. They found that the appropriateness of ChatGPT's responses in urology was around 52%, significantly lower than our findings of 80% accuracy for binary responses and 93.3% for descriptive responses.

Similarly, studies conducted by Huynh et al. [17] and Deebel et al. [18]. which focused on evaluating the utility of ChatGPT as a tool for education and self-assessment for urology trainees, found it wanting. While there were instances of ChatGPT providing correct responses and reasonable rationales, its overall performance was lackluster, with persistent justifications for incorrect responses potentially leading to misinformation [17, 18]. A similar disparity is seen when we compare our findings to those of Huynh et al., which found that ChatGPT was correct on 26.7% of open-ended and 28.2% of multiple-choice questions [17]. Our study's accuracy rate was substantially higher, showing that ChatGPT may have a more practical application in specific contexts and modes of questioning. Deebel et al. found that ChatGPT's performance improved when dealing with lower-order questions [18], a finding echoed by our results, showing a high accuracy level for both easy and moderate-level questions.

The results of our study echo those of previous research into the accuracy and quality of ChatGPT's responses in the field of urology, albeit with some differences. When compared to previous studies, it is evident that our research has shown a higher degree of accuracy in both binary and descriptive responses [17, 18, 21, 22]. Coskun et al. found that while ChatGPT was able to respond to all prostate cancer-related queries, its responses were less than optimal, often lacking in accuracy and quality [23]. This suggests that reliance on ChatGPT for accurate patient information should be exercised cautiously. However, the study by Coskun et al. reminds us of the limitations of AI-generated patient information, which are also echoed in our study. Although our accuracy scores were higher, the quality and completeness of the responses provided by ChatGPT were lower than desired, highlighting the need for improvements in the model's performance, particularly in the context of more complex or challenging queries.

Based on the results of this scientific study, the performance of the artificial intelligence model in answering questions about urological kidney stones can generally be considered high. The model typically received above-average scores for accuracy and completeness when answering questions of varying difficulty levels. Responses to easy questions typically received high scores in accuracy and completeness, while the accuracy and completeness scores for responses to more challenging questions were slightly lower. However, these scores are generally within acceptable levels. The standard deviations of the accuracy and completeness scores indicate a degree of variability in the model's performance from question to question. Additionally, the results of the Kruskal–Wallis tests suggest that the difficulty level of the questions does not impact the accuracy or completeness of the responses, implying that the model responds to questions of varying levels with similar capabilities.

Despite the valuable insights derived from this study, certain limitations have been acknowledged. Primarily, the specificity of the 90 questions related to urological kidney stones is recognized, a limitation that does not fully capture the extensive range of medical queries the model may encounter. Moreover, the categorization of questions as 'easy,' 'moderate,' or 'complex' is acknowledged to be somewhat subjective and potentially interpreted differently by various healthcare professionals. While the accuracy and completeness of the responses were evaluated, it is noted that other essential factors, such as relevance, coherence, and the ability to interact in real-time clinical context were not considered. It is therefore suggested that future research employ more comprehensive studies, incorporating larger and more diverse datasets as well as additional evaluative parameters, in order to more fully ascertain the capabilities and limitations of the GPT-4 model within a healthcare context.

As a conclusion, ChatGPT model is generally capable of answering questions about urological kidney stones accurately and comprehensively, by showing promising results in terms of its learning capacity and adaptability over time. However, it is important to note that performance does show some variability from question to question, especially when dealing with more complex or challenging questions. These findings highlight areas for learning and improvement, underscoring the importance of continuous training and updates.

Comments (0)

No login
gif