How accurate are ChatGPT-4 responses in chronic urticaria? A critical analysis with information quality metrics

Background

The increasing use of artificial intelligence (AI) in healthcare, especially in delivering medical information, prompts concerns over the reliability and accuracy of AI-generated responses. This study evaluates the quality, reliability, and readability of ChatGPT-4 responses for chronic urticaria (CU) care, considering the potential implications of inaccurate medical information.

Objective

The goal of the study was to assess the quality, reliability, and readability of ChatGPT-4 responses to inquiries on CU management in accordance with international guidelines, utilizing validated metrics to evaluate the effectiveness of ChatGPT-4 as a resource for medical information acquisition.

Methods

Twenty-four questions were derived from the EAACI/GA2LEN/EuroGuiDerm/APAAACI recommendations and utilized as prompts for ChatGPT-4 to obtain responses in individual chats for each question. The inquiries were categorized into 3 groups: A.) Classification and Diagnosis, B.) Assessment and Monitoring, and C.) Treatment and Management Recommendations. The responses were separately evaluated by allergy specialists utilizing the DISCERN instrument for quality assessment, Journal of the American Medical Association (JAMA) benchmark criteria for reliability evaluation, and Flesch scores for readability analysis. The scores were further examined by median calculations and Intraclass Correlation Coefficient assessments.

Results

Categories A and C exhibited insufficient reliability according to JAMA, with median scores of 1 and 0, respectively. Category B exhibited a low reliability score (median 2, interquartile range 2). The information quality from category C questions was satisfactory (median 51.5, IQR 12.5). All 3 groups exhibited confusing readability levels according to the Flesch assessment.

Limitations

The study's limitations encompass the emphasis on CU, possible bias in question selection, the use of particular instruments such as DISCERN, JAMA, and Flesch, as well as reliance on expert opinion for assessment.

Conclusion

ChatGPT-4 demonstrates potential for producing medical content; nonetheless, its reliability is shaky underscoring the necessity for caution and confirmation when employing AI-generated medical information, especially in the management of CU.

Comments (0)

No login
gif