Can Artificial Intelligence Improve the Readability of Patient Education Materials on Aortic Stenosis? A Pilot Study

Adapting previous methods by Kirchner et al., online PEMs pertaining to AS were collected on September 10, 2023 on a private browsing window on Google Chrome (version 116.0.5845.179) via web searches of the top 20 leading academic cardiology, heart, and vascular surgery institutions in the USA as per the US News and World Report (USNWR) 2023 hospital rankings [13, 14]. Additionally, PEMs on AS were gathered from available online patient resources provided by a professional cardiothoracic surgical society. Browser history and cache were cleared prior to web searches. PEMs that were video-based or already written at or below the 6th-grade reading skill level according to at least two of four utilized readability measures were excluded from analysis. The present study was exempt from institutional review as it did not involve human subjects research. This article does not contain any new studies with human participants or animals performed by any of the authors.

Once collected, online PEMs were reviewed for patient education descriptions of AS through group discussion by multiple investigators. Once identified, these patient education descriptions were assessed for readability via an online application (https://readable.com, Added Bytes Ltd., Brighton, England) using four validated readability measures: Flesch Reading Ease (FRE) score, Flesch-Kincaid Grade Level (FKGL), Gunning Fog Index (GFI), and Simple Measure of Gobbledygook Index (SMOGI) [15,16,17,18]. These measures are the most commonly used readability tests in the health literacy literature, and are calculated using the following formulas [9]:

$$}= 206.835-1.015\left(\frac}}}}\right)-84.6\left(\frac}}}}\right)$$

$$}= 0.39\left(\frac}}}}\right)+11.8\left(\frac}}}}\right)-15.59$$

$$\mathrm= 0.4 \left[\left(\frac}}}}\right)+100\left(\frac}}}}\right)\right]$$

The FRE score ranges from 0 to 100 and corresponds to an American educational level, with higher scores indicating easier reading material (100–90, very easy to read/5th-grade level; 90–80, easy to read or conversational English/6th-grade level; 80–70, fairly easy to read/7th-grade level; 70–60, plain English/8th- to 9th-grade level; 60–50, fairly difficult/10th- to 12th-grade level; 50–30, difficult/college level; 30–10, very difficult/college graduate level; 10–0, extremely difficult/professional level) [15]. FKGL, GFI, and SMOGI range from 0 to 18, 0 to 20, and 5 to 18, respectively, and each score indicates the number of years of education necessary to understand the assessed reading material [16,17,18]. Thus, PEMs with FKGL, GFI, and SMOGI scores of approximately 7 or lower, and FRE scores greater than 80, are generally considered in congruence with AMA and NIH readability recommendations [4, 5]. The FRE and SMOGI measures have previously been highlighted for their utility in patient education and health literacy research [9, 19].

Next, patient education descriptions of AS from each PEM were entered into the freely available LLMs ChatGPT-3.5 (https://chat.openai.com/chat; Version August 3, 2023, OpenAI, San Francisco, CA, USA) and Bard (https://bard.google.com; Version July 13, 2023, Google, Mountain View, CA, USA), preceded by the prompt “translate to 5th-grade reading level”. These AI dialogue platforms were prompted to translate text to a lower reading level than recommended by the AMA and NIH to account for potential variability in the interpretation of reading skill levels by the LLMs. The AI-generated text material from ChatGPT-3.5 and Bard was then reevaluated for readability and accuracy.

Our primary endpoint was the absolute difference in FRE, FKGL, GFI, and SMOGI scores of each PEM before and after conversion by both AI platforms. To evaluate conversion consistency, four independent conversions of PEMs by each AI dialogue platform were performed [13]. Mean readability scores were recorded and used for subsequent analysis. Percentage change in readability scores before and after AI conversion was calculated. The accuracy of medical content from converted PEMs was secondarily assessed through independent review by multiple investigators. Time (seconds) elapsed between original prompting of each AI platform and completion of text generation was also recorded for each PEM.

Continuous variables were presented as means ± standard deviation (SD) or medians with interquartile range if non-parametric. The normality of variables was assessed using quantile–quantile plots and Shapiro–Wilk testing. Student’s t tests or Wilcoxon rank-sum tests were subsequently used, as appropriate, to compare text characteristics, FRE, FKGL, GFI, and SMOGI scores before and after conversion, as well as percentage change in readability scores by each AI platform. A two-sided p value ≤ 0.05 was considered statistically significant. Statistical analyses were performed using Stata version 18.0 (StataCorp, College Station, TX, USA).

Comments (0)

No login
gif