In this study, we developed and evaluated a novel method using a locally hosted, open-source large language model, Llama-2-70B, to convert clinical free-text reports into template-based, machine-readable structured reports. Structured reporting offers many advantages, yet its adoption in clinical practice remains limited. An LLM-based approach could enable radiologists to preserve their traditional narrative reporting style. Our approach guarantees the creation of valid, fully structured reports in all instances. The LLM achieved mean Matthews correlation coefficients of 0.75 (94% HDI: 0.70–0.80) within the MIMIC-CXR cohort and 0.68 (94% HDI: 0.64–0.72) within the UH cohort. Our results show evidence (majority of probability density inside the ROPEs) for parity with human readers in both cohorts.
Interestingly, both LLM and human readers achieved a high specificity of over 0.95, and lower sensitivity ranging from 0.66 to 0.80. This result can primarily be attributed to the imbalance between positive and negative findings in the reports. However, a lower sensitivity poses a risk of missing critical findings and suggests room for improvement.
The MCCs for German reports were noticeably lower than for English reports. Although German was the second most common language in Llama-2’s training corpus, it constituted only 0.17% of the dataset, with English covering 89.7% [9]. On top of that, German reports sometimes included abbreviations differing in meaning from their English counterparts. For instance, PE typically stands for pulmonary embolism in English but for “Pleuraerguss” (pleural effusion) in German. Another challenge encountered is the vague wording of free-text reports, which use terms such as “small to moderate,” “possible,” and “cannot be excluded,” which cannot be unequivocally mapped against a fully structured template. The problem was more pronounced for the German reports. It could also explain lower MCC scores for human readers in the UH cohort. Hallucinations, where the LLM produces factually incorrect or nonsensical outputs, are a significant concern for clinical applications. We observed the LLM sometimes incorrectly predicted the presence of pleural drainage catheters when they were not explicitly mentioned in the report but were plausible given other findings such as pneumonia and large pleural effusion. Further research should prospectively evaluate the performance of LLMs for structured reporting when the reporting radiologist is aware of the template and its items. Other potential performance improvements include developing feedback loops for continual learning from radiologists’ corrections and curating guidelines for interpreting vague language to refine prompts. Additionally, integrating multi-modal learning, where the LLM can access both textual reports and associated medical images, could enhance its ability to cross-reference findings and improve accuracy.
Automatic retrospective structuring of radiology reports within an institution has many potential applications. It can support research, enabling faster retrieval of relevant cases for a query and automatic label extraction for imaging-based AI algorithms. It can simplify longitudinal patient monitoring and can be used to extract information from trial reports, e.g., with RECIST or Lugano criteria [19]. Finally, it may help in communication with other specialties and non-clinical researchers. On the other hand, retrospectively creating a fully structured report from a free-text report might lead to some information loss, typically due to two reasons: (1) the free-text report including a finding not covered by the template, and (2) a free-text finding being ambiguous or impossible to map to the options specified in the template unequivocally.
Several recent publications have explored the post-hoc structuring of radiology reports using LLMs. Adams et al [11] devised a two-step approach: GPT-4 automatically selected an appropriate template and then created a structured report. Mukherjee et al [20] used single- and multi-step prompting of a locally hosted Vicuna-13B model to extract 13 binary findings from chest radiography reports. Our approach has a few important advantages over the ones mentioned above. First, thanks to the constrained generation and limiting the LLM inference to predefined answers, structuring is immune to adversarial attacks, including prompt injection, which could be attempted to extract sensitive information [21]. Second, using a local model ensures that data does not leave the institution, satisfying data protection regulations. Finally, we evaluated a fine-grained and fully structured template, and a dataset of internal, real-world non-English reports.
The most advanced LLMs are primarily accessible through cloud-based services, offering access through graphical or application programming interfaces (APIs). These models remain a cost-effective and scalable solution for scenarios where data protection is not a primary concern. However, the difference in performance and speed between them and their open-access alternatives is becoming less significant. For instance, in a recent study, Llama-2-70B achieved an accuracy of 63.8% on USMLE-style questions from the MedQA dataset, a higher score than that of GPT-3.5, which was the state-of-the-art commercial model upon its release 9 months before Llama-2 [22]. In comparison, human experts scored 87%. Several fine-tuned models have been derived from Llama-2, including Vicuna, Alpaca, and other model families [20]. Llama-2 models were also successfully fine-tuned on medicine-specific data sources, demonstrating a moderate performance improvement in medical reasoning tasks over the base models [22].
Local, open-source LLMs offer the user full control over the model and eliminate the risk of its discontinuation by an external provider. A rich open-source software ecosystem simplifies switching between models and setups, including widely adopted libraries like Huggingface, LangChain, and Guidance. Although the computing requirements are still a limitation to running LLMs locally, smaller LLMs like Mistral 7B, which can run on consumer-grade machines, are showing performance improvements [23]. Further, by applying a technique called quantization, which reduces the number of bits in the model’s weights [24], LLMs become faster and incur lower computational and memory costs. This technique was also used in our study. To improve sensitivity and reduce the risk of missing critical findings, strategies such as domain-specific fine-tuning on medical datasets, prompt optimization, and optimizing template wording can be implemented. Finally, ensuring radiologists are aware of and include all required information in their reports can significantly enhance performance.
As with all non-experimental studies, the ability to draw causal conclusions is limited. Results should hence be interpreted as hypothesis-generating. Our study was performed on a relatively small sample size, and more data would be needed to make binary decisions in the Bayesian ROPE framework. Consequently, the effect size is expected to remain small within our cohorts. Further, defining the reference standard was challenging due to the rigidity of templates and the ambiguity of retrospectively collected free-text reports. Our evaluation focused on a single modality and anatomic region, and we only evaluated the performance of top-level findings. Statistics on nested elements are subject to conditional probabilities lying beyond the scope of this manuscript. Finally, we assessed only one prompt, and the LLM was not fine-tuned to medical and German contexts. Potential improvements by prompt engineering, model fine-tuning, and preprocessing of reports using machine translation remain subject to future research. To enhance the reliability of LLM-generated reports, future research could leverage Bayesian techniques, model ensembling, and Monte Carlo Dropout to quantify and improve confidence estimates.
In conclusion, on-premise, open-source LLMs are able to automatically structure free-text radiology reports using a fine-grained template with approximately human-level accuracy, highlighting their potential to significantly enhance the efficiency and reliability of processing and interpreting clinical reports. However, the understanding of semantics varied across languages and imaging findings. Further evaluations in larger clinical cohorts are needed to establish their usefulness in the clinical setting.
Comments (0)