Real-life benefit of artificial intelligence-based fracture detection in a pediatric emergency department

This study evaluated the performance of a commercially available AI-supported software for fracture detection in children and adolescents, as well as its impact on the diagnostic accuracy and confidence of inexperienced residents in a real-life clinical setting. To date, this is the largest study of its kind. The AI demonstrated a good stand-alone performance; however, the improvement in the diagnostic accuracy of inexperienced physicians when using this AI assistance was limited.

Stand-alone performance

In our cohort, AI achieved a sensitivity of 92%, specificity of 83%, and diagnostic accuracy of 87%. In the real-life cohort, a particularly high number of FP findings was identified at the apophysis of the 5th metatarsal bone. Empirically, this region often presents challenges in differentiating between a fracture and an apophysis in children, particularly for inexperienced physicians. Additionally, the often-irregular shape of toes was frequently misinterpreted as a fracture—a distinction that, in our experience, is particularly difficult for beginners. As anticipated, the physis gap, resembling a fracture line or avulsion, accounted for more than half of all FP findings, significantly contributing to misdiagnoses.

The stand-alone performance of the AI in our study is comparable to that reported in the few peer-reviewed studies of (predominantly other) AI software solutions in children [7,8,9,10,11,12]. However, the validity of such comparisons across different software tools and cohorts is limited. For example, it is important to consider whether the cohort includes polytrauma patients, as their absence could result in an underrepresentation of more severe (and typically easier to detect) fractures. Additionally, the age distribution within the cohort is crucial, as certain fracture types are age-specific (e.g., toddler fractures in infants or triplane fractures in adolescents).

For RBFracture, the AI software used in this study, only one peer-reviewed publication to date including a small set of children. In this manufacturer-associated study by Bachmann et al a stratified cohort of 98 children (generously defined as individuals up to 21 years of age) was employed [9]. They reported an exceptionally high sensitivity of 100% (95% CI: 100–100%) and a moderate specificity of 79%. The higher sensitivity compared to our study and other AI-based programs could be attributed to their small sample size or potentially higher median patient age, which may have resulted in a greater prevalence of closed growth plates—the most common source of error in pediatric fracture detection.

Fractures of particular medicolegal relevance

Trauma-related complications have been described as the leading cause of malpractice lawsuits against pediatric surgeons in some countries, accounting for 27% of all cases [13]. A study by Vinz et al retrospectively analyzed 189 cases investigated before a court of arbitration for potential treatment errors in pediatric fractures [14]. They found that 67% involved medical malpractice, often due to misinterpretation of radiographs, with 44% of these patients suffering permanent damage due to incorrect treatment.

According to van Laer, five trauma sequelae have been identified that are often overlooked in children and, if not treated appropriately, can potentially lead to permanent damage, such as valgus deformity of the leg or growth disturbances at the elbow [15]. These trauma sequelae include three types of fractures—non-displaced radial condyle fractures, proximal tibia fractures, and medial malleolus fractures—and two malalignments, which are not part of this fracture study.

For these three fracture types, characterized by subtle fracture morphology but high clinical impact, achieving the highest possible AI sensitivity is crucial. In our study, the AI demonstrated a sensitivity of 100% for proximal tibia fractures, 96% for medial ankle fractures, but only 68% for radial condyle fractures. This latter finding is concerning, as the elbow is a frequent site of medical errors in pediatric fractures, accounting for 77% of all cases, according to Vinz et al [14]. Consequently, more intensive AI training focused on pediatric elbow fractures, particularly this type of subtle fracture, would be highly beneficial.

Impact on inexperienced physicians

According to the hierarchical efficiency model for AI software in imaging proposed by Leeuwen et al, pure stand-alone performance represents only the second-lowest level [16]. The next level involves the impact of the software on the physician’s performance, which we also evaluated. To the best of the authors’ knowledge, this is the first study to provide such evidence in a real-life pediatric cohort. In our study, AI improved initially incorrect diagnoses in 4.1% of cases. However, it is noteworthy that in 1.5% of cases, AI led to the incorrect alteration of an initially correct diagnosis, resulting in a net improvement of 2.6% in reader accuracy due to AI.

Apart from our study, only two other studies have investigated the added value of AI on human readers. Both were not independent of the manufacturer and involved stratified cohorts based on fracture location, rather than a real-life cohort. In the aforementioned study by Bachmann et al [9], using the AI software RBFracture, the proportion of missed fractures was assessed instead of accuracy. With AI assistance, the rate of missed fractures decreased by 29%, more than our study’s rate (21%). This difference may primarily be due to the fact that the readers in that study were mostly paramedical staff, who likely have a lower performance level in initial fracture detection compared to physicians. For example, the rate of missed fractures among advanced trauma nurses was 79.0% (compared to 14.3% in our study), which is an unacceptably high figure in practice. Additional factors that may contribute to the higher reduction in missed fractures in that study include cohort stratification, the lack of clinical history available to the readers, the potentially higher sensitivity of the software version used, and the very small proportion of children in the analyzed cohort.

In the equally mentioned study by Nguyen et al, incorporating the AI software BoneView improved the sensitivity of junior radiologists by 10.3% and their specificity by 0.7% (3.6% and 1.7%, respectively, in our study) [17]. Whether this difference is due to the varying training levels of the readers, the stratified patient dataset, or the differing performance of the software used remains speculative.

While no further studies have specifically examined AI’s impact on human reader performance in children and adolescents, similar research exists for stratified adult cohorts. Guermazi et al reported that the addition of the BoneView software improved the sensitivity of emergency physicians by 9.9% and the specificity by 3.4% [18]. In Duron et al, the same AI software increased sensitivity by 7.5% and specificity by 4.1% [19].

Whether purchasing fracture detection software is worthwhile depends not only on its stand-alone performance and its impact on human readers but also on a complex question that cannot be answered unequivocally: What level of economic investment is justified for a defined increase in performance and a potential improvement in patient safety? In clinical practice, AI-based software should aim to (1) improve patient safety by acting as a “second pair of eyes” (known also as augmented AI [5]), helping to prevent harm caused by undiagnosed fractures or unnecessary immobilization, (2) reduce the number of return visits due to initial misdiagnoses, (3) assist in training less experienced junior doctors, and (4) enable the automatic triage of patients to ensure timely and appropriate care. Despite the limited overall increase in accuracy, based on our experience with fracture detection AI software, even experienced pediatric radiologists benefit from this “second pair of eyes,” leading to a reduction in satisfaction-of-search errors. This is especially true when fractures are located in unexpected areas, such as a partially imaged bony ligament tear in the upper ankle joint, even though the forefoot was the primary focus.

Future studies, preferably prospective and multicenter, should aim to achieve the next highest level of evidence: the effect on treatment outcomes or follow-up examinations, impact on quality of life, morbidity, survival, and the cost-effectiveness in terms of quality-adjusted life years and incremental cost per quality-adjusted life years [16].

Limitations

In addition to its retrospective, monocentric nature, this study has some other limitations. For example, updates to the AI software used may have been published between the start of the study and its publication, which could yield different results. Due to the large cohort, a certain learning effect among readers cannot be ruled out despite blinding to the gold standard. Our real-life design, in which the diagnosis was established first without AI and immediately followed by the same case with AI, could have resulted in readers being less willing to change their initial diagnosis. However, this approach most accurately reflects the decision-making process in real-life scenarios. Additionally, the performance of initially inexperienced doctors would likely show a steep increase during a potential wash-out phase of several months, an effect that was largely circumvented by our study design. Finally, an expert consensus as the ground truth is not perfect, and this applies not only to our study but also to all comparable studies (and to the AI training data as well). Furthermore, pediatric radiologists have demonstrated high diagnostic accuracy in evaluating pediatric trauma cases [10].

Comments (0)

No login
gif