We developed a DL tool to detect SC MS lesions, and we performed a multi-reader, multi-case retrospective study with a fully crossed design. The study was approved by the local ethics advisory committee (no. 24.100). All data were extracted (de-identified) from the French MS registry in October 2023 (www.ofsep.org) and included clinical and imaging data from 38 expert MS centers collected during patients’ routine follow-up visits [14, 15] (NCT02889965).
Characteristics of the segmentation modelDetails about the segmentation model are provided in the Supplemental Material. Briefly, it is based on a five-level U-net architecture [16] with one input channel for each of the two sequences of interest and one single output channel. The model was trained with a combined dice and cross-entropy loss function (deep supervision was applied) on an annotated dataset of 140 cases from 40 scanners and with a diverse range of lesion presentations extracted from the OFSEP database. The model output post-processing was optimized on a dataset of 21 cases for the task assessed in our work (favoring a good sensitivity with the risk of deteriorated precision). The resulting model achieved a lesion-wise sensitivity of 0.89 and a precision of 0.64 on a test set of 40 cases.
The multi-reader study designThe multi-reader study was conducted between December 2023 and June 2024 (Fig. 1). Twenty clinicians annotated cervical and thoracic SC sagittal T2 acquisitions from 50 pwMS, with the systematic help of sagittal STIR. A web-based annotation tool resembling a standard picture archiving and communication system was used to annotate lesions by clicking once on each lesion (Supplementary Fig. 1). Each reader underwent a standardized training protocol, including a 20-min presentation and a written tutorial. A timer recorded the time spent on each MRI volume. Each patient was analyzed twice by each reader, with and without the tool, in a randomized order across two sessions at least 15 days apart. At the end of the experiment, we asked each reader to complete a questionnaire concerning their potential expectations of a tool to assist in the detection of SC MS lesions in clinical practice. Each item was scored from 1 to 5, ranging from “completely disagree” to “completely agree”.
Fig. 1Summary of the multi-reader study design
DatasetSC MRI acquisitions were extracted from the OFSEP database. Inclusion criteria were images acquired from 2017, with both sagittal T2 and STIR acquisitions available at both upper and lower SC levels, with at least another SC MRI timepoint available. Non-inclusion criteria were patients who were already included in the training dataset. Exclusion criteria were a pathologic condition of the SC other than MS and poor imaging quality. Initially, 55 patients were selected, five of whom were excluded due to poor imaging quality, resulting in a final cohort of 50 patients.
ReadersRadiologists and neurologists with varying levels of expertise and who were not involved in the tool development process were recruited for the experiment. Senior neuroradiologists and neurologists were recruited at the meeting of the OFSEP imaging group and came from French MS expert centers. Junior radiologists and neurologists were residents from the same centers. The general radiologists had no particular specialization in the follow-up of MS patients.
Ground truthGround truth segmentation was established independently by a neurologist with 10 years of experience in MS SC MRI reading (A.K.) and a radiology resident (B.L.) using ITK-SNAP software (version 4.0.1). They had more data at their disposal than the readers (SC MRI data on at least one other time-point and axial acquisitions). Discrepancies between them were resolved through adjudication with a third expert, a neuroradiologist with six years of experience (N.L.).
Secondarily, the 20 readers’ annotations were grouped together on each MRI volume and were independently reviewed by B.L. and A.K. to construct a revised ground truth to potentially reintegrate lesions that would have been missed in the initial ground truth. We considered that a reader had correctly identified a lesion if he had clicked on any voxel within the ground truth lesion mask.
Statistical analysisStatistical analyses were performed using R (version 4.2.2).
Primary outcometo assess a difference in mean readers’ sensitivity and/or precision to detect SC lesions with or without the tool. Prior to the study, we established through simulation that, under simplifying assumptions and for a Type I error of 0.05, a Type II error of 0.20, and an improvement in sensitivity of 10% with the tool (based on an average performance of experts alone of 0.60), a set of 50 patients with at least seven experts was sufficient. The individual mean sensitivity together with its 95% confidence interval (CI) was computed for each condition, with or without the tool. The overall mean sensitivity was tested for equality between the two conditions using a regression including a condition, acquisition, and a reader-fixed effect. The same analysis was conducted for the mean precision.
Secondary outcomesWe studied supplemental metrics as an exploratory analysis. Corresponding p-values associated with absences of effect are provided without correction for multiple comparisons.
Supplemental assessment of difference in mean sensitivity and precisionThe main analysis was repeated according to readers’ experiences (< 10 and ≥ 10 years), readers’ specialty (neurologists with expertise in MS, radiologists, and neuroradiologists), acquisitions coverage (upper and lower SC), MRI magnetic field, slice thickness, and image in-plane resolution. Then, sensitivity and precision were computed for the model alone and were compared to the performances of the reader alone using a linear model with an acquisition and a condition (reader without the tool vs model alone) fixed effect.
Lesion count and lesion-wise sensitivity and precisionThe average number of true positive (TP) lesions detected per reader was computed for each condition. This number was tested for equality between the two conditions using a paired t-test. Lesion-wise sensitivity and precision were analyzed aggregating all lesions from all scans. The individual lesion-wise sensitivity, together with its 95% CI, was computed for each condition and was then tested for equality between conditions using a logistic regression including a condition and an acquisition fixed effect. The same analysis was reported for the lesion-wise precision. These analyses were repeated according to lesion volume.
Acquisition-wise categorizationEach acquisition was categorized in three classes: no lesion, 1 or 2 lesions, and ≥ 3 lesions. The overall accuracy associated with this categorization was computed for the two conditions and tested for equality using a logistic regression including a condition, a patient, and a reader effect.
Task durationThe mean time elapsed for each acquisition, each reader, and each condition was estimated after having filtered out the 5% upper values. This was done to avoid outliers due to delay during the task as a result of external solicitations and tested for equality between conditions using a Wilcoxon rank sum test.
Inter-reader variabilityThe inter-reader differences in detected lesions within each condition were reported as multi-rater Light’s kappa. The pooled inter-reader standard deviation associated with the number of lesions detected in each condition was computed and compared using a linear model with a condition and an acquisition fixed effect.
Revised ground truthThe main analyses were repeated with the revised ground truth.
Comments (0)