Reference standard for the evaluation of automatic segmentation algorithms: Quantification of inter observer variability of manual delineation of prostate contour on MRI

Repeatable and accurate prostate and/or lesion segmentation is crucial for comparing images acquired at multiple time points, multimodal fusion, prostate-guided biopsy, and radiotherapy planning. It is also essential for deriving precise biomarkers, such as prostate-specific antigen (PSA) density or quantitative magnetic resonance imaging (MRI) parameters to improve prostate cancer (PCa) characterization and prognostic [1,2]. There is currently no consensus as to the optimal technique for delineating prostate contours. Manual segmentation is one of the approaches, but it is a tedious and time-consuming process that is subject to variability [3].

Automated segmentation tools have been developed in order to remove the variability introduced by human readers [4]. Performance of such tools, as performance of human segmentation, is difficult to quantify because a true segmentation is often not accessible. Phantoms studies do not reflect the full range of normal and anatomical variability of clinical imaging, and an anatomical reference may not easily obtained [5], [6], [7]. An alternative approach involves generating a consensus segmentation by combining the masks provided by multiple readers. However, this option raises two important issues. The first is to determine which readers should participate in the segmentation process and how many of them should be involved to obtain a representative sample of observations that accurately reflects reality. The second is to produce a consensus segmentation that captures the variability present in multiple observations. While the latter is typically approached from a computer science perspective, the former poses a clinical challenge.

Previous studies have assessed the impact of readers' expertise on their segmentations but the influence of the number of readers on the resulting consensus segmentation remains unknown [8], [9], [10]. In practice, the number of readers is limited by factors such as time, funding, and the availability of experienced and motivated individuals.

Recognizing the significant issue of inter-reader variability, a recent consensus reached by the European Society of Radiology and the European Organisation For Research And Treatment recommends that the reference standard used for training algorithms should be based on segmentations generated by “multiple” expert observers [11]. However, the ideal number of readers remains unclear. To address this challenge, an objective assessment of the consistency between readers' segmentations and their agreement with the consensus is needed.

The purpose of this study was to investigate the relationship between inter reader variability in manual prostate contour segmentation and the number of readers to further determine the optimal number of readers required to establish a reliable reference standard for the evaluation of automatic segmentations algorithms.

Comments (0)

No login
gif