Identifying sources of bias when testing three available algorithms for quantifying white matter lesions: BIANCA, LPA and LGA

For this study, we used three established and available segmentation algorithms, i.e. BIANCA, LPA and LGA, since they are freely available, are widely used in the community [19, 21, 27, 32, 34, 36,37,38,39,40,41,42,43] and use different methodological approaches for segmenting the WML.

Participants

The data used in this study is sourced from the population-based 1000BRAINS cohort [44], which focuses on investigating structural and functional variations in the normal aging brain. Participants were recruited from the Heinz-Nixdorf-Recall (HNR) study and the related HNR-Multi-Generation-study [45]. This dataset includes a wide range of epidemiological information, such as neuropsychological tests, life quality, mood and daily activities, as well as laboratory, genetic, clinical, socioeconomic and environmental data.

Inclusion criteria for the present study required participants to be free from strokes and provide complete data, including FLAIR and T1 images, along with age, sex, blood glucose levels, systolic and diastolic blood pressure levels, diabetes diagnosis and hypertension diagnosis. Initially, there were 1314 participants, but 55 were excluded due to incomplete laboratory information (see ‘Influencing factors’ section), 65 lacked the complete set of T1 + FLAIR modalities and 28 were excluded because of any type of tissue defects post stroke. In total, 1166 participants (mean age = 60, range 18–87, female (F):male (M) 523:643) from the first visit, as detailed in [44], were included in this study. Participants belonging to the same family were addressed in the statistical analysis described in a later section. All participants provided informed consent in accordance with the Declaration of Helsinki. The study protocol of 1000BRAINS was approved by the ethics committee of the University of Duisburg-Essen.

MRI data

MRI brain scans were conducted using a 3-T MR scanner (Siemens Tim-TRIO; for the whole protocol, see Caspers et al. 2014). The sequences included in this study were as follows: a 3D T1-weighted MPRAGE (176 slices, TR = 2.25 s, TE = 3.03 ms, TI = 900 ms, FoV = 256 × 256 mm2, flip angle = 9°, voxel resolution = 1 × 1 × 1 mm3) and a clinical T2-weighted FLAIR (25 slices, TR = 9 s, TE = 100 ms, FoV = 220 × 220 mm2, flip angle = 150°, voxel resolution = 0.9 × 0.9 × 4 mm3).

Relevant influencing factors

Given the associations between WML and age, sex, high blood glucose levels, diabetes, high blood pressure and hypertension demonstrated in previous studies [29, 31, 46,47,48], we included these influencing factors in the present study. The following information was considered for each participant: age, sex, systolic blood pressure (mmHg), diastolic blood pressure (mmHg), whether participants had received a diagnosis of hypertension, the participants’ respective medication against hypertension (if applicable), and blood glucose levels (mg/dL) and whether subjects had a diagnosis of diabetes mellitus. A participant was considered to have diabetes if having a respective confirmed diagnosis. Subjects with confirmed diagnosis of hypertension or taking medication against hypertension were considered to have hypertension.

Evaluating the impact of relevant influencing factors on automatic WML estimationsAim 1: influence of training data

To determine if the composition of the training data influences BIANCA’s WML estimations, we created 15 training datasets, each showcasing a unique characteristic that could potentially influence BIANCAs’ outcome. We then trained the algorithm 15 times, each time for each training dataset, and compared the resulting WML estimations of the whole 1000BRAINS cohort. First, we selected a subsample characterised solely by participants’ age, ensuring that no other relevant factor was present. Specifically, we excluded participants with hypertension, systolic blood pressure exceeding 140 mmHg [49], diabetes or a blood glucose level above 126 mg/dL [50]. We grouped these participants into six training datasets: age 18 (18–37 years), age 37 (37–47 years), age 47 (47–57 years), age 57 (57–67 year), age 67 (67–77 years) and age 77 (77–87 years) (as shown in Table 1). We followed the same approach for studying the influence of all other relevant factors, i.e. selecting one factor of interest and keeping all other factors stable in the respective subsample. For exploring the influence of sex, we selected two subsamples of participants above 60 years old [51]: one without cardiovascular factors (hypertensions and diabetes) and another one with these factors. We then grouped the participants according to their sex, resulting in four more training datasets: ‘males with no cardiovascular factors’ (male–no CF), i.e. males with no hypertension and/or diabetes, with blood glucose level below 126 mg/dL, and with systolic blood pressure below 140 mmHg, and ‘females without cardiovascular factors’ (female–no CF), i.e. we created two ‘healthy’ older training datasets, one of males and another one of females; and, on the other hand, ‘males with cardiovascular factors’ (male–CF), i.e. with hypertension or diabetes or with blood glucose level above 126 mg/dL or with systolic blood pressure above 140 mmHg, and ‘females with cardiovascular factors’ (female–CF), i.e. two older groups, one with males and another one only with females, both presenting relevant cardiovascular factors (see Table 1). For exploring the influence of particular cardiovascular factors on older participants, we created four training datasets with individuals above 60 years old [51]. One comprising participants with a confirmed diabetes diagnosis or with blood glucose level exceeding 126 mg/dL [31, 50], we called this training dataset ‘diabetes’; another one comprising participants diagnosed with hypertension or systolic blood pressure above 140 mmHg [49], we called this dataset ‘hypertension’, and two ‘control’ subgroups, ‘control diabetes’ and ‘control hypertension’, to compare against the ‘diabetes’ and ‘hypertension’ training datasets. The ‘control’ datasets consisted of ‘healthy’ participants with blood glucose level below 126 mg/dL and systolic blood pressure below 140 mmHg. Finally, we created one last training dataset comprising participants with a uniform age distribution raging from 18 to 87, mixed sex, with and without the presence of cardiovascular factors (hypertension and diabetes), we called this training data set ‘TD120’ (as is constituted by 120 participants); see Table 1 for a summary of the training datasets. Lastly, the number of participants in each training dataset was selected in order to maintain the same prevalence of these factors in the 10000BRAINS cohort.

Table 1 Description of the training and validation datasetsAim 2: impact of relevant factors on test data

To identify if the presence of specific factors in the test data leads to (in)accurate delineation, we created 13 test datasets, each displaying a unique characteristic. We then compare the WML estimations and performance of BIANCA, LPA and LGA within each group of individuals. Similar to the creation of the training datasets for aim 1, we first selected a subsample characterized solely by participants’ age, ensuring that no other relevant factors were present. We grouped these participants into five test datasets: age 18 (18–37 years), age 37 (37–47 years), age 47 (47–57 years), age 57 (57–67 year) and age 67 (67–87 years) (as shown in Table 2). Secondly, we grouped participants above 60 years old [51] into four test datasets based on their sex and presence of cardiovascular factors: ‘males with no cardiovascular factors’ (male–no CF), ‘females without cardiovascular factors’ (female–no CF), ‘males with cardiovascular factors’ (male–CF), i.e. with hypertension or diabetes or with blood glucose level above 126 mg/dL or with systolic blood pressure above 140 mmHg, and ‘females with cardiovascular factors’ (female–CF) (see Table 2). Lastly, for exploring specifically the impact of particular cardiovascular factors on older participants, we created four test datasets. The first one, comprising participants with a confirmed diabetes diagnosis or presenting a blood glucose level exceeding 126 mg/dL [50], was called ‘diabetes’. The second one, integrating participants diagnosed with hypertension or presenting a systolic blood pressure above 140 mmHg [49], was denominated with the name ‘hypertension’. And two ‘control’ test datasets were called ‘control diabetes’ and ‘control hypertension’. These ‘control’ test datasets consisted of ‘healthy’ participants with blood glucose level below 126 mg/dL and systolic blood pressure below 140 mmHg. All participants in these test datasets were above 60 years old [51] (see Table 2 for a summary of the test datasets).

Table 2 Description of the 13 test datasetsAdditional step: repeating aim 1 on characterised subsamples

As an additional step, after creating the characterised subsamples in aim 2, we repeated our analyses for aim 1 on each distinctive test dataset. This means we repeated analyses for aim 1 13 times, but instead of applying it to the entire 1000BRAINS cohort, we applied it to each characterised test dataset. This approach allowed us to analyse whether the influence of the relevant factors in the training data differed when the characteristics of the test subsamples changes.

Manual WML segmentation

To examine whether BIANCA’s WML estimations are influenced by the presence of relevant factors in the training data, we manually segmented the WML on FLAIR modality of the participants conforming each training dataset shown in Table 1 (total of 120 participants). We did this with FSLeyes, a tool from FSL (https://fsl.fmrib.ox.ac.uk/fsl/fslwiki). Binary masks were generated with a value of 0 for non-WML voxels and 1 for WML voxels. Examples of manually segmented masks are shown in Fig. 1.

Fig. 1

Example of a manual segmentation on a FLAIR image

To analyze the performance of BIANCA, LPA and LGA on distinct characterised subsamples, we examined the degree of overlap between WML masks provided by the algorithms and manually segmented WML masks. For this purpose, we employed the same data created in the previous step as validation data, i.e. the 120 manually segmented scans in FLAIR space. Since LGA uses T1 modality as base reference, we also required validation data in T1 space. Therefore, we co-registered the manually segmented WML masks in FLAIR space to T1 space using FLIRT from FSL for modality co-registration (FLAIR to T1). Detailed information on the number of participants in each validation dataset is provided in Table 1.

Automated WML segmentation algorithms

To address the first aim, we opted for one of the most widely used and established algorithms in the literature [27, 32, 39, 40, 52, 53], BIANCA [19]. BIANCA requires data pre-processing, selection of initial parameters and to be trained.

The pre-processing steps involved tools from FSL (http://fsl.fmrib.ox.ac.uk/fsl) [54]. We utilised BET to produce brain extracted images in FLAIR and T1 modality, FLIRT for modality co-registration (T1 to FLAIR) using linear rigid-body registration (6 degrees), and normalization to the MNI152 standard template [55].

Regarding the initial parameters, we used T1 and FLAIR modalities, with FLAIR as the reference base modality. We followed the options recommended by [19] to optimise the dice similarity index (DSI) and false positive ratio. This included setting spatial weighting (sw) to 1 (default), no patch and selecting no border (excluding three voxels close to the lesion’s edge) for the location of non-lesion training points. We used a fixed and unbalanced (FU) number of training points, with 5000 for the number of lesion points and 25,000 for non-lesion points per training subject. The total WML volume, and hence the BIANCA estimation, was obtained by applying a threshold of 0.9 [19] to the lesion probability map, which constitutes the output of the algorithm. Further details can be found in [19].

BIANCA is trained by creating a feature space that includes both intensity and spatial features from the lesion and non-lesion voxels determined in the training data. Feature vectors for both classes, WML and non-WML, are created for each of the selected number of training points. Once the ‘training’ vectors are established in the feature space, classification of unseen voxels (unseen images) is performed by creating its own feature vector and measuring the distance to the 40 nearest training feature vectors (k-nearest neighbour). Therefore, by using each specific characterised training data (shown in Table 1), BIANCA generated specific training feature vectors linked to each relevant factor. This approach allowed us to assess WML estimation differences when training data characteristics changed.

To address the second aim, we selected algorithms that, like BIANCA, are well established and broadly used in the community [19, 21, 27, 32, 34, 36,37,38,39,40,41,42,43]. These algorithms include BIANCA itself, the Lesion Prediction Algorithm (LPA) [23] and the Lesion Growth Algorithm (LGA) [22].

In this case, BIANCA was trained on participants with a uniform age distribution ranging from 18 to 87, mixed sex, with and without the presence of cardiovascular factors, since this characterised dataset yelled the highest performance in aim 1.

LPA only requires the input modality: FLAIR. It also presents the option to include another modality as the base reference image. In this study, we tested the algorithm with two options, using only FLAIR modality and a combination of FLAIR + T1, using T1 as base modality. We applied a threshold of 0.5 [23] to the lesion probability maps.

LGA requires T1 modality as a reference image along with FLAIR images. An initial threshold (kappa), user-determined, is needed. In our study, we selected a kappa value of 0.25 [22] and applied a threshold of 0.3 [22] to the lesion probability maps [22, 23].

We applied BIANCA, LPA and LGA to the 13 characterised test datasets (described in Table 2) then compare their outputs and performance within each characteristic.

Statistical analysis

For aim 1, to determine whether BIANCA estimations are influenced by the presence of relevant factors in the training data, we compared the estimated WML volumes obtained with each training dataset when the algorithms was applied to the 1000BRAINS cohort and to each characterised subsample depicting different age distribution, stratified by sex, with and without the presence of cardiovascular factors (as shown in Table 2). Specifically, we examined how different characteristics present in the training data influenced the results obtained by BIANCA. We conducted these comparisons using mixed ANOVAs with Bonferroni post hoc tests. We applied this test to address the participants who are related to each other (between-subjects factor) and the variance introduced by the different training datasets (within-subjects factor).

We also analysed the performance in each case, to identify which composition of training data yielded the highest performance. We measured the degree of overlap between WML masks provided by BIANCA and WML manually segmented masks (see validation datasets in Table 1). We employed a specific metric for this purpose, the dice similarity index (DSI). The DSI is calculated as twice the number of voxels in the intersection of manual and algorithm masks divided by the sum of voxels manually segmented and algorithm segmented voxels. This choice aligns with previous studies, which have identified the DSI as the most robust indicator of overlap between the manual mask and the estimated mask [19].

For aim 2, to identify if the presence of specific factors in the test data leads to (in)accurate delineation, we analysed the output differences of BIANCA, LPA and LG, as well as their performance when applied to individuals exhibiting different age distribution, stratified by sex, with and without the presence of cardiovascular factors (details of individuals characteristics are shown in Table 2).

Regarding the output differences, we compared the outcomes of BIANCA versus LPA when using FLAIR modality only, versus LPA when using both T1 and FLAIR modalities, and versus LGA, within each characterised test data described in Table 2. For instance, we considered the test subgroup ‘hypertension’ as explanatory factor, and WML volume estimations as dependent variable (within-subjects). We analysed these differences using mixed between-within participants ANOVAs with Bonferroni post hoc tests.

Regarding the performance, we measured the degree of overlap between WML masks provided by the different algorithms and manually segmented WML masks (see validation datasets in Table 1) employing the DSI.

View original article

GEROSCIENCE

Share Bookmark

0 0 0 0 0 0 0

More from this channel

Identifying sources of bias when testing three available algorithms for quantifying white matter lesions: BIANCA, LPA and LGA

Comments (0)