A deep-learning algorithm (AIFORIA) for classification of hematopoietic cells in bone marrow aspirate smears based on nine cell classes—a feasible approach for routine screening?

Bone marrow aspirate smears were scanned (Pannoramic 1000, 3DHISTECH Ltd, Budapest, Hungary) and uploaded to the AIFORIA create platform (Aiforia Technologies, Plc, Helsinki, Finland) for the development of a convolutional neural network (CNN)-based algorithm for the detection and classification of hematopoietic cells. External human validation was independently performed by three experts in bone marrow cytology on a separate set of digitized bone marrow images.

Bone marrow aspirate smears

May-Grünwald Giemsa (MGG)-stained BMA smears were used for the training (n = 30), testing (n = 20), and validation (n = 30) of the AI model without duplication across datasets. The samples were collected from the archives of the Department of Clinical Pathology and Cancer Diagnostics, Karolinska University Laboratory (KUL), Solna. All samples were from untreated, non-cytopenic patients (n = 80) with normal or reactive marrow findings and taken as part of a routine bone marrow (BM) staging examination. The BMA smears were uniformly prepared using the same staining protocol according to the manufacturers’ guidelines (Sigma-Aldrich), and all included cellular marrow particles with the presence of megakaryocytes. The slides were digitized using a Pannoramic 1000 whole-slide scanner (3DHISTECH Ltd, Budapest, Hungary) with an output resolution of 63.06 × (using 40 × objective with a 1.6 × camera adapter magnification) and an image resolution of 0.158309 μm in X and 0.158834 in Y plane.

Cell classification and annotation

The hematopoietic cells in BMA smears were assigned to nine major cell classes: blast, promyelocyte, myelocyte/metamyelocyte, proerythroblast, erythroblast (basophilic, poly- and orthochromatic), mature granulocytes (segmented/band neutrophil, eosinophil, basophil), lymphocyte, monocyte, and plasma cell. Mature granulocytes were combined into one class, with only a few eosinophils and very few basophils present in our data set. The various maturation stages within erythropoiesis were divided into two classes—the more immature proerythroblasts and normoblasts. The BM DCC did not include mast cells, megakaryocytes, smudge cells, and mesenchymal stromal cells. Representative examples of the cell classes used for training of the AI model are illustrated in Fig. 1.

Fig. 1figure 1

Training annotations based on nine cell classes in bone marrow aspirate smears. Single-cell annotations for nine cell classes: blast, red ring; promyelocyte, purple ring; myelocyte/metamyelocyte, turquoise ring; granulocyte (neutrophil, eosinophil, basophil), brown ring; lymphocyte, green ring; monocyte, yellow ring; plasma cell, blue ring; normoblast (orthochromatic, polychromatic, basophilic), orange ring; pro-normoblast, bourgogne red ring. The training regions are indicated by a black line, and all cells within these areas were annotated, except smudge cells, thrombocyte aggregates, and artefacts

Regions of interest (ROI) for annotation were first selected and drawn manually where cells were mostly evenly distributed, cytologically intact, non-overlapping, and best representative for the spectrum of hematopoiesis. Individual cells were annotated based on well-established cytomorphological criteria for each cell type [6] using a consistent cell size for each class with the whole target (nucleus and surrounding cytoplasm) centered (Fig. 1, Table 1). Advanced parameters were used to allow for “object” overlap and object size differences. All annotations were reviewed for appropriateness of classification (“ground truth”) by two experienced hematopathologists. Cells of uncertain class, smudge cells, naked nuclei, and thrombocyte aggregates were not annotated but included in the training regions.

Table 1 Cell classes in the AI model and number of annotations used for trainingTraining and verification

Cell annotations were performed in a stepwise process following the recommended workflow (AIFORIA), starting with a smaller number of annotations for each class, followed by repeated training to guide new annotations until the desired AI model performance was obtained. The selected layer complexity for the model was set to “extra complex.” Advanced training parameters included, for example, the setting of maximum object overlap and minimal object size difference, in our model set at 0.5 and 0.25, respectively (Suppl. Table 1). Maximum object overlap prevents the neural network from finding two overlapping objects and, for example, detecting one object twice. Image augmentation was used to add variability to the training data during the training, i.e., more training data was created from the actual annotations. These parameters included the scale (min/max variation of training regions), luminance (min/max variation in the brightness within the same and in different images), contrast (different colors in the target regions in different images), all three set at min/max of − 10 to 10), maximum image shear (set at 10), maximum white balance change (set at 5), white noise (noise and artefacts in the background of the image (set at 2)), and rotation angle (min/max rotation angle used in augmentation, set at − 180 to 180). A total of 3056 (out of 7000) iterations were executed on all training regions (1 h 46 min 36 s) with an overall training loss of 0.2258.

Verification of the AI model was performed on the training regions and on selected areas outside the training regions to assess the generalizability of classification. Verification results were sorted by error rate (high-to-low) and used for reviewing the results. Annotations were improved by identifying misclassified cells and by adding annotations that were missed. Smudge cells, naked nuclei, cells in mitosis, thrombocyte aggregates, and cells that were not clearly identifiable were not annotated but intentionally included in the training region. The training was repeated several times with adjustment of the training parameters, and the AI model was further refined by alteration of the “gain values” for certain cell classes, if the model did not recognize enough or “too many” of that class. A total of 1950 single-cell annotations were performed for the training (Table 1). The final total class error for all training regions was 0.15% with 99.9% precision and sensitivity (FI-score 99.2%). Visual inspection of the classification results on a separate slide set that was not used for training indicated good performance of the AI model.

External validation of the AI model

The AI model was validated against three external human validators, all three experienced in bone marrow cytology, using a separate set of digitized whole-slide images (WSI) from normal hospital controls (n = 20). The validation regions were areas in which cells were well dispersed with good cytological details and low number of smudge (lysed) cells. The external validators used their own computer screens and had access to the AIFORIA Create platform. An average of 2048 cell annotations in 515 validation regions were independently performed on two separate occasions. Annotations made by the human expert were considered the “gold standard,” and classification results of the three external validators were averaged for comparison to AI (“AI vs human”) and also compared to each other (“human vs human”) with respect to the “ground truth” generated by the training and testing of the AI algorithm.

WSI analysis vs automated classification in regions of interest

In clinical routine, areas of well-spread marrow cells with good cytological details and paucity of artefacts are selected for the cytomorphological assessment of BMA smears and for performing DCC [4]. However, representative areas are not always found in the cellular trails of the BMA smear behind particles. For example, groups of blast cells can sometimes be detected in the tail or at the edges of the microscopic slides. Therefore, deep-learning models should either be applied on WSI or be trained for selecting ROI that are both informative and reflect the spectrum of hematopoietic cells present. Alternatively, a semi-automated approach could be used with the selection of ROI by human experts following WSI analysis for visual control of the output data in non-hemodiluted areas that show good cytological details. To test the appropriateness of the latter approach, 16 normal BMA smears were selected for WSI analysis and compared to the classification results in one larger ROI of equal size for all 16 samples vs ten smaller, randomly selected ROI/slide. The reason for also including a smaller ROI was that it better reflects the routine clinical approach when performing manual DDCs at high power magnification in different areas of a bone marrow aspirate smear.

Statistical analyses

The classification results of the external validation were exported from the Aiforia Create Platform for statistical analysis. Statistical analyses were performed using R Statistical Software version 4.3.3.

False positive (FP) refers to objects that were not annotated (external validator), but detected by AI, and false negative (FN) refers to objects that were annotated (external validator) but not detected by the AI model. The false positive error was calculated by FP/(FP + TN), the false negative error by FN/(TP + FN), and the total class error by (FP + FN) / P where P is the sum of (TP + TN + FP + FN). Precision is the percentage of the analysis findings that overlap with annotated objects, calculated by TP/(TP + FP). Sensitivity is the percentage of annotated objects that were found by the analysis, calculated by TP/(TP + FN). The results of the external validation (Table 2) were calculated using a two-step averaging process by first calculating the average FP %, FN %, total error %, precision %, sensitivity %, and F1-score for the nine cell classes per validator. The calculated values were then averaged across the three validators and compared to AI. The reported F1-score is the average of the F1-score from the three validators.

Table 2 External validation of the AI model (“AI vs human”) and comparison of classification results between experts

The Shapiro–Wilk test was used to assess whether the data sets from whole slide image (WSI) analysis and regions of interest (ROI) were normally distributed. The Spearman rank correlation test was used to assess the correlation between the classification results from WSI analysis and ROI.

Comments (0)

No login
gif