In this section, we present an analysis of the manual annotations that we obtained for the GoldHamster corpus, and the results of the experiments with automatic prediction of the labels.
Corpus analysisAfter the first round of annotation, in which two annotators screened each of the articles, we obtained 7,737 annotationsFootnote 18: 1,970 for “in vivo”, 1,397 for “human”, 1,171 for “invertebrates”, 892 for “others”, 740 for “organs”, 663 for “in silico”, 455 for “primary cell lines”, and 449 for “immortal cell lines”. These values are the total of annotation with duplicates, i.e. including situations in which two annotators agreed in assigning a certain label to an abstract. In addition, in 190 cases, an annotator did not assign any of the labels to an abstract. From the 1,600 abstracts, 899 had a full agreement, i.e. exactly the same sets of labels were assigned by both annotators. The remaining 701 abstracts without full agreement were reviewed in the additional rounds (r2.1, r2.2, r2.3, and r2.4).
Table 2 Statistics of the corpus in terms of the number of abstracts per label. We present statistics for all labels (cf. 3.1), rounds of annotation (cf. 3.3) and when considering all annotations (All, left side) or only the one for which two annotators agree (Agree, right side). The comparison for the rounds in terms of equality (=), increase (\(\bigtriangleup\)), or decrease (\(\bigtriangledown\)) is with respect to the previous column (round), i.e. after the addition of the corresponding round. The values do not include duplicates, i.e. a label is only counted once if there is an agreement for itWe summarize the number of abstracts for label that we obtained after each of the annotations rounds in Table 2. We show the impact of each additional round as compared to the previous column. Furthermore, we present results when considering all annotations (“All”) and when only considering the ones for which there was an agreement between two annotators (“Agree”). As expected, when full agreement is considered, the number of articles with any annotation reduces, i.e., from 1,600 to 1,168 for the first round. While only 899 abstracts had a full agreement, 1,168 abstracts had an agreement for at least one of the labels. Further, the number of abstracts with agreement increases after the annotation of the additional rounds, namely to: 1,189 (after r2.1), 1,257 (after r2.2), 1,289 (after r2.3), and 1,436 (after r2.4). For all rounds, the number of documents without any annotation remained equal to 25.
In the four additional rounds, the selected annotators received the anonymous annotations from the two annotators in the first round and were required to decide which annotation is the correct one. If no first round annotation was judged correct, the respective abstracts were removed from the corpus. We removed the following number of abstracts from the additional rounds: 11 (from 71) in r2.1, 20 (from 114) in r2.2, 23 (from 148) in r2.3, and 77 (from 368) in r2.4. We present the list of PMIDs in the supplementary material (Section 5).
Table 3 Agreement in terms of Cohen’s (\(\kappa\)) between annotators for the first round, for each label (cf. 3.1), and for the overall corpus. An agreement is moderate if the (\(\kappa\)) is higher than 0.6, and strong if higher than 0.8 (cf. [29])In Table 3 we show the level of agreement between the annotators with respect to individual labels in terms of the kappa coefficient (\(\kappa\)) [28]. Regarding the first round, we obtained an almost perfect agreement for “invertebrates” (0.82), substantial agreement for “in vivo” (0.78), “in silico” (0.72), “human” (0.63), and “organs” (0.62), and moderate agreement for “immortal cell lines” (0.49), “primary cells” (0.45), and “others” (0.42). Two of the lowest agreements were for the “primary cell lines” and “immortal cell lines” labels, which are indeed difficult to distinguish (if not using Cellosaurus).
Corpus evaluationWe ran various experiments to evaluate the corpus for predicting the labels. All results are in terms of the standard metrics of precision, recall, and f-score. In order to perform a 10-fold cross validation, we split the collection of abstracts into 10 parts in a stratified way, i.e., in a way to obtain datasets with a similar distribution of labels as in the complete corpus (cf. corpus statistics in Section 6 of the supplementary material).
The hyperparameters that we considered were decided based on an evaluation with a preliminary version of our corpus (after round 2.3) and with BioBERT. The sets of values for the hyperparameters that we considered were the following: learning rate of \(1 \times 10^, 5 \times 10^, 1 \times 10^\), and \(5 \times 10^\); batch size of 16 and 32; and epochs of 10, 20, 30, 40, and 50. We show the results for the 40 experiments that we ran with all combinations of hyperparameters in the supplementary material (Section 7). We made the final decision of the hyperparameters based on this analysis and on some of our constraints, i.e., time and memory performance. For our further experiments, we used the following hyperparameters: maximum length of 256, learning rate of \(1 \times 10^\), batch size of 32, and 10 epochs.
We compared the various language models based on our best run in the 10-fold cross validation. Table 4 summarizes the results. Further, we compared our best performing model to the various SVM kernel functions (cf. Section 8 the in supplementary material).
Table 4 Performance (in f-score) for the various pre-trained language modelsIn average, PubMedBERT performed slightly better than the other language models. It provided scores of at least 0.75 for all labels, while all other models provided a score under 0.70 for one or more labels. Indeed, PubMedBERT obtained a very good performance for other text classification tasks, as listed in the BLURB leaderboardFootnote 19.
Evaluation with the addition of MeSH termsFor the best performing model that we obtained with PubMedBERT (cf. Table 4), we ran experiments for all the proposed threshold values (cf. Section “MeSH terms”). Table 5 summarizes the results. Only one threshold obtained a performance as high as the experiment without the use of MeSH terms, namely, the value of 10 (overall f-score of 0.83). However, for all labels, there was always a threshold value that outperformed the experiment without the MeSH terms.
Table 5 Prediction of the labels with or without (w/o) adding the MeSH terms. We considered various thresholds for the minimum frequency of the terms, in relation to the number of articles in which they appearEvaluation with the addition of discourse elementsFor the best performing model that we obtained with PubMedBERT (cf. Table 4), we ran experiments with each of the sections as well as some of their combinations. Table 6 summarizes the results. None of these experiments outperformed the one that considered the titles and abstracts. Some sections achieved a better performance than others, namely, “Background”, “Methods” and “Conclusions”. Therefore, we ran experiments only for combinations of these three sections.
Table 6 Performance (in f-score) when using individual and combined sections for the prediction of the abstract labels. We show results from considering each section separately, i.e. Background (B), Objective (O), Methods (M), Results (R), and Conclusions (C), as well as combinations of the three best performing onesComparison to available NER toolsWe compared our results to two available tools for NER and entity normalization, namely, BERN2 [18] and PubTator [17]. Both tools extract a variety of entity types, but we considered only predictions for species and cell types, i.e., types “species” and “cell_line” in BERN2, and “Species” and “CellLine” in PubTator. The remaining entity types, e.g., DNA, drugs, diseases, mutations, were not relevant for our schema. Both tools provide the identifiers for the selected types, namely, to the NCBI taxonomy [30] for species and to Cellosaurus [22] for cell lines.
We used the same strategy for both tools. The “human” label was assigned if any prediction linked to H. sapiens (NCBITaxon:9606) was found. The abstracts with predictions for either D. melanogaster (NCBITaxon:7227) or C. elegans (NCBITaxon:6239) were assigned to the “invertebrate” label. For all remaining species, we assigned the abstracts to “in vivo”. Unfortunately, not all predictions for cell lines are mapped to an identifier in Cellosaurus. For the ones for which an identifier was available, we retrieved the corresponding species in Cellosaurus. Similarly, abstracts with cell lines from H. sapiens were assigned to the “human” label, and cell lines from D. melanogaster to the “invertebrate” label. For all remaining species, we assigned the abstracts to the “immortal cell lines” label, i.e., cell lines from vertebrates. We present results for the prediction of four of our labels in Table 7.
Table 7 Prediction of the labels based on the named-entity recognition provided by BERN2 and PubTator. The scores are an average over the 10-fold cross-validationFor all labels, the score were lower than the ones that we obtained with PubMedBERT. Both NER tools obtained similar results for the “in vivo” and “human” labels. However, BERN2 got some true positives for “immortal cell lines” (as opposed to none from PubTator), while PubTator scored much higher for “invertebrates”. When comparing the number of mentions predicted by each tool, in average, BERN2 predicted a total of 127 cell lines for the test set, as opposed to less than three from PubTator. This probably explain why BERN2 performed better for this label. Actually, BERN2 scored 1.0 for recall, while precision was low, around 0.15 (results not shown).
For the predictions of species, both tools returned approximately the same number of mentions. We checked a couple of abstracts to investigate the source of error, namely PMIDs 26845534 and 28916802. BERN2 correctly detected the “Drosophila” mention, and assigned the type (species). However, no identifier was associated to it, but simply the text “CUI-less”, thus hindering a mapping to the “invertebrate” label.
We also evaluated the two NER tools when considering only annotations that occurred in some particular discourse elements (cf. Section 9 in the supplementary material). We observed only a very small improvement when considering only the background section: for the “invertebrates” and “immortal cell lines” labels with BERN2 and for the “human” with PubTator.
Sentence-level predictionWe annotated the corpus on the level of the entity span, since this was necessary when working with the TeamTat tool. However, we did not give clear instructions to the annotators on how to annotate the text spans. Further, we did not ask them to annotate all mentions of the particular method, e..g, the species name. Finally, the annotators did not highlight mentions of other methods, if it was not the proposed experimental method. Therefore, our corpus cannot be compared to corpora built for NER task. Nevertheless, we experimented with using the annotations for training a sentence-level model.
The training data consisted of sentences and the corresponding labels of their annotations. We utilized the same code used for training BioBERT on document level, and the only change of parameter was considering the maximum length of 128 (instead of 256), a learning rate of \(5 \times 10^\) (instead of \(1 \times 10^\)), and a batch size of 16 (instead of 32). We did not consider a sequence model, as it is usual for NER tasks, i.e., the prediction made for the previous sentence does not influence the prediction for the current sentence. Table 8 summarizes the results. In general, the performance was much higher for the document-level prediction. Curiously, the sentence-level prediction scored better for the “in silico” label.
Table 8 Performance (in f-score) for the prediction of the labels on document and sentence level
Comments (0)