Our main experiment is a comparative evaluation using knowledge graphs with different levels of semantic richness resulting from one or more ontologies, and the use of LDs and mappings. Additional experiments focused on different ablations studies that removed gene annotations to HP, or gene and disease annotations related to specific annotations of GO and HP.
Vector combination approaches for embeddingsOne of the challenges in achieving a rich semantic representation of genes and diseases when using knowledge graph embeddings is to define a suitable approach to combine the gene and disease vectors.
Initial experiments with a stratified 70% training and 30% testing split compared the five chosen vector operations with AUC-ROC evaluated using the three best knowledge graph embeddings methods (RDF2Vec, OPA2Vec, and DistMult) coupled with Random Forest classifier (one of the best-performing machine learning algorithms) using the richest knowledge graph (HP-simple + LD + GO). The results are summarized in Fig. 5.
Fig. 5ROC curves and AUC values obtained for different vector operators with RF classifier for the HP-simple + LD + GO
The Hadamard operator outperforms other operators when using RDF2Vec, OPA2Vec, and TransE, whereas Concatenation works best with OWL2Vec* and DistMult. Overall, Hadamard and Concatenation are the top two performing combination approaches, with Hadamard achieving the best prediction results when combined with OPA2Vec and Random Forest or XGB. While Hadamard, Average, Weighted-L1, and Weighted-L2 all produce vectors of the same size (200), Concatenation produces double-sized vectors (400). This impacts the training time of the machine learning algorithms. Going forward, all reported experiments employ the Hadamard operator.
Impact of semantic richness of the knowledge graphsTable 3 illustrates the impact of employing knowledge graphs with varying degrees of semantic richness with different embedding methodsFootnote 3
Table 3 Median WAF scores for the combinations of knowledge graph embeddingss with Cosine similarity, RF or XGB for the different knowledge graphs using the Hadamard operator. Best result for each knowledge graph embeddings and machine learning algorithm or CS is bold. Results that are statistically significantly different when compared to HPf are underlinedPredictions made with machine learning algorithms achieve better results than cosine similarity. This is unsurprising since reducing the representation of a gene-disease association to a similarity score may be too limiting. Instead, a model learned on multi-dimensional representations is much better at capturing the complexity of the associations.
We can also observe performance differences between knowledge graph embeddings methods. OPA2Vec achieves the best results, with a maximum performance of 0.775 in WAF, followed by RDF2Vec with 0.753. DistMult and OWL2Vec* lag behind with 0.734 and 0.715, respectively.
Multiple factors can explain the better performance of OPA2Vec: it uses asserted and inferred logical axioms in ontologies by using a reasoner; it combines them with vector representations for the lexical component of the ontologies learned over PubMed abstracts using the word2vec model. A clear difference between OPA2Vec and RDF2Vec is the use of rich OWL axioms and word embeddings, which may explain the observed differences. Biomedical ontologies are rich in synonyms, and exploring their similarities in the context of scientific literature can be immensely informative. In other words, this algorithm shows better results because it is better tailored to the specifics of bio-ontologies. Path-based methods appear to be better performers that DistMult, TransE, and HAKE, however OWL2Vec* presents worse results compared to RDF2Vec and OPA2Vec. OWL2Vec* is based on a deeper exploration of OWL axioms which counterintuitively does not improve performance, possibly by introducing noise into the representations. All embedding methods employed receive literals and deal with them differently.
Curiously, knowledge graph embeddings methods show different behaviours depending on the knowledge graph they are applied to. For RDF2Vec, performance is significantly improved over the baseline HPf when using HPf+GO, but this is not the case for the other knowledge graph embeddings methods. A possible reason behind this is that when a knowledge graph with richer semantics is processed by methods that can explore them, it results in entity vectors that capture many different aspects that may not be relevant for gene-disease association prediction. Another motive could be the proximity in the graph between the HP class declaration and the related GO class. Logical definitions can be quite complex and include many different entities from different ontologies as well as semantic constructs (Fig. 3). In triple oriented methods, such as OPA2Vec and DistMult, the relation between the HP class and the GO class is not directly encoded at the triple-level, and it needs to be learned by jointly training on all triples. In random-walk based methods, such as RDF2Vec, paths linking both classes can be found, making the relation more explicit.
To delve deeper into this issue, the logical definitions declared in the HP ontology were analyzed, and a total of 3203 definitions were identified, but only around 10% of those (350) are related to the Gene Ontology. This motivated the creation of another knowledge graph, HPs+GO+LD that addresses both challenges: it only includes logical definitions with GO (potentially removing noise), and it establishes direct links between HP and GO classes (making the relation more explicit in the graph). We also created two more variants HPs+GO+Map and HPs+GO+LD+Map where mappings between HP and GO found through ontology matching are added to the knowledge graph. When using the three HPs+GO variants, both OPA2Vec and OWL2Vec* show significant improvements over the baseline, but DistMult performance is never significantly improved over the baseline regardless of the knowledge graph employed.
To better understand the impact of semantic richness, we compared precision and recall values for the five knowledge graphs using OPA2Vec and RDF2Vec embeddings combined with Hadamard operator and a Random Forest model (Fig. 6). In general, for both OPA2Vec and RDF2Vec performance increases with semantic richness, with HPf as the knowledge graph with lowest performance in both methods. In both methods, the greater recall gains are seen with HPf+GO, but with some precision being sacrificed. Precision is overall improved when using the HPs+GO variants, but with greater impacts on precision for RDF2Vec.
Fig. 6Recall-Precision diagram including f-measure values as height-lines. The diagram uses all knowledge graphs for OPA2Vec and RDF2Vec with RF using a 70-30 split
Overall, both RDF2Vec, OPA2Vec and OWL2Vec* are able to produce richer semantic representations when given richer knowledge graphs, which in turn improve the prediction of gene-disease predictions.
Ablation studiesWe performed two types of ablation studies to study the impact that a richer ontological layer can have on missing data: (1) removal of the gene annotations using HP; (2) removal of gene and disease annotations of specific branches of the ontologies.
The predictive performance is considerably impacted by the removal of HP gene annotations (Table 4). However, OPA2Vec is still able to achieve WAF values above 0.7. This prediction scenario is perhaps the most realistic one, where knowledge about the phenotype caused by genes is still not known, but disease phenotype and gene function are.
Table 4 Median WAF scores for the HP gene annotation ablation study. Best result for each knowledge graph embeddings approach and machine learning algorithm or CS is boldTable 5 presents the ontology branch annotations ablation studies, taking as a baseline HPs+GO+LD and using RDF2Vec and OPA2Vec with the Hadamard operator for RF and XGB.
Table 5 Median WAF scores for the ontology ablation studies. Comparison of the best knowledge graph embeddings methods RDF2Vec and OPA2Vec with Random Forest or XGB for the knowledge graph HPs+GO+LD. Results that are statistically significantly different when compared to HPs+GO+LD are underlined. Best results in boldThe GO ablation studies show that in most cases, the removal of annotations of a single branch, or considering just biological process (BP) annotations has little to no impact on prediction. The exception is the removal of cellular component (CC) annotations which positively impacts predictions made by XGB coupled with RDF2Vec. It appears that the removal of any branch of the GO ontology is at least partly compensated by the inclusion of logical definitions.
The HP ablation studies show that the annotations removal of any branch significantly lowers performance, with the removal of phenotypic abnormality annotations producing the largest decrease. When considering only phenotypic abnormality annotations, performance is less affected. This indicates that HP annotations of any branch are essential for the prediction and cannot be compensated by logical definitions.
Scalability studyAs knowledge graphs grow larger and more complex, ensuring the knowledge graph embeddings can handle it efficiently becomes increasingly important. We investigate the scalability of the knowledge graph embedding methods by analyzing their runtime when applied to differently sized knowledge graphs.
Figure 7 shows the results of the computational time for the best embedding methods with two knowledge graphs where the smallest size corresponds to removing the main branch of the human phenotype ontology (Phenotypic abnormality). We can see by the results that RDF2Vec and OPA2Vec are the promptest methods, while OWL2Vec and DistMult are slower. We also can observe that the increase in the size of the knowledge graph is proportional to the increase of the computational time.
Fig. 7Computational time for each embedding method with two knowledge graphs where the smallest size corresponds to removing the main branch of the human phenotype ontology (Phenotypic abnormality)
When comparing different embedding methods, we must consider whether they utilize path-based strategies (random walks) or access triples. For OPA2Vec, TransE, and DistMult, embeddings were generated using triples. In contrast, RDF2Vec and OWL2Vec utilized random walks for generating embeddings. Specifically, 500 random walks were generated for each knowledge graph for RDF2Vec and OWL2Vec. Furthermore, the entities used for learning the embeddings varied among the different methods. RDF2Vec and OWL2Vec only generated embeddings for the entities asked. OPA2Vec, TransE, and DistMult generated embeddings for all entities in the knowledge graph.
Comments (0)