Our aim is to find a formal ontological representation of phenotypes of collections of cells (and other entities) that is as close as possible to the EQ formalism used in the phenotype ontologies yet avoids the problematic inferences we identified. To achieve this goal, we reuse an ontological theory of collections and collectives [15] which introduces different properties of collections and collectives. Here, we are primarily concerned with defining collections of entities (such as cells) that are either contained in or part of an organism.
We focus on biomedical applications for our formulation where we are interested in collections of entities that are part of a body. As such, members of a collection can change over time, and collections can be empty (such as in the case of absent T cells). Empty collections are important as they are used to signify disorders, such as those resulting from congenital abnormalities, where certain types of cells or types of chemicals cannot be produced and therefore these collections are empty.
We can define mereological relations between collections [34]. Of particular importance for us is the relation between subclass (of members) and the parthood between the respective collections. For example, every T cell is a kind of lymphocyte; consequently, while a collection of T cells is a kind of collection of lymphocytes, it is also the case that every collection of T cells is a part of a collection of lymphocytes.
This may just be trivially true due to the reflexivity of part-of as long as we do not restrict these collections further. However, we are not particularly interested in just defining collections of types of cells; there are many collections of T cells that are part of a human body. Instead, we are interested in the notion of a “maximal” collection of entities that are a part of a body, i.e., the collection of all entities of type X that are a part of a single (instance of a) body. We call this the maximal collection of X within a Y (where Y is a class representing an organism or the body of an organism). We can define this class in first order logic (where \(\leftrightarrow\) represents bi-conditional logical symbol, read as “if and only if”, \(\wedge\) is a conjunction, \(\exists\) and \(\forall\) are the existential and universal quantifiers):
$$\begin \text (x) \leftrightarrow (\exists y( Y(y) \wedge ( \forall a ( X(a) \wedge part\_of(a,y) ) \leftrightarrow has\_member(a,x)))) \end$$
(12)
or, using temporalized parthood and membership relations (such as used in the Basic Formal Ontology, BFO [35]):
$$\begin \text (x,t) \leftrightarrow (\exists y( Y(y) \wedge ( \forall a ( X(a) \wedge part\_of(a,y,t) ) \leftrightarrow has\_member(a,x,t)))) \end$$
(13)
We cannot equivalently represent these axioms in a Description Logic that is used to represent phenotype ontologies. However, we may be able to assume that the universe over which we quantify ranges only over entities that are a part of a single body, allowing us to omit the condition on the right-hand side of Eqn. 12. We then define \(\text \) as the collection of all the individuals of type X (where “all” ranges over parts of Y, e.g., the parts of a body):
$$\begin \text (xc) \leftrightarrow (\forall a ( X(a) \leftrightarrow has\_member(a, xc))) \end$$
(14)
While this is an axiom in first order logic, we are mainly interested in an implementation in a Description Logic such as the one underlying OWL 2 DL [36] so that our results are compatible with the MP and HP. In Description Logic, we assert two axioms for these collection classes containing X:
$$\begin \text \sqsubseteq \forall has\_member. X \end$$
(15)
$$\begin X \sqsubseteq \exists member\_of. \text \end$$
(16)
These axioms do not yet capture the intuition that an \(\text \) should be the collection of all X in the domain of discourse; we can further strengthen these axioms by asserting that there is only one such collection:
$$\begin \text \equiv \ \} \end$$
(17)
Here, \(\text \) is a new individual name that is not used anywhere else, and \(\\) is the Description Logic constructor for nominals (class descriptions defined by enumerating the class members). Because every instance of X will be a member of this collection (Eqn. 16), X will approximate the notion of the maximal collection of Xs within a body.
Nevertheless, this is only a weak approximation of the first order logic axiom. In particular, we can infer from the first order logic axioms that, if X is a subclass of Y, then every \(X\_Collection\) is a part of some \(Y\_Collection\). In Description Logic, this is not inferred and we instead assert this consequence directly as a set of axioms: given an ontology O and its deductive closure \(O^\vdash\), and for every pair X and Y such that \(X \sqsubseteq Y \in O^\vdash\), we assert \(\text \sqsubseteq \exists part\_of. \text \).
Representing phenotypes of collectionsOur aim is to identify a set of axioms for representing quantitative phenotypes (phenotypes of collections) so that the inferences drawn from the axioms more accurately reflect the intended inferences from these axioms, while we aim to preserve interoperability with other axioms in phenotype ontologies that do not pertain to collections; consequently, we still have to follow the EQ formalism and the way it is implemented in phenotypes ontologies.
Qualities of cells and qualities of collectionsWe will use the following terms to refine the formal characterization of cardinality phenotypes in phenotype ontologies:
X and Y are classes from an anatomy or cell type ontology, such as the class T cell or NK T cell;
\(\text \) and \(\text \) are classes representing (maximal) collections where all the members of these collections are instances of X and Y, respectively.
amount is a quality (including the class amount (PATO:0000070), increased amount (PATO:0000470), decreased amount) (PATO:0001997), absent (PATO:0000462), and duplicated (PATO:0001473) defined in the PATO ontology.
The current phenotype ontologies represents phenotypes of collections in the EQ formalism where the Entity E is a cell class and the quality Q is a phenotype class from PATO (Eqn. 1); the class from PATO will be a subclass of the quality quantitative in PATO, such as amount. We reformulate these phenotype classes using the collection classes we defined earlier. We define a CP class that represents a cardinality phenotype on a collection of cells, employing an EQ pattern where the entity is the collection of cells, \(\text \), defined as follows:
$$\begin CP \sqsubseteq \exists has\_part.( amount \sqcap (\exists characteristic\_of. \text ) \sqcap \nonumber \\ (\exists has\_modifier. abnormal)) \end$$
(18)
Specifically, for a phenotype of the collection of T cells, we first define the class \(\text \) and then an Abnormality of T cell number as:
$$\begin \text \sqsubseteq \exists has\_part.( amount \sqcap \nonumber \\ (\exists characteristic\_of. \text )\sqcap (\exists has\_modifier. abnormal)) \end$$
(19)
Another type of cardinality abnormality is the absence of certain entity X. These absence phenotypes are currently formulated using the same EQ patterns, with Q being the class absence (PATO:0000462), therefore leading to the consequence that an absence of NK T cells is a subclass of an absence of T cells. We can use the notion of the empty collection to formulate absence:
$$\begin absent\_X \equiv \exists has\_part.(quality \sqcap \exists characteristic\_of.\nonumber \\ (X\_Collection \sqcap \forall has\_member. \bot ) \sqcap (\exists has\_modifier.abnormal)) \end$$
(20)
Here, \(\bot\) represents the bottom concept interpreted as an empty set. While we can use this notion of an empty collection, we still have to establish a relation between the empty collection of X and a body not having any instance of X as part; this would be possible in first order logic but not easy in Description Logic. Consequently, we also use the following formulation to relate absence to the parthood relation (where \(\lnot\) represents negation):
$$\begin absent\_X \equiv \lnot \exists has\_part. (quality \sqcap \exists characteristic\_of. X) \end$$
(21)
By defining \(absent\_X\) twice we also make the right-hand sides of the definitions equivalent and thereby can infer that having a quality of an empty collection of X is equivalent to not having a quality of X, i.e., we ensure equivalence between the two distinct formulations of absence.
We further define grouping classes, based on collections and based on qualities. For instance, any cardinality abnormality, whether it is a decrease or increase in number of T cells can be classified as a cardinality abnormality of collection of T cells. Therefore, we create the class CXP to group all the abnormalities of a certain collection XCollection defined as follows:
$$\begin CXP \sqsubseteq \exists has\_part. (quality \sqcap \exists characteristic\_of. X\_Collection\sqcap \nonumber \\ \exists has\_modifier. abnormal) \end$$
(22)
Another way to classify cardinality phenotypes is to group them based on qualities. For instance, we create a class that groups all the “increased cardinality” phenotypes. Therefore, we create the class CQ to group abnormalities of type amount of any collection XCollection using the root collection class C. CQ is defined as follows:
$$\begin CQ \sqsubseteq \exists has\_part. (Q \sqcap \exists characteristic\_of.C \sqcap \exists has\_modifier. abnormal) \end$$
(23)
Figure 3 illustrates the use of these grouping classes.
Fig. 3Illustration of grouping classes; the green class is an example of a quality-based grouping class decreased cardinality. This class will be inferred to be the superclass of every abnormality of a decreased cardinality of any collection of cells, including decreased cardinality of B cells, decreased cardinality of T cells, decreased cardinality of lymphocytes, etc. the blue classes are examples of grouping based on the entities collection of T cell, collection of B cell, and collection of lymphocytes. For instance, the class abnormality of collection of T cell will be inferred to be the superclass of any abnormality of collection of T cells, including decreased cardinality of T cells, and absent T cells
A revised hierarchy of cardinality phenotypes improves prediction of genes associated with rare diseaseWe quantitatively evaluate the new classified phenotype ontologies based on our new formulation of cardinality phenotype. The approach we use follows a task-based evaluation [37, 38]. In a task-based evaluation, we apply different variants of an ontology and evaluate their performance with respect to a specific task. We utilize an ontology-based phenotypic similarity measure to predict the association between genes and diseases. For this experiment, we utilized a dataset from the Mouse Genome Informatics (MGI) database [21] which includes associations between human genes and Mendelian diseases as reported in OMIM database. Using phenotypes associated with mouse orthologs of human genes (from MGI) and human disease phenotypes from the HP database [7], we calculate the degree of similarity between their phenotypes, rank genes for each disease, and determine whether we can identify the correct disease-associated gene at a certain rank; we quantify the performance using the area under the receiver operating characteristic (ROC) curve [39], similar to other studies [3, 4]. To directly compare human and mouse phenotypes, we use an integrated ontology consisting of HP and MP, where equivalences between HP and MP classes have been determined using an automated ontology alignment tool (see Integrating HP and MP with corrected cardinality phenotypes section). The use of ontology alignment is in contrast to an integration based on axioms as used in the integrated Monarch knowledge graph [40] or the PhenomeNET ontology [4]; while the integrated ontologies may provide more alignments between classes, relying exclusively on ontology alignment allows us to evaluate the modifications to HP and MP directly without the need to rewrite or add further axioms. Figure 4 illustrates the steps of our evaluation.
Fig. 4This figure present the workflow of this work with the example of the phenotype (increased number of T Cells) and the phenotype (absent T Cells). In this particular example, we created the class (collection of T cells) representing all the T cells. Then we created the phenotype classes (increased number of T Cell within a collection) and (absence of all T Cells). We added two structuring classes, one based on the quality (increased amount) and one based on the collection (collection of T cell). In order to evaluate, we applied a quantitative evaluation based on a biomedical task, in particular the gene–disease association prediction task using semantic similarity between phenotypes
We compare only diseases and genes which are annotated with at least one cardinality phenotype, using 425 diseases and 4,471 genes. We first compare their phenotype similarity only based on cardinality phenotype classes, i.e., ignoring all other phenotypes; we compare their similarity twice: first we use the original classification of phenotype classes in HP and MP, and, second, we use the revised classification of the cardinality phenotype classes based on our ontological analysis. For each disease, we rank all genes based on their similarity (the gene with the highest phenotype similarity is ranked first), and evaluate where we rank the correct disease-associated gene using the area under the ROC curve (ROCAUC).
Using the original classification of cardinality phenotypes in HP and MP, we obtain a ROCAUC of 0.6931 whereas the ROCAUC increases to 0.7384 with our revised hierarchy. While this demonstrates an improvement, it is not a realistic scenario in finding gene–disease associations because the majority of phenotypes is omitted. As a second test, we compared the same set of genes and diseases using all their phenotype annotations (cardinality phenotypes and non-cardinality phenotypes). Again, the ROCAUC improves from 0.9166 with the original classification of phenotypes to 0.9265 with the revised classification.
Comments (0)