In this paper, we applied six metrics to assess the quality characteristics of RDF resources in the rare-disease domain. We found a few issues when assessing the quality of these resources: eleven out of sixteen resources have non-resolvable URIs; seven resources have undefined URIs; two resources have inconsistency related to the ‘owl:ObjectProperty’ properties. Individual findings will be discussed in more depth in the sections that follow.
Insights into errorsNumerous resources such as the ORDO used the property <https://creativecommons.org/licenses/permits> and the class <http://web.resource.org/cc/Attribution> to describe the Creative Commons licenses. However, both of these are non-resolvable. The correct ones are <https://creativecommons.org/ns#permits> and <https://creativecommons.org/ns#Attribution> [22]. This implies that there is a lack of up-to-date communication between ontology creators and the Creative Commons organization.
There are some URIs that are classified by the algorithm as ‘undefined’ that are actually ‘defined’, according to the definition of the ‘undefined URIs’ metric. For example, the URI <http://www.w3.org/ns/prov-o> in the EJP RD Resource Metadata Ontology (see Table 6) is described in the triple: <http://www.w3.org/ns/prov-o#> rdf:type owl:Ontology.
Both URIs point to the same resource but are syntactically different (i.e., a URI with a hashtag compared to one without a hashtag). These examples show that any approach or technique based on pattern matching is heavily reliant on the accuracy of URIs. Also classified as ‘undefined’ are the other two ontology URIs <http://www.w3.org/1999/02/22-rdf-syntax-ns> and <http://www.w3.org/2000/01/rdf-schema> in the UniProt Ontology without hashtags. Another example is the URI <https://doi.org/10.1186/s13326-017-0126-0>. It is classified as ‘undefined’ in GO, because it does not exist in its triples that were parsed. The URI <http://dx.doi.org/10.1186/s13326-017-0126-0> is however defined. One should not use one URI for definition whilst using another URI for referencing it.
Besides, URIs whose ‘path’ part contains letters are more susceptible to any operation that is affected by case sensitivity. For example, ‘dcat:catalog’ (<http://www.w3.org/ns/dcat#catalog>) is a property while ‘dcat:Catalog” (<http://www.w3.org/ns/dcat#Catalog>) is a class. Their ‘path’ parts, ‘#catalog’ (lowercase) versus ‘#Catalog’ (upper case), are different. Such a small distinction makes it easy to confuse them. However, this issue can be alleviated by incorporating codes into the naming, for example, the ‘is located in’ property <http://semanticscience.org/resource/SIO_000061> and the class ‘female’ <http://purl.bioontology.org/ontology/SNOMEDCT/248152002> using only numbers.
Mismatched prefixes or terms are a common cause of undefined URIs. One example is <http://purl.org/dc/elements/1.1/license>, which is used in the HPO. It does not exist, whereas <http://purl.org/dc/terms/license> does exist, though both are resolvable. This is because two Dublin Core\(^}\) Metadata Initiative (DCMI) namespaces [23] were mixed up: ‘http://purl.org/dc/elements/1.1/’ and ‘http://purl.org/dc/terms/’. Another example is ‘rdfs:source’ (<https://www.w3.org/2000/01/rdf-schema#source>) used in the hPSCreg vocabulary. This term does not exist; however, ‘rdfs:Resource’ does. This is probably due to the misinterpretation of existing terms. Both examples demonstrate the need for automated quality assessment by machines to detect errors that are often hard to detect by humans.
Importantly, we do not regard a URI that does not have content-type RDF to be an error because such a URI already indicates that it does not provide an RDF representation. For instance, the URI <https://www.ietf.org/rfc/rfc3986.txt> with the ‘text/plain’ content-type in the ‘rare-disease biobank and registries’ resource and the URI <https://github.com/geneontology/go-ontology/issues/7549> with the ‘text/html’ content-type in HPO properly use non-RDF content. It is also essential to emphasize that the purpose of identifying errors in these resources is not to dissuade people from using them, but rather to suggest areas for improvement so that the rare-disease community can benefit from ‘linked data’ and RDF.
Strengths and limitationsOur effort to assess the quality of RDF resources in the domain of rare diseases has several strengths. First of all, a significant strength is that the metrics applied are objective and automatable, allowing the quality assessment to be easily scalable when applied to other RDF resources while yielding reliable results. Secondly, the assessment report is generated in the form of RDF, allowing the quality information to be shared and reused in the future to accommodate the dynamic nature of resources in the world of Linked Data.
There are limitations in the implementation of the assessment of the metrics. First, the current evaluation tool relies on pattern matching and is limited to the syntactical level, therefore does not deem two URIs with and without hashtags as identical. Second, the current version of the tool does not adequately handle instances. One example is the URI <http://purl.obolibrary.org/obo/IAO_0000120> which stands for ‘metadata complete’ and is an instance of ‘curation status specification’ (i.e., <http://purl.obolibrary.org/obo/IAO_0000078>), i.e., defined as ‘owl:NamedIndividual’ rather than ‘owl:Class’ or ‘rdfs:Class’. Only the metrics regarding resolvability and parsability are applicable, so the tool only tested instances based on these two metrics. Nevertheless, it is necessary to include additional metrics that measure different aspects of instances, which should be the subject of future work. One example of a metric may be detecting an instance as a type of two disjoint classes, which can lead to inconsistency.
Lessons learned for quality assessmentGiven the size of biomedical ontologies, it is necessary to design the most computationally efficient methods prior to metric implementation in terms of memory consumption and time cost, especially for a large-sized ontology (e.g., NCIT with over 170,000 terms) or when an ontology server has a blocking mechanism to prevent repeated external requests. For example, the assessment of SNOMED CT revealed that all the URIs stemming from SNOMED CT (i.e., those starting with ‘http://snomed.info/’) return the status code 423 Locked. This is not a quality issue of these URIs but is attributed to a blocking mechanism, despite retry and sleep functions being applied in the software. Both functions again increase the total running time of implementation. To enable consistency assessment in this type of cases, one potential approach is to retrieve a complete RDF representation of the resource, such as an ontology, a schema, or a (meta)dataset, and make it available in a triplestore as a temporary RDF graph to be referred to by assessed URIs.
Even though the current quality model is adequate for representing the quality metadata in RDF, the more resources are investigated, the more amendments or extensions may be required. DCAT <https://www.w3.org/ns/dcat#>, for instance, supports multiple RDF serialization formats, such as JSON-LD and Turtle. DCAT in JSON-LD <https://www.w3.org/ns/dcat2.jsonld> and Turtle <https://www.w3.org/ns/dcat2.ttl> are likely to produce different data quality measures, due to the fact that the graphs parsed from both URIs are not identical. A potential solution to address it is to treat (resources in) each serialization format as an individual resource and link the quality measures to the particular format assessed. Through the property ‘dcat:distribution’, each (resource in) serialization format can be linked to the original resource URI, such as DCAT <https://www.w3.org/ns/dcat#>.
Recommendation for creation of high-quality rare-disease resourcesIn this paper, we consider a resource to be of high quality if it does not have any foundational quality issues. Although some [16, 24] argue that resource quality is subjective and in the eye of the beholder, the foundational quality aspects emphasized in this work remain objective and fundamental for all resources. Here are some recommendations learned from this study for the creation of high-quality RDF resources in the domain of rare diseases:
Non-resolvable URIs: (1) If one creates URIs, ensure that they are resolvable. Non-resolvable URIs need to be corrected and all URIs need to be tested periodically. (2) Avoid using URIs from external resources that are non-resolvable. Even if within a commonly-used ontology such as the ORDO, there are 42 non-resolvable URIs, which are used to describe rare-disease conditions.
Undefined URIs: (1) If one creates URIs, it is recommended to only include digits in their naming so that they are case insensitive [25]. (2) If one reuses URIs from external resources, make sure to comprehend their namespaces and apply them correctly. Keep in mind that URIs of terminology may be case-sensitive, which can result in different resources being referenced when the capitalization of URIs is altered.
Inconsistent URIs: (1) If one creates classes or properties, ensure that they adhere to intrinsic characteristics as ‘owl:Class’ or ‘rdfs:Property’, ‘owl:ObjectProperty’ or ‘owl:DataTypeProperty’; (2) If one reuses existing classes or properties, ensure that they adhere to the same intrinsic characteristics and that they are not deprecated.
Related work and future workThere are some studies that investigated the quality issues related to foundational quality. Johannes et al. [26, 27] highlighted that the availability of (terms of) ontologies could significantly influence the reusability of resources that reference these ontologies. They conducted the ontology accessibility study on 1,439 ontologies on the DBpedia Archivo [28] platform, and found that 709 (46%) of these ontologies were not accessible at least once. Being inaccessible means that the ontology URI and all URIs defined in ontologies were non-resolvable, and they found that these non-resolvable ontologies have impacted 32% of linked data on the same platform. This finding based on ontologies on the Archivo platform is in line with our findings based on the rare-disease resources (including ontologies), indicating that non-resolvable URIs continue to be a problem in the Semantic Web community. Such a problem should be ‘resolved’, given the important role of identifiers in making data Findable, Accessible, Interoperable, and Reusable [29,30,31,32]. Identifiers (e.g., URIs) can make it easier to find resources in an unambiguous manner (F), ensure reliable access if resolvable and authorized (A), enable databases and repositories to recognize and computers to interpret the referred resources (I), altogether contributing to the reuse of resources (R).
Given the objective and automatable nature of the foundational quality metrics, it will be necessary in the future to assess resources in other domains to identify more quality issues in the real world, and accordingly to develop domain-specific guidelines.
Comments (0)