To face the multiple challenges within our health care system, the secondary use of health data holds multiple advantages: it could increase patient safety, provide insights into person-centered care, and foster innovation and clinical research. To maximize these benefits, the health care ecosystem is investing rapidly in primary sources, such as electronic health records (EHRs) and personalized health monitoring, as well as in secondary sources, such as health registries, health information systems, and digital health technologies, to effectively manage illnesses and health risks and improve health care outcomes []. These investments have led to large volumes of complex real-world data. However, health care is not obtaining the full potential of the secondary use of health data [,] because of—among other issues—concerns about the quality of the data being used [,]. Errors in the collection of health data are common. Studies have reported that at least half of EHR notes may contain an error leading to low-quality data [-]. The transition to digital health has produced more health data but not to the same extent as an increase in the quality of health data []. This will impede the potentially positive impact of digitalization on patient safety [], patient care [], decision-making [], and clinical research [].
The literature is replete with various definitions of data quality. One of the most used definitions for data quality comes from Juran et al [], who defined data quality as “data that are fit for use in their intended operational, decision-making, planning, and strategic roles.” According to the International Organization for Standardization (ISO) definition, quality is “the capacity of an ensemble of intrinsic characteristics to satisfy requirements” (ISO 9000-2015). DAMA International (The Global Data Management Community: a leading international association involving both business and technical data management professionals) adapts this definition to a data context: “data quality is the degree to which the data dimensions meet requirements.” These definitions emphasize the subjectivity and context dependency of data quality []. Owing to this “fit for purpose” principle, the quality of data may be adequate when used for one specific task but not for another.
For example, when health data collected for primary use setting, such as blood pressure, are reused for different purposes, the adequacy of their quality can vary. For managing hypertension, the data’s accuracy and completeness may be considered adequate. However, if the same data are reused for research, for example, in a clinical trial evaluating the effectiveness of an antihypertensive, more precise and standardized measurements methods are needed. From the perspective of secondary use, data are of sufficient quality when they serve the needs of the specific goals of the reuser [].
To ensure that the data are of high quality, they must meet some fundamental measurable characteristics (eg, data must be complete, correct, and up to date). These characteristics are called data quality dimensions, and several authors have attempted to formulate a complex multidimensional framework of data quality. Kahn et al [] developed a data quality framework containing conformance, completeness, and plausibility as the main data quality dimensions. This framework was the result of 2 stakeholder meetings in which data quality terms and definitions were grouped into an overall conceptual framework. The i~HD (European Institute for Innovation through Health Data) prioritized 9 data quality dimensions as most important to assess the quality of health data []. These dimensions were selected during a series of workshops with clinical care, clinical research, and ICT leads from 70 European hospitals. In addition, it is well known that there are several published reviews in which the results of individual quality assessment studies were collated into a new single framework of data quality dimensions. However, the results of these reviews have not yet been evaluated. Therefore, answering the “fit for purpose” question and establishing effective methods to assess data quality remain a challenge [].
The primary objective of this review is to provide a thorough overview of data quality frameworks and their associated assessment methods, with a specific focus on the secondary use of health data, as presented in published reviews. As a secondary aim, we seek to align and consolidate the findings into a unified framework that captures the most crucial aspects of quality with a definition along with their corresponding assessment methods and requirements for testing.
We conducted a review of reviews to gain insights into data quality related to the secondary use of health data. In this review of reviews, we applied the Equator recommendations from the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines proposed by Page et al []. As our work is primarily a review of reviews, we included only the items from these guidelines that were applicable. Abstracts were sourced by searching the PubMed, Embase, Web of Science, and SAGE databases. The search was conducted in April 2023, and only reviews published between 1995 and April 2023 were included. We used specific search terms that were aligned with the aim of our study. To ensure comprehensiveness, the search terms were expanded by searching for synonyms and relevant key terms. The following concepts were used: “data quality” or “data accuracy,” combined with “dimensions,” “quality improvement,” “data collection,” “health information interoperability,” “health information systems,” “public health information,” “quality assurance,” and “delivery of health care.” illustrates an example of the search strategy used in PubMed. To ensure the completeness of the review, the literature search spanned multiple databases. All keywords and search queries were adapted and modified to suit the requirements of these various databases ().
Textbox 1. Search query used.(“data quality” OR “Data Accuracy”[Mesh]) AND (dimensions OR “Quality Improvement”[Mesh] OR “Data Collection/standards”[Mesh] OR “Health Information Interoperability/standards”[Mesh] OR “Health Information Systems/standards”[Mesh] OR “Public Health Informatics/standards” OR “Quality Assurance, Health Care/standards”[Mesh] OR “Delivery of Health Care/standards”[Mesh]) Filters: Review, Systematic Review
Inclusion and Exclusion CriteriaWe included review articles that described and discussed frameworks of data quality dimensions and their assessment methods, especially from a secondary use perspective. Reviews were excluded if they were (1) not specifically related to the health care ecosystem, (2) lacked relevant information related to our research objective (no definition of dimensions), or (3) published in languages other than English.
Selection of ArticlesOne reviewer (JD) screened the titles and abstracts of 982 articles from the literature searches and excluded 940 reviews. Two reviewers (RVS and JD) independently performed full-text screening of the remaining 42 reviews. Disagreements between the 2 reviewers were resolved by consulting a third reviewer (DK). After full-text screening, 20 articles were excluded because they did not meet the inclusion criteria. A total of 22 articles were included in this review.
Data ExtractionAll included articles were imported into EndNote 20 (Clarivate). Data abstraction was conducted independently by 2 reviewers (RVS and JD). Disagreements between the 2 reviewers were resolved by consulting a third reviewer (DK). The information extracted from the reviews included various details, including the authors, publication year, research objectives, specific data source used, scope of secondary use, terminology used for the data quality dimensions, their corresponding definitions, and the measurement methods used.
Data SynthesisTo bring clarity to the diverse dimensions and definitions scattered throughout the literature, we labeled the observed definitions of dimensions from the reviews as “aspects.” We then used the framework of the i~HD. This framework underwent extensive validation through a large-scale exercise and was published []. It will now serve as a reference framework for mapping the diverse literature in the field. This overarching framework comprised 9 loosely delineated data quality dimensions (, []). Each observed definition of a data quality dimension was mapped onto a dimension of this reference framework. This mapping process was collaborative and required consensus among the reviewers. This consolidation is intended to offer a more coherent and unified perspective on data quality for secondary use.
Textbox 2. Consolidated data quality framework of the European Institute for Innovation through Health Data [20].Data quality dimension and definition
Completeness: the extent to which data are presentConsistency: the extent to which data satisfy constraintsCorrectness: the extent to which data are true and unbiasedTimeliness: the extent to which data are promptly processed and up to dateStability: the extent to which data are comparable among sources and over timeContextualization: the extent to which data are annotated with acquisition contextRepresentativeness: the extent to which data are representative of intended useTrustworthiness: the extent to which data can be trusted based on the owner’s reputationUniqueness: the extent to which data are not duplicatedsummarizes the literature review process and the articles included and excluded at every stage of the review using the PRISMA guidelines. It is important to note that this was not a systematic review of clinical trials; rather, it was an overview of existing reviews. As such, it synthesizes and analyzes the findings from multiple reviews on the topic of interest. A total of 22 articles were included in this review. The 22 reviews included systematic reviews (4/22, 18%) [-], scoping reviews (2/22, 9%) [,], and narrative reviews (16/22, 73%) [,-]. All the reviews were published between 1995 and 2023. Of the 20 excluded reviews, 5 (25%) were excluded because they were not specific to the health care ecosystem [,-], 13 (65%) lacked relevant information related to our research objective [-], and 2 (10%) were published in a language other than English [,].
Figure 1. The process of selecting articles. Data SourcesOf the 22 reviews, 10 (45%) discussed data quality pertaining to a registry [-,-,-] and 4 (18%) to a network of EHRs [,,,]. Of the 22 reviews, 4 (18%) discussed the quality of public health informatics systems [,], real-world data repositories [], and clinical research informatics tools []. Of the 22 reviews, 4 (18%) did not specify their data source [,,,].
Observed Frameworks for Data Quality DimensionsIn the initial phase of our study, we conducted a comprehensive review of 22 selected reviews, each presenting a distinct framework for understanding data quality dimensions. Across these reviews, the number of dimensions varied widely, ranging from 1 to 14 (median 4, IQR 2-5). The terminology used was diverse, yielding 23 different terms for dimensions and 62 unique definitions. A detailed overview, including data sources, data quality dimensions, and definitions, is provided in [,-]. Figure S1 in presents the frequency of all dimensions in each review along with the variety of definitions associated with each dimension.
Data Synthesis: Constructing a Consolidated Data Quality Framework For Secondary UseOverviewpresents all dimensions mentioned in the included reviews, with their definitions, mapped toward each of the 9 data quality dimensions in the framework of i~HD.
Table 1. Mapping of data quality aspects toward i~HD (European Institute for Innovation through Health Data) data quality framework.i~HD data quality dimensions and aspects as mentioned in the reviewsDefinitionCompletenessaEHR: electronic health record.
CompletenessThe first data quality dimension relates to the completeness of data. Among the 22 reviews included, 20 (91%) highlighted the significance of completeness [,-,-,,-]. Of these 20 reviews, 17 (85%) used the term completeness to refer to this dimension [,-,-,,-], whereas the remaining 3 (15%) used the terms plausibility [] and capture [,].
On the basis of the definitions of completeness, we can conclude that this dimension contains 2 main aspects. First, completeness related to the data level. The most used definition related to this aspect is the extent to which information is not missing [,,,]. Other reviews focused more on features that describe the frequencies of data attributes present in a data set without reference to data values [,,]. Shivasabesan et al [], for example, defined completeness as the presence of recorded data points for each variable. A second aspect for completeness relates more to a case level, in which all the incident cases occurring in the population are included [,,,].
ConsistencyThe second data quality dimension concerns the consistency of the data. Among the 22 selected reviews, 11 (50%) highlighted the importance of consistency [,,,-,,,,]. Although various frameworks acknowledge this as a crucial aspect of data quality, achieving a consensus on terminology and definition has proven challenging. Notably, some reviews used different terminologies to describe identical concepts associated with consistency [,,,]. Of the 11 reviews, 6 (55%) used the term consistency to describe this dimension [,,,,,], whereas 3 (27%) used conformance [,,] and 2 (18%) referred to comparability [,]. Of the 11 reviews, 3 (27%) used distinct terms: accuracy [], validity [], and concordance []. Most definitions focus on data quality features that describe the compliance of the representation of data with internal or external formatting, relational, or computational definitions [,]. Of the 11 reviews, 2 (18%) provided a specific definition of consistency concerning registry data, concentrating on the extent to which coding and classification procedures, along with the definitions or recording and reporting of specific data terms, adhere to the agreed international guidelines [,]. Furthermore, Bian et al [] concentrated on whether the values present meet syntactic or structural constraints in their definition, whereas Liaw et al [] defined consistency as the extent to which the representation of data values is consistent across all cases.
CorrectnessThe third data quality dimension relates to the correctness of the data. Of the 22 reviews, 14 (64%) highlighted the importance of correctness [,,,,,,,-,,,]. Of the 14 reviews, 2 (14%) used 2 different dimensions to describe the same concept of correctness [,]. Accuracy was the most frequently used term within these frameworks [,,,,,,]. In addition, other terms used included correctness [,,], plausibility [,,], and validity [,]. In general, this dimension assesses the degree to which the recorded data align with the truth [,,], ensuring correctness and reliability [,]. Of the 14 reviews, 2 (14%) provided a specific definition of correctness concerning EHR data, emphasizing that the element collected is true [,]. Furthermore, of the 14 reviews, 2 (14%) defined correctness more at a data set level, defining it as the proportion of cases in a data set with a given characteristic that genuinely possess the attribute [,]. These reviews specifically referred to this measure as validity. Nevertheless, the use of the term validity was not consistent across the literature; it was also used to define consistency. For instance, AbuHalimeh [] used validity to describe the degree to which information adheres to a predefined format or complies with the established business rules.
TimelinessThe fourth data quality dimension concerns the timeliness of the data. Among the 22 selected reviews, 11 (50%) underscored the importance of this data quality dimension [,,,,,,-,,]. Of the 11 reviews, 7 (64%) explicitly used the term timeliness [,,,,,,], whereas 4 (36%) referred to it as currency [,,,]. Mashoufi et al [] used the terms accessibility and timeliness to explain the same concept. Broadly, timeliness describes how promptly information is processed or how up to date the information is. Most reviews emphasized timeliness as the extent to which information is up to date for the task at hand [,,]. For instance, Weiskopf and Weng [] provided a specific definition for EHR data, stating that an element should be a relevant representation of the patient’s state at a given point in time. Other reviews defined timeliness as the speed at which data can be collected, processed, and reported [,,]. Similarly, Porgo et al [] defined timeliness as the extent to which data are available when needed.
StabilityThe fifth data quality dimension concerns the stability of the data. Among the 22 included reviews, 4 (18%) acknowledged the significance of stability [,,,]. The most frequently used terms for this dimension are consistency [,] and concordance []. In addition, other terms used include currency [], comparability [], and information loss and degradation []. Bian et al [] explored this aspect of data quality by using multiple terminologies to capture its multifaceted nature: stability, consistency, concordance, and information loss and degradation. This dimension, in general, encompasses 2 distinct aspects. First, it underscores the importance of data values that remain consistent across multiple sources and locations [,,]. Alternatively, as described by Bian et al [], it refers to the similarity in data quality for specific data elements used in measurements across different entities, such as health plans, physicians, or other data sources. Second, it addresses temporal changes in data that are collected over time. For instance, Lindquist [] highlighted the importance of stability in data fields that involve information that may change over time. The term consistency is used across different data quality dimensions, but it holds different meanings depending on the context. When discussing the dimension of stability, consistency refers to the comparability of data across different sources. This ensures that information remains uniform when aggregated or compared. Compared with the consistency dimension, the term relates to the internal coherence of data within a single data set, which relates to the absence of contradiction and compliance with certain constraints. The results indicate the same ambiguity in terms of currency. When associated with stability, currency refers to the longitudinal aspect of variables. In contrast, within the dimension of timeliness, currency is concerned with the aspect if data are up to date.
ContextualizationThe sixth data quality dimension revolves around the context of the data. Of the 22 reviews analyzed, 3 (14%) specifically addressed this aspect within their framework [,,]. The most used term was understandability [,]. In contrast, Syed et al [] used the term contextual validity, and Bian et al [] referred to flexibility and understandability for defining the same concept. Broadly speaking, contextualization pertains to whether the data are annotated with their acquisition context, which is a crucial factor for the correct interpretation of results. As defined by Bian et al [], this dimension relates to the ease with which a user can understand data. In addition, AbuHalimeh [] refers to the degree to which data can be comprehended.
RepresentationThe seventh dimension of data quality focuses on the representation of the data. Of the 22 reviews examined, 3 (14%) specifically highlighted the importance of this dimension [,,]. Of the 3 reviews, 2 (67%) used the term relevance [,], whereas Porgo et al [] used the term precision. Broadly speaking, representativeness assesses whether the information is applicable and helpful for the task at hand [,]. In more specific terms, as defined by Porgo et al [], representativeness relates to the extent to which data values are specific to the task at hand.
TrustworthinessThe eighth dimension of data quality relates to the trustworthiness of the data. Of the 22 reviews, only 2 (9%) considered this dimension in their review [,]. In both cases, trustworthiness was defined as the extent to which data are free from corruption, and access was appropriately controlled to ensure privacy and confidentiality.
UniquenessThe final dimension of data quality relates to the uniqueness of the data. Of the 22 reviews, only 1 (5%) referred to this aspect []. Uniqueness is evaluated based on whether there are no duplications or redundant data present in a data set.
Observed Data Quality Assessment MethodsOverviewOf the 22 selected reviews, only 8 (36%) mentioned data quality assessment methods [,,,,,-]. Assessment methods were defined for 15 (65%) of the 23 data quality dimensions. The number of assessment methods per dimension ranged from 1 to 15 (median 3, IQR 1-5). There was no consensus on which method to use for assessing data quality dimensions. Figure S2 in presents the frequency of the dimensions assessed in each review, along with the number of different data quality assessment methods.
In the following section, we harmonize these assessment methods with our consolidated framework. This provides a comprehensive overview linking the assessment methods to the primary data quality dimensions from the previous section. provides an overview of all data quality assessment techniques and their definitions. presents an overview of all assessment methods mentioned in the literature and mapped toward the i~HD data quality framework.
Table 2. Overview of all data quality assessment methods with definitions.Assessment MaAssessment technique in reviewsExplanationM1Linkages—other data setsPercentage of eligible population included in the data set.aM: method.
bM:I: mortality:incidence ratio.
Textbox 3. Mapping of assessment methods (Ms) toward data quality framework of the European Institute for Innovation through Health Data.Completeness
Capture []M1: linkages—other data setsM2: comparison of distributionsM3: case duplicationCompleteness []M4: completeness of variablesM5: completeness of casesCompleteness []M4: completeness of variablesM6:distribution comparisonM7: gold standardM5: completeness of casesCompleteness []M8: historic data methodsM9: mortality:incidence ratio (M:I)M10: number of sources and notifications per caseM11: capture-recapture methodM12: death certificate methodCompleteness []M8: historic data methodsM9: M:IM10: number of sources and notifications per caseM11: capture-recapture methodM12: death certificate methodM13: histological verification of diagnosisM14: independent case ascertainmentCompleteness []M4: completeness of variablesM6: distribution comparisonM7: gold standardM15: data element agreementM16: data source agreementCompleteness []M4: completeness of variablesM6: distribution comparisonM7: gold standardM17: conformance checkConsistency
Conformance []M18: element presenceM17: conformance checkConcordance []M15: data element agreementM19: not specifiedConsistency []M16: data source agreementComparability []M20: international standards for classification and codingM21: incidence rateM22: multiple primariesM23: incidental diagnosisM24: not specifiedComparability []M20: international standards for classification and codingConsistency []Correctness
Correctness []M7: gold standardM15: data element agreementPlausibility []M6: distribution comparisonM25: validity checkM31: log reviewM16: data source agreementValidity []M26: reabstracting and recodingM13: histological verification of diagnosisM27: missing informationM28: internal consistencyM12: death certificate methodValidity []M13: histological verification of diagnosisM12: death certificate methodAccuracy []M7: gold standardM28: internal consistencyM29: domain checkM30: interrater variabilityCorrectness []Accuracy []M7: gold standardM32: syntactic accuracyStability
Concordance []M15: data element agreementM16: data source agreementM6: distribution comparisonComparability []Consistency []Consistency []M15: data element agreementM16: data source agreementTimeliness
Currency []Currency []Timeliness []M34: not specifiedM35: not specifiedCurrency []Timeliness []M36: time to availabilityTrustworthiness
Representation
CompletenessAmong the 20 reviews that defined data quality dimensions related to completeness, 6 (30%) incorporated data quality assessment methods into their framework [,,,,,]. These 6 reviews collectively introduced 17 different data quality assessment methods. Some reviews (4/6, 67%) mentioned multiple methods to evaluate completeness, which highlights the absence of a consensus within the literature regarding the most suitable approach. The most frequently used method in the literature for assessing completeness was the examination of variable completeness [,,,]. This method involved calculating the percentage of cases that had complete observations for each variable within the data set. In 3 reviews [,,], researchers opted to compare the distributions or summary statistics of aggregated data from the data set with the expected distributions for the clinical concepts of interest. Another approach found in 3 reviews involved the use of a gold standard to evaluate completeness [,,]. This method relied on external knowledge and entailed comparing the data set under examination with data drawn from other sources or multiple sources.
ConsistencyAmong the 15 reviews highlighting the significance of consistency, 6 (40%) defined data quality assessment methods [,,,,,]. In these 6 reviews, a total of 10 distinct data quality assessment methods were defined. The most used method involved calculating the ratio of violations of specific consistency types to the total number of consistency checks [,]. There were 2 categories established for this assessment. First, internal consistency, which focuses on the most commonly used data type, format, or label within the data set. Second, external consistency, which centered on whether data types, formats, or labels could be mapped to a relevant reference terminology or data dictionary. Another common assessment method was the implementation of international standards for classification and coding standards [,]. This addressed specific oncology and suggested coding for topography, morphology, behavior, and grade. Liaw et al [] defined an assessment method in which ≥2 elements within a data set are compared to check if they report compatible information.
CorrectnessAmong the 16 reviews underscoring the importance of correctness, 6 (38%) detailed data quality assessment methods [,,,,,]. Collectively, these 6 reviews proposed 15 different techniques. Prominent among these were histological verification [,], where the percentage of morphologically verified values served as an indicator of diagnosis correctness. Another frequently used technique was the use of validity checks [], involving various methods to assess whether the data set values “make sense.” Three additional reviews opted for a comparative approach, benchmarking data against a gold standard and calculating the sensitivity, specificity, and accuracy scores [,,]. Interestingly, there is an overlap between consistency and completeness as data quality dimensions in the assessment of correctness. For instance, Weiskopf and Weng [] defined data element agreement as an assessment for this dimension, whereas Bray and Parkin [] evaluated the proportion of registered cases with unknown values for specific items as a correctness assessment method.
StabilityAmong the 7 reviews emphasizing the importance of stability of the data, only 3 (43%) discussed assessment techniques that address this dimension [,,]. These 3 reviews collectively outlined 5 different techniques. Notably, there was no predominant technique. Specifically, Weiskopf and Weng [] used several techniques to assess data stability, including an overlap with other dimensions, by using data element agreement. Another technique introduced in the same review was data source agreement, involving the comparison of data from different data sets from distinct sources.
TimelinessOf the 12 reviews focusing on the timeliness of data, 5 (42%) delved into assessment techniques for this data quality dimension [,,,,]. Across these reviews, 5 distinct assessment techniques were discussed. The most commonly used technique was the use of a log review [,]. This method involved collecting information that provides details on data entry, the time of data storage, the last update of the data, or when the data were accessed. In addition, Bray and Parkin [] assessed timeliness by calculating the interval between the date of diagnosis (or date of incidence) and the date the case was available in the registry or data set.
TrustworthinessIn the 2 reviews that considered trustworthiness as a data quality dimension, both used the same assessment technique [,]. This method involves the analysis of access reports as a security analysis, providing insight into the trustworthiness of the data.
RepresentationIn 1 review that addressed the representation dimension as a data quality aspect, only 1 assessment method was mentioned. Liaw et al [] introduced descriptive qualitative measures through group interviews to determine whether the data accurately represented the intended use.
Uniqueness and ContextualizationNo assessment methods were mentioned for these data quality dimensions.
This first review of reviews regarding the quality of health data for secondary use offers an overview of the frameworks of data quality dimensions and their assessment methods, as presented in published reviews. There is no consensus in the literature on the specific terminology and definitions of terms. Similarly, the methodologies used to assess these terms vary widely and are often not described in sufficient detail. Comparability, plausibility, validity, and concordance are the 4 aspects classified under different consolidated dimensions, depending on their definitions. This variability underscores the prevailing discrepancies and the urgent need for harmonized definitions. Almost none of the reviews explicitly refer to requirements of quality for the context of the data collection. Building on the insights gathered from these reviews, our consolidated framework organizes the numerous observed definitions into 9 main data quality dimensions, aiming to bring coherence to the fragmented landscape.
Health data in primary sources refer to data produced in the process of providing real-time and direct care to an individual [], with the purpose of improving the care process. A secondary source captures data collected by someone other than the primary user and can be used for other purposes (eg, research, quality measurement, and public health) []. The included reviews discussed data quality for secondary use. However, the quality of health data in secondary systems is a function of the primary sources from which they originate, the quality of the process to transfer and transform the primary data to the secondary source, and the quality of the secondary source itself. The transfer and transformation of primary data to secondary sources implies the standardization, aggregation, and streamlining of health data. This can be considered as an export-transform-load (ETL) process with its own data quality implications. When discussing data quality dimensions and assessment methods, research should consider these different stages within the data life cycle, a distinction seldom made in the literature. For example, Prang et al [] defined completeness within the context of a registry, which can be regarded as a secondary source. In this context, completeness was defined as the degree to which all potentially registrable data had been registered. The definition for completeness by Bian et al [] pertains to an EHR, which is considered a primary source. Here, the emphasis was on describing the frequencies of data attributes. Both papers emphasized the importance of completeness, but they approached this dimension from different perspectives within the data life cycle.
This fragmented landscape regarding terminology and definition of data quality dimensions, the lack of distinction between quality in primary and secondary data and in the ETL process, and the lack of consideration for the context allows room for interpretation, leading to difficulties in developing assessment methods. In our included articles, only 8 (36%) out of 22 reviews mentioned and defined assessment methods [,,,,,-]. However, the results showed that the described assessment methods are limited by a lack of well-defined and standardized metrics that
Comments (0)