Embedding routine health care data in clinical trials: with great power comes great responsibility

Data protection, privacy, informed consent

Embedding RHCD in RCT has great potential, as explained in the previous section. However, ‘with great power comes great responsibility’. Privacy and confidentiality are core principles of a safe patient-physician relationship. Electronic health records have made sensitive medical information relatively easily accessible, but the same holds for malicious parties, too. Data protection, therefore, is more relevant than ever. Databases must comply with international frameworks for information security. In 2018, a new European law on privacy, the General Data Protection Regulation, came into force (Dutch: Algemene Verordening Gegevensbescherming). This law acknowledges that secondary analyses of RHCD are of interest to the public but, at the same time, the law is very strict on which data are allowed to be processed. The collection and use of personal data are only acceptable when patient data are used for monitoring quality of care. However, if data are used for medical research, a stricter legal regime must be followed [13]. The law does not consider pseudonymised data to be anonymous, and in these cases informed consent is required; a consent waiver may only apply in exceptional cases. However, the standard method to obtain informed consent via an opt-in procedure, whereby each person is explicitly asked for permission in advance, is considered highly impractical by the research community, and therefore a public debate is currently taking place.

Fortunately, the law provides some leeway for informed consent via the opt-out procedure, whereby a person is informed that their data may also be used for (observational) medical research and is reminded of their right to revoke their consent. A few RCTs have compared the traditional opt-in to the opt-out approach in order to obtain informed consent for (research) registries [14]. As expected, the participation rate using the opt-out method was much higher (96% vs 21%). More importantly, the population in the opt-out group was more representative [15]. An additional survey confirmed that patients and caregivers support the opt-out approach and prefer it over the opt-in method [16]. Also, the (research) registries that obtain informed consent by the opt-out approach are generally of higher quality than their counterparts [17]. It therefore seems reasonable to use health records for observational medical research, provided the public is informed and offered the choice of opting out, in order to comply with the current privacy legislation. For clarification, if patients are identified by RHCD, informed consent by the opt-in approach remains the preferred method before randomisation into an RCT.

Finally, there is the issue of access to the data. Public trust in medical research must always be upheld. As a general rule, it is currently not possible to directly invite potentially eligible patients to participate in a study if there is no previous treatment relationship. To overcome this barrier, it would be prudent if a dedicated, independent medical ethics committee could consider applications for waiving consent in order to invite people to participate. Such a committee would have to carefully weigh the benefits against other ethical values such as autonomy, fidelity and justice [18].

Data linkage: technical and practical issues

RHCD cannot be used for RCTs if the variables of interest are not routinely evaluated. For example, quality-of-life questionnaires are not structurally assessed, which may hamper cost-effectiveness analysis. On the other hand, by linkage, several different sources of data (preferably discrete data instead of free text) can be combined to enrich the data with multiple facets, making RCHD use in RCTs a very powerful approach. Ideally, a unique identifier (e.g. social security number) is used in combination with several strong identifiers (e.g. name, date of birth). Also, linkage of different systems may be necessary to accomplish nationwide coverage.

Nonetheless, different data formats and inconsistencies within datasets may severely complicate linkage [19]. In addition, data are owned and managed by different parties who may not be inclined to collaborate. Finally, due to limited resources data linkage simply has low priority. However, at a time of crisis, and with the coordinating help of the Health Data Research Hub for Clinical Trials, RECOVERY was able to link 25 different registries in record time [20]. This demonstrates that ‘where there’s a will, there’s a way’. There are also successful examples of registry linkage in The Netherlands, despite its decentralised organisation [21, 22].

Is the quality of RCHD sufficient for use in RCT?

It is fair to question the quality of RCHD, since—by definition—data are not collected with the specifics of clinical trials in mind. Depending on the source of RCHD, missing data, misclassification, underreporting and overreporting are possible causes of reduced accuracy. However, if errors are random and datasets sufficiently large, RCHD are surprisingly robust as regards bias by the ‘magic of randomisation’ [1].

In ASCEND, outcome measures from UK HES were validated against trial-specific, adjudicated data [22]. The primary outcome was a composite of different severe cardiovascular events (i.e. non-fatal myocardial infarction, ischaemic stroke, transient ischaemic attack, vascular cardiovascular death, excluding haemorrhagic stroke). Overall, there was underreporting of events (1009 vs 1401). Nonetheless, the investigator observed a strong agreement between the two methods of follow-up (kappa: 0.78; a kappa value > 0.75 represents excellent agreement), and consequently the rate ratios for the aspirin-randomised comparison did not differ for either method. Interestingly, the extent of agreement was very high among the different components of the primary endpoint (kappa values varied between 0.73 and 0.94), except for transient ischaemic attacks (kappa: 0.43). From a clinical perspective, transient ischaemic attacks are relatively poorly defined, which may explain the modest agreement.

Heart failure is also a clinical syndrome where variable clinical presentation may undermine consistent classification. In addition, classification is further complicated by the changing nomenclature over time. Blecker et al. evaluated different automated algorithms to identify heart failure cases in a large local dataset containing almost 50,000 hospitalisations [23]. When relying on structured data, the positive predictive value for identifying heart failure events was high (96%) but sensitivity was poor (40%). However, when machine learning techniques and natural language processing were used to analyse data, sensitivity more than doubled (83%) whereas the positive predictive value remained acceptable (90%). The findings demonstrate that advanced algorithms could significantly increase the usefulness of RHCD, but are probably more difficult to implement at a national level.

In summary, the quality of RHCD is sufficient to capture clinical events, noting that external validation is necessary before use. Underreporting by RHCD is observed, particularly when the clinical event of interest cannot be clearly defined. (Very) large datasets can compensate for underreporting and randomisation makes them almost impervious to bias.

Consequence of time lag

RCHD are usually significantly time-lagged, and this is especially true for secondary care records such as health insurance claims, where coding is typically done by non-clinicians, weeks after the event. This may not be a major issue for interim analyses performed by a data monitoring committee; however, for pharmacovigilance or reporting on intervention or device-related adverse events, RCHD will most likely not suffice.

Nevertheless, there are exceptional cases where RCHD did suffice for rapid safety reporting. One example is the Salford Lung Study, in which the efficacy and safety of fluticasone/vilanterol inhalation was evaluated against standard of care in a primary care setting [24]. Patients were continuously monitored via real-time data collection from general practices and hospitals, among other systems. A safety alerting and reporting system was established based on serious adverse events, being initially flagged in the electronic health record. This would then prompt rapid evaluation and safety reporting if appropriate.

Unless real-time data collection and processing is available (or specifically organised for the study), the time lag of RHCD should be acknowledged. For surveillance, methods other than RCHD are required for rapid safety reporting (e.g. 24‑h telephone service), especially for trials with longer follow-up.

Comments (0)

No login
gif