Deduplicating the FDA adverse event reporting system with a novel application of network-based grouping

Spontaneous reporting systems like the Food and Drug Administration's (FDA) Adverse Event Reporting System (FAERS) are critical in pharmacovigilance, allowing for ongoing safety monitoring that can identify even very rare potential drug safety concerns [1], [2], [3], [4]. The FDA monitors FAERS data in multiple ways, but two of the most important are through case series analysis and data mining. Case series analysis involves an in-depth review of a particular product, sometimes including all reports (in the case of a scheduled review for a recently approved product), and often focused on a specific adverse event of interest [5], [6], [7]. These reviews typically involve considerable manual work and the application of clinical knowledge from trained pharmacovigilance safety reviewers. On the other hand, data mining is based on large-scale monitoring of the entirety of FAERS data. Calculations are made of observed-to-expected ratios based on the reporting experience of each drug-event combination compared with the background reporting experience across a selected set of drugs and events [8]. In both situations, the presence of duplicate reports that describe the same patient experience can have serious effects on the review process. Duplicates within a case series create additional work for the safety reviewer to identify and account for them. Data mining calculations based on duplicate reports can misrepresent the count of true patient cases, potentially making the association of a drug-adverse event relation look stronger or weaker than it is [9], [10], [11], [12].

One of the data mining tools used at the FDA is Empirica SignalTM, which also includes a proprietary deduplication algorithm. The algorithm seems to focus on minimizing false positive duplicate detection (i.e., it has high precision), and it only examines data in a fixed set of structured fields within each report by applying a crude rule-based approach. Further, the duplicate determinations made by the algorithm are not readily available to safety reviewers working on case series or other analyses, meaning that they do not have automated support that would help them identify or view duplicate reports during their work. Other approaches at deduplicating FAERS reports have been proposed but have focused on publicly released data that do not include the free text report narratives. Recent efforts by Hung et al. [11] and Han et al. [13] have highlighted how citations provided in the Literature Reference field can be used to identify likely duplicates, while Khaleel et al. [14] has used purely structured data fields, some with high levels of missingness.

The size of FAERS is beyond many other surveillance systems, as it contains close to 29 million reports and receives over two million reports of suspected adverse events per year. The reports are composed of several structured data fields and lengthy narrative descriptions, with current reports following the ICH E2B(R3) format [15]. Human experts, and the case series-based deduplication algorithm, rely on both structured fields and narrative text when determining report duplicate status [16], [17]. In theory, a complete deduplication approach for FAERS has to consider that every single report could be a duplicate of any other report (with 29 million FAERS reports, that would mean over 400 trillion pairings of reports to evaluate). For all intents and purposes, making this many direct comparisons while including narrative text is overly computationally intensive and impractical [16]. Crucially, the volume of reports also poses problems for any chain-based grouping strategy that allows for groups to merge based on pairwise duplicate determinations even when not all members of the group are found to be duplicates of each other. To prevent the formation of unrealistically large groups, we applied a community detection approach (from the network analysis field and roughly following the design of Ebeid et al. [18]) to split some groups that did not represent truly densely connected sets of duplicate reports. In the end, this approach can satisfy FDA’s need for a high-performing and transparent solution to process all historic FAERS report data, and the newly incoming daily FAERS reports (average 8,000 per day).

Comments (0)

No login
gif