In the field of epidemic intelligence, systematic information analysis is essential for detecting, verifying, evaluating, and investigating health events and risks with the aim of early warning. This process relies on two complementary surveillance strategies: event-based and indicator-based surveillance. Electronic Health Records (EHRs) serve as valuable sources of information for event-based surveillance strategies World Health Organization (2014).
The emergence of the COVID-19 pandemic in 2020 presented significant challenges, characterized by uncertainty about the limited understanding of the virus’s transmission dynamics and associated symptoms. In response, national and international organizations developed and updated recommendations and surveillance strategies based on the best available information on the virus. The World Health Organization (WHO) established emergency ICD-10 (International Classification of Diseases, 10th Revision)2 codes for COVID-19 infection in February 2020, when virus cases were already documented in four continents. This system standardizes diagnostic codes globally.
Identifying diagnoses from EHRs usually involves manually reviewing the EHR texts and assigning standardized codes to them. Implementing a new ICD-10 code during the pandemic resulted in a lack of standardized coding for some time, exacerbating the complexities of identifying and analyzing epidemiological trends. This situation potentially delayed crucial public health interventions and resource allocation. During the COVID-19 pandemic, healthcare providers faced challenges in accurately documenting and reporting cases, particularly in resource-constrained settings where PCR and antigen tests were scarce due to cost and limited availability of reagents. This scarcity disproportionately affected low-resource countries compared to their high-resource counterparts, highlighting significant disparities in access to diagnostic tools. Importantly, all these considerations remain relevant after COVID-19, as two new epidemics have emerged since then: monkeypox and Oropouche fever.
In response to the pressing need for timely surveillance at that time, we pursued the automatic classification of unstructured data from electronic health records from La Rioja, Argentina, into five categories: no-COVID, COVID-confirmed, COVID-suspected, COVID-rejected, and insufficient information. Automating this process through the use of natural language processing (NLP) techniques offers several advantages, including the ability to generate more timely statistics than indicator-based surveillance, thus complementing data provided by the National Health Surveillance System (SNVS 2.0 from its initials in Spanish). Early onset detection allows the allocation of skilled personnel to other critical tasks other than manual classification, and potentially anticipates a pandemic not yet classified in the ICD-10 system.
There are several challenges in applying NLP to EHRs, including dataset imbalance, the scarcity of publicly available corpora, particularly in Spanish – mainly due to the sensitivity of medical information, and the need to safeguard patient privacy –, as well as the abundance and ambiguity of abbreviations in clinical notes Chen et al., 2021, Aliabadi et al., 2020, Miotto et al., 2018, Wu et al., 2022, Hossain et al., 2023. An additional challenge is the constitution of an interdisciplinary team to address solutions for this problem Miotto et al. (2018). Although this offers several advantages, achieving a common understanding among different areas of knowledge requires considerable effort.
Previous studies have shown that employing simple, interpretable approaches with rapid deployment capabilities can yield promising results for this task, although such insights are predominantly available for clinical notes in English Sheikhalishahi et al., 2019, Chen et al., 2021, Meystre et al., 2022. Furthermore, in many cases, comprehensive descriptions of the methodologies employed and their performance metrics are lacking. Some works also discuss the importance of automating ICD-10 coding Li et al., 2018, Nigam, 2016. To bridge these gaps, this work offers a detailed description and evaluation of machine-learning solutions to classify clinical notes concerning COVID-related categories to provide decision-makers in Spanish-speaking contexts with vital information.
Given the time and resource constraints, particularly prevalent in low- and middle-income countries, as well as the urgency of a pandemic, our approach prioritized rapid and cost-effective methodologies. We focused on factors such as the availability of pre-trained models and the cost associated with training them, including both human resources and electricity consumption de Vries, 2023, Samsi et al., 2023. Additionally, the absence of high-cost resources such as GPUs (Graphics Processing Units) further constrained our approach. Our study evaluates a range of algorithms, including Naïve Bayes; Logistic Regression; Gradient Boosting Classifier (XGBoost); a bidirectional recurrent neural network (BiLSTM); and two transformer-based approaches: BETO Clínico and RoBERTa Clínico, which we developed by further pre-training existing models with an unannotated dataset derived from our corpus. For training and evaluating our algorithms, we required annotated data. The limited availability of annotated resources, the shortage of specialized annotators, and the minimal resources allocated for annotation posed additional challenges.
Given these challenges, our study aims to assess whether accurate and actionable classification results can be achieved at a lower cost than with state-of-the-art approaches. Our findings reveal promising results achieved within a short timeframe. Our best-performing model was BETO Clínico (a state-of-the-art transformer model based on BETO Cañete et al. (2023), which is an adaptation to Spanish of BERT, a pre-trained deep learning model that understands the context of words in a sentence Devlin (2018)). Notably, however, the simpler and more resource-efficient method of Logistic Regression showed a performance comparable to that state-of-the-art approach. Both methods correlated well with case confirmations of the province under study, provided by the National Health Surveillance System, and with ICD-10 codes, when available. Furthermore, Logistic Regression offers the advantage of being more accessible to a wider pool of personnel, facilitating implementation in resource-constrained settings.
Furthermore, our results show that in our problem the automatic coding of EHRs into ICD-10 codes, a labor-intensive task, could be achieved by annotating only a portion of the complete dataset and then training a simple machine learning model to classify the rest. However, it is worth noting that an expert should review the classification results afterwards.
This study is significant because it addresses the urgent need for low-cost, efficient machine learning (ML) methods that can operate effectively in resource-constrained environments, such as those found in many low- and middle-income countries, especially in light of the recent emergence of epidemics, and provides solutions to enhance decision making in these contexts. It also focuses on Spanish EHRs, underrepresented in biomedical NLP. By guiding improvements in pandemic preparedness and response across global settings, we aim to enhance the utility and accessibility of automated surveillance solutions within the broader context of epidemic intelligence.
Comments (0)