Phenotype fingerprinting of bipolar disorder prodrome

Dataset

The data source is the U.S. Veterans Affairs (VA) Corporate Data Warehouse (CDW), a large database containing the EHR data of the U.S. Veterans receiving care from the Veterans Health Administration (VHA). VHA is the largest integrated health care system in the U.S., providing outpatient and inpatient facilities across the nation. VHA offers primary and specialty services such as home health, women’s health, and mental health care. The CDW is administered by VA Informatics and Computing Infrastructure (VINCI), who provides a secure environment for researchers to access Veterans health data.

We first identified BD patients with 2 or more International Classification of Diseases (ICD) codes of BD (ICD 9 codes: 296.4X, 296.5X, 296.6X, 296.7, 296.80, and 296.89; ICD 10 codes: F31.X (PsychCentral)) in the CDW database (n = 346,511). We required 2 or more BD diagnosis codes to minimize the chance of misclassification. The dataset was further restricted to patients with (1) first BD diagnosis dated between 01/01/2001 and 09/30/2015; (2) evidence of being in the VA healthcare system ≥ 12 months prior to the first BD diagnosis; and (3) evidence of being in the VA healthcare system ≥ 12 months after the first BD diagnosis. The refined dataset has 207,838 patients. From the refined dataset, we randomly selected 20,000 patients for the prodrome fingerprinting analysis. The cohort selection is shown in Additional file 1:  Appendix I.

Features

We experimented with large number of potential features including hospitalizations, diagnoses, procedures, medications, note types, vital signs, lab results and BD symptoms. In the identification of BD symptoms, we explored both topic modeling (Shao et al. 2015) and keyword extraction. Note types were excluded as they were often correlated with diagnoses. Thus, for the phenotype fingerprinting effort, we focused on hospitalizations, diagnoses, medications, vitals, lab results and BD symptoms. We also leveraged existing terminologies and instruments to group individual data items into groups or themes. Hospitalization: the beginning and end of a hospitalization is defined by the admission and discharge data from the hospitalization table. Diagnosis groups: We used both primary and secondary diagnostic ICD 9 codes. The ICD codes were collapsed to the first level categories (e.g., cardiovascular system disorder, etc.) Given our focus on BD, the mental illness group included one additional level of details (e.g., dementias, alcohol-induced mental disorders, etc.) (Free and ICD-9-CM Medical Coding Reference, http://www.icd9data.com). Current Procedural Terminology (CPT) groups: We grouped the CPT codes into seven categories including anesthesia, surgery, radiology, pathology and laboratory, evaluation and management, medicine, and HCPCS level II (Intro To and Coding: Medical Billing Codig Certification 2017). Lab results: The top 10 most frequent labs were used as the features, which included glucose, hemoglobin, erythrocytes, creatinine, leukocytes, carbon dioxide, potassium, sodium, chloride, and urea nitrogen. For each lab, according to the value distribution among the 20,000 patients, we standardized it into a score in the range 0 to 1, with 0 meaning the population mean and higher score meaning farther away from the population mean. Vital signs: As for the vital signs, we focused on blood pressure, pain score, pulse rate, body temperature, and body mass index (BMI). For each vital sign, we set standardize score as 0 for the value in the normal range, 1 for the extreme value, and unified the score in the range of > 0 and < 1 for other values. BD symptoms: These features were obtained from unstructured text data. Because the documentation of BD symptoms prior to diagnosis are not systematic, our initial unsupervised topic modeling effort yielded only a few topics directly associated with BD. We then identified a set of instruments that measure BD symptoms and compiled a list of symptom keywords. The instruments included depression (HAM-D), mania (YMRS, HAM-D), sleep (MADRS), psychosis (PANSS), drug-induced movement disorders (SAS, BARS, and AIMS), and suicide risk (C-SSRS), etc. (POSITIVE AND NEGATIVE SYNDROME SCALE; Montgomery-Åsberg Depression Rating Scale (MADRS); Young et al. 2000; Hamilton 1960; Barnes 1989; Simpson and Angus 1970; Rush 2000; Posner et al. 2009) The keywords were manually grouped into "BD symptoms" based on how they were labeled in the instruments. The presence of a BD symptom was determined by the presence of a keyword representing the symptom in the notes.

Temporal data representation

To represent the onset and duration of features as well as potential temporal relationships (e.g., co-occurrence), the features were represented in a temporal image. The x axis of image represents the time of the feature being reported, measured in weeks. In this study, we included 1-year data prior to the first diagnosis, so there are 52 weeks prior to the diagnosis. We also included the data during the week of the first diagnosis. In most cases, the Y axis represents the presence or absence of the feature (e.g. diagnosis). Vitals and lab results are represented using grey scale based on the deviation from the normal range (the darker the more extreme abnormality). Figure 1 is an example of temporal data representation from an individual patient. On the x-axis, 0 is the index week of BD diagnosis. Scale with minus sign means the time before the index week. For example, “-50” indicates the week of 50 before the index. On the y-axis, the numbers are the identifier of features with the order as shown in Additional file 1: Appendix II. For example, 1 is the identifier for hospitalization, 30 is ICD diagnosis of Disease of Digestive System (520–579).

Fig. 1figure 1

Temporal data representation from an individual patient

Clustering analysis

We performed K-means clustering analyses of the temporal images. K-means is a simple and fast unsupervised machine learning algorithm that groups a dataset into a pre-defined number (k) of clusters. It requires choosing a distance metric on the dataset: more similar pairs should have smaller distances. We chose Euclidean distance, but it was not applied directly on the temporal images: an operation which we call “temporal blurring” was first applied to the images, and then the Euclidean distances on the “blurred” images were calculated. The temporal blurring was mathematically defined as follows: let \(A\) denote the matrix of the temporal image, then the blurring produces a matrix \(B\) of the same size with

where \(A(i,k)\) is the element of \(A\) in \(i\)th row and \(k\)th column, and similarly \(B(i,j)\) is the element of \(B\) in \(i\)th row and \(j\)th column. The effect of this operation was blurring in the time direction (Fig. 2).

Fig. 2figure 2

The temporal blurring operation

The reason for the temporal blurring was from clinical consideration. With Euclidean distance, two patients who have the same diagnosis at two different times have the same distance as two patients with different diagnoses (whether at the same or different times). However, the former two patients should be more similar clinically than the latter two patients, and the similarity of the former two patients should be higher if the two times are closer. When Euclidean distance was applied to the blurred image instead of the original, the desired similarity can be achieved.

One challenge in K-means analysis is the determination of the appropriate number of clusters (i.e., k). Although there exist many mathematical tools for this problem (Everitt et al. 2011), different tools often produce inconsistent results, and none of them can work universally (Ünlü and Xanthopoulos 2019; Akhanli and Hennig 2020). Therefore, we treated the determination of k more as a clinical problem rather than a pure mathematical problem. As a clinical problem, the clusters need to be clinically reasonable, which can be validated through analysis of the clustering results. However, we still needed some initial guidance from a mathematical tool, for which we chose the elbow method, as it is simple but also widely used. We gradually increased the number of k from 2 to 20 and calculated the sum of squared errors (SSE) for the corresponding clustering result (Fig. 3). The SSE curve did not show a clear "elbow point", but we observed that SSE decreased fast initially (k = 2 to 6), then slower (k = 6 to 10), and eventually steadily in a linear decreasing manner (k = 10 to 20). Therefore, rather than looking for a single "elbow point", we considered the "elbow area", which appeared to be between 6 and 10. Finally, we chose k = 8, but note that this choice was not considered as the only "correct" answer, as the nearby k values may also be good choices. For each cluster, the mean temporal image, whose pixel values were the mean of the corresponding pixels in the original temporal images within the cluster, was used as the phenotype fingerprint of the cluster. The clinical reasonableness of this clustering result was validated in the next step as described below.

Fig. 3figure 3

K-Means analysis to determine appropriate number of clusters

Associated clinical outcomes

We calculated the mortality rate, hospitalization rate, mean number of hospitalizations, mean length of stay and psychosis diagnosis with 1 year after the diagnosis. The mean age and gender distribution of each cluster was calculated. We also collected data on the type of medications received by each cluster. On each of outcomes, statistical testing (ANOVA or Chi-square test) was done for the significance of their differences. Considering the large sample size would cause significant p-value even for tiny differences between groups, we also calculated the absolute standardized difference (ASD) for each of the other clusters compared to Cluster 1 (reference group), for which a value of 10% or above indicated the difference between the two groups (Austin 2009).

Comments (0)

No login
gif