Assessing clusters of comorbidities in rheumatoid arthritis: a machine learning approach

Study population and design

We used the CorEvitas RA registry to identify a cohort of patients potentially eligible. From this group, patients were required to have at least 6 years of experience in the registry, between 2011 and 2021, but patients could have entered the CorEvitas before 2011. The first visit in the CorEvitas RA registry was considered baseline with follow-up through the last visit in the registry. The full longitudinal dataset was used to identify patients in comorbidity clusters during the first phase of these analyses. For the second phase, the comorbidity clusters were assessed using the first 3 years of consecutive available data with the next 3 consecutive years used to determine clinical outcomes.

Comorbidities of interest

The comorbidities of interest are collected at baseline and then updated in CorEvitas. These include conditions summarized in Supplemental Table 1. The list of comorbidities included is quite similar to what has been reported in prior papers examining frequent comorbidities in RA [2, 10]; this grouping of comorbidities has been found to be associated with relevant clinical outcomes in RA.

Comorbidities are recorded at the time of enrollment in the registry and updated by patients and clinicians at subsequent visits that typically occur twice per year. Since we focused on chronic comorbidities, i.e., comorbidities accumulated over time. In other words, if one of these chronic comorbidities (e.g., diabetes or coronary artery disease) was reported, then it was assumed to be ongoing at subsequent visits.

Specific questions on comorbidities in CorEvitas changed in 2011. To determine the impact of changes in the collection of comorbidities, a secondary analysis was conducted only using participants who entered in 2011 or after (see Supplemental Table 2). The reporting of comorbidities appeared similar to the total cohort. Thus, this sub-analysis was not pursued further.

Outcomes

The first phase of analyses focused on deriving comorbidity clusters using ML algorithms; therefore, the clusters were the outcomes. The second phase focused on whether comorbidity clusters associated with future clinical outcomes. The clinical outcomes of interest in phase two were the clinical disease activity index (CDAI) and function as measured by the Health Assessment Questionnaire—Disability Index (HAQ-DI) [11, 12].

CDAI and HAQ-DI are measured at almost all visits in CorEvitas. CDAI is a continuous scale from 0 to 76 with well-accepted thresholds for different levels of disease activity [11]. CDAI includes four components: patient global arthritis activity (0–10), physician (assessor) global arthritis activity (0–10), tender joint count (0–28), and swollen joint count (0–28). Since we assessed the outcomes during the final 3 years of the study period, the time-averaged CDAI from those years was used as the primary disease activity outcome. The time-averaged CDAI was calculated based on a weighted average of the CDAI, using the number of months between visits as the weighting factor. In other words, the CDAI at a given visit was multiplied by the number of months after a given visit; each segment (CDAI x months) was added together and then divided by 36 months. A secondary outcome was the change in time-averaged CDAI between the first 3 years and the second 3 years of the study period.

The HAQ-DI encompasses 20 items across eight domains, each item scored 0–3 based on how much help is required to complete a given task (i.e., dressing and grooming, arising, eating, walking, hygiene, reach, grip, and activities) [12]. The average score for each domain is calculated, and then the average across the eight domains is used as a summary. The same method was used for the HAQ-DI to assess outcomes during the final 3 years of the study period, using a time-averaged HAQ-DI. Just as with the CDAI, change in time-averaged HAQ-DI was considered a secondary outcome.

Statistical analyses

We assessed patient characteristics at baseline and at year 3 of follow-up and then examined the comorbidity distribution across the population throughout 6 years of longitudinal follow-up. During the first phase of this work, the results of five different ML algorithms for clustering the patients’ comorbidities over the 6-year period were examined; the ML algorithms included K-mode, K-mean, agglomerative hierarchical divisive analysis clustering (DIANA), agglomerative nesting clustering (AGNES), and model-based clustering (VarSelLCM) [13, 14]. Three, four, five, and six clusters were each assessed. We chose 5 as the number of clusters for all clustering algorithms based on the “elbow” method from the K-mode clustering [15]. The data were clustered by patient. (For K-means, center = 5; for K-modes, modes = 5; for AGNES and DIANA, cut the tree at k = 5. For VarselCluster, we selected all comorbidity variables and chose the highest probability group among 5 groups as the patient’s cluster group.)

In the second phase of this work, we compared the performance of the different clustering algorithms, with respect to their association with clinical outcomes. For all ML algorithms, the five-cluster solution was chosen based on statistical methods that look for an inflection point in the sum of squares [13] (see Supplemental Fig. 1). The two clinical outcomes selected were the time-averaged CDAI and time-averaged HAQ-DI. The clusters were defined using data from the first 3 years of follow-up and the clinical outcomes defined in the next 3 years.

To understand the value of the different clusters, we compared the model fit for three sets of models. These included the following as independent variables: (a) only demographics and RA variables; (b) demographics, RA variables, and each comorbidity; (c) demographics, RA variables, and the clusters. This was repeated for each of the ML clustering algorithms. Sensitivity analyses considered sex-stratified models, models with only comorbidities recorded since baseline, and the secondary outcome (change in CDAI or HAQ-DI).

R (version 4.3.0) and SAS (version 9.4) statistical computing packages were used for all analyses.

Comments (0)

No login
gif