This study was guided by the PROGRESS framework (Steyerberg et al. 2013), and reported in accordance with the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) statement (Collins et al. 2015) (Online Resource S1 Checklist). The statistical analysis plan was prospectively registered in ClinicalTrials (Registration number: NCT04196634). This study was approved by the Norwegian Centre for Research Data (NSD 861249) and the Regional Committees for Medical and Health Research Ethics in South East Norway (No: 2016/2300), and was conducted in accordance with the Helsinki declaration and the General Data Protection Regulation (GDPR). Written consent was obtained from all participants.
Study population and data collectionWe developed the prognostic prediction models using data from two sources: a prospective cohort study (Tveter et al. 2020) and a multi-arm randomised controlled trial (RCT) (MI-NAV trial) (Aanesen et al. 2022). The cohort study, conducted between November 2018 and February 2019, included 549 workers aged 18–67 years on sick leave for at least 4 weeks due to musculoskeletal disorders, as classified by the International Classification of Primary Care (ICPC-2) (WONCA Classification Committee and Committee 1998). Participants were recruited via the Norwegian Labour and Welfare Service’s (NAV) website, and individuals unemployed or with insufficient Norwegian or English language skills were excluded. The MI-NAV trial, conducted from April 2019 to October 2020, consisted of 514 workers on sick leave for at least 7 weeks due to musculoskeletal disorders, recruited by telephone, excluding individuals with serious health conditions, pregnancy, unemployment, or inadequate language skills. For the prediction models, we used data from the initial 4 to 12 weeks of each participant's sickness absence. Consequently, individuals with sickness absence spells exceeding 12 weeks at the time of data collection were excluded. External validation was conducted using data from a second RCT (Aasdahl et al. 2018), conducted between August 2017 and October 2022 in a different geographical region in Norway. This trial included 865 participants, aged 18–60 years who had been on sick leave for more than 8 weeks. Individuals who were not employed or pregnant were excluded. We further excluded those on sick leave for reasons other than musculoskeletal disorders or sickness absence for more than 12 weeks.
Across these data sources, participants completed a digital questionnaire consisting of validated instruments (Øiestad et al. 2020; Tveter et al. 2020). A small subset of participants was given the option of a paper-based version if digital access was not feasible. These data were linked to the National Sick Leave Registry, providing a 24-month record of sickness absence for each individual, covering a 12-month period before and 12 months after the completion of the baseline questionnaire. This data included information on sick leave benefits and certificates, work assessment allowance (WAA), disability pension (DP), and contracted work hours.
OutcomesTo operationalise prolonged sickness absence, we defined three outcomes: (1) more than 90 days of work absence during first sick leave spell; (2) more than 180 days of work absence during first sick leave spell; and (3) any new episode or increase in WAA/DP during 1-year follow-up. A continuous sick leave spell was measured by counting consecutive calendar days absent from work (including part-time sick leave). This spell ended once the participant returned to at least 80% of their contracted working hours for a minimum of four consecutive weeks. These dichotomous outcomes were consistent across both the development and validation cohorts.
Model predictorsWe developed one multivariable prediction model for each outcome, and selected predictors based on a systematic review of the literature and expert opinions, using sociodemographic and clinical data from patient-reported questionnaires. We used the ‘full model approach’ for model development, meaning all pre-selected predictors were included without exclusion at later stages (Steyerberg 2019). This method is less reliant on data-driven selection and reduces the risk of omitting clinically relevant predictors (Royston et al. 2009; Harrell 2015). Pre-selected predictors were evaluated blinded to the outcome (i.e., exploratory unadjusted analysis was not performed to inform the predictor selection). We followed recent sample size recommendations by Riley et al. (2019, 2020) (Online Resource S1 Supporting Information). For external validation, we adhered to the recommendation of a minimum of 100 outcome events when interpreting the results (Collins et al. 2016). Based on our sample size calculation, we limited the number of predictors to 11 for model development, while accounting for a total of 19 predictor parameters. These predictors, including age, education level, return-to-work expectancy, pain intensity, depression/anxiety, general health, fear avoidance, pain catastrophising, workability, and previous sick leave, have shown robust evidence in predicting sickness absence (Tousignant-Laflamme et al. 2023). All the predictor variables were available in the same form and measurement unit in both datasets, except for the fear-avoidance predictor which differed somewhat between the development and validation samples. Measurement method and variable type of all predictors are presented in Online Resource S1 Table.
Missing dataMissing data for each prognostic factor is provided in Table 1. No missing data were observed for the three outcomes. As there were hardly any missing data (1.9%) on predictors in the development sample, we conducted complete case analyses for the model development. Out of 336 individuals in the validation sample, 91 had all their predictor data missing. They were therefore excluded from the sample, leaving 244 (72.6%) individuals for the analyses. Of the 244 participants available, 18.4% had missing data on predictor variables which was handled using multiple imputation by chained equations. We created 40 imputed datasets for missing variables that were then combined across all datasets using Rubin’s rule (Marshall et al. 2009) (Online Resource S2 Supporting Information).
Table 1 Characteristics of the development and validation samples and the proportion of missing dataStatistical analysisSample relatednessTo assess the case-mix and relatedness between the development and validation samples, we followed the framework by Debray et al. (2015). Specifically, we created a ‘membership model’ using logistic regression to predict the probability of an individual being a member of the development sample rather than the validation sample. Independent variables in this membership model were the outcomes, gender, and all predictors from the prediction models. We used c-statistic to quantify the case-mix between the samples, where low values (close to 0.5) indicate similar case-mix and related samples.
Model developmentWe fitted three multivariable logistic regression models using the pre-selected predictors, without additional statistical selection processes. To account for data sourced from RCTs, intervention above usual care (presence or absence) was included as a predictor to mitigate bias from trial effects (Pajouheshnia et al. 2019). Continuous predictors were kept as continuous, and non-linearity and functional form was assessed using fractional polynomials (Sauerbrei and Royston 1999). We found that all the continuous variables were better modelled using a linear function. Model performance was evaluated using both apparent and optimism-corrected measures. Internal validation involved bootstrap resampling (500 samples) to adjust for overfitting, leading to shrunken regression coefficients and performance measures (Harrell Jr et al. 1996). Model performance was reported across three dimensions: overall model performance measured by Nagelkerke’s R2, discrimination, and calibration. Nagelkerke’s R2 estimates the proportion of variance explained by the model, ranging from 0–1 (1 = perfect). Model discrimination (i.e., the ability to distinguish between those with positive and negative outcomes) was assessed using c-statistic, equivalent to the area under the curve (AUC) of the receiver operator characteristic (ROC), with values ranging from 0.5 (no better than chance) to 1 (perfect discrimination). Model calibration (i.e., how well predicted probabilities agree with observed events) was assessed using calibration plot, calibration-in-the-large, calibration slope, and the ratio of the expected and observed event probabilities (E/O ratio). Calibration plot shows the observed and predicted probabilities of incident long-term sick leave by deciles of predicted risk, and we applied a locally weighted scatterplot smoother (lowess) curve to show calibration across the entire range of predicted probabilities. Calibration-in-the-large (intercept) is the difference between the mean observed risk and mean predicted risk and is a measure of systematic overprediction or underprediction, with an ideal value of 0. Calibration slope is a measure of agreement of observed and predicted risk across the whole range of predicted values, with a slope of 1 indicating perfect calibration; slope > 1 indicates that the risk predictions are too narrow, whereas slope < 1 indicates that the risk predictions are too extreme. Lastly, we assessed the E/O ratio to summarise the overall calibration, in which perfect calibration is represented by a value of 1.
External validationWe externally validated the optimism-adjusted prediction models by assessing the performance of the models when applied to the external validation sample. Model coefficients were averaged over the imputed datasets using Rubin’s rule (Steyerberg 2019). Performance criteria comprised overall performance, discrimination, and calibration as described above.
Clinical usefulnessWe used decision curve analysis to evaluate and compare the clinical usefulness of our models in the external validation sample. Decision curve analysis is a plot of the net benefit (analogous to profit) against the threshold probabilities of the prognostic model. Net benefit of a prognostic model is the difference between the proportion of true positives and the proportion of false positives weighted by the odds of the selected threshold for the outcome (Vickers and Elkin 2006; Vickers et al. 2019). Hence, to assess the clinical usefulness, we plotted decision curves for the models, and for two default strategies: ‘treating all’ and ‘treating none’. The strategy with the highest net benefit at any given risk threshold is considered to have the most clinical value (Vickers et al. 2019).
All statistical analyses were conducted using Stata version 17.0 (StataCorp, College Station, TX, USA), using the bsvalidation, mfpmi, pmcalplot, and dca packages. All analyses were supervised by a biostatistician.
Sensitivity analysesTo evaluate the impact of multiple imputation, we conducted complete case analyses on our external validation procedure. To assess the predictive ability of the models in a broader sample, we performed sensitivity analyses on all participants in the external validation cohort who were on sick leave between 4 and 12 weeks at baseline irrespective of diagnoses. Similar analyses were conducted including those with musculoskeletal disorders and mental disorders.
Comments (0)