Nox BodySleep 2.0 (NBS2) is an AI algorithm designed to determine sleep states and detect arousals using abdominal and thoracic RIP signals. The input signals are processed by a neural network, which assigns probabilities to each sleep state as well as to the likelihood of an arousal. An extensive dataset was used to assess the algorithm's efficacy, with careful attention to ensure that the subjects and data reflect the algorithm's intended application in standard clinical settings.
NBS2 is designed to aid in the scoring of home sleep tests that include no electrophysiological signals (e.g., EEG, EOG, and EMG). However, to rigorously evaluate NBS2, it was necessary to compare it with"gold standard"sleep and arousal scoring, which requires in-laboratory PSG recordings with high-quality electrophysiological signals. Accordingly, in the current study, PSG recordings were manually scored using all available signals while only RIP data were input to NBS2 to provide AI scoring for comparison, see Table 1. Of note, since our goal was to discriminate between three distinct states, namely Wake, REM and NREM, the gold standard labels for N1, N2, and N3 were combined into a single NREM state.
Table 1 Comparison of signals recorded during in-laboratory polysomnography (PSG) and home sleep tests, along with the required inputs for the current algorithm under evaluationAlgorithm developmentThe NBS2 algorithm was developed using a dataset of PSG recordings from clinical and research settings across seven sites in Europe, Asia, and the United States. In total, 3,185 recordings were used: 2,216 for training, 482 for tuning, and 487 for testing. Note that the results presented in this study are based on an entirely separate dataset, unrelated to those used for algorithm training and development.
Training was performed using cross-entropy loss, optimized with the AdamW method as implemented in TensorFlow. Overfitting was monitored and the following regularization techniques were applied: spatial dropout and batch normalization. Data augmentation was employed which involved scaling the input signals and introducing random temporal shifts. Training samples were selected with a 50% probability of containing a scored arousal.
ParticipantsA large set of in-lab PSG recordings were used to comparatively evaluate the algorithm’s performance. The PSG studies were drawn from dataset provided by FusionSleep clinics in Atlanta, Georgia, USA, between 2020 and 2023 (inclusive). This dataset is representative of the algorithm’s intended use in standard clinical practice in sleep clinics and hospitals, ensuring that the comparative results accurately reflect its real-world performance. Data was recorded using the Nox A1 sleep recorder (Nox Medical, Iceland). All recordings underwent standardized quality control and were scored by qualified sleep technologists according to the AASM manual for scoring sleep and associated events [2]. Finally, the scoring was reviewed and approved by a medical doctor.
The PSG recordings used for evaluation were selected based on predefined inclusion criteria, requiring that each study be manually scored, conducted during nighttime, and include an adult participant (≥ 18 years old). Additionally, both RIP belts had to be connected for at least half of the recorded sleep time. Studies not meeting these criteria were excluded from the analysis. No information was available on subject’s medication use or preexisting health conditions for this study.
Algorithm descriptionNBS2 requires only the abdomen and thorax RIP belt inductance signals as input. These signals are sampled at 25 Hz with a bandwidth of 0–12.5 Hz, measuring the loop inductance of the wire in each belt encircling the patient. A nonlinear adaptive filter is used to remove the baseline, gradually suppressing frequencies below 0.1 Hz while adapting to larger changes, such as patient movement. Nine consecutive signal epochs (where each epoch is a 30 s temporal window) are employed to process each target epoch, with four epochs preceding and four epochs following the target, as depicted in Fig. 1. For each of these epochs, there are two target outputs: 1) The sleep state in each epoch is derived according to the highest probability, followed by a holistic regression analysis that refines it, i.e., smooths consecutive sleep state outputs, which is a common regularization technique for time-series data [22]; 2) The algorithm computes the probability of an arousal event for each second within the target epoch. An arousal is subsequently identified if at least three consecutive seconds exceed a predefined threshold, in which case an arousal is scored along with its duration, see Fig. 2.
Fig. 1Overview of the Nox BodySleep 2.0 (NBS2) algorithm. The input consists of nine consecutive 30 s respiratory inductance plethysmography (RIP) signal epochs. NBS2 predicts the probability of sleep states (Wake, NREM, REM) and arousal (1 s resolution) for the center epoch. Sleep state is scored by selecting the state with the highest probability. An arousal is scored using a rule-based post-processing step, which requires that the arousal probability exceeds a specific threshold for at least 3 s. The output probabilities shown in the figure are illustrative and do not represent actual algorithm outputs
Fig. 2The algorithm analyses 4.5 min data segments at a time. Subplot A shows breathing movements recorded by two respiratory inductance plethysmography (RIP) belts placed around the abdomen and chest, which serve as inputs to the algorithm. Subplots B and C display the neural-network outputs: arousal probabilities and sleep state probabilities. Arousals are marked with red squares when the arousal probability exceeds a threshold for at least 3 s. The sleep state prediction in subplot C is presented as a hypnodensity graph [53], predicting the likelihood of each sleep state—Wake, NREM, or REM—across epochs. The depicted window of data captures a transition from REM sleep, characterized by reduced chest wall movement, to NREM sleep, where chest movement increases. This transition is associated with arousals and a higher probability of wakefulness
The neural network architecture is based on temporal convolutional networks (TCN), following the suggestions made by S. Bai et al. [23] to improve the performance of sequence modelling tasks. The architecture employs residual connections, regularization, and activation functions, to form a chain of residual blocks [24,25,26]. A schematic of model architecture is available in the supplementary materials (Figures S1-S2).
TCNs possess several features well suited to the task at hand. They efficiently handle long-range dependencies using dilated convolutions, effectively capturing the necessary context surrounding the window of interest. Furthermore, their ability to extract time-invariant features is particularly important for precisely localizing sleep-related events like arousals. Using a multi-task learning approach, we train a single model to handle both arousal detection and sleep state classification, enabling shared feature learning with separate outputs for each task (see Figure S2. in the supplementary materials).
Statistical analysisTo evaluate the NBS2 algorithm's performance, we use both epoch-level agreement metrics and patient-level measures. Sensitivity, specificity, accuracy, and F1 score, are calculated to assess epoch-level agreement between NBS2 outputs and manual scoring. Additionally, Cohen’s Kappa is computed to quantify agreement, corrected for chance. Epochs containing manually scored signal artifacts are excluded from analysis. For patient-level measures, specifically TST and ArI, agreement are evaluated using Bland–Altman analysis and intraclass correlation coefficients (ICC).
Epoch-level sensitivity, specificity, accuracy, and F1 score are defined as follows: sensitivity = TP/(TP + FN), specificity = TN/(TN + FP), accuracy = (TP + TN)/(TP + TN + FP + FN), and F1 score = 2 TP/(2 TP + FP + FN) where TP, TN, FP, and FN denote true positives, true negatives, false positives, and false negatives, respectively. Sensitivity, specificity, accuracy, and F1 score, are calculated with 95% confidence intervals. The sleep state metrics are calculated for each class using a one-vs-rest method, while arousal epochs are identified based on their epoch presence or absence in either NBS2 or the manual scoring. Arousal detection performance is also analyzed by REM and NREM sleep states. Epoch-level measures of arousal detection are selected for comparison of medical devices through their FDA 510(k) summaries [27,28,29].
For patient-level measures, Bland–Altman analysis is used to calculate mean difference (bias) and limits of agreement (LoA), both with 95% confidence intervals. Similarly, ICC is computed, with corresponding 95% confidence intervals. Additionally, the algorithm's arousal scoring performance is evaluated across obstructive sleep apnea (OSA) severity levels, specifically examining ArI bias, LoA and ICC within defined apnea–hypopnea index (AHI) groups: normal (AHI < 5), mild (5 ≤ AHI < 15), moderate (15 ≤ AHI < 30), and severe (AHI ≥ 30).
In a supplementary analysis, all epoch- and patient-level measures are provided for each OSA severity group, as well as for periodic limb movement of sleep index (PLMSI) subgroups (PLMSI < 15 and PLMSI ≥ 15).
All confidence intervals are determined via bootstrapping with 10,000 iterations.
Comments (0)