This research was approved for human subjects’ study by the Washington State University institutional (ethics) review board (IRB# 17442). All participants provided written informed consent. The study recruited a total of twenty healthy pregnant women (29.9 years old ± 4.9 years, 66.0 kg ± 10.5 kg, 166.0 cm ± 6.7 cm). All female participants were in approximately their 13th week of gestation (± 1 week). Recruitment was conducted through flyers distributed during their initial obstetrician visit. They volunteered by calling into an enrollment person/researcher for screening. They were excluded from participating if they were considered a high-risk pregnancy, unable to walk unassisted, had a cognitive inability to read and understand instructions, or if they could not commit to longitudinal testing for the entirety of the pregnancy.
ProtocolParticipants were enrolled as part of a larger follow-up (longitudinal) study examining fall risk in pregnancy. Each follow-up timepoint (n = 5) contained a wide testing battery and each lasted approx. 60-min. Accordingly, about 100 h of video data were accumulated. During each testing session, participants wore eye tracking glasses (Tobii Pro2, Stockholm https://www.tobii.com), which captured environment/lab video data at 24 frames per second (fps) from test start to finish along with the participants gaze at each frame (1920 × 1080 px). The lab comprised two walking tracks/paths within a continuous loop with intentionally placed hurdles and distractors, Fig. 1. Six fully visible white PVC pipe obstacle hurdles were placed 3 m apart along the 12 m long, 0.92 m wide, two-sided black walking path. Some hurdles were always set to 10% body height, while surrounding hurdles were randomly assigned to 5%, 7.5%, 10%, and 12.5% of body height of the tested participant.
Fig. 1The walking path route undertaken by all participants during testing. Upon starting, participants were tasked with crossing obstacles at 3 m intervals. Obstacle heights were set at a percentage (%) of the participant heigh during each walking trial
DatasetVideo data spanning the full 60-min per participant were utilized to train the proposed model. This was due to the frequent examples of more obscure angles and head positions captured when the participant was not performing a direct 2-min walk test and may for example be standing and talking to a researcher between tasks while looking around the lab setting. That aided the model to generalize to more diverse scenarios like rare head angles during the test. A key advantage to the data being captured within a controlled setting is homogeneity. Specifically, within all videos captured from participants, variables such as lighting, hurdles and video quality remain similar. For use within the produced dataset the frames of the captured footage were extracted and labelled using a Python-based tool [47] with example classes being: hurdle, tennis ball, animate distractor, bucket.
The labelling process resulted in a dataset consisting of 987 labelled frames and 18 classes across all frames. These classes represent a variety of objects or obstacles that are pertinent to determining whether a participant is paying attention (i.e., to obstacles along the path) or distracted. Of these 18 classes 3 are defined as “core objects” being tennis ball, support, and hurdles as these are the direct objects and points of investigation for the task. The labelled information was extracted using the inbuilt functionality of the label producing software. The images folder contained the full resolution raw images with accompanying labelled information stored in the annotations folder in.txt format. Annotations contained a line for each object detected within the scene holding the object class id and object bounding box coordinates within the x mid, y mid, width, height format.
Object detection model (ODM)Model implementation was performed using the Python-based deep learning library PyTorch and the Ultralytics suite of available Yolo algorithms. The final object detection algorithm used was the latest YoloV8 network [46]. That version was chosen as it has been shown to have more accurate results on images and video within 1.3 ms speed per image size at 640 × 640 (to be used in this study) compared to YoloV7 [48]. This architecture (Fig. 2) takes the image as input and feeds it through a series of convolution, pooling and batch norm layers before outputting predicted classes and bounding box coordinates on the extracted features.
Fig. 2Example backbone architecture of the YoloV8 feature extraction model
The output from the model was then further enhanced using non maximum suppression, used to remove duplicated bounding boxes and reduce noise in detection based on intersection over union (IoU) metrics (i.e., overlap between predicted bounding boxes and ground truth annotations). Given the minute pixel data required to accurately classify important obstacles within the track, the model was trained on images resized to 640 × 640 px to retain ample image information while balancing performance. The requirement was further aided using distributed focal loss (DFL) which is a custom loss function used for improving the ability of models to identify small objects within images (Eqs. 1–4), which was a core requirement for the dataset and to also aid with class imbalances within the training data.
$$s_ = \frac }}\mathop \sum \limits_ \left[ \times \sqrt }} }}} } \right]$$
(1)
$$w_ = \frac }} + \epsilon$$
(2)
$$focal_ oss_ = focal_ oss_ \times \sqrt }} }}}$$
(3)
$$DFL\left( \right) = - \alpha \left( }}} \right)(1 - p)^ \log p$$
(4)
Equations: DFL Loss equation, where s is the average object size for the batch, Nj is the number of anchor boxes in the batch with ground-truth label hj [2] is the height of the ground-truth bounding box for anchor box i with label j, epsilon is a small constant to avoid division by zero, and focal_loss_[48] is the focal loss for anchor box i with label j. The DFL loss is computed for each class j separately, and the final DFL loss is the sum of the DFL losses for all classes:
The training process was conducted within a Windows based Python 3.8 environment, on a system containing an RTX 3070 graphics card, Ryzen 7 3700X CPU and 24 GB of RAM and took ~ 3 h to train over 100 epochs. The dataset was split using a pragmatic 80:20 train-test ratio outputting evaluation metrics across both training and validation examples: train/box_loss, train/cls_loss, train/dfl_loss, precision, recall, mean average precision (mAP50 and mAP50-95), val/box_loss, val/cls_loss and val/dfl_loss.
ODM: eye locationClassification of the objects provided context to the video data but when considered in isolation, provided little meaningful information. To automate the detection of where visual attention is, a mechanism is required to provide information (Fig. 3). Within the model an algorithm was implemented to detect overlaps between the bounding box coordinates using the × 1, y1, × 2, y2 format. Algorithm 1 outlines the process, by performing a for loop over each detected object, the coordinates are input to the overlap detection function taking the coordinates as arguments. Those coordinates are then compared with the stored eye tracker coordinates returning true if an overlap is detected.
Fig. 3Flow diagram illustrating the deployment of the proposed AI model and its accompanying mechanics throughout a video
ODM: object row mechanicWhilst the proposed lab setup mimics that of an optimal walking path for gait assessment [25], it also provides the inclusion of potential distractors and hurdles along the path. The spatial context of these distractors and hurdles are vitally important for inclusion within the model given their clinical significance and implications for assessing a participant’s visuospatial attention and ability to navigate environmental obstacles. To achieve this, detection of what the participant is looking at is performed first, followed by the classification of what row the object belongs to, appending this provided context within the CSV file (e.g., tennis ball row 2) upon completion.
Algorithm 1Algorithm for detecting bounding-box overlap
When navigating the hurdles, a participant will encounter up to three sequential hurdles along each track/path, and it is important to understand which row the participants attention is on. For example, if it is known that the participant is looking at the immediate hurdle, it can be inferred they are paying attention to the hurdle and planning safe crossing (no contact). This assumption is based on typical gaze behavior observed in most individuals. However, we acknowledge that there may be exceptions, particularly among experienced participants or those familiar with the path. It can also be inferred that if the participant is not paying attention to the nearest hurdle before crossing, they are distracted. Here, across all scenes involving obstacle crossing, the same core objects are present and organized along the walking path into rows (Fig. 4), (i) a set of tennis balls (at each side of the walking track and used by the participant to judge horizontal opening size, defined as the horizontal distance between the two balls;), (ii) supports (are used to hold up the tennis balls and can also be used by the participant to judge opening size) and (iii) a hurdle (obstacle to be navigated by participant).
Fig. 4VGG image annotation tool, used to create the segmentation masks
Given the consistent spatial relationship of these objects the vertical pixel coordinates can be used to begin to cluster these objects into their respective rows Algorithm 2. The algorithm first sorts the detected tennis ball objects based on their Y positions. Then, it iterates through the sorted list, calculating the distance between each consecutive pair of balls. If the distance is less than 50 pixels, the balls are considered to belong to the same row and are added to the current row array. If the distance exceeds 50 pixels, the current row is appended to the rows array, and a new current row is initiated to begin capturing the next set of balls within the same row.
Algorithm 2Algorithm for clustering balls
Once the algorithm for sorting hurdles into clustered rows was established, the loop responsible for classifying the actual row of the objects was created. Algorithm 3 gives each detected object an associated row by looping over every detected object and determining the object’s midpoint. With this information deduced and each ball clustered into its respective row, the y point of the object can then be compared with the detected row lines. Whichever row is determined to have the least absolute difference is classified to be the row of the object.
Algorithm 3Algorithm for assessing object row
Track segmentation model (TSM)If participants are looking downwards at the track/path ahead of their immediate foot placement, this can provide context i.e., thinking about foot placement. Determining the exact spatial location of the participant’s walking path is more difficult, because a more detailed classification is required compared to general object detection. To address this, a further segmentation model was developed and deployed to provide a pixel-wise segmentation mask for exact track location. This means that the exact location of the tracks themselves were detected not just a general bounding box. To develop this model, the same process for dataset collection was utilized as with the object detection tool. The videos were broken down into component frames to be used as images within the dataset. To create the segmentation masks (black and white images containing white pixels only where the regions we want the AI to detect are) the VGG image annotation tool [49] was used (https://www.robots.ox.ac.uk/~vgg/software/via/) Fig. 4.
Using VGG, the tool for creating segmentation masks to be used in AI models, a dataset of 388 frames and accompanying binary segmentation masks were created. This dataset was then used to create a U-Net based segmentation network (Fig. 5) with PyTorch. This model was then trained within the same Python 3.8 environment using a Ryzen 3700x, 24 GB of RAM and an RTX 3070ti based machine over a course of 100 epochs. After gaining a binary (white/black) segmentation mask of track location, detection of overlap between the eye location and track mask (black and white segmentation masks where only the location of the tracks are white) can be identified, Algorithm 4.
Fig. 5Visualization of the U-Net architecture that depicts how an image passes through the network
Algorithm 4Algorithm for detecting track overlap
TSM: left/right object directionWith a methodology in place to assess an object’s row, a methodology for detecting which track a set of objects belong to is required (left or right). Understanding which side of the tracks an object belongs to is an important classification to assessing whether or not the participant is distracted (like paying attention to your driving lane vs oncoming traffic on a two-lane road). The track being actively navigated will always be on the right from the participants perspective, meaning any attention paid to objects on the right track will be relevant to navigation planning either immediately or in the near future. Conversely, attention paid to obstacles on the left track indicates a distraction, as when they are visible they will be beyond the immediate area of the participant. A further algorithm (Algorithm 5) can be implemented to attain what side an object is on relative to the participant by inferring the mid-point between the different segmented track points.
Algorithm 5Algorithm for left/right side detection
Comments (0)