Development of Machine Learning Copilot to Assist Novices in Learning Flexible Laryngoscopy

INTRODUCTION

It has been well documented that most otolaryngologists practice within densely populated metropolitan areas. Two-thirds of U.S. counties lack a practicing otolaryngology-head and neck surgery (OHNS) specialist, with their density per county being associated with the highest education and income quartiles. Using the Centers for Medicare and Medicaid Services (CMS) Provider Utilization and Payment Data Physician and Other Supplier PUF dataset, Davis et al. (J Voice. doi.org/10.1016/j. jvoice.2021.05.002) observed a direct association between access to otolaryngologists able to perform flexible fiberoptic laryngoscopy (FFL) based on billing codes and number of Medicare enrollees, further demonstrating the disproportionate clustering of otolaryngologists in urban areas, consistent with prior studies. The lack of access to OHNS among low-income rural areas further exacerbates healthcare disparities across the U.S. Telemedicine has been a solution proposed to alleviate some of the socioeconomic inequity, but has several limitations within OHNS, namely being unable to remotely perform FFL as part of a physical exam.

FFL is one of the most common procedures performed by otolaryngologists and is an early core competency that OHNS residents are expected to acquire. It allows for detailed examination of the sinonasal cavity, nasopharynx pharynx, and larynx, and is crucial in the evaluation of patients with voice, swallowing, and airway symptoms, patients needing airway management such as awake fiberoptic nasotracheal intubation, and those with head and neck cancer. Several steps comprise a successful FFL, including identification of key structures, successful navigation of the patient’s anatomy, patient comfort/coaching, and visual capture of each anatomical area of interest, which have varying difficulty depending on patient anatomy, pathology, and learner experience. Recently, interest has grown within other specialties to train residents and advanced practice providers (APPs) to perform FFL. For instance, Price et al. (Int J Radiat Oncol Biol Phys. doi.org/10.1016/j.ijrobp.2020.05.009) demonstrated that, within radiation oncology, a simulation-based training (SBT) workshop on FFL for radiation oncology residents was both feasible and increased confidence in head and neck anatomy and FFL procedural skills.

Previous work has shown that a learning curve exists for novices learning to become competent in performing FFL using a manikin model, with a mean of six attempts required to achieve competency based on a validated checklist. This learning curve has historically been addressed by having residents perform the procedure on patients needing evaluation; however, this approach risks discomfort, nasal bleeding, and mucosal injury to the patient. These risks are greatest before competence is achieved by the learner. SBT has been proposed as a tool for novices to achieve competence in a variety of otolaryngologic procedures, and both trainers and trainees have ranked FFL as a skill where SBT would be highly useful.

Artificial intelligence (AI) applications are rapidly being investigated within the field of OHNS, and AI-based simulation tools may be able to bridge the gap between learning to perform FFL during SBT and performing FFL on a patient. Here, we describe the development and prospective pilot testing of machine learning (ML) software “Copilot” that uses a pretrained convolutional neural network for image processing of diagnostic laryngoscopy to help train novice medical students to competently perform FFL on a manikin and improve their uptake of FFL skills.

METHODS

This study was approved by the Johns Hopkins Institutional Review Board: IRB00343343.

OBJECTIVE

In defining the requirements of the AI Copilot, a team of two experienced otolaryngologists determined that when performed on a simplified model (AirSim Combo Bronchi X manikin (Fig. 1) (United Kingdom, TruCorp Ltd)), a basic FFL procedure consisted of:

Entering the nasal cavity and navigating the nasal passage to reach the nasopharynx;
Visualizing the soft palate;
Visualizing the epiglottis and vallecula;
Visualizing the vocal folds; and
Withdrawing the scope. A computer scientist then translated these high-level requirements to specific capabilities the AI copilot would need:
Identifying the optimal scoping path in the nasal passage;
Identifying the scope’s location;
Highlighting key anatomical structures; and
Providing real-time feedback and navigational cues.

AI Copilot Architecture

To develop the AI Copilot, we used supervised machine learning, in which neural networks learn to predict output labels from human-labeled data. The AI Copilot consisted of two key machine learning components:

1. An image classifier model dubbed the “anatomical region classifier,” responsible for predicting the location of a camera in the upper airway.

2. An object detection model dubbed the “anatomical structure detector,” responsible for locating and identifying key anatomical structures in images. The outputs of these models were then filtered and time-averaged to reduce noise in the system. These outputs were then incorporated as inputs into the logic of a larger system that kept track of the state of the procedure and the camera location. Based on these inputs and the state of the system, instructions and cues were provided to the user via an overlay on the video feed.

To run the AI Copilot, the live video feed from an Ambu aView 2 Advance was transferred to a computer via an HDMI connection and a video capture card. Running a local FastAPI web server, the computer read in the video feed, processed it, and sent the modified feed, along with metadata, to a web browser through a webhook. This new video feed was displayed on the computer monitor, allowing the machine learning copilot to provide real-time cues to the user, overlaid on top of the raw FFL video.

Data Collection

Figure 1. Setup for pilot testing of machine learning copilot. Panel A: Ambu aView 2 Advance; Panel B: Ambu
aScope 4 RhinoLaryngo Slim; Panel C: AirSim Combo Bronchi X.

The training data were collected by performing FFL on an AirSim Combo Bronchi X manikin (Fig. 1) (United Kingdom, TruCorp Ltd) using an Ambu aScope 4 RhinoLaryngo Slim (Fig. 1) connected to an Ambu aView 2 Advance Displaying Unit (Ballerup, Ambu A/S) (Fig. 1).

Data for training both models were selected from a dataset that was generated by performing 20 flexible laryngoscopies on the manikin. All scoping videos were performed by a single, experienced otolaryngologist on an Ambu aScope 4 RhinoLaryngo Slim (Fig. 1). Various levels of expertise were simulated when recording videos to ensure that a variety of angles and views were captured. The FFL AI Copilot was trained on the left nasal cavity only to keep the machine learning consistent and prevent confusion or misclassification of structures based on sidedness.

Anatomical Region Classifier

The anatomical region classifier was used to identify the physical location of the camera at the time of image capture. The classes used were Larynx, Nasal Cavity, Nasopharynx, Oropharynx, Out of Body, and Whiteout. Videos were split into image frames, and each image was assigned to a class by a trained graduate student. The classes were defined in the manikin model as follows:

Out of body: Any image from outside of the manikin.
Nasal cavity: The scope had entered the manikin but had not yet moved past the posterior nasal septum.
Nasopharynx: Between the distal-most end of the septum, but not yet past a clear delineation in the material where the velum would be in a human.
Oropharynx: Past the velum delineation but not yet past the superior border of the epiglottis.
Larynx: Past the superior edge of the epiglottis but not yet through the vocal folds to the trachea.
Whiteout: The scope was too close to the mucosa such that the image was 80% red or whited out, and a human observer could not reasonably determine where the scope was currently located without information from previous frames.

In total, there were 56,262 images in the anatomical region dataset. They were broken down into 42,971 training images, 3,596 validation images, and 9,695 testing images. To prevent data leakage due to the high correlation between consecutive frames, each video and all its images were assigned exclusively to a single split: train, validation, or test.

The region classifier used the Resnet-18 convolutional neural network model pretrained on ImageNet and was fine-tuned on the new dataset designed for anatomical region classification within the manikin. The final linear layer of the Resnet-18 was replaced with a new linear layer with six outputs for the six anatomical regions. Models were trained using a sweep of hyperparameters of frozen layers, learning rate, image transforms, and batch size.

Anatomical Structure Detector

The anatomical structure object detector is used to identify key anatomical structures in the manikin and place a bounding box around each structure. The anatomical structures labeled were inferior turbinate, middle turbinate, uvula, vallecula, epiglottis, and vocal folds. In addition, the path we desired for the user to take in the nasal cavity was labeled into two separate classes, one for the path leading up to the middle turbinate (Path 1) and one for the path after passing the proximal end of the inferior turbinate (Path 2). Images were labeled with bounding boxes by trained graduate students using Label Studio, a data annotation tool. Some of the structures did not have apparent boundaries with clear delineations, which led to noisy bounding box labels. A confusion matrix was generated for the anatomical region classifier.

The anatomical structure detector used a YOLOv7 model that was fine-tuned on a dataset made of 11,337 images from a subset of 16 videos and evaluated against 3,096 images from four videos. No validation set was used due to the time-intensive nature of labeling the data. Models were trained using a hyperparameter sweep of image size, learning rate, image transforms, and batch size. Because multiple structures could be identified within a single frame, predicting the maximum likelihood of a class was not viable when integrating the anatomical structure detector into the AI Copilot. Instead, each class had its own confidence threshold that was hand-tuned based on human judgment after interacting with the complete system.

The performance of the model was measured using mean average precision (mAP). Mean average precision is a standard metric used in object detection and has the benefit of balancing precision and recall. An intersection-over-union threshold of 0.5 was used to calculate the mAP because the ground truth bounding box labels for some classes were noisy.

The AI Copilot was pilot tested prospectively by having 64 medical students naïve to FFL use the AI Copilot to perform FFL on the AirSim Combo Bronchi X manikin (Fig. 1) (United Kingdom, TruCorp Ltd). Anonymous surveys were handed out to the medical students after they had performed the FFL, asking them to rate the ease of using machine learning Copilot during FFL and self-rate their FFL skills with and without the Copilot, both on a 5-point Likert Scale. Descriptive statistics were used to analyze medical student responses to ease of use of the tool, and their subjective skill set before and after use of the tool. This was a proof-of-concept study to test the feasibility of the AI Copilot. The authors plan to do a formal study evaluating the impact of the AI Copilot on novice learners.

RESULTS

Anatomical Region Classifier

The best-performing model had six frozen layers and was trained with a batch size of 128 for four epochs with a learning rate of 0.0002. This model achieved an overall accuracy of 91.9% on the validation set and 80.1% on the test set. From the confusion matrix, it was apparent that the whiteout class was most frequently incorrectly classified as the nasal cavity, where most of the whiteouts happened in the training data. Anatomical regions were most likely to be misclassified as one of their adjacent regions because the transition point from one region to another is not discrete.

Anatomical Structure Detector

The best-performing model was trained for 80 epochs using a batch size of 128 and a learning rate of 0.001. This model achieved an overall mean average precision of 0.642. Large variability was seen in the mAP across classes. With strong delineations, structures like the vocal folds and epiglottis likely had more consistency and less noise in their bounding box labeling and were therefore easier for the model to learn.

AI Copilot Computer Performance

Initially, the AI Copilot was designed to run on a computer with a graphics processing unit (GPU), but we felt this limited the potential environments in which this might eventually be used. To make the software more broadly usable, we focused on making it work on a MacBook Pro M1. Through various optimizations, we were able to run the AI Copilot at approximately 28 frames per second (FPS), which is imperceptible from real-time and nearly matches the video frame rate of 30 FPS.

Ninety point nine percent of medical students naïve to FFL strongly agreed/agreed that the AI Copilot was easy to use. Medical students’ self-rating of FFL skills following use of AI Copilot, however, were equivocal overall compared to their self-rating without the Copilot.

CONCLUSIONS

We described the development and pilot testing of the first AI Copilot to help train novices to competently perform FFL on a manikin. The AI Copilot tracked successful capture of diagnosable views of key anatomical structures, effectively guiding users through FFL to ensure that all anatomical structures are sufficiently captured. This tool has the potential to assist novices in efficiently gaining competence in FFL.

You Might Also Like

Explore This Issue

You Might Also Like:

Leave a Reply Cancel reply