Endotracheal intubation (ETI) is an emergency procedure performed in civilian
and combat casualty care settings to establish an airway. Objective and
automated assessment of ETI skills is essential for the training and
certification of healthcare providers. However, the current approach is based
on manual feedback by an expert, which is subjective, time- and
resource-intensive, and is prone to poor inter-rater reliability and halo
effects. This work proposes a framework to evaluate ETI skills using single and
multi-view videos. The framework consists of two stages. First, a 2D
convolutional autoencoder (AE) and a pre-trained self-supervision network
extract features from videos. Second, a 1D convolutional enhanced with a
cross-view attention module takes the features from the AE as input and outputs
predictions for skill evaluation. The ETI datasets were collected in two
phases. In the first phase, ETI is performed by two subject cohorts: Experts
and Novices. In the second phase, novice subjects perform ETI under time
pressure, and the outcome is either Successful or Unsuccessful. A third dataset
of videos from a single head-mounted camera for Experts and Novices is also
analyzed. The study achieved an accuracy of 100% in identifying Expert/Novice
trials in the initial phase. In the second phase, the model showed 85% accuracy
in classifying Successful/Unsuccessful procedures. Using head-mounted cameras
alone, the model showed a 96% accuracy on Expert and Novice classification
while maintaining an accuracy of 85% on classifying successful and
unsuccessful. In addition, GradCAMs are presented to explain the differences
between Expert and Novice behavior and Successful and Unsuccessful trials. The
approach offers a reliable and objective method for automated assessment of ETI
skills