6 research outputs found
Automated pharyngeal phase detection and bolus localization in videofluoroscopic swallowing study: Killing two birds with one stone?
The videofluoroscopic swallowing study (VFSS) is a gold-standard imaging
technique for assessing swallowing, but analysis and rating of VFSS recordings
is time consuming and requires specialized training and expertise. Researchers
have recently demonstrated that it is possible to automatically detect the
pharyngeal phase of swallowing and to localize the bolus in VFSS recordings via
computer vision, fostering the development of novel techniques for automatic
VFSS analysis. However, training of algorithms to perform these tasks requires
large amounts of annotated data that are seldom available. We demonstrate that
the challenges of pharyngeal phase detection and bolus localization can be
solved together using a single approach. We propose a deep-learning framework
that jointly tackles pharyngeal phase detection and bolus localization in a
weakly-supervised manner, requiring only the initial and final frames of the
pharyngeal phase as ground truth annotations for the training. Our approach
stems from the observation that bolus presence in the pharynx is the most
prominent visual feature upon which to infer whether individual VFSS frames
belong to the pharyngeal phase. We conducted extensive experiments with
multiple convolutional neural networks (CNNs) on a dataset of 1245 bolus-level
clips from 59 healthy subjects. We demonstrated that the pharyngeal phase can
be detected with an F1-score higher than 0.9. Moreover, by processing the class
activation maps of the CNNs, we were able to localize the bolus with promising
results, obtaining correlations with ground truth trajectories higher than 0.9,
without any manual annotations of bolus location used for training purposes.
Once validated on a larger sample of participants with swallowing disorders,
our framework will pave the way for the development of intelligent tools for
VFSS analysis to support clinicians in swallowing assessment
Video-SwinUNet: Spatio-temporal Deep Learning Framework for VFSS Instance Segmentation
This paper presents a deep learning framework for medical video segmentation.
Convolution neural network (CNN) and transformer-based methods have achieved
great milestones in medical image segmentation tasks due to their incredible
semantic feature encoding and global information comprehension abilities.
However, most existing approaches ignore a salient aspect of medical video data
- the temporal dimension. Our proposed framework explicitly extracts features
from neighbouring frames across the temporal dimension and incorporates them
with a temporal feature blender, which then tokenises the high-level
spatio-temporal feature to form a strong global feature encoded via a Swin
Transformer. The final segmentation results are produced via a UNet-like
encoder-decoder architecture. Our model outperforms other approaches by a
significant margin and improves the segmentation benchmarks on the VFSS2022
dataset, achieving a dice coefficient of 0.8986 and 0.8186 for the two datasets
tested. Our studies also show the efficacy of the temporal feature blending
scheme and cross-dataset transferability of learned capabilities. Code and
models are fully available at https://github.com/SimonZeng7108/Video-SwinUNet
Noninvasive Dynamic Characterization of Swallowing Kinematics and Impairments in High Resolution Cervical Auscultation via Deep Learning
Swallowing is a complex sensorimotor activity by which food and liquids are transferred from the oral cavity to the stomach. Swallowing requires the coordination between multiple subsystems which makes it subject to impairment secondary to a variety of medical or surgically related conditions. Dysphagia refers to any swallowing disorder and is common in patients with head and neck cancer and neurological conditions such as stroke. Dysphagia affects nearly 9 million adults and causes death for more than 60,000 yearly in the US. In this research, we utilize advanced signal processing techniques with sensor technology and deep learning methods to develop a noninvasive and widely available tool for the evaluation and diagnosis of swallowing problems. We investigate the use of modern spectral estimation methods in addition to convolutional recurrent neural networks to demarcate and localize the important swallowing physiological events that contribute to airway protection solely based on signals collected from non-invasive sensors attached to the anterior neck. These events include the full swallowing activity, upper esophageal sphincter opening duration and maximal opening diameter, and aspiration. We believe that combining sensor technology and state of the art deep learning architectures specialized in time series analysis, will help achieve great advances for dysphagia detection and management in terms of non-invasiveness, portability, and availability. Like never before, such advances will enable patients to get continuous feedback about their swallowing out of standard clinical care setting which will extremely facilitate their daily activities and enhance the quality of their lives
Multi-modal and multi-dimensional biomedical image data analysis using deep learning
There is a growing need for the development of computational methods and tools for automated, objective, and quantitative analysis of biomedical signal and image data to facilitate disease and treatment monitoring, early diagnosis, and scientific discovery. Recent advances in artificial intelligence and machine learning, particularly in deep learning, have revolutionized computer vision and image analysis for many application areas. While processing of non-biomedical signal, image, and video data using deep learning methods has been very successful, high-stakes biomedical applications present unique challenges such as different image modalities, limited training data, need for explainability and interpretability etc. that need to be addressed. In this dissertation, we developed novel, explainable, and attention-based deep learning frameworks for objective, automated, and quantitative analysis of biomedical signal, image, and video data. The proposed solutions involve multi-scale signal analysis for oraldiadochokinesis studies; ensemble of deep learning cascades using global soft attention mechanisms for segmentation of meningeal vascular networks in confocal microscopy; spatial attention and spatio-temporal data fusion for detection of rare and short-term video events in laryngeal endoscopy videos; and a novel discrete Fourier transform driven class activation map for explainable-AI and weakly-supervised object localization and segmentation for detailed vocal fold motion analysis using laryngeal endoscopy videos. Experiments conducted on the proposed methods showed robust and promising results towards automated, objective, and quantitative analysis of biomedical data, that is of great value for potential early diagnosis and effective disease progress or treatment monitoring.Includes bibliographical references
Automatic Detection of the Pharyngeal Phase in Raw Videos for the Videofluoroscopic Swallowing Study Using Efficient Data Collection and 3D Convolutional Networks
Videofluoroscopic swallowing study (VFSS) is a standard diagnostic tool for dysphagia. To detect the presence of aspiration during a swallow, a manual search is commonly used to mark the time intervals of the pharyngeal phase on the corresponding VFSS image. In this study, we present a novel approach that uses 3D convolutional networks to detect the pharyngeal phase in raw VFSS videos without manual annotations. For efficient collection of training data, we propose a cascade framework which no longer requires time intervals of the swallowing process nor the manual marking of anatomical positions for detection. For video classification, we applied the inflated 3D convolutional network (I3D), one of the state-of-the-art network for action classification, as a baseline architecture. We also present a modified 3D convolutional network architecture that is derived from the baseline I3D architecture. The classification and detection performance of these two architectures were evaluated for comparison. The experimental results show that the proposed model outperformed the baseline I3D model in the condition where both models are trained with random weights. We conclude that the proposed method greatly reduces the examination time of the VFSS images with a low miss rate