31 research outputs found
Skill, or style? Classification of fetal sonography eye-tracking data
We present a method for classifying human skill at fetal ultrasound scanning from eye-tracking and pupillary data of sonographers. Human skill characterization for this clinical task typically creates groupings of clinician skills such as expert and beginner based on the number of years of professional experience; experts typically have more than 10 years and beginners between 0-5 years. In some cases, they also include trainees who are not yet fully-qualified professionals. Prior work has considered eye movements that necessitates separating eye-tracking data into eye movements, such as fixations and saccades. Our method does not use prior assumptions about the relationship between years of experience and does not require the separation of eye-tracking data. Our best performing skill classification model achieves an F1 score of 98% and 70% for expert and trainee classes respectively. We also show that years of experience as a direct measure of skill, is significantly correlated to the expertise of a sonographer
Towards standard plane prediction of fetal head ultrasound with domain adaption
Fetal Standard Plane (SP) acquisition is a key step in ultrasound based assessment of fetal health. The task detects an
ultrasound (US) image with predefined anatomy. However, it
requires skill to acquire a good SP in practice, and trainees
and occasional users of ultrasound devices can find this challenging. In this work, we consider the task of automatically
predicting the fetal head SP from the video approaching the
SP. We adopt a domain transfer learning approach that maps
the encoded spatial and temporal features of video in the
source domain to the spatial representations of the desired SP
image in the target domain, together with adversarial training
to preserve the quality of the resulting image. Experimental
results show that the predicted head plane is plausible and
consistent with the anatomical features expected in a real SP.
The proposed approach is motivated to support non-experts
to find and analyse a trans-ventricular (TV) plane but could
also be generalized to other planes, trimesters, and ultrasound
imaging tasks for which standard planes are defined
D2ANET: Densely Attentional-Aware Network for first trimester ultrasound CRL and NT segmentation
Manual annotation of medical images is time consuming for clinical experts; therefore, reliable automatic segmentation would be the ideal way to handle large medical datasets. In this paper, we are interested in detection and segmentation of two fundamental measurements in the first trimester ultrasound (US) scan: Nuchal Translucency (NT) and Crown Rump Length (CRL). There can be a significant variation in the shape, location or size of the anatomical structures in the fetal US scans. We propose a new approach, namely Densely Attentional-Aware Network for First Trimester Ultrasound CRL and NT Segmentation (DA2Net), to encode variation in feature size by relying on the powerful attention mechanism and densely connected networks. Our results show that the proposed D2ANet offers high pixel agreement (mean JSC = 84.21) with expert manual annotations
Gaze-probe joint guidance with multi-task learning in obstetric ultrasound scanning
In this work, we exploit multi-task learning to jointly predict the two decision-making processes of gaze movement and probe manipulation that an experienced sonographer would perform in routine obstetric scanning. A multimodal guidance framework, Multimodal-GuideNet, is proposed to detect the causal relationship between a real-world ultrasound video signal, synchronized gaze, and probe motion. The association between the multi-modality inputs is learned and shared through a modality-aware spatial graph that leverages useful cross-modal dependencies. By estimating the probability distribution of probe and gaze movements in real scans, the predicted guidance signals also allow inter- and intra-sonographer variations and avoid a fixed scanning path. We validate the new multi-modality approach on three types of obstetric scanning examinations, and the result consistently outperforms single-task learning under various guidance policies. To simulate sonographer’s attention on multi-structure images, we also explore multi-step estimation in gaze guidance, and its visual results show that the prediction allows multiple gaze centers that are substantially aligned with underlying anatomical structures
Automating the human action of first-trimester biometry measurement from real-world freehand ultrasound
Objective: Automated medical image analysis solutions should closely mimic complete human actions to be useful in clinical practice. However, more often an automated image analysis solution represents only part of a human task, which restricts its practical utility. In the case of ultrasound-based fetal biometry, an automated solution should ideally recognize key fetal structures in freehand video guidance, select a standard plane from a video stream and perform biometry. A complete automated solution should automate all three subactions.
Methods: In this article, we consider how to automate the complete human action of first-trimester biometry measurement from real-world freehand ultrasound. In the proposed hybrid convolutional neural network (CNN) architecture design, a classification regression-based guidance model detects and tracks fetal anatomical structures (using visual cues) in the ultrasound video. Several high-quality standard planes that contain the mid-sagittal view of the fetus are sampled at multiple time stamps (using a custom-designed confident-frame detector) based on the estimated probability values associated with predicted anatomical structures that define the biometry plane. Automated semantic segmentation is performed on the selected frames to extract fetal anatomical landmarks. A crown–rump length (CRL) estimate is calculated as the mean CRL from these multiple frames.
Results: Our fully automated method has a high correlation with clinical expert CRL measurement (Pearson's p = 0.92, R-squared [R2] = 0.84) and a low mean absolute error of 0.834 (weeks) for fetal age estimation on a test data set of 42 videos.
Conclusion: A novel algorithm for standard plane detection employs a quality detection mechanism defined by clinical standards, ensuring precise biometric measurements
Self-supervised Representation Learning for Ultrasound Video
Recent advances in deep learning have achieved promising performance for
medical image analysis, while in most cases ground-truth annotations from human
experts are necessary to train the deep model. In practice, such annotations
are expensive to collect and can be scarce for medical imaging applications.
Therefore, there is significant interest in learning representations from
unlabelled raw data. In this paper, we propose a self-supervised learning
approach to learn meaningful and transferable representations from medical
imaging video without any type of human annotation. We assume that in order to
learn such a representation, the model should identify anatomical structures
from the unlabelled data. Therefore we force the model to address anatomy-aware
tasks with free supervision from the data itself. Specifically, the model is
designed to correct the order of a reshuffled video clip and at the same time
predict the geometric transformation applied to the video clip. Experiments on
fetal ultrasound video show that the proposed approach can effectively learn
meaningful and strong representations, which transfer well to downstream tasks
like standard plane detection and saliency prediction.Comment: ISBI 202
Dual Representation Learning From Fetal Ultrasound Video and Sonographer Audio
This paper tackles the challenging problem of real-world data self-supervised representation learning from two modalities: fetal ultrasound (US) video and the corresponding speech acquired when a sonographer performs a pregnancy scan. We propose to transfer knowledge between the different modalities, even though the sonographer's speech and the US video may not be semantically correlated. We design a network architecture capable of learning useful representations such as of anatomical features and structures while recognising the correlation between an US video scan and the sonographer's speech. We introduce dual representation learning from US video and audio, which consists of two concepts: Multi-Modal Contrastive Learning and Multi-Modal Similarity Learning, in a latent feature space. Experiments show that the proposed architecture learns powerful representations and transfers well for two downstream tasks. Furthermore, we experiment with two different datasets for pretraining which differ in size and length of video clips (as well as sonographer speech) to show that the quality of the sonographer's speech plays an important role in the final performance.</p
Show from Tell:Audio-Visual Modelling in Clinical Settings
Auditory and visual signals usually present together and correlate with each other, not only in natural environments but also in clinical settings. However, the audio-visual modelling in the latter case can be more challenging, due to the different sources of audio/video signals and the noise (both signal-level and semantic-level) in auditory signals -- usually speech. In this paper, we consider audio-visual modelling in a clinical setting, providing a solution to learn medical representations that benefit various clinical tasks, without human expert annotation. A simple yet effective multi-modal self-supervised learning framework is proposed for this purpose. The proposed approach is able to localise anatomical regions of interest during ultrasound imaging, with only speech audio as a reference. Experimental evaluations on a large-scale clinical multi-modal ultrasound video dataset show that the proposed self-supervised method learns good transferable anatomical representations that boost the performance of automated downstream clinical tasks, even outperforming fully-supervised solutions
Audio-visual modelling in a clinical setting
Auditory and visual signals are two primary perception modalities that are usually present together and correlate with each other, not only in natural environments but also in clinical settings. However, audio-visual modelling in the latter case can be more challenging, due to the different sources of audio/video signals and the noise (both signal-level and semantic-level) in auditory signals—usually speech audio. In this study, we consider audio-visual modelling in a clinical setting, providing a solution to learn medical representations that benefit various clinical tasks, without relying on dense supervisory annotations from human experts for the model training. A simple yet effective multi-modal self-supervised learning framework is presented for this purpose. The proposed approach is able to help find standard anatomical planes, predict the focusing position of sonographer’s eyes, and localise anatomical regions of interest during ultrasound imaging. Experimental analysis on a large-scale clinical multi-modal ultrasound video dataset show that the proposed novel representation learning method provides good transferable anatomical representations that boost the performance of automated downstream clinical tasks, even outperforming fully-supervised solutions. Being able to learn such medical representations in a self-supervised manner will contribute to several aspects including a better understanding of obstetric imaging, training new sonographers, more effective assistive tools for human experts, and enhancement of the clinical workflow
Discovering Salient Anatomical Landmarks by Predicting Human Gaze
Anatomical landmarks are a crucial prerequisite for many medical imaging
tasks. Usually, the set of landmarks for a given task is predefined by experts.
The landmark locations for a given image are then annotated manually or via
machine learning methods trained on manual annotations. In this paper, in
contrast, we present a method to automatically discover and localize anatomical
landmarks in medical images. Specifically, we consider landmarks that attract
the visual attention of humans, which we term visually salient landmarks. We
illustrate the method for fetal neurosonographic images. First, full-length
clinical fetal ultrasound scans are recorded with live sonographer
gaze-tracking. Next, a convolutional neural network (CNN) is trained to predict
the gaze point distribution (saliency map) of the sonographers on scan video
frames. The CNN is then used to predict saliency maps of unseen fetal
neurosonographic images, and the landmarks are extracted as the local maxima of
these saliency maps. Finally, the landmarks are matched across images by
clustering the landmark CNN features. We show that the discovered landmarks can
be used within affine image registration, with average landmark alignment
errors between 4.1% and 10.9% of the fetal head long axis length.Comment: Accepted at IEEE International Symposium on Biomedical Imaging 2020
(ISBI 2020