3,164 research outputs found
SonoNet: Real-Time Detection and Localisation of Fetal Standard Scan Planes in Freehand Ultrasound
Identifying and interpreting fetal standard scan planes during 2D ultrasound
mid-pregnancy examinations are highly complex tasks which require years of
training. Apart from guiding the probe to the correct location, it can be
equally difficult for a non-expert to identify relevant structures within the
image. Automatic image processing can provide tools to help experienced as well
as inexperienced operators with these tasks. In this paper, we propose a novel
method based on convolutional neural networks which can automatically detect 13
fetal standard views in freehand 2D ultrasound data as well as provide a
localisation of the fetal structures via a bounding box. An important
contribution is that the network learns to localise the target anatomy using
weak supervision based on image-level labels only. The network architecture is
designed to operate in real-time while providing optimal output for the
localisation task. We present results for real-time annotation, retrospective
frame retrieval from saved videos, and localisation on a very large and
challenging dataset consisting of images and video recordings of full clinical
anomaly screenings. We found that the proposed method achieved an average
F1-score of 0.798 in a realistic classification experiment modelling real-time
detection, and obtained a 90.09% accuracy for retrospective frame retrieval.
Moreover, an accuracy of 77.8% was achieved on the localisation task.Comment: 12 pages, 8 figures, published in IEEE Transactions in Medical
Imagin
Excitation Backprop for RNNs
Deep models are state-of-the-art for many vision tasks including video action
recognition and video captioning. Models are trained to caption or classify
activity in videos, but little is known about the evidence used to make such
decisions. Grounding decisions made by deep networks has been studied in
spatial visual content, giving more insight into model predictions for images.
However, such studies are relatively lacking for models of spatiotemporal
visual content - videos. In this work, we devise a formulation that
simultaneously grounds evidence in space and time, in a single pass, using
top-down saliency. We visualize the spatiotemporal cues that contribute to a
deep model's classification/captioning output using the model's internal
representation. Based on these spatiotemporal cues, we are able to localize
segments within a video that correspond with a specific action, or phrase from
a caption, without explicitly optimizing/training for these tasks.Comment: CVPR 2018 Camera Ready Versio
- …