980 research outputs found
Comparative evaluation of instrument segmentation and tracking methods in minimally invasive surgery
Intraoperative segmentation and tracking of minimally invasive instruments is
a prerequisite for computer- and robotic-assisted surgery. Since additional
hardware like tracking systems or the robot encoders are cumbersome and lack
accuracy, surgical vision is evolving as promising techniques to segment and
track the instruments using only the endoscopic images. However, what is
missing so far are common image data sets for consistent evaluation and
benchmarking of algorithms against each other. The paper presents a comparative
validation study of different vision-based methods for instrument segmentation
and tracking in the context of robotic as well as conventional laparoscopic
surgery. The contribution of the paper is twofold: we introduce a comprehensive
validation data set that was provided to the study participants and present the
results of the comparative validation study. Based on the results of the
validation study, we arrive at the conclusion that modern deep learning
approaches outperform other methods in instrument segmentation tasks, but the
results are still not perfect. Furthermore, we show that merging results from
different methods actually significantly increases accuracy in comparison to
the best stand-alone method. On the other hand, the results of the instrument
tracking task show that this is still an open challenge, especially during
challenging scenarios in conventional laparoscopic surgery
ToolNet: Holistically-Nested Real-Time Segmentation of Robotic Surgical Tools
Real-time tool segmentation from endoscopic videos is an essential part of
many computer-assisted robotic surgical systems and of critical importance in
robotic surgical data science. We propose two novel deep learning architectures
for automatic segmentation of non-rigid surgical instruments. Both methods take
advantage of automated deep-learning-based multi-scale feature extraction while
trying to maintain an accurate segmentation quality at all resolutions. The two
proposed methods encode the multi-scale constraint inside the network
architecture. The first proposed architecture enforces it by cascaded
aggregation of predictions and the second proposed network does it by means of
a holistically-nested architecture where the loss at each scale is taken into
account for the optimization process. As the proposed methods are for real-time
semantic labeling, both present a reduced number of parameters. We propose the
use of parametric rectified linear units for semantic labeling in these small
architectures to increase the regularization ability of the design and maintain
the segmentation accuracy without overfitting the training sets. We compare the
proposed architectures against state-of-the-art fully convolutional networks.
We validate our methods using existing benchmark datasets, including ex vivo
cases with phantom tissue and different robotic surgical instruments present in
the scene. Our results show a statistically significant improved Dice
Similarity Coefficient over previous instrument segmentation methods. We
analyze our design choices and discuss the key drivers for improving accuracy.Comment: Paper accepted at IROS 201
Surgical Phase Recognition of Short Video Shots Based on Temporal Modeling of Deep Features
Recognizing the phases of a laparoscopic surgery (LS) operation form its
video constitutes a fundamental step for efficient content representation,
indexing and retrieval in surgical video databases. In the literature, most
techniques focus on phase segmentation of the entire LS video using
hand-crafted visual features, instrument usage signals, and recently
convolutional neural networks (CNNs). In this paper we address the problem of
phase recognition of short video shots (10s) of the operation, without
utilizing information about the preceding/forthcoming video frames, their phase
labels or the instruments used. We investigate four state-of-the-art CNN
architectures (Alexnet, VGG19, GoogleNet, and ResNet101), for feature
extraction via transfer learning. Visual saliency was employed for selecting
the most informative region of the image as input to the CNN. Video shot
representation was based on two temporal pooling mechanisms. Most importantly,
we investigate the role of 'elapsed time' (from the beginning of the
operation), and we show that inclusion of this feature can increase performance
dramatically (69% vs. 75% mean accuracy). Finally, a long short-term memory
(LSTM) network was trained for video shot classification based on the fusion of
CNN features with 'elapsed time', increasing the accuracy to 86%. Our results
highlight the prominent role of visual saliency, long-range temporal recursion
and 'elapsed time' (a feature so far ignored), for surgical phase recognition.Comment: 6 pages, 4 figures, 6 table
2017 Robotic Instrument Segmentation Challenge
In mainstream computer vision and machine learning, public datasets such as
ImageNet, COCO and KITTI have helped drive enormous improvements by enabling
researchers to understand the strengths and limitations of different algorithms
via performance comparison. However, this type of approach has had limited
translation to problems in robotic assisted surgery as this field has never
established the same level of common datasets and benchmarking methods. In 2015
a sub-challenge was introduced at the EndoVis workshop where a set of robotic
images were provided with automatically generated annotations from robot
forward kinematics. However, there were issues with this dataset due to the
limited background variation, lack of complex motion and inaccuracies in the
annotation. In this work we present the results of the 2017 challenge on
robotic instrument segmentation which involved 10 teams participating in
binary, parts and type based segmentation of articulated da Vinci robotic
instruments
Automatic Detection of Out-of-body Frames in Surgical Videos for Privacy Protection Using Self-supervised Learning and Minimal Labels
Endoscopic video recordings are widely used in minimally invasive
robot-assisted surgery, but when the endoscope is outside the patient's body,
it can capture irrelevant segments that may contain sensitive information. To
address this, we propose a framework that accurately detects out-of-body frames
in surgical videos by leveraging self-supervision with minimal data labels. We
use a massive amount of unlabeled endoscopic images to learn meaningful
representations in a self-supervised manner. Our approach, which involves
pre-training on an auxiliary task and fine-tuning with limited supervision,
outperforms previous methods for detecting out-of-body frames in surgical
videos captured from da Vinci X and Xi surgical systems. The average F1 scores
range from 96.00 to 98.02. Remarkably, using only 5% of the training labels,
our approach still maintains an average F1 score performance above 97,
outperforming fully-supervised methods with 95% fewer labels. These results
demonstrate the potential of our framework to facilitate the safe handling of
surgical video recordings and enhance data privacy protection in minimally
invasive surgery.Comment: A 15-page journal article submitted to Journal of Medical Robotics
Research (JMRR
De-smokeGCN: Generative Cooperative Networks for Joint Surgical Smoke Detection and Removal
Surgical smoke removal algorithms can improve the quality of intra-operative imaging and reduce hazards in image-guided surgery, a highly desirable post-process for many clinical applications. These algorithms also enable effective computer vision tasks for future robotic surgery. In this paper, we present a new unsupervised learning framework for high-quality pixel-wise smoke detection and removal. One of the well recognized grand challenges in using convolutional neural networks (CNNs) for medical image processing is to obtain intra-operative medical imaging datasets for network training and validation, but availability and quality of these datasets are scarce. Our novel training framework does not require ground-truth image pairs. Instead, it learns purely from computer-generated simulation images. This approach opens up new avenues and bridges a substantial gap between conventional non-learning based methods and which requiring prior knowledge gained from extensive training datasets. Inspired by the Generative Adversarial Network (GAN), we have developed a novel generative-collaborative learning scheme that decomposes the de-smoke process into two separate tasks: smoke detection and smoke removal. The detection network is used as prior knowledge, and also as a loss function to maximize its support for training of the smoke removal network. Quantitative and qualitative studies show that the proposed training framework outperforms the state-of-the-art de-smoking approaches including the latest GAN framework (such as PIX2PIX). Although trained on synthetic images, experimental results on clinical images have proved the effectiveness of the proposed network for detecting and removing surgical smoke on both simulated and real-world laparoscopic images
Uncertainty-Aware Organ Classification for Surgical Data Science Applications in Laparoscopy
Objective: Surgical data science is evolving into a research field that aims
to observe everything occurring within and around the treatment process to
provide situation-aware data-driven assistance. In the context of endoscopic
video analysis, the accurate classification of organs in the field of view of
the camera proffers a technical challenge. Herein, we propose a new approach to
anatomical structure classification and image tagging that features an
intrinsic measure of confidence to estimate its own performance with high
reliability and which can be applied to both RGB and multispectral imaging (MI)
data. Methods: Organ recognition is performed using a superpixel classification
strategy based on textural and reflectance information. Classification
confidence is estimated by analyzing the dispersion of class probabilities.
Assessment of the proposed technology is performed through a comprehensive in
vivo study with seven pigs. Results: When applied to image tagging, mean
accuracy in our experiments increased from 65% (RGB) and 80% (MI) to 90% (RGB)
and 96% (MI) with the confidence measure. Conclusion: Results showed that the
confidence measure had a significant influence on the classification accuracy,
and MI data are better suited for anatomical structure labeling than RGB data.
Significance: This work significantly enhances the state of art in automatic
labeling of endoscopic videos by introducing the use of the confidence metric,
and by being the first study to use MI data for in vivo laparoscopic tissue
classification. The data of our experiments will be released as the first in
vivo MI dataset upon publication of this paper.Comment: 7 pages, 6 images, 2 table
Image Compositing for Segmentation of Surgical Tools Without Manual Annotations
Producing manual, pixel-accurate, image segmentation labels is tedious and time-consuming. This is often a rate-limiting factor when large amounts of labeled images are required, such as for training deep convolutional networks for instrument-background segmentation in surgical scenes. No large datasets comparable to industry standards in the computer vision community are available for this task. To circumvent this problem, we propose to automate the creation of a realistic training dataset by exploiting techniques stemming from special effects and harnessing them to target training performance rather than visual appeal. Foreground data is captured by placing sample surgical instruments over a chroma key (a.k.a. green screen) in a controlled environment, thereby making extraction of the relevant image segment straightforward. Multiple lighting conditions and viewpoints can be captured and introduced in the simulation by moving the instruments and camera and modulating the light source. Background data is captured by collecting videos that do not contain instruments. In the absence of pre-existing instrument-free background videos, minimal labeling effort is required, just to select frames that do not contain surgical instruments from videos of surgical interventions freely available online. We compare different methods to blend instruments over tissue and propose a novel data augmentation approach that takes advantage of the plurality of options. We show that by training a vanilla U-Net on semi-synthetic data only and applying a simple post-processing, we are able to match the results of the same network trained on a publicly available manually labeled real dataset
- …