980 research outputs found

    Comparative evaluation of instrument segmentation and tracking methods in minimally invasive surgery

    Get PDF
    Intraoperative segmentation and tracking of minimally invasive instruments is a prerequisite for computer- and robotic-assisted surgery. Since additional hardware like tracking systems or the robot encoders are cumbersome and lack accuracy, surgical vision is evolving as promising techniques to segment and track the instruments using only the endoscopic images. However, what is missing so far are common image data sets for consistent evaluation and benchmarking of algorithms against each other. The paper presents a comparative validation study of different vision-based methods for instrument segmentation and tracking in the context of robotic as well as conventional laparoscopic surgery. The contribution of the paper is twofold: we introduce a comprehensive validation data set that was provided to the study participants and present the results of the comparative validation study. Based on the results of the validation study, we arrive at the conclusion that modern deep learning approaches outperform other methods in instrument segmentation tasks, but the results are still not perfect. Furthermore, we show that merging results from different methods actually significantly increases accuracy in comparison to the best stand-alone method. On the other hand, the results of the instrument tracking task show that this is still an open challenge, especially during challenging scenarios in conventional laparoscopic surgery

    ToolNet: Holistically-Nested Real-Time Segmentation of Robotic Surgical Tools

    Get PDF
    Real-time tool segmentation from endoscopic videos is an essential part of many computer-assisted robotic surgical systems and of critical importance in robotic surgical data science. We propose two novel deep learning architectures for automatic segmentation of non-rigid surgical instruments. Both methods take advantage of automated deep-learning-based multi-scale feature extraction while trying to maintain an accurate segmentation quality at all resolutions. The two proposed methods encode the multi-scale constraint inside the network architecture. The first proposed architecture enforces it by cascaded aggregation of predictions and the second proposed network does it by means of a holistically-nested architecture where the loss at each scale is taken into account for the optimization process. As the proposed methods are for real-time semantic labeling, both present a reduced number of parameters. We propose the use of parametric rectified linear units for semantic labeling in these small architectures to increase the regularization ability of the design and maintain the segmentation accuracy without overfitting the training sets. We compare the proposed architectures against state-of-the-art fully convolutional networks. We validate our methods using existing benchmark datasets, including ex vivo cases with phantom tissue and different robotic surgical instruments present in the scene. Our results show a statistically significant improved Dice Similarity Coefficient over previous instrument segmentation methods. We analyze our design choices and discuss the key drivers for improving accuracy.Comment: Paper accepted at IROS 201

    Surgical Phase Recognition of Short Video Shots Based on Temporal Modeling of Deep Features

    Full text link
    Recognizing the phases of a laparoscopic surgery (LS) operation form its video constitutes a fundamental step for efficient content representation, indexing and retrieval in surgical video databases. In the literature, most techniques focus on phase segmentation of the entire LS video using hand-crafted visual features, instrument usage signals, and recently convolutional neural networks (CNNs). In this paper we address the problem of phase recognition of short video shots (10s) of the operation, without utilizing information about the preceding/forthcoming video frames, their phase labels or the instruments used. We investigate four state-of-the-art CNN architectures (Alexnet, VGG19, GoogleNet, and ResNet101), for feature extraction via transfer learning. Visual saliency was employed for selecting the most informative region of the image as input to the CNN. Video shot representation was based on two temporal pooling mechanisms. Most importantly, we investigate the role of 'elapsed time' (from the beginning of the operation), and we show that inclusion of this feature can increase performance dramatically (69% vs. 75% mean accuracy). Finally, a long short-term memory (LSTM) network was trained for video shot classification based on the fusion of CNN features with 'elapsed time', increasing the accuracy to 86%. Our results highlight the prominent role of visual saliency, long-range temporal recursion and 'elapsed time' (a feature so far ignored), for surgical phase recognition.Comment: 6 pages, 4 figures, 6 table

    2017 Robotic Instrument Segmentation Challenge

    Get PDF
    In mainstream computer vision and machine learning, public datasets such as ImageNet, COCO and KITTI have helped drive enormous improvements by enabling researchers to understand the strengths and limitations of different algorithms via performance comparison. However, this type of approach has had limited translation to problems in robotic assisted surgery as this field has never established the same level of common datasets and benchmarking methods. In 2015 a sub-challenge was introduced at the EndoVis workshop where a set of robotic images were provided with automatically generated annotations from robot forward kinematics. However, there were issues with this dataset due to the limited background variation, lack of complex motion and inaccuracies in the annotation. In this work we present the results of the 2017 challenge on robotic instrument segmentation which involved 10 teams participating in binary, parts and type based segmentation of articulated da Vinci robotic instruments

    Automatic Detection of Out-of-body Frames in Surgical Videos for Privacy Protection Using Self-supervised Learning and Minimal Labels

    Full text link
    Endoscopic video recordings are widely used in minimally invasive robot-assisted surgery, but when the endoscope is outside the patient's body, it can capture irrelevant segments that may contain sensitive information. To address this, we propose a framework that accurately detects out-of-body frames in surgical videos by leveraging self-supervision with minimal data labels. We use a massive amount of unlabeled endoscopic images to learn meaningful representations in a self-supervised manner. Our approach, which involves pre-training on an auxiliary task and fine-tuning with limited supervision, outperforms previous methods for detecting out-of-body frames in surgical videos captured from da Vinci X and Xi surgical systems. The average F1 scores range from 96.00 to 98.02. Remarkably, using only 5% of the training labels, our approach still maintains an average F1 score performance above 97, outperforming fully-supervised methods with 95% fewer labels. These results demonstrate the potential of our framework to facilitate the safe handling of surgical video recordings and enhance data privacy protection in minimally invasive surgery.Comment: A 15-page journal article submitted to Journal of Medical Robotics Research (JMRR

    De-smokeGCN: Generative Cooperative Networks for Joint Surgical Smoke Detection and Removal

    Get PDF
    Surgical smoke removal algorithms can improve the quality of intra-operative imaging and reduce hazards in image-guided surgery, a highly desirable post-process for many clinical applications. These algorithms also enable effective computer vision tasks for future robotic surgery. In this paper, we present a new unsupervised learning framework for high-quality pixel-wise smoke detection and removal. One of the well recognized grand challenges in using convolutional neural networks (CNNs) for medical image processing is to obtain intra-operative medical imaging datasets for network training and validation, but availability and quality of these datasets are scarce. Our novel training framework does not require ground-truth image pairs. Instead, it learns purely from computer-generated simulation images. This approach opens up new avenues and bridges a substantial gap between conventional non-learning based methods and which requiring prior knowledge gained from extensive training datasets. Inspired by the Generative Adversarial Network (GAN), we have developed a novel generative-collaborative learning scheme that decomposes the de-smoke process into two separate tasks: smoke detection and smoke removal. The detection network is used as prior knowledge, and also as a loss function to maximize its support for training of the smoke removal network. Quantitative and qualitative studies show that the proposed training framework outperforms the state-of-the-art de-smoking approaches including the latest GAN framework (such as PIX2PIX). Although trained on synthetic images, experimental results on clinical images have proved the effectiveness of the proposed network for detecting and removing surgical smoke on both simulated and real-world laparoscopic images

    Uncertainty-Aware Organ Classification for Surgical Data Science Applications in Laparoscopy

    Get PDF
    Objective: Surgical data science is evolving into a research field that aims to observe everything occurring within and around the treatment process to provide situation-aware data-driven assistance. In the context of endoscopic video analysis, the accurate classification of organs in the field of view of the camera proffers a technical challenge. Herein, we propose a new approach to anatomical structure classification and image tagging that features an intrinsic measure of confidence to estimate its own performance with high reliability and which can be applied to both RGB and multispectral imaging (MI) data. Methods: Organ recognition is performed using a superpixel classification strategy based on textural and reflectance information. Classification confidence is estimated by analyzing the dispersion of class probabilities. Assessment of the proposed technology is performed through a comprehensive in vivo study with seven pigs. Results: When applied to image tagging, mean accuracy in our experiments increased from 65% (RGB) and 80% (MI) to 90% (RGB) and 96% (MI) with the confidence measure. Conclusion: Results showed that the confidence measure had a significant influence on the classification accuracy, and MI data are better suited for anatomical structure labeling than RGB data. Significance: This work significantly enhances the state of art in automatic labeling of endoscopic videos by introducing the use of the confidence metric, and by being the first study to use MI data for in vivo laparoscopic tissue classification. The data of our experiments will be released as the first in vivo MI dataset upon publication of this paper.Comment: 7 pages, 6 images, 2 table

    Image Compositing for Segmentation of Surgical Tools Without Manual Annotations

    Get PDF
    Producing manual, pixel-accurate, image segmentation labels is tedious and time-consuming. This is often a rate-limiting factor when large amounts of labeled images are required, such as for training deep convolutional networks for instrument-background segmentation in surgical scenes. No large datasets comparable to industry standards in the computer vision community are available for this task. To circumvent this problem, we propose to automate the creation of a realistic training dataset by exploiting techniques stemming from special effects and harnessing them to target training performance rather than visual appeal. Foreground data is captured by placing sample surgical instruments over a chroma key (a.k.a. green screen) in a controlled environment, thereby making extraction of the relevant image segment straightforward. Multiple lighting conditions and viewpoints can be captured and introduced in the simulation by moving the instruments and camera and modulating the light source. Background data is captured by collecting videos that do not contain instruments. In the absence of pre-existing instrument-free background videos, minimal labeling effort is required, just to select frames that do not contain surgical instruments from videos of surgical interventions freely available online. We compare different methods to blend instruments over tissue and propose a novel data augmentation approach that takes advantage of the plurality of options. We show that by training a vanilla U-Net on semi-synthetic data only and applying a simple post-processing, we are able to match the results of the same network trained on a publicly available manually labeled real dataset
    corecore