32 research outputs found
2018 Robotic Scene Segmentation Challenge
In 2015 we began a sub-challenge at the EndoVis workshop at MICCAI in Munich using endoscope images of ex-vivo tissue with automatically generated annotations from robot forward kinematics and instrument CAD models. However, the limited background variation and simple motion rendered the dataset uninformative in learning about which techniques would be suitable for segmentation in real surgery. In 2017, at the same workshop in Quebec we introduced the robotic instrument segmentation dataset with 10 teams participating in the challenge to perform binary, articulating parts and type segmentation of da Vinci instruments. This challenge included realistic instrument motion and more complex porcine tissue as background and was widely addressed with modifications on U-Nets and other popular CNN architectures. In 2018 we added to the complexity by introducing a set of anatomical objects and medical devices to the segmented classes. To avoid over-complicating the challenge, we continued with porcine data which is dramatically simpler than human tissue due to the lack of fatty tissue occluding many organs
Scalable Joint Detection and Segmentation of Surgical Instruments with Weak Supervision
Computer vision based models, such as object segmentation, detection and tracking, have the potential to assist surgeons intra-operatively and improve the quality and outcomes of minimally invasive surgery. Different work streams towards instrument detection include segmentation, bounding box localisation and classification. While segmentation models offer much more granular results, bounding box annotations are easier to annotate at scale. To leverage the granularity of segmentation approaches with the scalability of bounding box-based models, a multi-task model for joint bounding box detection and segmentation of surgical instruments is proposed. The model consists of a shared backbone and three independent heads for the tasks of classification, bounding box regression, and segmentation. Using adaptive losses together with simple yet effective weakly-supervised label inference, the proposed model use weak labels to learn to segment surgical instruments with a fraction of the dataset requiring segmentation masks. Results suggest that instrument detection and segmentation tasks share intrinsic challenges and jointly learning from both reduces the burden of annotating masks at scale. Experimental validation shows that the proposed model obtain comparable results to that of single-task state-of-the-art detector and segmentation models, while only requiring a fraction of the dataset to be annotated with masks. Specifically, the proposed model obtained 0.81 weighted average precision (wAP) and 0.73 mean intersection-over-union (IOU) in the Endovis2018 dataset with 1% annotated masks, while performing joint detection and segmentation at more than 20 frames per second
From Generalization to Precision: Exploring SAM for Tool Segmentation in Surgical Environments
Purpose: Accurate tool segmentation is essential in computer-aided
procedures. However, this task conveys challenges due to artifacts' presence
and the limited training data in medical scenarios. Methods that generalize to
unseen data represent an interesting venue, where zero-shot segmentation
presents an option to account for data limitation. Initial exploratory works
with the Segment Anything Model (SAM) show that bounding-box-based prompting
presents notable zero-short generalization. However, point-based prompting
leads to a degraded performance that further deteriorates under image
corruption. We argue that SAM drastically over-segment images with high
corruption levels, resulting in degraded performance when only a single
segmentation mask is considered, while the combination of the masks overlapping
the object of interest generates an accurate prediction. Method: We use SAM to
generate the over-segmented prediction of endoscopic frames. Then, we employ
the ground-truth tool mask to analyze the results of SAM when the best single
mask is selected as prediction and when all the individual masks overlapping
the object of interest are combined to obtain the final predicted mask. We
analyze the Endovis18 and Endovis17 instrument segmentation datasets using
synthetic corruptions of various strengths and an In-House dataset featuring
counterfactually created real-world corruptions. Results: Combining the
over-segmented masks contributes to improvements in the IoU. Furthermore,
selecting the best single segmentation presents a competitive IoU score for
clean images. Conclusions: Combined SAM predictions present improved results
and robustness up to a certain corruption level. However, appropriate prompting
strategies are fundamental for implementing these models in the medical domain
Towards Holistic Surgical Scene Understanding
Most benchmarks for studying surgical interventions focus on a specific
challenge instead of leveraging the intrinsic complementarity among different
tasks. In this work, we present a new experimental framework towards holistic
surgical scene understanding. First, we introduce the Phase, Step, Instrument,
and Atomic Visual Action recognition (PSI-AVA) Dataset. PSI-AVA includes
annotations for both long-term (Phase and Step recognition) and short-term
reasoning (Instrument detection and novel Atomic Action recognition) in
robot-assisted radical prostatectomy videos. Second, we present Transformers
for Action, Phase, Instrument, and steps Recognition (TAPIR) as a strong
baseline for surgical scene understanding. TAPIR leverages our dataset's
multi-level annotations as it benefits from the learned representation on the
instrument detection task to improve its classification capacity. Our
experimental results in both PSI-AVA and other publicly available databases
demonstrate the adequacy of our framework to spur future research on holistic
surgical scene understanding.Comment: MICCAI 2022 Ora
Generalizing Surgical Instruments Segmentation to Unseen Domains with One-to-Many Synthesis
Despite their impressive performance in various surgical scene understanding
tasks, deep learning-based methods are frequently hindered from deploying to
real-world surgical applications for various causes. Particularly, data
collection, annotation, and domain shift in-between sites and patients are the
most common obstacles. In this work, we mitigate data-related issues by
efficiently leveraging minimal source images to generate synthetic surgical
instrument segmentation datasets and achieve outstanding generalization
performance on unseen real domains. Specifically, in our framework, only one
background tissue image and at most three images of each foreground instrument
are taken as the seed images. These source images are extensively transformed
and employed to build up the foreground and background image pools, from which
randomly sampled tissue and instrument images are composed with multiple
blending techniques to generate new surgical scene images. Besides, we
introduce hybrid training-time augmentations to diversify the training data
further. Extensive evaluation on three real-world datasets, i.e., Endo2017,
Endo2018, and RoboTool, demonstrates that our one-to-many synthetic surgical
instruments datasets generation and segmentation framework can achieve
encouraging performance compared with training with real data. Notably, on the
RoboTool dataset, where a more significant domain gap exists, our framework
shows its superiority of generalization by a considerable margin. We expect
that our inspiring results will attract research attention to improving model
generalization with data synthesizing.Comment: First two authors contributed equally. Accepted by IROS202
Surgical-VQLA:Transformer with Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery
Despite the availability of computer-aided simulators and recorded videos of surgical procedures, junior residents still heavily rely on experts to answer their queries. However, expert surgeons are often overloaded with clinical and academic workloads and limit their time in answering. For this purpose, we develop a surgical question-answering system to facilitate robot-assisted surgical scene and activity understanding from recorded videos. Most of the existing visual question answering (VQA) methods require an object detector and regions based feature extractor to extract visual features and fuse them with the embedded text of the question for answer generation. However, (i) surgical object detection model is scarce due to smaller datasets and lack of bounding box annotation; (ii) current fusion strategy of heterogeneous modalities like text and image is naive; (iii) the localized answering is missing, which is crucial in complex surgical scenarios. In this paper, we propose Visual Question Localized-Answering in Robotic Surgery (Surgical-VQLA) to localize the specific surgical area during the answer prediction. To deal with the fusion of the heterogeneous modalities, we design gated vision-language embedding (GVLE) to build input patches for the Language Vision Transformer (LViT) to predict the answer. To get localization, we add the detection head in parallel with the prediction head of the LViT. We also integrate generalized intersection over union (GIoU) loss to boost localization performance by preserving the accuracy of the question-answering model. We annotate two datasets of VQLA by utilizing publicly available surgical videos from EndoVis-17 and 18 of the MICCAI challenges. Our validation results suggest that Surgical-VQLA can better understand the surgical scene and localized the specific area related to the question-answering. GVLE presents an efficient language-vision embedding technique by showing superior performance over the existing benchmarks
Task-Aware Asynchronous Multi-Task Model with Class Incremental Contrastive Learning for Surgical Scene Understanding
Purpose: Surgery scene understanding with tool-tissue interaction recognition
and automatic report generation can play an important role in intra-operative
guidance, decision-making and postoperative analysis in robotic surgery.
However, domain shifts between different surgeries with inter and intra-patient
variation and novel instruments' appearance degrade the performance of model
prediction. Moreover, it requires output from multiple models, which can be
computationally expensive and affect real-time performance.
Methodology: A multi-task learning (MTL) model is proposed for surgical
report generation and tool-tissue interaction prediction that deals with domain
shift problems. The model forms of shared feature extractor, mesh-transformer
branch for captioning and graph attention branch for tool-tissue interaction
prediction. The shared feature extractor employs class incremental contrastive
learning (CICL) to tackle intensity shift and novel class appearance in the
target domain. We design Laplacian of Gaussian (LoG) based curriculum learning
into both shared and task-specific branches to enhance model learning. We
incorporate a task-aware asynchronous MTL optimization technique to fine-tune
the shared weights and converge both tasks optimally.
Results: The proposed MTL model trained using task-aware optimization and
fine-tuning techniques reported a balanced performance (BLEU score of 0.4049
for scene captioning and accuracy of 0.3508 for interaction detection) for both
tasks on the target domain and performed on-par with single-task models in
domain adaptation.
Conclusion: The proposed multi-task model was able to adapt to domain shifts,
incorporate novel instruments in the target domain, and perform tool-tissue
interaction detection and report generation on par with single-task models.Comment: Manuscript accepted in the International Journal of Computer Assisted
Radiology and Surgery. codes available:
https://github.com/lalithjets/Domain-adaptation-in-MT