39 research outputs found
Frequency Dropout: Feature-Level Regularization via Randomized Filtering
Deep convolutional neural networks have shown remarkable performance on
various computer vision tasks, and yet, they are susceptible to picking up
spurious correlations from the training signal. So called `shortcuts' can occur
during learning, for example, when there are specific frequencies present in
the image data that correlate with the output predictions. Both high and low
frequencies can be characteristic of the underlying noise distribution caused
by the image acquisition rather than in relation to the task-relevant
information about the image content. Models that learn features related to this
characteristic noise will not generalize well to new data.
In this work, we propose a simple yet effective training strategy, Frequency
Dropout, to prevent convolutional neural networks from learning
frequency-specific imaging features. We employ randomized filtering of feature
maps during training which acts as a feature-level regularization. In this
study, we consider common image processing filters such as Gaussian smoothing,
Laplacian of Gaussian, and Gabor filtering. Our training strategy is
model-agnostic and can be used for any computer vision task. We demonstrate the
effectiveness of Frequency Dropout on a range of popular architectures and
multiple tasks including image classification, domain adaptation, and semantic
segmentation using both computer vision and medical imaging datasets. Our
results suggest that the proposed approach does not only improve predictive
accuracy but also improves robustness against domain shift.Comment: 15 page
Revisiting Distillation for Continual Learning on Visual Question Localized-Answering in Robotic Surgery
The visual-question localized-answering (VQLA) system can serve as a
knowledgeable assistant in surgical education. Except for providing text-based
answers, the VQLA system can highlight the interested region for better
surgical scene understanding. However, deep neural networks (DNNs) suffer from
catastrophic forgetting when learning new knowledge. Specifically, when DNNs
learn on incremental classes or tasks, their performance on old tasks drops
dramatically. Furthermore, due to medical data privacy and licensing issues, it
is often difficult to access old data when updating continual learning (CL)
models. Therefore, we develop a non-exemplar continual surgical VQLA framework,
to explore and balance the rigidity-plasticity trade-off of DNNs in a
sequential learning paradigm. We revisit the distillation loss in CL tasks, and
propose rigidity-plasticity-aware distillation (RP-Dist) and self-calibrated
heterogeneous distillation (SH-Dist) to preserve the old knowledge. The weight
aligning (WA) technique is also integrated to adjust the weight bias between
old and new tasks. We further establish a CL framework on three public surgical
datasets in the context of surgical settings that consist of overlapping
classes between old and new surgical VQLA tasks. With extensive experiments, we
demonstrate that our proposed method excellently reconciles learning and
forgetting on the continual surgical VQLA over conventional CL methods. Our
code is publicly accessible.Comment: To appear in MICCAI 2023. Code availability:
https://github.com/longbai1006/CS-VQL
Robustness Stress Testing in Medical Image Classification
Deep neural networks have shown impressive performance for image-based
disease detection. Performance is commonly evaluated through clinical
validation on independent test sets to demonstrate clinically acceptable
accuracy. Reporting good performance metrics on test sets, however, is not
always a sufficient indication of the generalizability and robustness of an
algorithm. In particular, when the test data is drawn from the same
distribution as the training data, the iid test set performance can be an
unreliable estimate of the accuracy on new data. In this paper, we employ
stress testing to assess model robustness and subgroup performance disparities
in disease detection models. We design progressive stress testing using five
different bidirectional and unidirectional image perturbations with six
different severity levels. As a use case, we apply stress tests to measure the
robustness of disease detection models for chest X-ray and skin lesion images,
and demonstrate the importance of studying class and domain-specific model
behaviour. Our experiments indicate that some models may yield more robust and
equitable performance than others. We also find that pretraining
characteristics play an important role in downstream robustness. We conclude
that progressive stress testing is a viable and important tool and should
become standard practice in the clinical validation of image-based disease
detection models.Comment: 11 page
Angular Gap: Reducing the Uncertainty of Image Difficulty through Model Calibration
Curriculum learning needs example difficulty to proceed from easy to hard. However, the credibility of image difficulty is rarely investigated, which can seriously affect the effectiveness of curricula. In this work, we propose Angular Gap, a measure of difficulty based on the difference in angular distance between feature embeddings and class-weight embeddings built by hyperspherical learning. To ascertain difficulty estimation, we introduce class-wise model calibration, as a post-training technique, to the learnt hyperbolic space. This bridges the gap between probabilistic model calibration and angular distance estimation of hyperspherical learning. We show the superiority of our calibrated Angular Gap over recent difficulty metrics on CIFAR10-H and ImageNetV2. We further propose a curriculum based on Angular Gap for unsupervised domain adaptation that can translate from learning easy samples to mining hard samples. We combine this curriculum with a state-of-the-art self-training method, Cycle Self Training (CST). The proposed Curricular CST learns robust representations and outperforms recent baselines on Office31 and VisDA 2017
Co-Attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery
Medical students and junior surgeons often rely on senior surgeons and
specialists to answer their questions when learning surgery. However, experts
are often busy with clinical and academic work, and have little time to give
guidance. Meanwhile, existing deep learning (DL)-based surgical Visual Question
Answering (VQA) systems can only provide simple answers without the location of
the answers. In addition, vision-language (ViL) embedding is still a less
explored research in these kinds of tasks. Therefore, a surgical Visual
Question Localized-Answering (VQLA) system would be helpful for medical
students and junior surgeons to learn and understand from recorded surgical
videos. We propose an end-to-end Transformer with Co-Attention gaTed
Vision-Language (CAT-ViL) for VQLA in surgical scenarios, which does not
require feature extraction through detection models. The CAT-ViL embedding
module is designed to fuse heterogeneous features from visual and textual
sources. The fused embedding will feed a standard Data-Efficient Image
Transformer (DeiT) module, before the parallel classifier and detector for
joint prediction. We conduct the experimental validation on public surgical
videos from MICCAI EndoVis Challenge 2017 and 2018. The experimental results
highlight the superior performance and robustness of our proposed model
compared to the state-of-the-art approaches. Ablation studies further prove the
outstanding performance of all the proposed components. The proposed method
provides a promising solution for surgical scene understanding, and opens up a
primary step in the Artificial Intelligence (AI)-based VQLA system for surgical
training. Our code is publicly available.Comment: To appear in MICCAI 2023. Code availability:
https://github.com/longbai1006/CAT-Vi
Generalizing Surgical Instruments Segmentation to Unseen Domains with One-to-Many Synthesis
Despite their impressive performance in various surgical scene understanding
tasks, deep learning-based methods are frequently hindered from deploying to
real-world surgical applications for various causes. Particularly, data
collection, annotation, and domain shift in-between sites and patients are the
most common obstacles. In this work, we mitigate data-related issues by
efficiently leveraging minimal source images to generate synthetic surgical
instrument segmentation datasets and achieve outstanding generalization
performance on unseen real domains. Specifically, in our framework, only one
background tissue image and at most three images of each foreground instrument
are taken as the seed images. These source images are extensively transformed
and employed to build up the foreground and background image pools, from which
randomly sampled tissue and instrument images are composed with multiple
blending techniques to generate new surgical scene images. Besides, we
introduce hybrid training-time augmentations to diversify the training data
further. Extensive evaluation on three real-world datasets, i.e., Endo2017,
Endo2018, and RoboTool, demonstrates that our one-to-many synthetic surgical
instruments datasets generation and segmentation framework can achieve
encouraging performance compared with training with real data. Notably, on the
RoboTool dataset, where a more significant domain gap exists, our framework
shows its superiority of generalization by a considerable margin. We expect
that our inspiring results will attract research attention to improving model
generalization with data synthesizing.Comment: First two authors contributed equally. Accepted by IROS202
Generalizing Surgical Instruments Segmentation to Unseen Domains with One-to-Many Synthesis
Despite their impressive performance in various surgical scene understanding tasks, deep learning-based methods are frequently hindered from deploying to real-world surgical applications for various causes. Particularly, data collection, annotation, and domain shift in-between sites and patients are the most common obstacles. In this work, we mitigate data-related issues by efficiently leveraging minimal source images to generate synthetic surgical instrument segmentation datasets and achieve outstanding generalization performance on unseen real domains. Specifically, in our framework, only one background tissue image and at most three images of each foreground instrument are taken as the seed images. These source images are extensively transformed and employed to build up the foreground and background image pools, from which randomly sampled tissue and instrument images are composed with multiple blending techniques to generate new surgical scene images. Besides, we introduce hybrid training-time augmentations to diversify the training data further. Extensive evaluation on three real-world datasets, i.e., Endo2017, Endo2018, and RoboTool, demonstrates that our one-to-many synthetic surgical instruments datasets generation and segmentation framework can achieve encouraging performance compared with training with real data. Notably, on the RoboTool dataset, where a more significant domain gap exists, our framework shows its superiority of generalization by a considerable margin. We expect that our inspiring results will attract research attention to improving model generalization with data synthesizing
Confidence-Aware Paced-Curriculum Learning by Label Smoothing for Surgical Scene Understanding
Curriculum learning and self-paced learning are the training strategies that gradually feed the samples from easy to more complex. They have captivated increasing attention due to their excellent performance in robotic vision. Most recent works focus on designing curricula based on difficulty levels in input samples or smoothing the feature maps. However, smoothing labels to control the learning utility in a curriculum manner is still unexplored. In this work, we design a paced curriculum by label smoothing (P-CBLS) using paced learning with uniform label smoothing (ULS) for classification tasks and fuse uniform and spatially varying label smoothing (SVLS) for semantic segmentation tasks in a curriculum manner. In ULS and SVLS, a bigger smoothing factor value enforces a heavy smoothing penalty in the true label and limits learning less information. Therefore, we design the curriculum by label smoothing (CBLS). We set a bigger smoothing value at the beginning of training and gradually decreased it to zero to control the model learning utility from lower to higher. We also designed a confidence-aware pacing function and combined it with our CBLS to investigate the benefits of various curricula. The proposed techniques are validated on four robotic surgery datasets of multi-class, multi-label classification, captioning, and segmentation tasks. We also investigate the robustness of our method by corrupting validation data into different severity levels. Our extensive analysis shows that the proposed method improves prediction accuracy and robustness. The code is publicly available at https://github.com/XuMengyaAmy/P-CBLS. Note to Practitioners —The motivation of this article is to improve the performance and robustness of deep neural networks in safety-critical applications such as robotic surgery by controlling the learning ability of the model in a curriculum learning manner and allowing the model to imitate the cognitive process of humans and animals. The designed approaches do not add parameters that require additional computational resources