1,140 research outputs found
Scalable Joint Detection and Segmentation of Surgical Instruments with Weak Supervision
Computer vision based models, such as object segmentation, detection and tracking, have the potential to assist surgeons intra-operatively and improve the quality and outcomes of minimally invasive surgery. Different work streams towards instrument detection include segmentation, bounding box localisation and classification. While segmentation models offer much more granular results, bounding box annotations are easier to annotate at scale. To leverage the granularity of segmentation approaches with the scalability of bounding box-based models, a multi-task model for joint bounding box detection and segmentation of surgical instruments is proposed. The model consists of a shared backbone and three independent heads for the tasks of classification, bounding box regression, and segmentation. Using adaptive losses together with simple yet effective weakly-supervised label inference, the proposed model use weak labels to learn to segment surgical instruments with a fraction of the dataset requiring segmentation masks. Results suggest that instrument detection and segmentation tasks share intrinsic challenges and jointly learning from both reduces the burden of annotating masks at scale. Experimental validation shows that the proposed model obtain comparable results to that of single-task state-of-the-art detector and segmentation models, while only requiring a fraction of the dataset to be annotated with masks. Specifically, the proposed model obtained 0.81 weighted average precision (wAP) and 0.73 mean intersection-over-union (IOU) in the Endovis2018 dataset with 1% annotated masks, while performing joint detection and segmentation at more than 20 frames per second
Proposal Flow: Semantic Correspondences from Object Proposals
Finding image correspondences remains a challenging problem in the presence
of intra-class variations and large changes in scene layout. Semantic flow
methods are designed to handle images depicting different instances of the same
object or scene category. We introduce a novel approach to semantic flow,
dubbed proposal flow, that establishes reliable correspondences using object
proposals. Unlike prevailing semantic flow approaches that operate on pixels or
regularly sampled local regions, proposal flow benefits from the
characteristics of modern object proposals, that exhibit high repeatability at
multiple scales, and can take advantage of both local and geometric consistency
constraints among proposals. We also show that the corresponding sparse
proposal flow can effectively be transformed into a conventional dense flow
field. We introduce two new challenging datasets that can be used to evaluate
both general semantic flow techniques and region-based approaches such as
proposal flow. We use these benchmarks to compare different matching
algorithms, object proposals, and region features within proposal flow, to the
state of the art in semantic flow. This comparison, along with experiments on
standard datasets, demonstrates that proposal flow significantly outperforms
existing semantic flow methods in various settings.Comment: arXiv admin note: text overlap with arXiv:1511.0506
Discovery-and-Selection: Towards Optimal Multiple Instance Learning for Weakly Supervised Object Detection
Weakly supervised object detection (WSOD) is a challenging task that requires
simultaneously learn object classifiers and estimate object locations under the
supervision of image category labels. A major line of WSOD methods roots in
multiple instance learning which regards images as bags of instances and
selects positive instances from each bag to learn the detector. However, a
grand challenge emerges when the detector inclines to converge to
discriminative parts of objects rather than the whole objects. In this paper,
under the hypothesis that optimal solutions are included in local minima, we
propose a discovery-and-selection approach fused with multiple instance
learning (DS-MIL), which finds rich local minima and select optimal solution
from multiple local minima. To implement DS-MIL, an attention module is
proposed so that more context information can be captured by feature maps and
more valuable proposals can be collected during training. With proposal
candidates, a selection module is proposed to select informative instances for
object detector. Experimental results on commonly used benchmarks show that our
proposed DS-MIL approach can consistently improve the baselines, reporting
state-of-the-art performance
Self-supervised object detection from audio-visual correspondence
We tackle the problem of learning object detectors without supervision.
Differently from weakly-supervised object detection, we do not assume
image-level class labels. Instead, we extract a supervisory signal from
audio-visual data, using the audio component to "teach" the object detector.
While this problem is related to sound source localisation, it is considerably
harder because the detector must classify the objects by type, enumerate each
instance of the object, and do so even when the object is silent. We tackle
this problem by first designing a self-supervised framework with a contrastive
objective that jointly learns to classify and localise objects. Then, without
using any supervision, we simply use these self-supervised labels and boxes to
train an image-based object detector. With this, we outperform previous
unsupervised and weakly-supervised detectors for the task of object detection
and sound source localization. We also show that we can align this detector to
ground-truth classes with as little as one label per pseudo-class, and show how
our method can learn to detect generic objects that go beyond instruments, such
as airplanes and cats.Comment: Under revie
Pixel is All You Need: Adversarial Trajectory-Ensemble Active Learning for Salient Object Detection
Although weakly-supervised techniques can reduce the labeling effort, it is
unclear whether a saliency model trained with weakly-supervised data (e.g.,
point annotation) can achieve the equivalent performance of its
fully-supervised version. This paper attempts to answer this unexplored
question by proving a hypothesis: there is a point-labeled dataset where
saliency models trained on it can achieve equivalent performance when trained
on the densely annotated dataset. To prove this conjecture, we proposed a novel
yet effective adversarial trajectory-ensemble active learning (ATAL). Our
contributions are three-fold: 1) Our proposed adversarial attack triggering
uncertainty can conquer the overconfidence of existing active learning methods
and accurately locate these uncertain pixels. {2)} Our proposed
trajectory-ensemble uncertainty estimation method maintains the advantages of
the ensemble networks while significantly reducing the computational cost. {3)}
Our proposed relationship-aware diversity sampling algorithm can conquer
oversampling while boosting performance. Experimental results show that our
ATAL can find such a point-labeled dataset, where a saliency model trained on
it obtained -- performance of its fully-supervised version with
only ten annotated points per image.Comment: 9 pages, 8 figure
Re-Attention Transformer for Weakly Supervised Object Localization
Weakly supervised object localization is a challenging task which aims to
localize objects with coarse annotations such as image categories. Existing
deep network approaches are mainly based on class activation map, which focuses
on highlighting discriminative local region while ignoring the full object. In
addition, the emerging transformer-based techniques constantly put a lot of
emphasis on the backdrop that impedes the ability to identify complete objects.
To address these issues, we present a re-attention mechanism termed token
refinement transformer (TRT) that captures the object-level semantics to guide
the localization well. Specifically, TRT introduces a novel module named token
priority scoring module (TPSM) to suppress the effects of background noise
while focusing on the target object. Then, we incorporate the class activation
map as the semantically aware input to restrain the attention map to the target
object. Extensive experiments on two benchmarks showcase the superiority of our
proposed method against existing methods with image category annotations.
Source code is available in
\url{https://github.com/su-hui-zz/ReAttentionTransformer}.Comment: 11 pages, 5 figure
ENInst: Enhancing Weakly-supervised Low-shot Instance Segmentation
We address a weakly-supervised low-shot instance segmentation, an
annotation-efficient training method to deal with novel classes effectively.
Since it is an under-explored problem, we first investigate the difficulty of
the problem and identify the performance bottleneck by conducting systematic
analyses of model components and individual sub-tasks with a simple baseline
model. Based on the analyses, we propose ENInst with sub-task enhancement
methods: instance-wise mask refinement for enhancing pixel localization quality
and novel classifier composition for improving classification accuracy. Our
proposed method lifts the overall performance by enhancing the performance of
each sub-task. We demonstrate that our ENInst is 7.5 times more efficient in
achieving comparable performance to the existing fully-supervised few-shot
models and even outperforms them at times.Comment: Accepted at Pattern Recognition (PR
- …