34 research outputs found
Constrained Sampling for Class-Agnostic Weakly Supervised Object Localization
Self-supervised vision transformers can generate accurate localization maps
of the objects in an image. However, since they decompose the scene into
multiple maps containing various objects, and they do not rely on any explicit
supervisory signal, they cannot distinguish between the object of interest from
other objects, as required in weakly-supervised object localization (WSOL). To
address this issue, we propose leveraging the multiple maps generated by the
different transformer heads to acquire pseudo-labels for training a WSOL model.
In particular, a new discriminative proposals sampling method is introduced
that relies on a pretrained CNN classifier to identify discriminative regions.
Then, foreground and background pixels are sampled from these regions in order
to train a WSOL model for generating activation maps that can accurately
localize objects belonging to a specific class. Empirical results on the
challenging CUB benchmark dataset indicate that our proposed approach can
outperform state-of-art methods over a wide range of threshold values. Our
method provides class activation maps with a better coverage of foreground
object regions w.r.t. the background.Comment: 3 pages, 2 figure
DiPS: Discriminative Pseudo-Label Sampling with Self-Supervised Transformers for Weakly Supervised Object Localization
Self-supervised vision transformers (SSTs) have shown great potential to
yield rich localization maps that highlight different objects in an image.
However, these maps remain class-agnostic since the model is unsupervised. They
often tend to decompose the image into multiple maps containing different
objects while being unable to distinguish the object of interest from
background noise objects. In this paper, Discriminative Pseudo-label Sampling
(DiPS) is introduced to leverage these class-agnostic maps for
weakly-supervised object localization (WSOL), where only image-class labels are
available. Given multiple attention maps, DiPS relies on a pre-trained
classifier to identify the most discriminative regions of each attention map.
This ensures that the selected ROIs cover the correct image object while
discarding the background ones, and, as such, provides a rich pool of diverse
and discriminative proposals to cover different parts of the object.
Subsequently, these proposals are used as pseudo-labels to train our new
transformer-based WSOL model designed to perform classification and
localization tasks. Unlike standard WSOL methods, DiPS optimizes performance in
both tasks by using a transformer encoder and a dedicated output head for each
task, each trained using dedicated loss functions. To avoid overfitting a
single proposal and promote better object coverage, a single proposal is
randomly selected among the top ones for a training image at each training
step. Experimental results on the challenging CUB, ILSVRC, OpenImages, and
TelDrone datasets indicate that our architecture, in combination with our
transformer-based proposals, can yield better localization performance than
state-of-the-art methods
Discriminative Sampling of Proposals in Self-Supervised Transformers for Weakly Supervised Object Localization
Self-supervised vision transformers can generate accurate localization maps
of the objects in an image. However, since they decompose the scene into
multiple maps containing various objects, and they do not rely on any explicit
supervisory signal, they cannot distinguish between the object of interest from
other objects, as required in weakly-supervised object localization (WSOL). To
address this issue, we propose leveraging the multiple maps generated by the
different transformer heads to acquire pseudo-labels for training a WSOL model.
In particular, a new Discriminative Proposals Sampling (DiPS) method is
introduced that relies on a pretrained CNN classifier to identify
discriminative regions. Then, foreground and background pixels are sampled from
these regions in order to train a WSOL model for generating activation maps
that can accurately localize objects belonging to a specific class. Empirical
results on the challenging CUB, OpenImages, and ILSVRC benchmark datasets
indicate that our proposed approach can outperform state-of-art methods over a
wide range of threshold values. DiPS provides class activation maps with a
better coverage of foreground object regions w.r.t. the background
Deep weakly-supervised learning methods for classification and localization in histology images: a survey
Using state-of-the-art deep learning models for cancer diagnosis presents
several challenges related to the nature and availability of labeled histology
images. In particular, cancer grading and localization in these images normally
relies on both image- and pixel-level labels, the latter requiring a costly
annotation process. In this survey, deep weakly-supervised learning (WSL)
models are investigated to identify and locate diseases in histology images,
without the need for pixel-level annotations. Given training data with global
image-level labels, these models allow to simultaneously classify histology
images and yield pixel-wise localization scores, thereby identifying the
corresponding regions of interest (ROI). Since relevant WSL models have mainly
been investigated within the computer vision community, and validated on
natural scene images, we assess the extent to which they apply to histology
images which have challenging properties, e.g. very large size, similarity
between foreground/background, highly unstructured regions, stain
heterogeneity, and noisy/ambiguous labels. The most relevant models for deep
WSL are compared experimentally in terms of accuracy (classification and
pixel-wise localization) on several public benchmark histology datasets for
breast and colon cancer -- BACH ICIAR 2018, BreaKHis, CAMELYON16, and GlaS.
Furthermore, for large-scale evaluation of WSL models on histology images, we
propose a protocol to construct WSL datasets from Whole Slide Imaging. Results
indicate that several deep learning models can provide a high level of
classification accuracy, although accurate pixel-wise localization of cancer
regions remains an issue for such images. Code is publicly available.Comment: 35 pages, 18 figure
CoLo-CAM: Class Activation Mapping for Object Co-Localization in Weakly-Labeled Unconstrained Videos
Weakly supervised video object localization (WSVOL) methods often rely on
visual and motion cues only, making them susceptible to inaccurate
localization. Recently, discriminative models have been explored using a
temporal class activation mapping (CAM) method. Although their results are
promising, objects are assumed to have limited movement from frame to frame,
leading to degradation in performance for relatively long-term dependencies. In
this paper, a novel CoLo-CAM method for WSVOL is proposed that leverages
spatiotemporal information in activation maps during training without making
assumptions about object position. Given a sequence of frames, explicit joint
learning of localization is produced based on color cues across these maps, by
assuming that an object has similar color across adjacent frames. CAM
activations are constrained to respond similarly over pixels with similar
colors, achieving co-localization. This joint learning creates direct
communication among pixels across all image locations and over all frames,
allowing for transfer, aggregation, and correction of learned localization,
leading to better localization performance. This is achieved by minimizing the
color term of a conditional random field (CRF) loss over a sequence of
frames/CAMs. Empirical experiments on two challenging datasets with
unconstrained videos, YouTube-Objects, show the merits of our method, and its
robustness to long-term dependencies, leading to new state-of-the-art
performance for WSVOL.Comment: 16 pages, 8 figure