166 research outputs found
Structure-Consistent Weakly Supervised Salient Object Detection with Local Saliency Coherence
Sparse labels have been attracting much attention in recent years. However,
the performance gap between weakly supervised and fully supervised salient
object detection methods is huge, and most previous weakly supervised works
adopt complex training methods with many bells and whistles. In this work, we
propose a one-round end-to-end training approach for weakly supervised salient
object detection via scribble annotations without pre/post-processing
operations or extra supervision data. Since scribble labels fail to offer
detailed salient regions, we propose a local coherence loss to propagate the
labels to unlabeled regions based on image features and pixel distance, so as
to predict integral salient regions with complete object structures. We design
a saliency structure consistency loss as self-consistent mechanism to ensure
consistent saliency maps are predicted with different scales of the same image
as input, which could be viewed as a regularization technique to enhance the
model generalization ability. Additionally, we design an aggregation module
(AGGM) to better integrate high-level features, low-level features and global
context information for the decoder to aggregate various information. Extensive
experiments show that our method achieves a new state-of-the-art performance on
six benchmarks (e.g. for the ECSSD dataset: F_\beta = 0.8995, E_\xi = 0.9079
and MAE = 0.0489$), with an average gain of 4.60\% for F-measure, 2.05\% for
E-measure and 1.88\% for MAE over the previous best method on this task. Source
code is available at http://github.com/siyueyu/SCWSSOD.Comment: Accepted by AAAI202
A Visual Representation-guided Framework with Global Affinity for Weakly Supervised Salient Object Detection
Fully supervised salient object detection (SOD) methods have made
considerable progress in performance, yet these models rely heavily on
expensive pixel-wise labels. Recently, to achieve a trade-off between labeling
burden and performance, scribble-based SOD methods have attracted increasing
attention. Previous scribble-based models directly implement the SOD task only
based on SOD training data with limited information, it is extremely difficult
for them to understand the image and further achieve a superior SOD task. In
this paper, we propose a simple yet effective framework guided by general
visual representations with rich contextual semantic knowledge for
scribble-based SOD. These general visual representations are generated by
self-supervised learning based on large-scale unlabeled datasets. Our framework
consists of a task-related encoder, a general visual module, and an information
integration module to efficiently combine the general visual representations
with task-related features to perform the SOD task based on understanding the
contextual connections of images. Meanwhile, we propose a novel global semantic
affinity loss to guide the model to perceive the global structure of the
salient objects. Experimental results on five public benchmark datasets
demonstrate that our method, which only utilizes scribble annotations without
introducing any extra label, outperforms the state-of-the-art weakly supervised
SOD methods. Specifically, it outperforms the previous best scribble-based
method on all datasets with an average gain of 5.5% for max f-measure, 5.8% for
mean f-measure, 24% for MAE, and 3.1% for E-measure. Moreover, our method
achieves comparable or even superior performance to the state-of-the-art fully
supervised models
Energy-Based Generative Cooperative Saliency Prediction
Conventional saliency prediction models typically learn a deterministic
mapping from images to the corresponding ground truth saliency maps. In this
paper, we study the saliency prediction problem from the perspective of
generative models by learning a conditional probability distribution over
saliency maps given an image, and treating the prediction as a sampling
process. Specifically, we propose a generative cooperative saliency prediction
framework based on the generative cooperative networks, where a conditional
latent variable model and a conditional energy-based model are jointly trained
to predict saliency in a cooperative manner. We call our model the SalCoopNets.
The latent variable model serves as a fast but coarse predictor to efficiently
produce an initial prediction, which is then refined by the iterative Langevin
revision of the energy-based model that serves as a fine predictor. Such a
coarse-to-fine cooperative saliency prediction strategy offers the best of both
worlds. Moreover, we generalize our framework to the scenario of weakly
supervised saliency prediction, where saliency annotation of training images is
partially observed, by proposing a cooperative learning while recovering
strategy. Lastly, we show that the learned energy function can serve as a
refinement module that can refine the results of other pre-trained saliency
prediction models. Experimental results show that our generative model can
achieve state-of-the-art performance. Our code is publicly available at:
\url{https://github.com/JingZhang617/SalCoopNets}
Transformer Transforms Salient Object Detection and Camouflaged Object Detection
The transformer networks are particularly good at modeling long-range
dependencies within a long sequence. In this paper, we conduct research on
applying the transformer networks for salient object detection (SOD). We adopt
the dense transformer backbone for fully supervised RGB image based SOD, RGB-D
image pair based SOD, and weakly supervised SOD within a unified framework
based on the observation that the transformer backbone can provide accurate
structure modeling, which makes it powerful in learning from weak labels with
less structure information. Further, we find that the vision transformer
architectures do not offer direct spatial supervision, instead encoding
position as a feature. Therefore, we investigate the contributions of two
strategies to provide stronger spatial supervision through the transformer
layers within our unified framework, namely deep supervision and
difficulty-aware learning. We find that deep supervision can get gradients back
into the higher level features, thus leads to uniform activation within the
same semantic object. Difficulty-aware learning on the other hand is capable of
identifying the hard pixels for effective hard negative mining. We also
visualize features of conventional backbone and transformer backbone before and
after fine-tuning them for SOD, and find that transformer backbone encodes more
accurate object structure information and more distinct semantic information
within the lower and higher level features respectively. We also apply our
model to camouflaged object detection (COD) and achieve similar observations as
the above three SOD tasks. Extensive experimental results on various SOD and
COD tasks illustrate that transformer networks can transform SOD and COD,
leading to new benchmarks for each related task. The source code and
experimental results are available via our project page:
https://github.com/fupiao1998/TrasformerSOD.Comment: Technical report, 18 pages, 22 figure
Mutual Information Regularization for Weakly-supervised RGB-D Salient Object Detection
In this paper, we present a weakly-supervised RGB-D salient object detection
model via scribble supervision. Specifically, as a multimodal learning task, we
focus on effective multimodal representation learning via inter-modal mutual
information regularization. In particular, following the principle of
disentangled representation learning, we introduce a mutual information upper
bound with a mutual information minimization regularizer to encourage the
disentangled representation of each modality for salient object detection.
Based on our multimodal representation learning framework, we introduce an
asymmetric feature extractor for our multimodal data, which is proven more
effective than the conventional symmetric backbone setting. We also introduce
multimodal variational auto-encoder as stochastic prediction refinement
techniques, which takes pseudo labels from the first training stage as
supervision and generates refined prediction. Experimental results on benchmark
RGB-D salient object detection datasets verify both effectiveness of our
explicit multimodal disentangled representation learning method and the
stochastic prediction refinement strategy, achieving comparable performance
with the state-of-the-art fully supervised models. Our code and data are
available at: https://github.com/baneitixiaomai/MIRV.Comment: IEEE Transactions on Circuits and Systems for Video Technology 202
Exploiting saliency for object segmentation from image level labels
There have been remarkable improvements in the semantic labelling task in the
recent years. However, the state of the art methods rely on large-scale
pixel-level annotations. This paper studies the problem of training a
pixel-wise semantic labeller network from image-level annotations of the
present object classes. Recently, it has been shown that high quality seeds
indicating discriminative object regions can be obtained from image-level
labels. Without additional information, obtaining the full extent of the object
is an inherently ill-posed problem due to co-occurrences. We propose using a
saliency model as additional information and hereby exploit prior knowledge on
the object extent and image statistics. We show how to combine both information
sources in order to recover 80% of the fully supervised performance - which is
the new state of the art in weakly supervised training for pixel-wise semantic
labelling. The code is available at https://goo.gl/KygSeb.Comment: CVPR 201
Weakly Supervised Video Salient Object Detection via Point Supervision
Video salient object detection models trained on pixel-wise dense annotation
have achieved excellent performance, yet obtaining pixel-by-pixel annotated
datasets is laborious. Several works attempt to use scribble annotations to
mitigate this problem, but point supervision as a more labor-saving annotation
method (even the most labor-saving method among manual annotation methods for
dense prediction), has not been explored. In this paper, we propose a strong
baseline model based on point supervision. To infer saliency maps with temporal
information, we mine inter-frame complementary information from short-term and
long-term perspectives, respectively. Specifically, we propose a hybrid token
attention module, which mixes optical flow and image information from
orthogonal directions, adaptively highlighting critical optical flow
information (channel dimension) and critical token information (spatial
dimension). To exploit long-term cues, we develop the Long-term Cross-Frame
Attention module (LCFA), which assists the current frame in inferring salient
objects based on multi-frame tokens. Furthermore, we label two point-supervised
datasets, P-DAVIS and P-DAVSOD, by relabeling the DAVIS and the DAVSOD dataset.
Experiments on the six benchmark datasets illustrate our method outperforms the
previous state-of-the-art weakly supervised methods and even is comparable with
some fully supervised approaches. Source code and datasets are available.Comment: accepted by ACM MM 202
- …