53 research outputs found
Fast Adversarial Training with Smooth Convergence
Fast adversarial training (FAT) is beneficial for improving the adversarial
robustness of neural networks. However, previous FAT work has encountered a
significant issue known as catastrophic overfitting when dealing with large
perturbation budgets, \ie the adversarial robustness of models declines to near
zero during training.
To address this, we analyze the training process of prior FAT work and
observe that catastrophic overfitting is accompanied by the appearance of loss
convergence outliers.
Therefore, we argue a moderately smooth loss convergence process will be a
stable FAT process that solves catastrophic overfitting.
To obtain a smooth loss convergence process, we propose a novel oscillatory
constraint (dubbed ConvergeSmooth) to limit the loss difference between
adjacent epochs. The convergence stride of ConvergeSmooth is introduced to
balance convergence and smoothing. Likewise, we design weight centralization
without introducing additional hyperparameters other than the loss balance
coefficient.
Our proposed methods are attack-agnostic and thus can improve the training
stability of various FAT techniques.
Extensive experiments on popular datasets show that the proposed methods
efficiently avoid catastrophic overfitting and outperform all previous FAT
methods. Code is available at \url{https://github.com/FAT-CS/ConvergeSmooth}
ComPtr: Towards Diverse Bi-source Dense Prediction Tasks via A Simple yet General Complementary Transformer
Deep learning (DL) has advanced the field of dense prediction, while
gradually dissolving the inherent barriers between different tasks. However,
most existing works focus on designing architectures and constructing visual
cues only for the specific task, which ignores the potential uniformity
introduced by the DL paradigm. In this paper, we attempt to construct a novel
\underline{ComP}lementary \underline{tr}ansformer, \textbf{ComPtr}, for diverse
bi-source dense prediction tasks. Specifically, unlike existing methods that
over-specialize in a single task or a subset of tasks, ComPtr starts from the
more general concept of bi-source dense prediction. Based on the basic
dependence on information complementarity, we propose consistency enhancement
and difference awareness components with which ComPtr can evacuate and collect
important visual semantic cues from different image sources for diverse tasks,
respectively. ComPtr treats different inputs equally and builds an efficient
dense interaction model in the form of sequence-to-sequence on top of the
transformer. This task-generic design provides a smooth foundation for
constructing the unified model that can simultaneously deal with various
bi-source information. In extensive experiments across several representative
vision tasks, i.e. remote sensing change detection, RGB-T crowd counting,
RGB-D/T salient object detection, and RGB-D semantic segmentation, the proposed
method consistently obtains favorable performance. The code will be available
at \url{https://github.com/lartpang/ComPtr}
Multi-scale Interactive Network for Salient Object Detection
Deep-learning based salient object detection methods achieve great progress.
However, the variable scale and unknown category of salient objects are great
challenges all the time. These are closely related to the utilization of
multi-level and multi-scale features. In this paper, we propose the aggregate
interaction modules to integrate the features from adjacent levels, in which
less noise is introduced because of only using small up-/down-sampling rates.
To obtain more efficient multi-scale features from the integrated features, the
self-interaction modules are embedded in each decoder unit. Besides, the class
imbalance issue caused by the scale variation weakens the effect of the binary
cross entropy loss and results in the spatial inconsistency of the predictions.
Therefore, we exploit the consistency-enhanced loss to highlight the
fore-/back-ground difference and preserve the intra-class consistency.
Experimental results on five benchmark datasets demonstrate that the proposed
method without any post-processing performs favorably against 23
state-of-the-art approaches. The source code will be publicly available at
https://github.com/lartpang/MINet.Comment: Accepted by CVPR 202
CAVER: Cross-Modal View-Mixed Transformer for Bi-Modal Salient Object Detection
Most of the existing bi-modal (RGB-D and RGB-T) salient object detection
methods utilize the convolution operation and construct complex interweave
fusion structures to achieve cross-modal information integration. The inherent
local connectivity of the convolution operation constrains the performance of
the convolution-based methods to a ceiling. In this work, we rethink these
tasks from the perspective of global information alignment and transformation.
Specifically, the proposed \underline{c}ross-mod\underline{a}l
\underline{v}iew-mixed transform\underline{er} (CAVER) cascades several
cross-modal integration units to construct a top-down transformer-based
information propagation path. CAVER treats the multi-scale and multi-modal
feature integration as a sequence-to-sequence context propagation and update
process built on a novel view-mixed attention mechanism. Besides, considering
the quadratic complexity w.r.t. the number of input tokens, we design a
parameter-free patch-wise token re-embedding strategy to simplify operations.
Extensive experimental results on RGB-D and RGB-T SOD datasets demonstrate that
such a simple two-stream encoder-decoder framework can surpass recent
state-of-the-art methods when it is equipped with the proposed components.Comment: Updated version, more flexible structure, better performanc
Shrinking Class Space for Enhanced Certainty in Semi-Supervised Learning
Semi-supervised learning is attracting blooming attention, due to its success
in combining unlabeled data. To mitigate potentially incorrect pseudo labels,
recent frameworks mostly set a fixed confidence threshold to discard uncertain
samples. This practice ensures high-quality pseudo labels, but incurs a
relatively low utilization of the whole unlabeled set. In this work, our key
insight is that these uncertain samples can be turned into certain ones, as
long as the confusion classes for the top-1 class are detected and removed.
Invoked by this, we propose a novel method dubbed ShrinkMatch to learn
uncertain samples. For each uncertain sample, it adaptively seeks a shrunk
class space, which merely contains the original top-1 class, as well as
remaining less likely classes. Since the confusion ones are removed in this
space, the re-calculated top-1 confidence can satisfy the pre-defined
threshold. We then impose a consistency regularization between a pair of
strongly and weakly augmented samples in the shrunk space to strive for
discriminative representations. Furthermore, considering the varied reliability
among uncertain samples and the gradually improved model during training, we
correspondingly design two reweighting principles for our uncertain loss. Our
method exhibits impressive performance on widely adopted benchmarks. Code is
available at https://github.com/LiheYoung/ShrinkMatch.Comment: Accepted by ICCV 202
FreeMask: Synthetic Images with Dense Annotations Make Stronger Segmentation Models
Semantic segmentation has witnessed tremendous progress due to the proposal
of various advanced network architectures. However, they are extremely hungry
for delicate annotations to train, and the acquisition is laborious and
unaffordable. Therefore, we present FreeMask in this work, which resorts to
synthetic images from generative models to ease the burden of both data
collection and annotation procedures. Concretely, we first synthesize abundant
training images conditioned on the semantic masks provided by realistic
datasets. This yields extra well-aligned image-mask training pairs for semantic
segmentation models. We surprisingly observe that, solely trained with
synthetic images, we already achieve comparable performance with real ones
(e.g., 48.3 vs. 48.5 mIoU on ADE20K, and 49.3 vs. 50.5 on COCO-Stuff). Then, we
investigate the role of synthetic images by joint training with real images, or
pre-training for real images. Meantime, we design a robust filtering principle
to suppress incorrectly synthesized regions. In addition, we propose to
inequally treat different semantic masks to prioritize those harder ones and
sample more corresponding synthetic images for them. As a result, either
jointly trained or pre-trained with our filtered and re-sampled synthesized
images, segmentation models can be greatly enhanced, e.g., from 48.7 to 52.0 on
ADE20K. Code is available at https://github.com/LiheYoung/FreeMask.Comment: Accepted by NeurIPS 202
ZoomNeXt: A Unified Collaborative Pyramid Network for Camouflaged Object Detection
Recent camouflaged object detection (COD) attempts to segment objects
visually blended into their surroundings, which is extremely complex and
difficult in real-world scenarios. Apart from the high intrinsic similarity
between camouflaged objects and their background, objects are usually diverse
in scale, fuzzy in appearance, and even severely occluded. To this end, we
propose an effective unified collaborative pyramid network which mimics human
behavior when observing vague images and videos, \textit{i.e.}, zooming in and
out. Specifically, our approach employs the zooming strategy to learn
discriminative mixed-scale semantics by the multi-head scale integration and
rich granularity perception units, which are designed to fully explore
imperceptible clues between candidate objects and background surroundings. The
former's intrinsic multi-head aggregation provides more diverse visual
patterns. The latter's routing mechanism can effectively propagate inter-frame
difference in spatiotemporal scenarios and adaptively ignore static
representations. They provides a solid foundation for realizing a unified
architecture for static and dynamic COD. Moreover, considering the uncertainty
and ambiguity derived from indistinguishable textures, we construct a simple
yet effective regularization, uncertainty awareness loss, to encourage
predictions with higher confidence in candidate regions. Our highly
task-friendly framework consistently outperforms existing state-of-the-art
methods in image and video COD benchmarks. The code will be available at
\url{https://github.com/lartpang/ZoomNeXt}.Comment: Extensions to the conference version: arXiv:2203.02688; Fixed some
word error
Augmentation Matters: A Simple-yet-Effective Approach to Semi-supervised Semantic Segmentation
Recent studies on semi-supervised semantic segmentation (SSS) have seen fast
progress. Despite their promising performance, current state-of-the-art methods
tend to increasingly complex designs at the cost of introducing more network
components and additional training procedures. Differently, in this work, we
follow a standard teacher-student framework and propose AugSeg, a simple and
clean approach that focuses mainly on data perturbations to boost the SSS
performance. We argue that various data augmentations should be adjusted to
better adapt to the semi-supervised scenarios instead of directly applying
these techniques from supervised learning. Specifically, we adopt a simplified
intensity-based augmentation that selects a random number of data
transformations with uniformly sampling distortion strengths from a continuous
space. Based on the estimated confidence of the model on different unlabeled
samples, we also randomly inject labelled information to augment the unlabeled
samples in an adaptive manner. Without bells and whistles, our simple AugSeg
can readily achieve new state-of-the-art performance on SSS benchmarks under
different partition protocols.Comment: 10 pages, 8 table
- …