25 research outputs found
On the Importance of Backbone to the Adversarial Robustness of Object Detectors
Object detection is a critical component of various security-sensitive
applications, such as autonomous driving and video surveillance. However,
existing deep learning-based object detectors are vulnerable to adversarial
attacks, which poses a significant challenge to their reliability and safety.
Through experiments, we found that existing works on improving the adversarial
robustness of object detectors have given a false sense of security. We argue
that using adversarially pre-trained backbone networks is essential for
enhancing the adversarial robustness of object detectors. We propose a simple
yet effective recipe for fast adversarial fine-tuning on object detectors with
adversarially pre-trained backbones. Without any modifications to the structure
of object detectors, our recipe achieved significantly better adversarial
robustness than previous works. Moreover, we explore the potential of different
modern object detectors to improve adversarial robustness using our recipe and
demonstrate several interesting findings. Our empirical results set a new
milestone and deepen the understanding of adversarially robust object
detection. Code and trained checkpoints will be publicly available.Comment: 12 page
DFormer: Diffusion-guided Transformer for Universal Image Segmentation
This paper introduces an approach, named DFormer, for universal image
segmentation. The proposed DFormer views universal image segmentation task as a
denoising process using a diffusion model. DFormer first adds various levels of
Gaussian noise to ground-truth masks, and then learns a model to predict
denoising masks from corrupted masks. Specifically, we take deep pixel-level
features along with the noisy masks as inputs to generate mask features and
attention masks, employing diffusion-based decoder to perform mask prediction
gradually. At inference, our DFormer directly predicts the masks and
corresponding categories from a set of randomly-generated masks. Extensive
experiments reveal the merits of our proposed contributions on different image
segmentation tasks: panoptic segmentation, instance segmentation, and semantic
segmentation. Our DFormer outperforms the recent diffusion-based panoptic
segmentation method Pix2Seq-D with a gain of 3.6% on MS COCO val2017 set.
Further, DFormer achieves promising semantic segmentation performance
outperforming the recent diffusion-based method by 2.2% on ADE20K val set. Our
source code and models will be publicly on https://github.com/cp3wan/DForme
H2RBox-v2: Boosting HBox-supervised Oriented Object Detection via Symmetric Learning
With the increasing demand for oriented object detection e.g. in autonomous
driving and remote sensing, the oriented annotation has become a
labor-intensive work. To make full use of existing horizontally annotated
datasets and reduce the annotation cost, a weakly-supervised detector H2RBox
for learning the rotated box (RBox) from the horizontal box (HBox) has been
proposed and received great attention. This paper presents a new version,
H2RBox-v2, to further bridge the gap between HBox-supervised and
RBox-supervised oriented object detection. While exploiting axisymmetry via
flipping and rotating consistencies is available through our theoretical
analysis, H2RBox-v2, using a weakly-supervised branch similar to H2RBox, is
embedded with a novel self-supervised branch that learns orientations from the
symmetry inherent in the image of objects. Complemented by modules to cope with
peripheral issues, e.g. angular periodicity, a stable and effective solution is
achieved. To our knowledge, H2RBox-v2 is the first symmetry-supervised paradigm
for oriented object detection. Compared to H2RBox, our method is less
susceptible to low annotation quality and insufficient training data, which in
such cases is expected to give a competitive performance much closer to
fully-supervised oriented object detectors. Specifically, the performance
comparison between H2RBox-v2 and Rotated FCOS on DOTA-v1.0/1.5/2.0 is
72.31%/64.76%/50.33% vs. 72.44%/64.53%/51.77%, 89.66% vs. 88.99% on HRSC, and
42.27% vs. 41.25% on FAIR1M.Comment: 13 pages, 4 figures, 7 tables, the source code is available at
https://github.com/open-mmlab/mmrotat
Dynamic Focus-aware Positional Queries for Semantic Segmentation
Most of the latest top semantic segmentation approaches are based on vision
Transformers, particularly DETR-like frameworks, which employ a set of queries
in the Transformer decoder. Each query is composed of a content query that
preserves semantic information and a positional query that provides positional
guidance for aggregating the query-specific context. However, the positional
queries in the Transformer decoder layers are typically represented as fixed
learnable weights, which often encode dataset statistics for segments and can
be inaccurate for individual samples. Therefore, in this paper, we propose to
generate positional queries dynamically conditioned on the cross-attention
scores and the localization information of the preceding layer. By doing so,
each query is aware of its previous focus, thus providing more accurate
positional guidance and encouraging the cross-attention consistency across the
decoder layers. In addition, we also propose an efficient way to deal with
high-resolution cross-attention by dynamically determining the contextual
tokens based on the low-resolution cross-attention maps to perform local
relation aggregation. Our overall framework termed FASeg (Focus-Aware semantic
Segmentation) provides a simple yet effective solution for semantic
segmentation. Extensive experiments on ADE20K and Cityscapes show that our
FASeg achieves state-of-the-art performance, e.g., obtaining 48.3% and 49.6%
mIoU respectively for single-scale inference on ADE20K validation set with
ResNet-50 and Swin-T backbones, and barely increases the computation
consumption from Mask2former. Source code will be made publicly available at
https://github.com/zip-group/FASeg.Comment: Tech repor
AIMS: All-Inclusive Multi-Level Segmentation
Despite the progress of image segmentation for accurate visual entity
segmentation, completing the diverse requirements of image editing applications
for different-level region-of-interest selections remains unsolved. In this
paper, we propose a new task, All-Inclusive Multi-Level Segmentation (AIMS),
which segments visual regions into three levels: part, entity, and relation
(two entities with some semantic relationships). We also build a unified AIMS
model through multi-dataset multi-task training to address the two major
challenges of annotation inconsistency and task correlation. Specifically, we
propose task complementarity, association, and prompt mask encoder for
three-level predictions. Extensive experiments demonstrate the effectiveness
and generalization capacity of our method compared to other state-of-the-art
methods on a single dataset or the concurrent work on segmenting anything. We
will make our code and training model publicly available.Comment: Technical Repor