9 research outputs found
Backpropagation Path Search On Adversarial Transferability
Deep neural networks are vulnerable to adversarial examples, dictating the
imperativeness to test the model's robustness before deployment. Transfer-based
attackers craft adversarial examples against surrogate models and transfer them
to victim models deployed in the black-box situation. To enhance the
adversarial transferability, structure-based attackers adjust the
backpropagation path to avoid the attack from overfitting the surrogate model.
However, existing structure-based attackers fail to explore the convolution
module in CNNs and modify the backpropagation graph heuristically, leading to
limited effectiveness. In this paper, we propose backPropagation pAth Search
(PAS), solving the aforementioned two problems. We first propose SkipConv to
adjust the backpropagation path of convolution by structural
reparameterization. To overcome the drawback of heuristically designed
backpropagation paths, we further construct a DAG-based search space, utilize
one-step approximation for path evaluation and employ Bayesian Optimization to
search for the optimal path. We conduct comprehensive experiments in a wide
range of transfer settings, showing that PAS improves the attack success rate
by a huge margin for both normally trained and defense models.Comment: Accepted by ICCV202
Segment Anything Model Meets Image Harmonization
Image harmonization is a crucial technique in image composition that aims to
seamlessly match the background by adjusting the foreground of composite
images. Current methods adopt either global-level or pixel-level feature
matching. Global-level feature matching ignores the proximity prior, treating
foreground and background as separate entities. On the other hand, pixel-level
feature matching loses contextual information. Therefore, it is necessary to
use the information from semantic maps that describe different objects to guide
harmonization. In this paper, we propose Semantic-guided Region-aware Instance
Normalization (SRIN) that can utilize the semantic segmentation maps output by
a pre-trained Segment Anything Model (SAM) to guide the visual consistency
learning of foreground and background features. Abundant experiments
demonstrate the superiority of our method for image harmonization over
state-of-the-art methods.Comment: Accepted by ICASSP 202
DiffusionInst: Diffusion Model for Instance Segmentation
Diffusion frameworks have achieved comparable performance with previous
state-of-the-art image generation models. Researchers are curious about its
variants in discriminative tasks because of its powerful noise-to-image
denoising pipeline. This paper proposes DiffusionInst, a novel framework that
represents instances as instance-aware filters and formulates instance
segmentation as a noise-to-filter denoising process. The model is trained to
reverse the noisy groundtruth without any inductive bias from RPN. During
inference, it takes a randomly generated filter as input and outputs mask in
one-step or multi-step denoising. Extensive experimental results on COCO and
LVIS show that DiffusionInst achieves competitive performance compared to
existing instance segmentation models with various backbones, such as ResNet
and Swin Transformers. We hope our work could serve as a strong baseline, which
could inspire designing more efficient diffusion frameworks for challenging
discriminative tasks. Our code is available in
https://github.com/chenhaoxing/DiffusionInst
DiffUTE: Universal Text Editing Diffusion Model
Diffusion model based language-guided image editing has achieved great
success recently. However, existing state-of-the-art diffusion models struggle
with rendering correct text and text style during generation. To tackle this
problem, we propose a universal self-supervised text editing diffusion model
(DiffUTE), which aims to replace or modify words in the source image with
another one while maintaining its realistic appearance. Specifically, we build
our model on a diffusion model and carefully modify the network structure to
enable the model for drawing multilingual characters with the help of glyph and
position information. Moreover, we design a self-supervised learning framework
to leverage large amounts of web data to improve the representation ability of
the model. Experimental results show that our method achieves an impressive
performance and enables controllable editing on in-the-wild images with high
fidelity. Our code will be avaliable in
\url{https://github.com/chenhaoxing/DiffUTE}
Multi‐mode neural network for human action recognition
Video data are of two different intrinsic modes, in‐frame and temporal. It is beneficial to incorporate static in‐frame features to acquire dynamic features for video applications. However, some existing methods such as recurrent neural networks do not have a good performance, and some other such as 3D convolutional neural networks (CNNs) are both memory consuming and time consuming. This study proposes an effective framework that takes the advantage of deep learning on the static image feature extraction to tackle the video data. After extracting in‐frame feature vectors using a pretrained deep network, the authors integrate them and form a multi‐mode feature matrix, which preserves the multi‐mode structure and high‐level representation. They propose two models for follow‐up classification. The authors first introduce a temporal CNN, which directly feeds the multi‐mode feature matrix into a CNN. However, they show that characteristics of the multi‐mode features differ significantly in distinct modes. The authors therefore further propose the multi‐mode neural network (MMNN), in which different modes deploy different types of layers. They evaluate their algorithm with the task of human action recognition. The experimental results show that the MMNN achieves a much better performance than the existing long short‐term memory‐based methods and consumes far fewer resources than the existing 3D end‐to‐end models
Mobile User Interface Element Detection Via Adaptively Prompt Tuning
Recent object detection approaches rely on pretrained vision-language models
for image-text alignment. However, they fail to detect the Mobile User
Interface (MUI) element since it contains additional OCR information, which
describes its content and function but is often ignored. In this paper, we
develop a new MUI element detection dataset named MUI-zh and propose an
Adaptively Prompt Tuning (APT) module to take advantage of discriminating OCR
information. APT is a lightweight and effective module to jointly optimize
category prompts across different modalities. For every element, APT uniformly
encodes its visual features and OCR descriptions to dynamically adjust the
representation of frozen category prompts. We evaluate the effectiveness of our
plug-and-play APT upon several existing CLIP-based detectors for both standard
and open-vocabulary MUI element detection. Extensive experiments show that our
method achieves considerable improvements on two datasets. The datasets is
available at \url{github.com/antmachineintelligence/MUI-zh}.Comment: Accepted by CVPR2
Hierarchical Dynamic Image Harmonization
Image harmonization is a critical task in computer vision, which aims to
adjust the foreground to make it compatible with the background. Recent works
mainly focus on using global transformations (i.e., normalization and color
curve rendering) to achieve visual consistency. However, these models ignore
local visual consistency and their huge model sizes limit their harmonization
ability on edge devices. In this paper, we propose a hierarchical dynamic
network (HDNet) to adapt features from local to global view for better feature
transformation in efficient image harmonization. Inspired by the success of
various dynamic models, local dynamic (LD) module and mask-aware global dynamic
(MGD) module are proposed in this paper. Specifically, LD matches local
representations between the foreground and background regions based on semantic
similarities, then adaptively adjust every foreground local representation
according to the appearance of its -nearest neighbor background regions. In
this way, LD can produce more realistic images at a more fine-grained level,
and simultaneously enjoy the characteristic of semantic alignment. The MGD
effectively applies distinct convolution to the foreground and background,
learning the representations of foreground and background regions as well as
their correlations to the global harmonization, facilitating local visual
consistency for the images much more efficiently. Experimental results
demonstrate that the proposed HDNet significantly reduces the total model
parameters by more than 80\% compared to previous methods, while still
attaining state-of-the-art performance on the popular iHarmony4 dataset.
Notably, the HDNet achieves a 4\% improvement in PSNR and a 19\% reduction in
MSE compared to the prior state-of-the-art methods