164 research outputs found
Pose-Guided Multi-Granularity Attention Network for Text-Based Person Search
Text-based person search aims to retrieve the corresponding person images in
an image database by virtue of a describing sentence about the person, which
poses great potential for various applications such as video surveillance.
Extracting visual contents corresponding to the human description is the key to
this cross-modal matching problem. Moreover, correlated images and descriptions
involve different granularities of semantic relevance, which is usually ignored
in previous methods. To exploit the multilevel corresponding visual contents,
we propose a pose-guided multi-granularity attention network (PMA). Firstly, we
propose a coarse alignment network (CA) to select the related image regions to
the global description by a similarity-based attention. To further capture the
phrase-related visual body part, a fine-grained alignment network (FA) is
proposed, which employs pose information to learn latent semantic alignment
between visual body part and textual noun phrase. To verify the effectiveness
of our model, we perform extensive experiments on the CUHK Person Description
Dataset (CUHK-PEDES) which is currently the only available dataset for
text-based person search. Experimental results show that our approach
outperforms the state-of-the-art methods by 15 \% in terms of the top-1 metric.Comment: published in AAAI2020(oral
FreeU: Free Lunch in Diffusion U-Net
In this paper, we uncover the untapped potential of diffusion U-Net, which
serves as a "free lunch" that substantially improves the generation quality on
the fly. We initially investigate the key contributions of the U-Net
architecture to the denoising process and identify that its main backbone
primarily contributes to denoising, whereas its skip connections mainly
introduce high-frequency features into the decoder module, causing the network
to overlook the backbone semantics. Capitalizing on this discovery, we propose
a simple yet effective method-termed "FreeU" - that enhances generation quality
without additional training or finetuning. Our key insight is to strategically
re-weight the contributions sourced from the U-Net's skip connections and
backbone feature maps, to leverage the strengths of both components of the
U-Net architecture. Promising results on image and video generation tasks
demonstrate that our FreeU can be readily integrated to existing diffusion
models, e.g., Stable Diffusion, DreamBooth, ModelScope, Rerender and ReVersion,
to improve the generation quality with only a few lines of code. All you need
is to adjust two scaling factors during inference. Project page:
https://chenyangsi.top/FreeU/.Comment: Project page: https://chenyangsi.top/FreeU
Exploring Semantic Attributes from A Foundation Model for Federated Learning of Disjoint Label Spaces
Conventional centralised deep learning paradigms are not feasible when data
from different sources cannot be shared due to data privacy or transmission
limitation. To resolve this problem, federated learning has been introduced to
transfer knowledge across multiple sources (clients) with non-shared data while
optimising a globally generalised central model (server). Existing federated
learning paradigms mostly focus on transferring holistic high-level knowledge
(such as class) across models, which are closely related to specific objects of
interest so may suffer from inverse attack. In contrast, in this work, we
consider transferring mid-level semantic knowledge (such as attribute) which is
not sensitive to specific objects of interest and therefore is more
privacy-preserving and scalable. To this end, we formulate a new Federated
Zero-Shot Learning (FZSL) paradigm to learn mid-level semantic knowledge at
multiple local clients with non-shared local data and cumulatively aggregate a
globally generalised central model for deployment. To improve model
discriminative ability, we propose to explore semantic knowledge augmentation
from external knowledge for enriching the mid-level semantic space in FZSL.
Extensive experiments on five zeroshot learning benchmark datasets validate the
effectiveness of our approach for optimising a generalisable federated learning
model with mid-level semantic knowledge transfer.Comment: Under Revie
Scaling Supervised Local Learning with Augmented Auxiliary Networks
Deep neural networks are typically trained using global error signals that
backpropagate (BP) end-to-end, which is not only biologically implausible but
also suffers from the update locking problem and requires huge memory
consumption. Local learning, which updates each layer independently with a
gradient-isolated auxiliary network, offers a promising alternative to address
the above problems. However, existing local learning methods are confronted
with a large accuracy gap with the BP counterpart, particularly for large-scale
networks. This is due to the weak coupling between local layers and their
subsequent network layers, as there is no gradient communication across layers.
To tackle this issue, we put forward an augmented local learning method, dubbed
AugLocal. AugLocal constructs each hidden layer's auxiliary network by
uniformly selecting a small subset of layers from its subsequent network layers
to enhance their synergy. We also propose to linearly reduce the depth of
auxiliary networks as the hidden layer goes deeper, ensuring sufficient network
capacity while reducing the computational cost of auxiliary networks. Our
extensive experiments on four image classification datasets (i.e., CIFAR-10,
SVHN, STL-10, and ImageNet) demonstrate that AugLocal can effectively scale up
to tens of local layers with a comparable accuracy to BP-trained networks while
reducing GPU memory usage by around 40%. The proposed AugLocal method,
therefore, opens up a myriad of opportunities for training high-performance
deep neural networks on resource-constrained platforms.Code is available at
https://github.com/ChenxiangMA/AugLocal.Comment: Accepted by ICLR 202
MetaFormer Baselines for Vision
MetaFormer, the abstracted architecture of Transformer, has been found to
play a significant role in achieving competitive performance. In this paper, we
further explore the capacity of MetaFormer, again, without focusing on token
mixer design: we introduce several baseline models under MetaFormer using the
most basic or common mixers, and summarize our observations as follows: (1)
MetaFormer ensures solid lower bound of performance. By merely adopting
identity mapping as the token mixer, the MetaFormer model, termed
IdentityFormer, achieves >80% accuracy on ImageNet-1K. (2) MetaFormer works
well with arbitrary token mixers. When specifying the token mixer as even a
random matrix to mix tokens, the resulting model RandFormer yields an accuracy
of >81%, outperforming IdentityFormer. Rest assured of MetaFormer's results
when new token mixers are adopted. (3) MetaFormer effortlessly offers
state-of-the-art results. With just conventional token mixers dated back five
years ago, the models instantiated from MetaFormer already beat state of the
art. (a) ConvFormer outperforms ConvNeXt. Taking the common depthwise separable
convolutions as the token mixer, the model termed ConvFormer, which can be
regarded as pure CNNs, outperforms the strong CNN model ConvNeXt. (b) CAFormer
sets new record on ImageNet-1K. By simply applying depthwise separable
convolutions as token mixer in the bottom stages and vanilla self-attention in
the top stages, the resulting model CAFormer sets a new record on ImageNet-1K:
it achieves an accuracy of 85.5% at 224x224 resolution, under normal supervised
training without external data or distillation. In our expedition to probe
MetaFormer, we also find that a new activation, StarReLU, reduces 71% FLOPs of
activation compared with GELU yet achieves better performance. We expect
StarReLU to find great potential in MetaFormer-like models alongside other
neural networks.Comment: Accepted to TPAMI. Code: https://github.com/sail-sg/metaforme
- …
