308 research outputs found
Ambient Sound Helps: Audiovisual Crowd Counting in Extreme Conditions
Visual crowd counting has been recently studied as a way to enable people
counting in crowd scenes from images. Albeit successful, vision-based crowd
counting approaches could fail to capture informative features in extreme
conditions, e.g., imaging at night and occlusion. In this work, we introduce a
novel task of audiovisual crowd counting, in which visual and auditory
information are integrated for counting purposes. We collect a large-scale
benchmark, named auDiovISual Crowd cOunting (DISCO) dataset, consisting of
1,935 images and the corresponding audio clips, and 170,270 annotated
instances. In order to fuse the two modalities, we make use of a linear
feature-wise fusion module that carries out an affine transformation on visual
and auditory features. Finally, we conduct extensive experiments using the
proposed dataset and approach. Experimental results show that introducing
auditory information can benefit crowd counting under different illumination,
noise, and occlusion conditions. The dataset and code will be released. Code
and data have been made availabl
TiC: Exploring Vision Transformer in Convolution
While models derived from Vision Transformers (ViTs) have been phonemically
surging, pre-trained models cannot seamlessly adapt to arbitrary resolution
images without altering the architecture and configuration, such as sampling
the positional encoding, limiting their flexibility for various vision tasks.
For instance, the Segment Anything Model (SAM) based on ViT-Huge requires all
input images to be resized to 10241024. To overcome this limitation, we
propose the Multi-Head Self-Attention Convolution (MSA-Conv) that incorporates
Self-Attention within generalized convolutions, including standard, dilated,
and depthwise ones. Enabling transformers to handle images of varying sizes
without retraining or rescaling, the use of MSA-Conv further reduces
computational costs compared to global attention in ViT, which grows costly as
image size increases. Later, we present the Vision Transformer in Convolution
(TiC) as a proof of concept for image classification with MSA-Conv, where two
capacity enhancing strategies, namely Multi-Directional Cyclic Shifted
Mechanism and Inter-Pooling Mechanism, have been proposed, through establishing
long-distance connections between tokens and enlarging the effective receptive
field. Extensive experiments have been carried out to validate the overall
effectiveness of TiC. Additionally, ablation studies confirm the performance
improvement made by MSA-Conv and the two capacity enhancing strategies
separately. Note that our proposal aims at studying an alternative to the
global attention used in ViT, while MSA-Conv meets our goal by making TiC
comparable to state-of-the-art on ImageNet-1K. Code will be released at
https://github.com/zs670980918/MSA-Conv
Face.evoLVe: A High-Performance Face Recognition Library
In this paper, we develop face.evoLVe -- a comprehensive library that
collects and implements a wide range of popular deep learning-based methods for
face recognition. First of all, face.evoLVe is composed of key components that
cover the full process of face analytics, including face alignment, data
processing, various backbones, losses, and alternatives with bags of tricks for
improving performance. Later, face.evoLVe supports multi-GPU training on top of
different deep learning platforms, such as PyTorch and PaddlePaddle, which
facilitates researchers to work on both large-scale datasets with millions of
images and low-shot counterparts with limited well-annotated data. More
importantly, along with face.evoLVe, images before & after alignment in the
common benchmark datasets are released with source codes and trained models
provided. All these efforts lower the technical burdens in reproducing the
existing methods for comparison, while users of our library could focus on
developing advanced approaches more efficiently. Last but not least,
face.evoLVe is well designed and vibrantly evolving, so that new face
recognition approaches can be easily plugged into our framework. Note that we
have used face.evoLVe to participate in a number of face recognition
competitions and secured the first place. The version that supports PyTorch is
publicly available at https://github.com/ZhaoJ9014/face.evoLVe.PyTorch and the
PaddlePaddle version is available at
https://github.com/ZhaoJ9014/face.evoLVe.PyTorch/tree/master/paddle.
Face.evoLVe has been widely used for face analytics, receiving 2.4K stars and
622 forks.Comment: A short verson is accepted by NeuroComputing
(https://www.sciencedirect.com/science/article/pii/S0925231222005057?via%3Dihub).
Primary corresponding author is Dr. Jian Zha
- …