12,945 research outputs found
Backbone Can Not be Trained at Once: Rolling Back to Pre-trained Network for Person Re-Identification
In person re-identification (ReID) task, because of its shortage of trainable
dataset, it is common to utilize fine-tuning method using a classification
network pre-trained on a large dataset. However, it is relatively difficult to
sufficiently fine-tune the low-level layers of the network due to the gradient
vanishing problem. In this work, we propose a novel fine-tuning strategy that
allows low-level layers to be sufficiently trained by rolling back the weights
of high-level layers to their initial pre-trained weights. Our strategy
alleviates the problem of gradient vanishing in low-level layers and robustly
trains the low-level layers to fit the ReID dataset, thereby increasing the
performance of ReID tasks. The improved performance of the proposed strategy is
validated via several experiments. Furthermore, without any add-ons such as
pose estimation or segmentation, our strategy exhibits state-of-the-art
performance using only vanilla deep convolutional neural network architecture.Comment: Accepted to AAAI 201
AdaptiveWeighted Attention Network with Camera Spectral Sensitivity Prior for Spectral Reconstruction from RGB Images
Recent promising effort for spectral reconstruction (SR) focuses on learning
a complicated mapping through using a deeper and wider convolutional neural
networks (CNNs). Nevertheless, most CNN-based SR algorithms neglect to explore
the camera spectral sensitivity (CSS) prior and interdependencies among
intermediate features, thus limiting the representation ability of the network
and performance of SR. To conquer these issues, we propose a novel adaptive
weighted attention network (AWAN) for SR, whose backbone is stacked with
multiple dual residual attention blocks (DRAB) decorating with long and short
skip connections to form the dual residual learning. Concretely, we investigate
an adaptive weighted channel attention (AWCA) module to reallocate channel-wise
feature responses via integrating correlations between channels. Furthermore, a
patch-level second-order non-local (PSNL) module is developed to capture
long-range spatial contextual information by second-order non-local operations
for more powerful feature representations. Based on the fact that the recovered
RGB images can be projected by the reconstructed hyperspectral image (HSI) and
the given CSS function, we incorporate the discrepancies of the RGB images and
HSIs as a finer constraint for more accurate reconstruction. Experimental
results demonstrate the effectiveness of our proposed AWAN network in terms of
quantitative comparison and perceptual quality over other state-of-the-art SR
methods. In the NTIRE 2020 Spectral Reconstruction Challenge, our entries
obtain the 1st ranking on the Clean track and the 3rd place on the Real World
track. Codes are available at https://github.com/Deep-imagelab/AWAN.Comment: The 1st ranking on the Clean track and the 3rd place only 1.59106e-4
more than the 1st on the Real World track of the NTIRE 2020 Spectral
Reconstruction Challeng
Improving Object Detection with Inverted Attention
Improving object detectors against occlusion, blur and noise is a critical
step to deploy detectors in real applications. Since it is not possible to
exhaust all image defects through data collection, many researchers seek to
generate hard samples in training. The generated hard samples are either images
or feature maps with coarse patches dropped out in the spatial dimensions.
Significant overheads are required in training the extra hard samples and/or
estimating drop-out patches using extra network branches. In this paper, we
improve object detectors using a highly efficient and fine-grain mechanism
called Inverted Attention (IA). Different from the original detector network
that only focuses on the dominant part of objects, the detector network with IA
iteratively inverts attention on feature maps and puts more attention on
complementary object parts, feature channels and even context. Our approach (1)
operates along both the spatial and channels dimensions of the feature maps;
(2) requires no extra training on hard samples, no extra network parameters for
attention estimation, and no testing overheads. Experiments show that our
approach consistently improved both two-stage and single-stage detectors on
benchmark databases.Comment: 9 pages, 7 figures, 6 table
Spatial and Temporal Mutual Promotion for Video-based Person Re-identification
Video-based person re-identification is a crucial task of matching video
sequences of a person across multiple camera views. Generally, features
directly extracted from a single frame suffer from occlusion, blur,
illumination and posture changes. This leads to false activation or missing
activation in some regions, which corrupts the appearance and motion
representation. How to explore the abundant spatial-temporal information in
video sequences is the key to solve this problem. To this end, we propose a
Refining Recurrent Unit (RRU) that recovers the missing parts and suppresses
noisy parts of the current frame's features by referring historical frames.
With RRU, the quality of each frame's appearance representation is improved.
Then we use the Spatial-Temporal clues Integration Module (STIM) to mine the
spatial-temporal information from those upgraded features. Meanwhile, the
multi-level training objective is used to enhance the capability of RRU and
STIM. Through the cooperation of those modules, the spatial and temporal
features mutually promote each other and the final spatial-temporal feature
representation is more discriminative and robust. Extensive experiments are
conducted on three challenging datasets, i.e., iLIDS-VID, PRID-2011 and MARS.
The experimental results demonstrate that our approach outperforms existing
state-of-the-art methods of video-based person re-identification on iLIDS-VID
and MARS and achieves favorable results on PRID-2011.Comment: Accepted by AAAI19 as spotligh
VGR-Net: A View Invariant Gait Recognition Network
Biometric identification systems have become immensely popular and important
because of their high reliability and efficiency. However person identification
at a distance, still remains a challenging problem. Gait can be seen as an
essential biometric feature for human recognition and identification. It can be
easily acquired from a distance and does not require any user cooperation thus
making it suitable for surveillance. But the task of recognizing an individual
using gait can be adversely affected by varying view points making this task
more and more challenging. Our proposed approach tackles this problem by
identifying spatio-temporal features and performing extensive experimentation
and training mechanism. In this paper, we propose a 3-D Convolution Deep Neural
Network for person identification using gait under multiple view. It is a
2-stage network, in which we have a classification network that initially
identifies the viewing point angle. After that another set of networks (one for
each angle) has been trained to identify the person under a particular viewing
angle. We have tested this network over CASIA-B publicly available database and
have achieved state-of-the-art results. The proposed system is much more
efficient in terms of time and space and performing better for almost all
angles.Comment: Accepted in ISBA (IEEE International conference on Identity, Security
and Behaviour Analysis)-201
Re-Identification with Consistent Attentive Siamese Networks
We propose a new deep architecture for person re-identification (re-id).
While re-id has seen much recent progress, spatial localization and
view-invariant representation learning for robust cross-view matching remain
key, unsolved problems. We address these questions by means of a new
attention-driven Siamese learning architecture, called the Consistent Attentive
Siamese Network. Our key innovations compared to existing, competing methods
include (a) a flexible framework design that produces attention with only
identity labels as supervision, (b) explicit mechanisms to enforce attention
consistency among images of the same person, and (c) a new Siamese framework
that integrates attention and attention consistency, producing principled
supervisory signals as well as the first mechanism that can explain the
reasoning behind the Siamese framework's predictions. We conduct extensive
evaluations on the CUHK03-NP, DukeMTMC-ReID, and Market-1501 datasets and
report competitive performance.Comment: 10 pages, 8 figures, 3 tables, to appear in CVPR 201
Identify Speakers in Cocktail Parties with End-to-End Attention
In scenarios where multiple speakers talk at the same time, it is important
to be able to identify the talkers accurately. This paper presents an
end-to-end system that integrates speech source extraction and speaker
identification, and proposes a new way to jointly optimize these two parts by
max-pooling the speaker predictions along the channel dimension. Residual
attention permits us to learn spectrogram masks that are optimized for the
purpose of speaker identification, while residual forward connections permit
dilated convolution with a sufficiently large context window to guarantee
correct streaming across syllable boundaries. End-to-end training results in a
system that recognizes one speaker in a two-speaker broadcast speech mixture
with 99.9% accuracy and both speakers with 93.9% accuracy, and that recognizes
all speakers in three-speaker scenarios with 81.2% accuracy.Comment: Accepted by Interspeech 2020 for presentation;
https://github.com/JunzheJosephZhu/Identify-Speakers-in-Cocktail-Parties-with-E2E-Attentio
Crossing Generative Adversarial Networks for Cross-View Person Re-identification
Person re-identification (\textit{re-id}) refers to matching pedestrians
across disjoint yet non-overlapping camera views. The most effective way to
match these pedestrians undertaking significant visual variations is to seek
reliably invariant features that can describe the person of interest
faithfully. Most of existing methods are presented in a supervised manner to
produce discriminative features by relying on labeled paired images in
correspondence. However, annotating pair-wise images is prohibitively expensive
in labors, and thus not practical in large-scale networked cameras. Moreover,
seeking comparable representations across camera views demands a flexible model
to address the complex distributions of images. In this work, we study the
co-occurrence statistic patterns between pairs of images, and propose to
crossing Generative Adversarial Network (Cross-GAN) for learning a joint
distribution for cross-image representations in a unsupervised manner. Given a
pair of person images, the proposed model consists of the variational
auto-encoder to encode the pair into respective latent variables, a proposed
cross-view alignment to reduce the view disparity, and an adversarial layer to
seek the joint distribution of latent representations. The learned latent
representations are well-aligned to reflect the co-occurrence patterns of
paired images. We empirically evaluate the proposed model against challenging
datasets, and our results show the importance of joint invariant features in
improving matching rates of person re-id with comparison to semi/unsupervised
state-of-the-arts.Comment: 12 pages. arXiv admin note: text overlap with arXiv:1702.03431 by
other author
Multi-scale 3D Convolution Network for Video Based Person Re-Identification
This paper proposes a two-stream convolution network to extract spatial and
temporal cues for video based person Re-Identification (ReID). A temporal
stream in this network is constructed by inserting several Multi-scale 3D (M3D)
convolution layers into a 2D CNN network. The resulting M3D convolution network
introduces a fraction of parameters into the 2D CNN, but gains the ability of
multi-scale temporal feature learning. With this compact architecture, M3D
convolution network is also more efficient and easier to optimize than existing
3D convolution networks. The temporal stream further involves Residual
Attention Layers (RAL) to refine the temporal features. By jointly learning
spatial-temporal attention masks in a residual manner, RAL identifies the
discriminative spatial regions and temporal cues. The other stream in our
network is implemented with a 2D CNN for spatial feature extraction. The
spatial and temporal features from two streams are finally fused for the video
based person ReID. Evaluations on three widely used benchmarks datasets, i.e.,
MARS, PRID2011, and iLIDS-VID demonstrate the substantial advantages of our
method over existing 3D convolution networks and state-of-art methods.Comment: AAAI, 201
Adaptation and Re-Identification Network: An Unsupervised Deep Transfer Learning Approach to Person Re-Identification
Person re-identification (Re-ID) aims at recognizing the same person from
images taken across different cameras. To address this task, one typically
requires a large amount labeled data for training an effective Re-ID model,
which might not be practical for real-world applications. To alleviate this
limitation, we choose to exploit a sufficient amount of pre-existing labeled
data from a different (auxiliary) dataset. By jointly considering such an
auxiliary dataset and the dataset of interest (but without label information),
our proposed adaptation and re-identification network (ARN) performs
unsupervised domain adaptation, which leverages information across datasets and
derives domain-invariant features for Re-ID purposes. In our experiments, we
verify that our network performs favorably against state-of-the-art
unsupervised Re-ID approaches, and even outperforms a number of baseline Re-ID
methods which require fully supervised data for training.Comment: 7 pages, 3 figures. CVPR 2018 workshop pape
- …