2,418 research outputs found
Semantics-Aligned Representation Learning for Person Re-identification
Person re-identification (reID) aims to match person images to retrieve the
ones with the same identity. This is a challenging task, as the images to be
matched are generally semantically misaligned due to the diversity of human
poses and capture viewpoints, incompleteness of the visible bodies (due to
occlusion), etc. In this paper, we propose a framework that drives the reID
network to learn semantics-aligned feature representation through delicate
supervision designs. Specifically, we build a Semantics Aligning Network (SAN)
which consists of a base network as encoder (SA-Enc) for re-ID, and a decoder
(SA-Dec) for reconstructing/regressing the densely semantics aligned full
texture image. We jointly train the SAN under the supervisions of person
re-identification and aligned texture generation. Moreover, at the decoder,
besides the reconstruction loss, we add Triplet ReID constraints over the
feature maps as the perceptual losses. The decoder is discarded in the
inference and thus our scheme is computationally efficient. Ablation studies
demonstrate the effectiveness of our design. We achieve the state-of-the-art
performances on the benchmark datasets CUHK03, Market1501, MSMT17, and the
partial person reID dataset Partial REID. Code for our proposed method is
available at:
https://github.com/microsoft/Semantics-Aligned-Representation-Learning-for-Person-Re-identification.Comment: Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20),
code has been release
Confounder Identification-free Causal Visual Feature Learning
Confounders in deep learning are in general detrimental to model's
generalization where they infiltrate feature representations. Therefore,
learning causal features that are free of interference from confounders is
important. Most previous causal learning based approaches employ back-door
criterion to mitigate the adverse effect of certain specific confounder, which
require the explicit identification of confounder. However, in real scenarios,
confounders are typically diverse and difficult to be identified. In this
paper, we propose a novel Confounder Identification-free Causal Visual Feature
Learning (CICF) method, which obviates the need for identifying confounders.
CICF models the interventions among different samples based on front-door
criterion, and then approximates the global-scope intervening effect upon the
instance-level interventions from the perspective of optimization. In this way,
we aim to find a reliable optimization direction, which avoids the intervening
effects of confounders, to learn causal features. Furthermore, we uncover the
relation between CICF and the popular meta-learning strategy MAML, and provide
an interpretation of why MAML works from the theoretical perspective of causal
learning for the first time. Thanks to the effective learning of causal
features, our CICF enables models to have superior generalization capability.
Extensive experiments on domain generalization benchmark datasets demonstrate
the effectiveness of our CICF, which achieves the state-of-the-art performance.Comment: 25 page
MixNet: Toward Accurate Detection of Challenging Scene Text in the Wild
Detecting small scene text instances in the wild is particularly challenging,
where the influence of irregular positions and nonideal lighting often leads to
detection errors. We present MixNet, a hybrid architecture that combines the
strengths of CNNs and Transformers, capable of accurately detecting small text
from challenging natural scenes, regardless of the orientations, styles, and
lighting conditions. MixNet incorporates two key modules: (1) the Feature
Shuffle Network (FSNet) to serve as the backbone and (2) the Central
Transformer Block (CTBlock) to exploit the 1D manifold constraint of the scene
text. We first introduce a novel feature shuffling strategy in FSNet to
facilitate the exchange of features across multiple scales, generating
high-resolution features superior to popular ResNet and HRNet. The FSNet
backbone has achieved significant improvements over many existing text
detection methods, including PAN, DB, and FAST. Then we design a complementary
CTBlock to leverage center line based features similar to the medial axis of
text regions and show that it can outperform contour-based approaches in
challenging cases when small scene texts appear closely. Extensive experimental
results show that MixNet, which mixes FSNet with CTBlock, achieves
state-of-the-art results on multiple scene text detection datasets
- …