3 research outputs found
Semantics-Aligned Representation Learning for Person Re-identification
Person re-identification (reID) aims to match person images to retrieve the
ones with the same identity. This is a challenging task, as the images to be
matched are generally semantically misaligned due to the diversity of human
poses and capture viewpoints, incompleteness of the visible bodies (due to
occlusion), etc. In this paper, we propose a framework that drives the reID
network to learn semantics-aligned feature representation through delicate
supervision designs. Specifically, we build a Semantics Aligning Network (SAN)
which consists of a base network as encoder (SA-Enc) for re-ID, and a decoder
(SA-Dec) for reconstructing/regressing the densely semantics aligned full
texture image. We jointly train the SAN under the supervisions of person
re-identification and aligned texture generation. Moreover, at the decoder,
besides the reconstruction loss, we add Triplet ReID constraints over the
feature maps as the perceptual losses. The decoder is discarded in the
inference and thus our scheme is computationally efficient. Ablation studies
demonstrate the effectiveness of our design. We achieve the state-of-the-art
performances on the benchmark datasets CUHK03, Market1501, MSMT17, and the
partial person reID dataset Partial REID. Code for our proposed method is
available at:
https://github.com/microsoft/Semantics-Aligned-Representation-Learning-for-Person-Re-identification.Comment: Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20),
code has been release
3D Human Pose, Shape and Texture from Low-Resolution Images and Videos
3D human pose and shape estimation from monocular images has been an active
research area in computer vision. Existing deep learning methods for this task
rely on high-resolution input, which however, is not always available in many
scenarios such as video surveillance and sports broadcasting. Two common
approaches to deal with low-resolution images are applying super-resolution
techniques to the input, which may result in unpleasant artifacts, or simply
training one model for each resolution, which is impractical in many realistic
applications.
To address the above issues, this paper proposes a novel algorithm called
RSC-Net, which consists of a Resolution-aware network, a Self-supervision loss,
and a Contrastive learning scheme. The proposed method is able to learn 3D body
pose and shape across different resolutions with one single model. The
self-supervision loss enforces scale-consistency of the output, and the
contrastive learning scheme enforces scale-consistency of the deep features. We
show that both these new losses provide robustness when learning in a
weakly-supervised manner. Moreover, we extend the RSC-Net to handle
low-resolution videos and apply it to reconstruct textured 3D pedestrians from
low-resolution input. Extensive experiments demonstrate that the RSC-Net can
achieve consistently better results than the state-of-the-art methods for
challenging low-resolution images.Comment: arXiv admin note: substantial text overlap with arXiv:2007.1366