6,207 research outputs found
Appearance-and-Relation Networks for Video Classification
Spatiotemporal feature learning in videos is a fundamental problem in
computer vision. This paper presents a new architecture, termed as
Appearance-and-Relation Network (ARTNet), to learn video representation in an
end-to-end manner. ARTNets are constructed by stacking multiple generic
building blocks, called as SMART, whose goal is to simultaneously model
appearance and relation from RGB input in a separate and explicit manner.
Specifically, SMART blocks decouple the spatiotemporal learning module into an
appearance branch for spatial modeling and a relation branch for temporal
modeling. The appearance branch is implemented based on the linear combination
of pixels or filter responses in each frame, while the relation branch is
designed based on the multiplicative interactions between pixels or filter
responses across multiple frames. We perform experiments on three action
recognition benchmarks: Kinetics, UCF101, and HMDB51, demonstrating that SMART
blocks obtain an evident improvement over 3D convolutions for spatiotemporal
feature learning. Under the same training setting, ARTNets achieve superior
performance on these three datasets to the existing state-of-the-art methods.Comment: CVPR18 camera-ready version. Code & models available at
https://github.com/wanglimin/ARTNe
Re-ID done right: towards good practices for person re-identification
Training a deep architecture using a ranking loss has become standard for the
person re-identification task. Increasingly, these deep architectures include
additional components that leverage part detections, attribute predictions,
pose estimators and other auxiliary information, in order to more effectively
localize and align discriminative image regions. In this paper we adopt a
different approach and carefully design each component of a simple deep
architecture and, critically, the strategy for training it effectively for
person re-identification. We extensively evaluate each design choice, leading
to a list of good practices for person re-identification. By following these
practices, our approach outperforms the state of the art, including more
complex methods with auxiliary components, by large margins on four benchmark
datasets. We also provide a qualitative analysis of our trained representation
which indicates that, while compact, it is able to capture information from
localized and discriminative regions, in a manner akin to an implicit attention
mechanism
- …