1,203 research outputs found
Multi-scale 3D Convolution Network for Video Based Person Re-Identification
This paper proposes a two-stream convolution network to extract spatial and
temporal cues for video based person Re-Identification (ReID). A temporal
stream in this network is constructed by inserting several Multi-scale 3D (M3D)
convolution layers into a 2D CNN network. The resulting M3D convolution network
introduces a fraction of parameters into the 2D CNN, but gains the ability of
multi-scale temporal feature learning. With this compact architecture, M3D
convolution network is also more efficient and easier to optimize than existing
3D convolution networks. The temporal stream further involves Residual
Attention Layers (RAL) to refine the temporal features. By jointly learning
spatial-temporal attention masks in a residual manner, RAL identifies the
discriminative spatial regions and temporal cues. The other stream in our
network is implemented with a 2D CNN for spatial feature extraction. The
spatial and temporal features from two streams are finally fused for the video
based person ReID. Evaluations on three widely used benchmarks datasets, i.e.,
MARS, PRID2011, and iLIDS-VID demonstrate the substantial advantages of our
method over existing 3D convolution networks and state-of-art methods.Comment: AAAI, 201
Appearance-and-Relation Networks for Video Classification
Spatiotemporal feature learning in videos is a fundamental problem in
computer vision. This paper presents a new architecture, termed as
Appearance-and-Relation Network (ARTNet), to learn video representation in an
end-to-end manner. ARTNets are constructed by stacking multiple generic
building blocks, called as SMART, whose goal is to simultaneously model
appearance and relation from RGB input in a separate and explicit manner.
Specifically, SMART blocks decouple the spatiotemporal learning module into an
appearance branch for spatial modeling and a relation branch for temporal
modeling. The appearance branch is implemented based on the linear combination
of pixels or filter responses in each frame, while the relation branch is
designed based on the multiplicative interactions between pixels or filter
responses across multiple frames. We perform experiments on three action
recognition benchmarks: Kinetics, UCF101, and HMDB51, demonstrating that SMART
blocks obtain an evident improvement over 3D convolutions for spatiotemporal
feature learning. Under the same training setting, ARTNets achieve superior
performance on these three datasets to the existing state-of-the-art methods.Comment: CVPR18 camera-ready version. Code & models available at
https://github.com/wanglimin/ARTNe
About Pyramid Structure in Convolutional Neural Networks
Deep convolutional neural networks (CNN) brought revolution without any doubt
to various challenging tasks, mainly in computer vision. However, their model
designing still requires attention to reduce number of learnable parameters,
with no meaningful reduction in performance. In this paper we investigate to
what extend CNN may take advantage of pyramid structure typical of biological
neurons. A generalized statement over convolutional layers from input till
fully connected layer is introduced that helps further in understanding and
designing a successful deep network. It reduces ambiguity, number of
parameters, and their size on disk without degrading overall accuracy.
Performance are shown on state-of-the-art models for MNIST, Cifar-10,
Cifar-100, and ImageNet-12 datasets. Despite more than 80% reduction in
parameters for Caffe_LENET, challenging results are obtained. Further, despite
10-20% reduction in training data along with 10-40% reduction in parameters for
AlexNet model and its variations, competitive results are achieved when
compared to similar well-engineered deeper architectures.Comment: Published in 2016 International Joint Conference on Neural Networks
(IJCNN
- …