3,432 research outputs found
Structure-Preserving Binary Representations for RGB-D Action Recognition
In this paper, we propose a novel binary local representation for RGB-D video data fusion with a structure-preserving projection. Our contribution consists of two aspects. To acquire a general feature for the video data, we convert the problem to describing the gradient fields of RGB and depth information of video sequences. With the local fluxes of the gradient fields, which include the orientation and the magnitude of the neighborhood of each point, a new kind of continuous local descriptor called Local Flux Feature(LFF) is obtained. Then the LFFs from RGB and depth channels are fused into a Hamming spacevia the Structure Preserving Projection (SPP). Specifically, an orthogonal projection matrix is applied to preserve the pairwise structure with a shape constraint to avoid the collapse of data structure in the projected space. Furthermore, a bipartite graph structure of data is taken into consideration, which is regarded as a higher level connection between samples and classes than the pairwise structure of local features. The extensive experiments show not only the high efficiency of binary codes and the effectiveness of combining LFFs from RGB-D channels via SPP on various action recognition benchmarks of RGB-D data, but also the potential power of LFF for general action recognition
Graph Distillation for Action Detection with Privileged Modalities
We propose a technique that tackles action detection in multimodal videos
under a realistic and challenging condition in which only limited training data
and partially observed modalities are available. Common methods in transfer
learning do not take advantage of the extra modalities potentially available in
the source domain. On the other hand, previous work on multimodal learning only
focuses on a single domain or task and does not handle the modality discrepancy
between training and testing. In this work, we propose a method termed graph
distillation that incorporates rich privileged information from a large-scale
multimodal dataset in the source domain, and improves the learning in the
target domain where training data and modalities are scarce. We evaluate our
approach on action classification and detection tasks in multimodal videos, and
show that our model outperforms the state-of-the-art by a large margin on the
NTU RGB+D and PKU-MMD benchmarks. The code is released at
http://alan.vision/eccv18_graph/.Comment: ECCV 201
Generalized Rank Pooling for Activity Recognition
Most popular deep models for action recognition split video sequences into
short sub-sequences consisting of a few frames; frame-based features are then
pooled for recognizing the activity. Usually, this pooling step discards the
temporal order of the frames, which could otherwise be used for better
recognition. Towards this end, we propose a novel pooling method, generalized
rank pooling (GRP), that takes as input, features from the intermediate layers
of a CNN that is trained on tiny sub-sequences, and produces as output the
parameters of a subspace which (i) provides a low-rank approximation to the
features and (ii) preserves their temporal order. We propose to use these
parameters as a compact representation for the video sequence, which is then
used in a classification setup. We formulate an objective for computing this
subspace as a Riemannian optimization problem on the Grassmann manifold, and
propose an efficient conjugate gradient scheme for solving it. Experiments on
several activity recognition datasets show that our scheme leads to
state-of-the-art performance.Comment: Accepted at IEEE International Conference on Computer Vision and
Pattern Recognition (CVPR), 201
ModDrop: adaptive multi-modal gesture recognition
We present a method for gesture detection and localisation based on
multi-scale and multi-modal deep learning. Each visual modality captures
spatial information at a particular spatial scale (such as motion of the upper
body or a hand), and the whole system operates at three temporal scales. Key to
our technique is a training strategy which exploits: i) careful initialization
of individual modalities; and ii) gradual fusion involving random dropping of
separate channels (dubbed ModDrop) for learning cross-modality correlations
while preserving uniqueness of each modality-specific representation. We
present experiments on the ChaLearn 2014 Looking at People Challenge gesture
recognition track, in which we placed first out of 17 teams. Fusing multiple
modalities at several spatial and temporal scales leads to a significant
increase in recognition rates, allowing the model to compensate for errors of
the individual classifiers as well as noise in the separate channels.
Futhermore, the proposed ModDrop training technique ensures robustness of the
classifier to missing signals in one or several channels to produce meaningful
predictions from any number of available modalities. In addition, we
demonstrate the applicability of the proposed fusion scheme to modalities of
arbitrary nature by experiments on the same dataset augmented with audio.Comment: 14 pages, 7 figure
- …