136,488 research outputs found
Space-Time Representation of People Based on 3D Skeletal Data: A Review
Spatiotemporal human representation based on 3D visual perception data is a
rapidly growing research area. Based on the information sources, these
representations can be broadly categorized into two groups based on RGB-D
information or 3D skeleton data. Recently, skeleton-based human representations
have been intensively studied and kept attracting an increasing attention, due
to their robustness to variations of viewpoint, human body scale and motion
speed as well as the realtime, online performance. This paper presents a
comprehensive survey of existing space-time representations of people based on
3D skeletal data, and provides an informative categorization and analysis of
these methods from the perspectives, including information modality,
representation encoding, structure and transition, and feature engineering. We
also provide a brief overview of skeleton acquisition devices and construction
methods, enlist a number of public benchmark datasets with skeleton data, and
discuss potential future research directions.Comment: Our paper has been accepted by the journal Computer Vision and Image
Understanding, see
http://www.sciencedirect.com/science/article/pii/S1077314217300279, Computer
Vision and Image Understanding, 201
HSCS: Hierarchical Sparsity Based Co-saliency Detection for RGBD Images
Co-saliency detection aims to discover common and salient objects in an image
group containing more than two relevant images. Moreover, depth information has
been demonstrated to be effective for many computer vision tasks. In this
paper, we propose a novel co-saliency detection method for RGBD images based on
hierarchical sparsity reconstruction and energy function refinement. With the
assistance of the intra saliency map, the inter-image correspondence is
formulated as a hierarchical sparsity reconstruction framework. The global
sparsity reconstruction model with a ranking scheme focuses on capturing the
global characteristics among the whole image group through a common foreground
dictionary. The pairwise sparsity reconstruction model aims to explore the
corresponding relationship between pairwise images through a set of pairwise
dictionaries. In order to improve the intra-image smoothness and inter-image
consistency, an energy function refinement model is proposed, which includes
the unary data term, spatial smooth term, and holistic consistency term.
Experiments on two RGBD co-saliency detection benchmarks demonstrate that the
proposed method outperforms the state-of-the-art algorithms both qualitatively
and quantitatively.Comment: 11 pages, 5 figures, Accepted by IEEE Transactions on Multimedia,
https://rmcong.github.io
Context-Aware Deep Spatio-Temporal Network for Hand Pose Estimation from Depth Images
As a fundamental and challenging problem in computer vision, hand pose
estimation aims to estimate the hand joint locations from depth images.
Typically, the problem is modeled as learning a mapping function from images to
hand joint coordinates in a data-driven manner. In this paper, we propose
Context-Aware Deep Spatio-Temporal Network (CADSTN), a novel method to jointly
model the spatio-temporal properties for hand pose estimation. Our proposed
network is able to learn the representations of the spatial information and the
temporal structure from the image sequences. Moreover, by adopting adaptive
fusion method, the model is capable of dynamically weighting different
predictions to lay emphasis on sufficient context. Our method is examined on
two common benchmarks, the experimental results demonstrate that our proposed
approach achieves the best or the second-best performance with state-of-the-art
methods and runs in 60fps.Comment: IEEE Transactions On Cybernetic
Hierarchical Recurrent Filtering for Fully Convolutional DenseNets
Generating a robust representation of the environment is a crucial ability of
learning agents. Deep learning based methods have greatly improved perception
systems but still fail in challenging situations. These failures are often not
solvable on the basis of a single image. In this work, we present a
parameter-efficient temporal filtering concept which extends an existing
single-frame segmentation model to work with multiple frames. The resulting
recurrent architecture temporally filters representations on all abstraction
levels in a hierarchical manner, while decoupling temporal dependencies from
scene representation. Using a synthetic dataset, we show the ability of our
model to cope with data perturbations and highlight the importance of recurrent
and hierarchical filtering.Comment: In Proceedings of 26th European Symposium on Artificial Neural
Networks, Computational Intelligence and Machine Learning (ESANN), Bruges,
Belgium, 201
Interpreting Layered Neural Networks via Hierarchical Modular Representation
Interpreting the prediction mechanism of complex models is currently one of
the most important tasks in the machine learning field, especially with layered
neural networks, which have achieved high predictive performance with various
practical data sets. To reveal the global structure of a trained neural network
in an interpretable way, a series of clustering methods have been proposed,
which decompose the units into clusters according to the similarity of their
inference roles. The main problems in these studies were that (1) we have no
prior knowledge about the optimal resolution for the decomposition, or the
appropriate number of clusters, and (2) there was no method with which to
acquire knowledge about whether the outputs of each cluster have a positive or
negative correlation with the input and output dimension values. In this paper,
to solve these problems, we propose a method for obtaining a hierarchical
modular representation of a layered neural network. The application of a
hierarchical clustering method to a trained network reveals a tree-structured
relationship among hidden layer units, based on their feature vectors defined
by their correlation with the input and output dimension values
Discriminatively Learned Hierarchical Rank Pooling Networks
In this work, we present novel temporal encoding methods for action and
activity classification by extending the unsupervised rank pooling temporal
encoding method in two ways. First, we present "discriminative rank pooling" in
which the shared weights of our video representation and the parameters of the
action classifiers are estimated jointly for a given training dataset of
labelled vector sequences using a bilevel optimization formulation of the
learning problem. When the frame level features vectors are obtained from a
convolutional neural network (CNN), we rank pool the network activations and
jointly estimate all parameters of the model, including CNN filters and
fully-connected weights, in an end-to-end manner which we coined as "end-to-end
trainable rank pooled CNN". Importantly, this model can make use of any
existing convolutional neural network architecture (e.g., AlexNet or VGG)
without modification or introduction of additional parameters. Then, we extend
rank pooling to a high capacity video representation, called "hierarchical rank
pooling". Hierarchical rank pooling consists of a network of rank pooling
functions, which encode temporal semantics over arbitrary long video clips
based on rich frame level features. By stacking non-linear feature functions
and temporal sub-sequence encoders one on top of the other, we build a high
capacity encoding network of the dynamic behaviour of the video. The resulting
video representation is a fixed-length feature vector describing the entire
video clip that can be used as input to standard machine learning classifiers.
We demonstrate our approach on the task of action and activity recognition.
Obtained results are comparable to state-of-the-art methods on three important
activity recognition benchmarks with classification performance of 76.7% mAP on
Hollywood2, 69.4% on HMDB51, and 93.6% on UCF101.Comment: International Journal of Computer Visio
A Distributed Deep Representation Learning Model for Big Image Data Classification
This paper describes an effective and efficient image classification
framework nominated distributed deep representation learning model (DDRL). The
aim is to strike the balance between the computational intensive deep learning
approaches (tuned parameters) which are intended for distributed computing, and
the approaches that focused on the designed parameters but often limited by
sequential computing and cannot scale up. In the evaluation of our approach, it
is shown that DDRL is able to achieve state-of-art classification accuracy
efficiently on both medium and large datasets. The result implies that our
approach is more efficient than the conventional deep learning approaches, and
can be applied to big data that is too complex for parameter designing focused
approaches. More specifically, DDRL contains two main components, i.e., feature
extraction and selection. A hierarchical distributed deep representation
learning algorithm is designed to extract image statistics and a nonlinear
mapping algorithm is used to map the inherent statistics into abstract
features. Both algorithms are carefully designed to avoid millions of
parameters tuning. This leads to a more compact solution for image
classification of big data. We note that the proposed approach is designed to
be friendly with parallel computing. It is generic and easy to be deployed to
different distributed computing resources. In the experiments, the largescale
image datasets are classified with a DDRM implementation on Hadoop MapReduce,
which shows high scalability and resilience
3D human action analysis and recognition through GLAC descriptor on 2D motion and static posture images
In this paper, we present an approach for identification of actions within
depth action videos. First, we process the video to get motion history images
(MHIs) and static history images (SHIs) corresponding to an action video based
on the use of 3D Motion Trail Model (3DMTM). We then characterize the action
video by extracting the Gradient Local Auto-Correlations (GLAC) features from
the SHIs and the MHIs. The two sets of features i.e., GLAC features from MHIs
and GLAC features from SHIs are concatenated to obtain a representation vector
for action. Finally, we perform the classification on all the action samples by
using the l2-regularized Collaborative Representation Classifier (l2-CRC) to
recognize different human actions in an effective way. We perform evaluation of
the proposed method on three action datasets, MSR-Action3D, DHA and UTD-MHAD.
Through experimental results, we observe that the proposed method performs
superior to other approaches.Comment: Multimed Tools Appl (2019
Human Action Recognition and Prediction: A Survey
Derived from rapid advances in computer vision and machine learning, video
analysis tasks have been moving from inferring the present state to predicting
the future state. Vision-based action recognition and prediction from videos
are such tasks, where action recognition is to infer human actions (present
state) based upon complete action executions, and action prediction to predict
human actions (future state) based upon incomplete action executions. These two
tasks have become particularly prevalent topics recently because of their
explosively emerging real-world applications, such as visual surveillance,
autonomous driving vehicle, entertainment, and video retrieval, etc. Many
attempts have been devoted in the last a few decades in order to build a robust
and effective framework for action recognition and prediction. In this paper,
we survey the complete state-of-the-art techniques in the action recognition
and prediction. Existing models, popular algorithms, technical difficulties,
popular action databases, evaluation protocols, and promising future directions
are also provided with systematic discussions
Deep Stacked Hierarchical Multi-patch Network for Image Deblurring
Despite deep end-to-end learning methods have shown their superiority in
removing non-uniform motion blur, there still exist major challenges with the
current multi-scale and scale-recurrent models: 1) Deconvolution/upsampling
operations in the coarse-to-fine scheme result in expensive runtime; 2) Simply
increasing the model depth with finer-scale levels cannot improve the quality
of deblurring. To tackle the above problems, we present a deep hierarchical
multi-patch network inspired by Spatial Pyramid Matching to deal with blurry
images via a fine-to-coarse hierarchical representation. To deal with the
performance saturation w.r.t. depth, we propose a stacked version of our
multi-patch model. Our proposed basic multi-patch model achieves the
state-of-the-art performance on the GoPro dataset while enjoying a 40x faster
runtime compared to current multi-scale methods. With 30ms to process an image
at 1280x720 resolution, it is the first real-time deep motion deblurring model
for 720p images at 30fps. For stacked networks, significant improvements (over
1.2dB) are achieved on the GoPro dataset by increasing the network depth.
Moreover, by varying the depth of the stacked model, one can adapt the
performance and runtime of the same network for different application
scenarios.Comment: IEEE Conference on Computer Vision and Pattern Recognition 201
- …