136,488 research outputs found

    Space-Time Representation of People Based on 3D Skeletal Data: A Review

    Full text link
    Spatiotemporal human representation based on 3D visual perception data is a rapidly growing research area. Based on the information sources, these representations can be broadly categorized into two groups based on RGB-D information or 3D skeleton data. Recently, skeleton-based human representations have been intensively studied and kept attracting an increasing attention, due to their robustness to variations of viewpoint, human body scale and motion speed as well as the realtime, online performance. This paper presents a comprehensive survey of existing space-time representations of people based on 3D skeletal data, and provides an informative categorization and analysis of these methods from the perspectives, including information modality, representation encoding, structure and transition, and feature engineering. We also provide a brief overview of skeleton acquisition devices and construction methods, enlist a number of public benchmark datasets with skeleton data, and discuss potential future research directions.Comment: Our paper has been accepted by the journal Computer Vision and Image Understanding, see http://www.sciencedirect.com/science/article/pii/S1077314217300279, Computer Vision and Image Understanding, 201

    HSCS: Hierarchical Sparsity Based Co-saliency Detection for RGBD Images

    Full text link
    Co-saliency detection aims to discover common and salient objects in an image group containing more than two relevant images. Moreover, depth information has been demonstrated to be effective for many computer vision tasks. In this paper, we propose a novel co-saliency detection method for RGBD images based on hierarchical sparsity reconstruction and energy function refinement. With the assistance of the intra saliency map, the inter-image correspondence is formulated as a hierarchical sparsity reconstruction framework. The global sparsity reconstruction model with a ranking scheme focuses on capturing the global characteristics among the whole image group through a common foreground dictionary. The pairwise sparsity reconstruction model aims to explore the corresponding relationship between pairwise images through a set of pairwise dictionaries. In order to improve the intra-image smoothness and inter-image consistency, an energy function refinement model is proposed, which includes the unary data term, spatial smooth term, and holistic consistency term. Experiments on two RGBD co-saliency detection benchmarks demonstrate that the proposed method outperforms the state-of-the-art algorithms both qualitatively and quantitatively.Comment: 11 pages, 5 figures, Accepted by IEEE Transactions on Multimedia, https://rmcong.github.io

    Context-Aware Deep Spatio-Temporal Network for Hand Pose Estimation from Depth Images

    Full text link
    As a fundamental and challenging problem in computer vision, hand pose estimation aims to estimate the hand joint locations from depth images. Typically, the problem is modeled as learning a mapping function from images to hand joint coordinates in a data-driven manner. In this paper, we propose Context-Aware Deep Spatio-Temporal Network (CADSTN), a novel method to jointly model the spatio-temporal properties for hand pose estimation. Our proposed network is able to learn the representations of the spatial information and the temporal structure from the image sequences. Moreover, by adopting adaptive fusion method, the model is capable of dynamically weighting different predictions to lay emphasis on sufficient context. Our method is examined on two common benchmarks, the experimental results demonstrate that our proposed approach achieves the best or the second-best performance with state-of-the-art methods and runs in 60fps.Comment: IEEE Transactions On Cybernetic

    Hierarchical Recurrent Filtering for Fully Convolutional DenseNets

    Full text link
    Generating a robust representation of the environment is a crucial ability of learning agents. Deep learning based methods have greatly improved perception systems but still fail in challenging situations. These failures are often not solvable on the basis of a single image. In this work, we present a parameter-efficient temporal filtering concept which extends an existing single-frame segmentation model to work with multiple frames. The resulting recurrent architecture temporally filters representations on all abstraction levels in a hierarchical manner, while decoupling temporal dependencies from scene representation. Using a synthetic dataset, we show the ability of our model to cope with data perturbations and highlight the importance of recurrent and hierarchical filtering.Comment: In Proceedings of 26th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN), Bruges, Belgium, 201

    Interpreting Layered Neural Networks via Hierarchical Modular Representation

    Full text link
    Interpreting the prediction mechanism of complex models is currently one of the most important tasks in the machine learning field, especially with layered neural networks, which have achieved high predictive performance with various practical data sets. To reveal the global structure of a trained neural network in an interpretable way, a series of clustering methods have been proposed, which decompose the units into clusters according to the similarity of their inference roles. The main problems in these studies were that (1) we have no prior knowledge about the optimal resolution for the decomposition, or the appropriate number of clusters, and (2) there was no method with which to acquire knowledge about whether the outputs of each cluster have a positive or negative correlation with the input and output dimension values. In this paper, to solve these problems, we propose a method for obtaining a hierarchical modular representation of a layered neural network. The application of a hierarchical clustering method to a trained network reveals a tree-structured relationship among hidden layer units, based on their feature vectors defined by their correlation with the input and output dimension values

    Discriminatively Learned Hierarchical Rank Pooling Networks

    Full text link
    In this work, we present novel temporal encoding methods for action and activity classification by extending the unsupervised rank pooling temporal encoding method in two ways. First, we present "discriminative rank pooling" in which the shared weights of our video representation and the parameters of the action classifiers are estimated jointly for a given training dataset of labelled vector sequences using a bilevel optimization formulation of the learning problem. When the frame level features vectors are obtained from a convolutional neural network (CNN), we rank pool the network activations and jointly estimate all parameters of the model, including CNN filters and fully-connected weights, in an end-to-end manner which we coined as "end-to-end trainable rank pooled CNN". Importantly, this model can make use of any existing convolutional neural network architecture (e.g., AlexNet or VGG) without modification or introduction of additional parameters. Then, we extend rank pooling to a high capacity video representation, called "hierarchical rank pooling". Hierarchical rank pooling consists of a network of rank pooling functions, which encode temporal semantics over arbitrary long video clips based on rich frame level features. By stacking non-linear feature functions and temporal sub-sequence encoders one on top of the other, we build a high capacity encoding network of the dynamic behaviour of the video. The resulting video representation is a fixed-length feature vector describing the entire video clip that can be used as input to standard machine learning classifiers. We demonstrate our approach on the task of action and activity recognition. Obtained results are comparable to state-of-the-art methods on three important activity recognition benchmarks with classification performance of 76.7% mAP on Hollywood2, 69.4% on HMDB51, and 93.6% on UCF101.Comment: International Journal of Computer Visio

    A Distributed Deep Representation Learning Model for Big Image Data Classification

    Full text link
    This paper describes an effective and efficient image classification framework nominated distributed deep representation learning model (DDRL). The aim is to strike the balance between the computational intensive deep learning approaches (tuned parameters) which are intended for distributed computing, and the approaches that focused on the designed parameters but often limited by sequential computing and cannot scale up. In the evaluation of our approach, it is shown that DDRL is able to achieve state-of-art classification accuracy efficiently on both medium and large datasets. The result implies that our approach is more efficient than the conventional deep learning approaches, and can be applied to big data that is too complex for parameter designing focused approaches. More specifically, DDRL contains two main components, i.e., feature extraction and selection. A hierarchical distributed deep representation learning algorithm is designed to extract image statistics and a nonlinear mapping algorithm is used to map the inherent statistics into abstract features. Both algorithms are carefully designed to avoid millions of parameters tuning. This leads to a more compact solution for image classification of big data. We note that the proposed approach is designed to be friendly with parallel computing. It is generic and easy to be deployed to different distributed computing resources. In the experiments, the largescale image datasets are classified with a DDRM implementation on Hadoop MapReduce, which shows high scalability and resilience

    3D human action analysis and recognition through GLAC descriptor on 2D motion and static posture images

    Full text link
    In this paper, we present an approach for identification of actions within depth action videos. First, we process the video to get motion history images (MHIs) and static history images (SHIs) corresponding to an action video based on the use of 3D Motion Trail Model (3DMTM). We then characterize the action video by extracting the Gradient Local Auto-Correlations (GLAC) features from the SHIs and the MHIs. The two sets of features i.e., GLAC features from MHIs and GLAC features from SHIs are concatenated to obtain a representation vector for action. Finally, we perform the classification on all the action samples by using the l2-regularized Collaborative Representation Classifier (l2-CRC) to recognize different human actions in an effective way. We perform evaluation of the proposed method on three action datasets, MSR-Action3D, DHA and UTD-MHAD. Through experimental results, we observe that the proposed method performs superior to other approaches.Comment: Multimed Tools Appl (2019

    Human Action Recognition and Prediction: A Survey

    Full text link
    Derived from rapid advances in computer vision and machine learning, video analysis tasks have been moving from inferring the present state to predicting the future state. Vision-based action recognition and prediction from videos are such tasks, where action recognition is to infer human actions (present state) based upon complete action executions, and action prediction to predict human actions (future state) based upon incomplete action executions. These two tasks have become particularly prevalent topics recently because of their explosively emerging real-world applications, such as visual surveillance, autonomous driving vehicle, entertainment, and video retrieval, etc. Many attempts have been devoted in the last a few decades in order to build a robust and effective framework for action recognition and prediction. In this paper, we survey the complete state-of-the-art techniques in the action recognition and prediction. Existing models, popular algorithms, technical difficulties, popular action databases, evaluation protocols, and promising future directions are also provided with systematic discussions

    Deep Stacked Hierarchical Multi-patch Network for Image Deblurring

    Full text link
    Despite deep end-to-end learning methods have shown their superiority in removing non-uniform motion blur, there still exist major challenges with the current multi-scale and scale-recurrent models: 1) Deconvolution/upsampling operations in the coarse-to-fine scheme result in expensive runtime; 2) Simply increasing the model depth with finer-scale levels cannot improve the quality of deblurring. To tackle the above problems, we present a deep hierarchical multi-patch network inspired by Spatial Pyramid Matching to deal with blurry images via a fine-to-coarse hierarchical representation. To deal with the performance saturation w.r.t. depth, we propose a stacked version of our multi-patch model. Our proposed basic multi-patch model achieves the state-of-the-art performance on the GoPro dataset while enjoying a 40x faster runtime compared to current multi-scale methods. With 30ms to process an image at 1280x720 resolution, it is the first real-time deep motion deblurring model for 720p images at 30fps. For stacked networks, significant improvements (over 1.2dB) are achieved on the GoPro dataset by increasing the network depth. Moreover, by varying the depth of the stacked model, one can adapt the performance and runtime of the same network for different application scenarios.Comment: IEEE Conference on Computer Vision and Pattern Recognition 201
    • …
    corecore