10,392 research outputs found

    VIEW-INVARIANT ACTION RECOGNITION FROM RGB DATA VIA 3D POSE ESTIMATION

    Get PDF
    In this paper, we propose a novel view-invariant action recognition method using a single monocular RGB camera. View-invariance remains a very challenging topic in 2D action recognition due to the lack of 3D information in RGB images. Most successful approaches make use of the concept of knowledge transfer by projecting 3D synthetic data to multiple viewpoints. Instead of relying on knowledge transfer, we propose to augment the RGB data by a third dimension by means of 3D skeleton estimation from 2D images using a CNN-based pose estimator. In order to ensure view-invariance, a pre-processing for alignment is applied followed by data expansion as a way for denoising. Finally, a Long-Short Term Memory (LSTM) architecture is used to model the temporal dependency between skeletons. The proposed network is trained to directly recognize actions from aligned 3D skeletons. The experiments performed on the challenging Northwestern-UCLA dataset show the superiority of our approach as compared to state-of-the-art ones

    Fast, invariant representation for human action in the visual system

    Get PDF
    Humans can effortlessly recognize others' actions in the presence of complex transformations, such as changes in viewpoint. Several studies have located the regions in the brain involved in invariant action recognition, however, the underlying neural computations remain poorly understood. We use magnetoencephalography (MEG) decoding and a dataset of well-controlled, naturalistic videos of five actions (run, walk, jump, eat, drink) performed by different actors at different viewpoints to study the computational steps used to recognize actions across complex transformations. In particular, we ask when the brain discounts changes in 3D viewpoint relative to when it initially discriminates between actions. We measure the latency difference between invariant and non-invariant action decoding when subjects view full videos as well as form-depleted and motion-depleted stimuli. Our results show no difference in decoding latency or temporal profile between invariant and non-invariant action recognition in full videos. However, when either form or motion information is removed from the stimulus set, we observe a decrease and delay in invariant action decoding. Our results suggest that the brain recognizes actions and builds invariance to complex transformations at the same time, and that both form and motion information are crucial for fast, invariant action recognition

    Histogram of Oriented Principal Components for Cross-View Action Recognition

    Full text link
    Existing techniques for 3D action recognition are sensitive to viewpoint variations because they extract features from depth images which are viewpoint dependent. In contrast, we directly process pointclouds for cross-view action recognition from unknown and unseen views. We propose the Histogram of Oriented Principal Components (HOPC) descriptor that is robust to noise, viewpoint, scale and action speed variations. At a 3D point, HOPC is computed by projecting the three scaled eigenvectors of the pointcloud within its local spatio-temporal support volume onto the vertices of a regular dodecahedron. HOPC is also used for the detection of Spatio-Temporal Keypoints (STK) in 3D pointcloud sequences so that view-invariant STK descriptors (or Local HOPC descriptors) at these key locations only are used for action recognition. We also propose a global descriptor computed from the normalized spatio-temporal distribution of STKs in 4-D, which we refer to as STK-D. We have evaluated the performance of our proposed descriptors against nine existing techniques on two cross-view and three single-view human action recognition datasets. The Experimental results show that our techniques provide significant improvement over state-of-the-art methods

    A neural model for the visual tuning properties of action-selective neurons

    Get PDF
    SUMMARY: The recognition of actions of conspecifics is crucial for survival and social interaction. Most current models on the recognition of transitive (goal-directed) actions rely on the hypothesized role of internal motor simulations for action recognition. However, these models do not specify how visual information can be processed by cortical mechanisms in order to be compared with such motor representations. This raises the question how such visual processing might be accomplished, and in how far motor processing is critical in order to account for the visual properties of action-selective neurons.
We present a neural model for the visual processing of transient actions that is consistent with physiological data and that accomplishes recognition of grasping actions from real video stimuli. Shape recognition is accomplished by a view-dependent hierarchical neural architecture that retains some coarse position information on the highest level that can be exploited by subsequent stages. Additionally, simple recurrent neural circuits integrate effector information over time and realize selectivity for temporal sequences. A novel mechanism combines information about the shape and position of object and effector in an object-centered frame of reference. Action-selective model neurons defined in such a relative reference frame are tuned to learned associations between object and effector shapes, as well as their relative position and motion. 
We demonstrate that this model reproduces a variety of electrophysiological findings on the visual properties of action-selective neurons in the superior temporal sulcus, and of mirror neurons in area F5. Specifically, the model accounts for the fact that a majority of mirror neurons in area F5 show view dependence. The model predicts a number of electrophysiological results, which partially could be confirmed in recent experiments.
We conclude that the tuning of action-selective neurons given visual stimuli can be accounted for by well-established, predominantly visual neural processes rather than internal motor simulations.

METHODS: The shape recognition relies on a hierarchy of feature detectors of increasing complexity and invariance [1]. The mid-level features are learned from sequences of gray-level images depicting segmented views of hand and object shapes. The highest hierarchy level consists of detector populations for complete shapes with a coarse spatial resolution of approximately 3.7°. Additionally, effector shapes are integrated over time by asymmetric lateral connections between shape detectors using a neural field approach [2]. These model neurons thus encode actions such as hand opening or closing for particular grip types. 
We exploit gain field mechanism in order to implement the central coordinate transformation of the shape representations to an object-centered reference frame [3]. Typical effector-object-interactions correspond to activity regions in such a relative reference frame and are learned from training examples. Similarly, simple motion-energy detectors are applied in the object-centered reference frame and encode relative motion. The properties of transitive action neurons are modeled as a multiplicative combination of relative shape and motion detectors.

RESULTS: The model performance was tested on a set of 160 unsegmented sequences of hand grasping or placing actions performed on objects of different sizes, using different grip types and views. Hand actions and objects could be reliably recognized despite their mutual occlusions. Detectors on the highest level showed correct action tuning in more than 95% of the examples and generalized to untrained views. 
Furthermore, the model replicates a number of electrophysiological as well as imaging experiments on action-selective neurons, such as their particular selectivity for transitive actions compared to mimicked actions, the invariance to stimulus position, and their view-dependence. In particular, using the same stimulus set the model nicely fits neural data from a recent electrophysiological experiment that confirmed sequence selectivity in mirror neurons in area F5, as was predicted before by the model.

References
[1] Serre, T. et al. (2007): IEEE Pattern Anal. Mach. Int. 29, 411-426.
[2] Giese, A.M. and Poggio, T. (2003): Nat. Rev. Neurosci. 4, 179-192.
[3] Deneve, S. and Pouget, A. (2003). Neuron 37: 347-359.
&#xa

    Statistical Analysis of Dynamic Actions

    Get PDF
    Real-world action recognition applications require the development of systems which are fast, can handle a large variety of actions without a priori knowledge of the type of actions, need a minimal number of parameters, and necessitate as short as possible learning stage. In this paper, we suggest such an approach. We regard dynamic activities as long-term temporal objects, which are characterized by spatio-temporal features at multiple temporal scales. Based on this, we design a simple statistical distance measure between video sequences which captures the similarities in their behavioral content. This measure is nonparametric and can thus handle a wide range of complex dynamic actions. Having a behavior-based distance measure between sequences, we use it for a variety of tasks, including: video indexing, temporal segmentation, and action-based video clustering. These tasks are performed without prior knowledge of the types of actions, their models, or their temporal extents

    Learning viewpoint invariant perceptual representations from cluttered images

    Get PDF
    In order to perform object recognition, it is necessary to form perceptual representations that are sufficiently specific to distinguish between objects, but that are also sufficiently flexible to generalize across changes in location, rotation, and scale. A standard method for learning perceptual representations that are invariant to viewpoint is to form temporal associations across image sequences showing object transformations. However, this method requires that individual stimuli be presented in isolation and is therefore unlikely to succeed in real-world applications where multiple objects can co-occur in the visual input. This paper proposes a simple modification to the learning method that can overcome this limitation and results in more robust learning of invariant representations

    Slow and steady feature analysis: higher order temporal coherence in video

    Full text link
    How can unlabeled video augment visual learning? Existing methods perform "slow" feature analysis, encouraging the representations of temporally close frames to exhibit only small differences. While this standard approach captures the fact that high-level visual signals change slowly over time, it fails to capture *how* the visual content changes. We propose to generalize slow feature analysis to "steady" feature analysis. The key idea is to impose a prior that higher order derivatives in the learned feature space must be small. To this end, we train a convolutional neural network with a regularizer on tuples of sequential frames from unlabeled video. It encourages feature changes over time to be smooth, i.e., similar to the most recent changes. Using five diverse datasets, including unlabeled YouTube and KITTI videos, we demonstrate our method's impact on object, scene, and action recognition tasks. We further show that our features learned from unlabeled video can even surpass a standard heavily supervised pretraining approach.Comment: in Computer Vision and Pattern Recognition (CVPR) 2016, Las Vegas, NV, June 201
    • …
    corecore