25,460 research outputs found
Neural correlates of the processing of co-speech gestures
In communicative situations, speech is often accompanied by gestures. For example, speakers tend to illustrate certain contents of speech by means of iconic gestures which are hand movements that bear a formal relationship to the contents of speech. The meaning of an iconic gesture is determined both by its form as well as the speech context in which it is performed. Thus, gesture and speech interact in comprehension. Using fMRI, the present study investigated what brain areas are involved in this interaction process. Participants watched videos in which sentences containing an ambiguous word (e.g. She touched the mouse) were accompanied by either a meaningless grooming movement, a gesture supporting the more frequent dominant meaning (e.g. animal) or a gesture supporting the less frequent subordinate meaning (e.g. computer device). We hypothesized that brain areas involved in the interaction of gesture and speech would show greater activation to gesture-supported sentences as compared to sentences accompanied by a meaningless grooming movement. The main results are that when contrasted with grooming, both types of gestures (dominant and subordinate) activated an array of brain regions consisting of the left posterior superior temporal sulcus (STS), the inferior parietal lobule bilaterally and the ventral precentral sulcus bilaterally. Given the crucial role of the STS in audiovisual integration processes, this activation might reflect the interaction between the meaning of gesture and the ambiguous sentence. The activations in inferior frontal and inferior parietal regions may reflect a mechanism of determining the goal of co-speech hand movements through an observation-execution matching process
A robust and efficient video representation for action recognition
This paper introduces a state-of-the-art video representation and applies it
to efficient action recognition and detection. We first propose to improve the
popular dense trajectory features by explicit camera motion estimation. More
specifically, we extract feature point matches between frames using SURF
descriptors and dense optical flow. The matches are used to estimate a
homography with RANSAC. To improve the robustness of homography estimation, a
human detector is employed to remove outlier matches from the human body as
human motion is not constrained by the camera. Trajectories consistent with the
homography are considered as due to camera motion, and thus removed. We also
use the homography to cancel out camera motion from the optical flow. This
results in significant improvement on motion-based HOF and MBH descriptors. We
further explore the recent Fisher vector as an alternative feature encoding
approach to the standard bag-of-words histogram, and consider different ways to
include spatial layout information in these encodings. We present a large and
varied set of evaluations, considering (i) classification of short basic
actions on six datasets, (ii) localization of such actions in feature-length
movies, and (iii) large-scale recognition of complex events. We find that our
improved trajectory features significantly outperform previous dense
trajectories, and that Fisher vectors are superior to bag-of-words encodings
for video recognition tasks. In all three tasks, we show substantial
improvements over the state-of-the-art results
Towards social pattern characterization in egocentric photo-streams
Following the increasingly popular trend of social interaction analysis in egocentric vision, this article presents a comprehensive pipeline for automatic social pattern characterization of a wearable photo-camera user. The proposed framework relies merely on the visual analysis of egocentric photo-streams and consists of three major steps. The first step is to detect social interactions of the user where the impact of several social signals on the task is explored. The detected social events are inspected in the second step for categorization into different social meetings. These two steps act at event-level where each potential social event is modeled as a multi-dimensional time-series, whose dimensions correspond to a set of relevant features for each task; finally, LSTM is employed to classify the time-series. The last step of the framework is to characterize social patterns of the user. Our goal is to quantify the duration, the diversity and the frequency of the user social relations in various social situations. This goal is achieved by the discovery of recurrences of the same people across the whole set of social events related to the user. Experimental evaluation over EgoSocialStyle - the proposed dataset in this work, and EGO-GROUP demonstrates promising results on the task of social pattern characterization from egocentric photo-streams
MoWLD: a robust motion image descriptor for violence detection
© 2015, Springer Science+Business Media New York. Automatic violence detection from video is a hot topic for many video surveillance applications. However, there has been little success in designing an algorithm that can detect violence in surveillance videos with high performance. Existing methods typically apply the Bag-of-Words (BoW) model on local spatiotemporal descriptors. However, traditional spatiotemporal features are not discriminative enough, and also the BoW model roughly assigns each feature vector to only one visual word and therefore ignores the spatial relationships among the features. To tackle these problems, in this paper we propose a novel Motion Weber Local Descriptor (MoWLD) in the spirit of the well-known WLD and make it a powerful and robust descriptor for motion images. We extend the WLD spatial descriptions by adding a temporal component to the appearance descriptor, which implicitly captures local motion information as well as low-level image appear information. To eliminate redundant and irrelevant features, the non-parametric Kernel Density Estimation (KDE) is employed on the MoWLD descriptor. In order to obtain more discriminative features, we adopt the sparse coding and max pooling scheme to further process the selected MoWLDs. Experimental results on three benchmark datasets have demonstrated the superiority of the proposed approach over the state-of-the-arts
Representation and recognition of human actions in video
PhDAutomated human action recognition plays a critical role in the development of human-machine
communication, by aiming for a more natural interaction between artificial intelligence and the
human society. Recent developments in technology have permitted a shift from a traditional
human action recognition performed in a well-constrained laboratory environment to realistic
unconstrained scenarios. This advancement has given rise to new problems and challenges still
not addressed by the available methods. Thus, the aim of this thesis is to study innovative approaches
that address the challenging problems of human action recognition from video captured
in unconstrained scenarios. To this end, novel action representations, feature selection methods,
fusion strategies and classification approaches are formulated.
More specifically, a novel interest points based action representation is firstly introduced, this
representation seeks to describe actions as clouds of interest points accumulated at different temporal
scales. The idea behind this method consists of extracting holistic features from the point
clouds and explicitly and globally describing the spatial and temporal action dynamic. Since
the proposed clouds of points representation exploits alternative and complementary information
compared to the conventional interest points-based methods, a more solid representation is then
obtained by fusing the two representations, adopting a Multiple Kernel Learning strategy. The
validity of the proposed approach in recognising action from a well-known benchmark dataset is
demonstrated as well as the superior performance achieved by fusing representations.
Since the proposed method appears limited by the presence of a dynamic background and fast
camera movements, a novel trajectory-based representation is formulated. Different from interest
points, trajectories can simultaneously retain motion and appearance information even in noisy
and crowded scenarios. Additionally, they can handle drastic camera movements and a robust
region of interest estimation. An equally important contribution is the proposed collaborative
feature selection performed to remove redundant and noisy components. In particular, a novel
feature selection method based on Multi-Class Delta Latent Dirichlet Allocation (MC-DLDA)
is introduced. Crucial, to enrich the final action representation, the trajectory representation is
adaptively fused with a conventional interest point representation. The proposed approach is
extensively validated on different datasets, and the reported performances are comparable with
the best state-of-the-art. The obtained results also confirm the fundamental contribution of both
collaborative feature selection and adaptive fusion.
Finally, the problem of realistic human action classification in very ambiguous scenarios is
taken into account. In these circumstances, standard feature selection methods and multi-class
classifiers appear inadequate due to: sparse training set, high intra-class variation and inter-class
similarity. Thus, both the feature selection and classification problems need to be redesigned.
The proposed idea is to iteratively decompose the classification task in subtasks and select the
optimal feature set and classifier in accordance with the subtask context. To this end, a cascaded
feature selection and action classification approach is introduced. The proposed cascade aims to
classify actions by exploiting as much information as possible, and at the same time trying to
simplify the multi-class classification in a cascade of binary separations. Specifically, instead of
separating multiple action classes simultaneously, the overall task is automatically divided into
easier binary sub-tasks. Experiments have been carried out using challenging public datasets;
the obtained results demonstrate that with identical action representation, the cascaded classifier
significantly outperforms standard multi-class classifiers
- …