8,594 research outputs found
ModDrop: adaptive multi-modal gesture recognition
We present a method for gesture detection and localisation based on
multi-scale and multi-modal deep learning. Each visual modality captures
spatial information at a particular spatial scale (such as motion of the upper
body or a hand), and the whole system operates at three temporal scales. Key to
our technique is a training strategy which exploits: i) careful initialization
of individual modalities; and ii) gradual fusion involving random dropping of
separate channels (dubbed ModDrop) for learning cross-modality correlations
while preserving uniqueness of each modality-specific representation. We
present experiments on the ChaLearn 2014 Looking at People Challenge gesture
recognition track, in which we placed first out of 17 teams. Fusing multiple
modalities at several spatial and temporal scales leads to a significant
increase in recognition rates, allowing the model to compensate for errors of
the individual classifiers as well as noise in the separate channels.
Futhermore, the proposed ModDrop training technique ensures robustness of the
classifier to missing signals in one or several channels to produce meaningful
predictions from any number of available modalities. In addition, we
demonstrate the applicability of the proposed fusion scheme to modalities of
arbitrary nature by experiments on the same dataset augmented with audio.Comment: 14 pages, 7 figure
Fast Fight Detection
Action recognition has become a hot topic within computer vision. However, the action recognition community has focused mainly on relatively simple actions like clapping, walking, jogging, etc. The detection of specific events with direct practical use such as fights or in general aggressive behavior has been comparatively less studied. Such capability may be extremely useful in some video surveillance scenarios like prisons, psychiatric centers or even embedded in camera phones. As a consequence, there is growing interest in developing violence detection algorithms. Recent work considered the well-known Bag-of-Words framework for the specific problem of fight detection. Under this framework, spatio-temporal features are extracted from the video sequences and used for classification. Despite encouraging results in which high accuracy rates were achieved, the computational cost of extracting such features is prohibitive for practical applications. This work proposes a novel method to detect violence sequences. Features extracted from motion blobs are used to discriminate fight and non-fight sequences. Although the method is outperformed in accuracy by state of the art, it has a significantly faster computation time thus making it amenable for real-time applications
A 3D descriptor to detect task-oriented grasping points in clothing
© 2016. This manuscript version is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/Manipulating textile objects with a robot is a challenging task, especially because the garment perception is difficult due to the endless configurations it can adopt, coupled with a large variety of colors and designs. Most current approaches follow a multiple re-grasp strategy, in which clothes are sequentially grasped from different points until one of them yields a recognizable configuration. In this work we propose a method that combines 3D and appearance information to directly select a suitable grasping point for the task at hand, which in our case consists of hanging a shirt or a polo shirt from a hook. Our method follows a coarse-to-fine approach in which, first, the collar of the garment is detected and, next, a grasping point on the lapel is chosen using a novel 3D descriptor.
In contrast to current 3D descriptors, ours can run in real time, even when it needs to be densely computed over the input image. Our central idea is to take advantage of the structured nature of range images that most depth sensors provide and, by exploiting integral imaging, achieve speed-ups of two orders of magnitude with respect to competing approaches, while maintaining performance. This makes it especially adequate for robotic applications as we thoroughly demonstrate in the experimental section.Peer ReviewedPostprint (author's final draft
Video Data Visualization System: Semantic Classification And Personalization
We present in this paper an intelligent video data visualization tool, based
on semantic classification, for retrieving and exploring a large scale corpus
of videos. Our work is based on semantic classification resulting from semantic
analysis of video. The obtained classes will be projected in the visualization
space. The graph is represented by nodes and edges, the nodes are the keyframes
of video documents and the edges are the relation between documents and the
classes of documents. Finally, we construct the user's profile, based on the
interaction with the system, to render the system more adequate to its
references.Comment: graphic
Action Recognition Using Particle Flow Fields
In recent years, research in human action recognition has advanced on multiple fronts to address various types of actions including simple, isolated actions in staged data (e.g., KTH dataset), complex actions (e.g., Hollywood dataset), and naturally occurring actions in surveillance videos (e.g, VIRAT dataset). Several techniques including those based on gradient, flow, and interest-points, have been developed for their recognition. Most perform very well in standard action recognition datasets, but fail to produce similar results in more complex, large-scale datasets. Action recognition on large categories of unconstrained videos taken from the web is a very challenging problem compared to datasets like KTH (six actions), IXMAS (thirteen actions), and Weizmann (ten actions). Challenges such as camera motion, different viewpoints, huge interclass variations, cluttered background, occlusions, bad illumination conditions, and poor quality of web videos cause the majority of the state-of-the-art action recognition approaches to fail. An increasing number of categories and the inclusion of actions with high confusion also increase the difficulty of the problem. The approach taken to solve this action recognition problem depends primarily on the dataset and the possibility of detecting and tracking the object of interest. In this dissertation, a new method for video representation is proposed and three new approaches to perform action recognition in different scenarios using varying prerequisites are presented. The prerequisites have decreasing levels of difficulty to obtain: 1) Scenario requires human detection and trackiii ing to perform action recognition; 2) Scenario requires background and foreground separation to perform action recognition; and 3) No pre-processing is required for action recognition. First, we propose a new video representation using optical flow and particle advection. The proposed “Particle Flow Field” (PFF) representation has been used to generate motion descriptors and tested in a Bag of Video Words (BoVW) framework on the KTH dataset. We show that particle flow fields has better performance than other low-level video representations, such as 2D-Gradients, 3D-Gradients and optical flow. Second, we analyze the performance of the state-of-the-art technique based on the histogram of oriented 3D-Gradients in spatio temporal volumes, where human detection and tracking are required. We use the proposed particle flow field and show superior results compared to the histogram of oriented 3D-Gradients in spatio temporal volumes. The proposed method, when used for human action recognition, just needs human detection and does not necessarily require human tracking and figure centric bounding boxes. It has been tested on KTH (six actions), Weizmann (ten actions), and IXMAS (thirteen actions, 4 different views) action recognition datasets. Third, we propose using the scene context information obtained from moving and stationary pixels in the key frames, in conjunction with motion descriptors obtained using Bag of Words framework, to solve the action recognition problem on a large (50 actions) dataset with videos from the web. We perform a combination of early and late fusion on multiple features to handle the huge number of categories. We demonstrate that scene context is a very important feature for performing action recognition on huge datasets. iv The proposed method needs separation of moving and stationary pixels, and does not require any kind of video stabilization, person detection, or tracking and pruning of features. Our approach obtains good performance on a huge number of action categories. It has been tested on the UCF50 dataset with 50 action categories, which is an extension of the UCF YouTube Action (UCF11) Dataset containing 11 action categories. We also tested our approach on the KTH and HMDB51 datasets for comparison. Finally, we focus on solving practice problems in representing actions by bag of spatio temporal features (i.e. cuboids), which has proven valuable for action recognition in recent literature. We observed that the visual vocabulary based (bag of video words) method suffers from many drawbacks in practice, such as: (i) It requires an intensive training stage to obtain good performance; (ii) it is sensitive to the vocabulary size; (iii) it is unable to cope with incremental recognition problems; (iv) it is unable to recognize simultaneous multiple actions; (v) it is unable to perform recognition frame by frame. In order to overcome these drawbacks, we propose a framework to index large scale motion features using Sphere/Rectangle-tree (SR-tree) for incremental action detection and recognition. The recognition comprises of the following two steps: 1) recognizing the local features by non-parametric nearest neighbor (NN), and 2) using a simple voting strategy to label the action. It can also provide localization of the action. Since it does not require feature quantization it can efficiently grow the feature-tree by adding features from new training actions or categories. Our method provides an effective way for practical incremental action recognition. Furthermore, it can handle large scale datasets because the SR-tree is a disk-based v data structure. We tested our approach on two publicly available datasets, the KTH dataset and the IXMAS multi-view dataset, and achieved promising results
Review of Person Re-identification Techniques
Person re-identification across different surveillance cameras with disjoint
fields of view has become one of the most interesting and challenging subjects
in the area of intelligent video surveillance. Although several methods have
been developed and proposed, certain limitations and unresolved issues remain.
In all of the existing re-identification approaches, feature vectors are
extracted from segmented still images or video frames. Different similarity or
dissimilarity measures have been applied to these vectors. Some methods have
used simple constant metrics, whereas others have utilised models to obtain
optimised metrics. Some have created models based on local colour or texture
information, and others have built models based on the gait of people. In
general, the main objective of all these approaches is to achieve a
higher-accuracy rate and lowercomputational costs. This study summarises
several developments in recent literature and discusses the various available
methods used in person re-identification. Specifically, their advantages and
disadvantages are mentioned and compared.Comment: Published 201
- …