1,795 research outputs found
A random forest approach to segmenting and classifying gestures
This thesis investigates a gesture segmentation and recognition scheme that employs a random forest classification model. A complete gesture recognition system should localize and classify each gesture from a given gesture vocabulary, within a continuous video stream. Thus, the system must determine the start and end points of each gesture in time, as well as accurately recognize the class label of each gesture. We propose a unified approach that performs the tasks of temporal segmentation and classification simultaneously. Our method trains a random forest classification model to recognize gestures from a given vocabulary, as presented in a training dataset of video plus 3D body joint locations, as well as out-of-vocabulary (non-gesture) instances. Given an input video stream, our trained model is applied to candidate gestures using sliding windows at multiple temporal scales. The class label with the highest classifier confidence is selected, and its corresponding scale is used to determine the segmentation boundaries in time. We evaluated our formulation in segmenting and recognizing gestures from two different benchmark datasets: the NATOPS dataset of 9,600 gesture instances from a vocabulary of 24 aircraft handling signals, and the CHALEARN dataset of 7,754 gesture instances from a vocabulary of 20 Italian communication gestures. The performance of our method compares favorably with state-of-the-art methods that employ Hidden Markov Models or Hidden Conditional Random Fields on the NATOPS dataset. We conclude with a discussion of the advantages of using our model
A Survey of Applications and Human Motion Recognition with Microsoft Kinect
Microsoft Kinect, a low-cost motion sensing device, enables users to interact with computers or game consoles naturally through gestures and spoken commands without any other peripheral equipment. As such, it has commanded intense interests in research and development on the Kinect technology. In this paper, we present, a comprehensive survey on Kinect applications, and the latest research and development on motion recognition using data captured by the Kinect sensor. On the applications front, we review the applications of the Kinect technology in a variety of areas, including healthcare, education and performing arts, robotics, sign language recognition, retail services, workplace safety training, as well as 3D reconstructions. On the technology front, we provide an overview of the main features of both versions of the Kinect sensor together with the depth sensing technologies used, and review literatures on human motion recognition techniques used in Kinect applications. We provide a classification of motion recognition techniques to highlight the different approaches used in human motion recognition. Furthermore, we compile a list of publicly available Kinect datasets. These datasets are valuable resources for researchers to investigate better methods for human motion recognition and lower-level computer vision tasks such as segmentation, object detection and human pose estimation
ModDrop: adaptive multi-modal gesture recognition
We present a method for gesture detection and localisation based on
multi-scale and multi-modal deep learning. Each visual modality captures
spatial information at a particular spatial scale (such as motion of the upper
body or a hand), and the whole system operates at three temporal scales. Key to
our technique is a training strategy which exploits: i) careful initialization
of individual modalities; and ii) gradual fusion involving random dropping of
separate channels (dubbed ModDrop) for learning cross-modality correlations
while preserving uniqueness of each modality-specific representation. We
present experiments on the ChaLearn 2014 Looking at People Challenge gesture
recognition track, in which we placed first out of 17 teams. Fusing multiple
modalities at several spatial and temporal scales leads to a significant
increase in recognition rates, allowing the model to compensate for errors of
the individual classifiers as well as noise in the separate channels.
Futhermore, the proposed ModDrop training technique ensures robustness of the
classifier to missing signals in one or several channels to produce meaningful
predictions from any number of available modalities. In addition, we
demonstrate the applicability of the proposed fusion scheme to modalities of
arbitrary nature by experiments on the same dataset augmented with audio.Comment: 14 pages, 7 figure
To Draw or Not to Draw: Recognizing Stroke-Hover Intent in Gesture-Free Bare-Hand Mid-Air Drawing Tasks
Over the past several decades, technological advancements have introduced new modes of communication
with the computers, introducing a shift from traditional mouse and keyboard interfaces.
While touch based interactions are abundantly being used today, latest developments in computer
vision, body tracking stereo cameras, and augmented and virtual reality have now enabled communicating
with the computers using spatial input in the physical 3D space. These techniques are now
being integrated into several design critical tasks like sketching, modeling, etc. through sophisticated
methodologies and use of specialized instrumented devices. One of the prime challenges in
design research is to make this spatial interaction with the computer as intuitive as possible for the
users.
Drawing curves in mid-air with fingers, is a fundamental task with applications to 3D sketching,
geometric modeling, handwriting recognition, and authentication. Sketching in general, is a
crucial mode for effective idea communication between designers. Mid-air curve input is typically
accomplished through instrumented controllers, specific hand postures, or pre-defined hand gestures,
in presence of depth and motion sensing cameras. The user may use any of these modalities
to express the intention to start or stop sketching. However, apart from suffering with issues like
lack of robustness, the use of such gestures, specific postures, or the necessity of instrumented
controllers for design specific tasks further result in an additional cognitive load on the user.
To address the problems associated with different mid-air curve input modalities, the presented
research discusses the design, development, and evaluation of data driven models for intent recognition
in non-instrumented, gesture-free, bare-hand mid-air drawing tasks.
The research is motivated by a behavioral study that demonstrates the need for such an approach
due to the lack of robustness and intuitiveness while using hand postures and instrumented
devices. The main objective is to study how users move during mid-air sketching, develop qualitative
insights regarding such movements, and consequently implement a computational approach to
determine when the user intends to draw in mid-air without the use of an explicit mechanism (such
as an instrumented controller or a specified hand-posture). By recording the user’s hand trajectory,
the idea is to simply classify this point as either hover or stroke. The resulting model allows for
the classification of points on the user’s spatial trajectory.
Drawing inspiration from the way users sketch in mid-air, this research first specifies the necessity
for an alternate approach for processing bare hand mid-air curves in a continuous fashion.
Further, this research presents a novel drawing intent recognition work flow for every recorded
drawing point, using three different approaches. We begin with recording mid-air drawing data
and developing a classification model based on the extracted geometric properties of the recorded
data. The main goal behind developing this model is to identify drawing intent from critical geometric
and temporal features. In the second approach, we explore the variations in prediction
quality of the model by improving the dimensionality of data used as mid-air curve input. Finally,
in the third approach, we seek to understand the drawing intention from mid-air curves using
sophisticated dimensionality reduction neural networks such as autoencoders. Finally, the broad
level implications of this research are discussed, with potential development areas in the design
and research of mid-air interactions
Personalized face and gesture analysis using hierarchical neural networks
The video-based computational analyses of human face and gesture signals encompass a myriad of challenging research problems involving computer vision, machine learning and human computer interaction. In this thesis, we focus on the following challenges: a) the classification of hand and body gestures along with the temporal localization of their occurrence in a continuous stream, b) the recognition of facial expressivity levels in people with Parkinson's Disease using multimodal feature representations, c) the prediction of student learning outcomes in intelligent tutoring systems using affect signals, and d) the personalization of machine learning models, which can adapt to subject and group-specific nuances in facial and gestural behavior. Specifically, we first conduct a quantitative comparison of two approaches to the problem of segmenting and classifying gestures on two benchmark gesture datasets: a method that simultaneously segments and classifies gestures versus a cascaded method that performs the tasks sequentially. Second, we introduce a framework that computationally predicts an accurate score for facial expressivity and validate it on a dataset of interview videos of people with Parkinson's disease. Third, based on a unique dataset of videos of students interacting with MathSpring, an intelligent tutoring system, collected by our collaborative research team, we build models to predict learning outcomes from their facial affect signals. Finally, we propose a novel solution to a relatively unexplored area in automatic face and gesture analysis research: personalization of models to individuals and groups. We develop hierarchical Bayesian neural networks to overcome the challenges posed by group or subject-specific variations in face and gesture signals. We successfully validate our formulation on the problems of personalized subject-specific gesture classification, context-specific facial expressivity recognition and student-specific learning outcome prediction. We demonstrate the flexibility of our hierarchical framework by validating the utility of both fully connected and recurrent neural architectures
- …