3,844 research outputs found
The Impact of Quantity of Training Data on Recognition of Eating Gestures
This paper considers the problem of recognizing eating gestures by tracking
wrist motion. Eating gestures can have large variability in motion depending on
the subject, utensil, and type of food or beverage being consumed. Previous
works have shown viable proofs-of-concept of recognizing eating gestures in
laboratory settings with small numbers of subjects and food types, but it is
unclear how well these methods would work if tested on a larger population in
natural settings. As more subjects, locations and foods are tested, a larger
amount of motion variability could cause a decrease in recognition accuracy. To
explore this issue, this paper describes the collection and annotation of
51,614 eating gestures taken by 269 subjects eating a meal in a cafeteria.
Experiments are described that explore the complexity of hidden Markov models
(HMMs) and the amount of training data needed to adequately capture the motion
variability across this large data set. Results found that HMMs needed a
complexity of 13 states and 5 Gaussians to reach a plateau in accuracy,
signifying that a minimum of 65 samples per gesture type are needed. Results
also found that 500 training samples per gesture type were needed to identify
the point of diminishing returns in recognition accuracy. Overall, the findings
provide evidence that the size a data set typically used to demonstrate a
laboratory proofs-of-concept may not be sufficiently large enough to capture
all the motion variability that could be expected in transitioning to
deployment with a larger population. Our data set, which is 1-2 orders of
magnitude larger than all data sets tested in previous works, is being made
publicly available
Visual Rendering of Shapes on 2D Display Devices Guided by Hand Gestures
Designing of touchless user interface is gaining popularity in various
contexts. Using such interfaces, users can interact with electronic devices
even when the hands are dirty or non-conductive. Also, user with partial
physical disability can interact with electronic devices using such systems.
Research in this direction has got major boost because of the emergence of
low-cost sensors such as Leap Motion, Kinect or RealSense devices. In this
paper, we propose a Leap Motion controller-based methodology to facilitate
rendering of 2D and 3D shapes on display devices. The proposed method tracks
finger movements while users perform natural gestures within the field of view
of the sensor. In the next phase, trajectories are analyzed to extract extended
Npen++ features in 3D. These features represent finger movements during the
gestures and they are fed to unidirectional left-to-right Hidden Markov Model
(HMM) for training. A one-to-one mapping between gestures and shapes is
proposed. Finally, shapes corresponding to these gestures are rendered over the
display using MuPad interface. We have created a dataset of 5400 samples
recorded by 10 volunteers. Our dataset contains 18 geometric and 18
non-geometric shapes such as "circle", "rectangle", "flower", "cone", "sphere"
etc. The proposed methodology achieves an accuracy of 92.87% when evaluated
using 5-fold cross validation method. Our experiments revel that the extended
3D features perform better than existing 3D features in the context of shape
representation and classification. The method can be used for developing useful
HCI applications for smart display devices.Comment: Submitted to Elsevier Displays Journal, 32 pages, 18 figures, 7
table
A Probabilistic Modeling Approach to One-Shot Gesture Recognition
Gesture recognition enables a natural extension of the way we currently
interact with devices. Commercially available gesture recognition systems are
usually pre-trained and offer no option for customization by the user. In order
to improve the user experience, it is desirable to allow end users to define
their own gestures. This scenario requires learning from just a few training
examples if we want to impose only a light training load on the user. To this
end, we propose a gesture classifier based on a hierarchical probabilistic
modeling approach. In this framework, high-level features that are shared among
different gestures can be extracted from a large labeled data set, yielding a
prior distribution for gestures. When learning new types of gestures, the
learned shared prior reduces the number of required training examples for
individual gestures. We implemented the proposed gesture classifier for a Myo
sensor bracelet and show favorable results for the tested system on a database
of 17 different gesture types. Furthermore, we propose and implement two
methods to incorporate the gesture classifier in a real-time gesture
recognition system
Driver distraction detection and recognition using RGB-D sensor
Driver inattention assessment has become a very active field in intelligent
transportation systems. Based on active sensor Kinect and computer vision
tools, we have built an efficient module for detecting driver distraction and
recognizing the type of distraction. Based on color and depth map data from the
Kinect, our system is composed of four sub-modules. We call them eye behavior
(detecting gaze and blinking), arm position (is the right arm up, down, right
of forward), head orientation, and facial expressions. Each module produces
relevant information for assessing driver inattention. They are merged together
later on using two different classification strategies: AdaBoost classifier and
Hidden Markov Model. Evaluation is done using a driving simulator and 8 drivers
of different gender, age and nationality for a total of more than 8 hours of
recording. Qualitative and quantitative results show strong and accurate
detection and recognition capacity (85% accuracy for the type of distraction
and 90% for distraction detection). Moreover, each module is obtained
independently and could be used for other types of inference, such as fatigue
detection, and could be implemented for real cars systems
Real-time on-device nod and shake recognition
We discuss methods for teaching systems to identify gestures such as head nod
and shake. We use iPhone X depth camera to gather data and later use similar
data as input for a working app. These methods have proved robust for training
with limited datasets and thus we make the argument that similar methods could
be adapted to learn other human to human non-verbal gestures. We showcase how
to augment Euler angle gesture sequences to train models with a relatively
large number of parameters such as LSTM and GRU and gain better performance
than reported for smaller models such as HMM. In the examples here, we
demonstrate how to train such models with Keras and run the resulting models
real time on device with CoreML
Understanding Human Motion and Gestures for Underwater Human-Robot Collaboration
In this paper, we present a number of robust methodologies for an underwater
robot to visually detect, follow, and interact with a diver for collaborative
task execution. We design and develop two autonomous diver-following
algorithms, the first of which utilizes both spatial- and frequency-domain
features pertaining to human swimming patterns in order to visually track a
diver. The second algorithm uses a convolutional neural network-based model for
robust tracking-by-detection. In addition, we propose a hand gesture-based
human-robot communication framework that is syntactically simpler and
computationally more efficient than the existing grammar-based frameworks. In
the proposed interaction framework, deep visual detectors are used to provide
accurate hand gesture recognition; subsequently, a finite-state machine
performs robust and efficient gesture-to-instruction mapping. The
distinguishing feature of this framework is that it can be easily adopted by
divers for communicating with underwater robots without using artificial
markers or requiring memorization of complex language rules. Furthermore, we
validate the performance and effectiveness of the proposed methodologies
through extensive field experiments in closed- and open-water environments.
Finally, we perform a user interaction study to demonstrate the usability
benefits of our proposed interaction framework compared to existing methods.Comment: arXiv admin note: text overlap with arXiv:1709.0877
MS-ASL: A Large-Scale Data Set and Benchmark for Understanding American Sign Language
Sign language recognition is a challenging and often underestimated problem
comprising multi-modal articulators (handshape, orientation, movement, upper
body and face) that integrate asynchronously on multiple streams. Learning
powerful statistical models in such a scenario requires much data, particularly
to apply recent advances of the field. However, labeled data is a scarce
resource for sign language due to the enormous cost of transcribing these
unwritten languages.
We propose the first real-life large-scale sign language data set comprising
over 25,000 annotated videos, which we thoroughly evaluate with
state-of-the-art methods from sign and related action recognition. Unlike the
current state-of-the-art, the data set allows to investigate the generalization
to unseen individuals (signer-independent test) in a realistic setting with
over 200 signers. Previous work mostly deals with limited vocabulary tasks,
while here, we cover a large class count of 1000 signs in challenging and
unconstrained real-life recording conditions. We further propose I3D, known
from video classifications, as a powerful and suitable architecture for sign
language recognition, outperforming the current state-of-the-art by a large
margin. The data set is publicly available to the community
SignsWorld; Deeping Into the Silence World and Hearing Its Signs (State of the Art)
Automatic speech processing systems are employed more and more often in real
environments. Although the underlying speech technology is mostly language
independent, differences between languages with respect to their structure and
grammar have substantial effect on the recognition systems performance. In this
paper, we present a review of the latest developments in the sign language
recognition research in general and in the Arabic sign language (ArSL) in
specific. This paper also presents a general framework for improving the deaf
community communication with the hearing people that is called SignsWorld. The
overall goal of the SignsWorld project is to develop a vision-based technology
for recognizing and translating continuous Arabic sign language ArSL.Comment: 20 pages, A state of art paper so it contains many reference
Implicit segmentation of Kannada characters in offline handwriting recognition using hidden Markov models
We describe a method for classification of handwritten Kannada characters
using Hidden Markov Models (HMMs). Kannada script is agglutinative, where
simple shapes are concatenated horizontally to form a character. This results
in a large number of characters making the task of classification difficult.
Character segmentation plays a significant role in reducing the number of
classes. Explicit segmentation techniques suffer when overlapping shapes are
present, which is common in the case of handwritten text. We use HMMs to take
advantage of the agglutinative nature of Kannada script, which allows us to
perform implicit segmentation of characters along with recognition. All the
experiments are performed on the Chars74k dataset that consists of 657
handwritten characters collected across multiple users. Gradient-based features
are extracted from individual characters and are used to train character HMMs.
The use of implicit segmentation technique at the character level resulted in
an improvement of around 10%. This system also outperformed an existing system
tested on the same dataset by around 16%. Analysis based on learning curves
showed that increasing the training data could result in better accuracy.
Accordingly, we collected additional data and obtained an improvement of 4%
with 6 additional samples
The speaker-independent lipreading play-off; a survey of lipreading machines
Lipreading is a difficult gesture classification task. One problem in
computer lipreading is speaker-independence. Speaker-independence means to
achieve the same accuracy on test speakers not included in the training set as
speakers within the training set. Current literature is limited on
speaker-independent lipreading, the few independent test speaker accuracy
scores are usually aggregated within dependent test speaker accuracies for an
averaged performance. This leads to unclear independent results. Here we
undertake a systematic survey of experiments with the TCD-TIMIT dataset using
both conventional approaches and deep learning methods to provide a series of
wholly speaker-independent benchmarks and show that the best
speaker-independent machine scores 69.58% accuracy with CNN features and an SVM
classifier. This is less than state of the art speaker-dependent lipreading
machines, but greater than previously reported in independence experiments.Comment: To appear at the third IEEE International Conference on Image
Processing, Applications and Systems 201
- …