27,153 research outputs found
Fast and Robust Dynamic Hand Gesture Recognition via Key Frames Extraction and Feature Fusion
Gesture recognition is a hot topic in computer vision and pattern
recognition, which plays a vitally important role in natural human-computer
interface. Although great progress has been made recently, fast and robust hand
gesture recognition remains an open problem, since the existing methods have
not well balanced the performance and the efficiency simultaneously. To bridge
it, this work combines image entropy and density clustering to exploit the key
frames from hand gesture video for further feature extraction, which can
improve the efficiency of recognition. Moreover, a feature fusion strategy is
also proposed to further improve feature representation, which elevates the
performance of recognition. To validate our approach in a "wild" environment,
we also introduce two new datasets called HandGesture and Action3D datasets.
Experiments consistently demonstrate that our strategy achieves competitive
results on Northwestern University, Cambridge, HandGesture and Action3D hand
gesture datasets. Our code and datasets will release at
https://github.com/Ha0Tang/HandGestureRecognition.Comment: 11 pages, 3 figures, accepted to NeuroComputin
Robust and customized methods for real-time hand gesture recognition under object-occlusion
Dynamic hand tracking and gesture recognition is a hard task since there are
many joints on the fingers and each joint owns many degrees of freedom.
Besides, object occlusion is also a thorny issue in finger tracking and posture
recognition. Therefore, we propose a robust and customized system for realtime
hand tracking and gesture recognition under occlusion environment. First, we
model the angles between hand keypoints and encode their relative coordinate
vectors, then we introduce GAN to generate raw discrete sequence dataset.
Secondly we propose a time series forecasting method in the prediction of
defined hand keypoint location. Finally, we define a sliding window matching
method to complete gesture recognition. We analyze 11 kinds of typical gestures
and show how to perform gesture recognition with the proposed method. Our work
can reach state of the art results and contribute to build a framework to
implement customized gesture recognition task
Pose Guided Structured Region Ensemble Network for Cascaded Hand Pose Estimation
Hand pose estimation from a single depth image is an essential topic in
computer vision and human computer interaction. Despite recent advancements in
this area promoted by convolutional neural network, accurate hand pose
estimation is still a challenging problem. In this paper we propose a Pose
guided structured Region Ensemble Network (Pose-REN) to boost the performance
of hand pose estimation. The proposed method extracts regions from the feature
maps of convolutional neural network under the guide of an initially estimated
pose, generating more optimal and representative features for hand pose
estimation. The extracted feature regions are then integrated hierarchically
according to the topology of hand joints by employing tree-structured fully
connections. A refined estimation of hand pose is directly regressed by the
proposed network and the final hand pose is obtained by utilizing an iterative
cascaded method. Comprehensive experiments on public hand pose datasets
demonstrate that our proposed method outperforms state-of-the-art algorithms.Comment: Accepted by Neurocomputin
Fingertip Detection and Tracking for Recognition of Air-Writing in Videos
Air-writing is the process of writing characters or words in free space using
finger or hand movements without the aid of any hand-held device. In this work,
we address the problem of mid-air finger writing using web-cam video as input.
In spite of recent advances in object detection and tracking, accurate and
robust detection and tracking of the fingertip remains a challenging task,
primarily due to small dimension of the fingertip. Moreover, the initialization
and termination of mid-air finger writing is also challenging due to the
absence of any standard delimiting criterion. To solve these problems, we
propose a new writing hand pose detection algorithm for initialization of
air-writing using the Faster R-CNN framework for accurate hand detection
followed by hand segmentation and finally counting the number of raised fingers
based on geometrical properties of the hand. Further, we propose a robust
fingertip detection and tracking approach using a new signature function called
distance-weighted curvature entropy. Finally, a fingertip velocity-based
termination criterion is used as a delimiter to mark the completion of the
air-writing gesture. Experiments show the superiority of the proposed fingertip
detection and tracking algorithm over state-of-the-art approaches giving a mean
precision of 73.1 % while achieving real-time performance at 18.5 fps, a
condition which is of vital importance to air-writing. Character recognition
experiments give a mean accuracy of 96.11 % using the proposed air-writing
system, a result which is comparable to that of existing handwritten character
recognition systems.Comment: 32 pages, 10 figures, 2 tables. Submitted to Journal of Expert
Systems with Application
Deep Facial Expression Recognition: A Survey
With the transition of facial expression recognition (FER) from
laboratory-controlled to challenging in-the-wild conditions and the recent
success of deep learning techniques in various fields, deep neural networks
have increasingly been leveraged to learn discriminative representations for
automatic FER. Recent deep FER systems generally focus on two important issues:
overfitting caused by a lack of sufficient training data and
expression-unrelated variations, such as illumination, head pose and identity
bias. In this paper, we provide a comprehensive survey on deep FER, including
datasets and algorithms that provide insights into these intrinsic problems.
First, we describe the standard pipeline of a deep FER system with the related
background knowledge and suggestions of applicable implementations for each
stage. We then introduce the available datasets that are widely used in the
literature and provide accepted data selection and evaluation principles for
these datasets. For the state of the art in deep FER, we review existing novel
deep neural networks and related training strategies that are designed for FER
based on both static images and dynamic image sequences, and discuss their
advantages and limitations. Competitive performances on widely used benchmarks
are also summarized in this section. We then extend our survey to additional
related issues and application scenarios. Finally, we review the remaining
challenges and corresponding opportunities in this field as well as future
directions for the design of robust deep FER systems
Online Action Recognition based on Incremental Learning of Weighted Covariance Descriptors
Different from traditional action recognition based on video segments, online
action recognition aims to recognize actions from unsegmented streams of data
in a continuous manner. One way for online recognition is based on the evidence
accumulation over time to make predictions from stream videos. This paper
presents a fast yet effective method to recognize actions from stream of noisy
skeleton data, and a novel weighted covariance descriptor is adopted to
accumulate evidence. In particular, a fast incremental updating method for the
weighted covariance descriptor is developed for accumulation of temporal
information and online prediction. The weighted covariance descriptor takes the
following principles into consideration: past frames have less contribution for
recognition and recent and informative frames such as key frames contribute
more to the recognition. The online recognition is achieved using a simple
nearest neighbor search against a set of offline trained action models.
Experimental results on MSC-12 Kinect Gesture dataset and our newly constructed
online action recognition dataset have demonstrated the efficacy of the
proposed method
Evaluation of the Spatio-Temporal features and GAN for Micro-expression Recognition System
Owing to the development and advancement of artificial intelligence, numerous
works were established in the human facial expression recognition system.
Meanwhile, the detection and classification of micro-expressions are attracting
attentions from various research communities in the recent few years. In this
paper, we first review the processes of a conventional optical-flow-based
recognition system, which comprised of facial landmarks annotations, optical
flow guided images computation, features extraction and emotion class
categorization. Secondly, a few approaches have been proposed to improve the
feature extraction part, such as exploiting GAN to generate more image samples.
Particularly, several variations of optical flow are computed in order to
generate optimal images to lead to high recognition accuracy. Next, GAN, a
combination of Generator and Discriminator, is utilized to generate new "fake"
images to increase the sample size. Thirdly, a modified state-of-the-art
Convolutional neural networks is proposed. To verify the effectiveness of the
the proposed method, the results are evaluated on spontaneous micro-expression
databases, namely SMIC, CASME II and SAMM. Both the F1-score and accuracy
performance metrics are reported in this paper.Comment: 15 pages, 16 figures, 6 table
American Sign Language fingerspelling recognition in the wild
We address the problem of American Sign Language fingerspelling recognition
in the wild, using videos collected from websites. We introduce the largest
data set available so far for the problem of fingerspelling recognition, and
the first using naturally occurring video data. Using this data set, we present
the first attempt to recognize fingerspelling sequences in this challenging
setting. Unlike prior work, our video data is extremely challenging due to low
frame rates and visual variability. To tackle the visual challenges, we train a
special-purpose signing hand detector using a small subset of our data. Given
the hand detector output, a sequence model decodes the hypothesized
fingerspelled letter sequence. For the sequence model, we explore
attention-based recurrent encoder-decoders and CTC-based approaches. As the
first attempt at fingerspelling recognition in the wild, this work is intended
to serve as a baseline for future work on sign language recognition in
realistic conditions. We find that, as expected, letter error rates are much
higher than in previous work on more controlled data, and we analyze the
sources of error and effects of model variants.Comment: accepted in SLT 201
UAV-GESTURE: A Dataset for UAV Control and Gesture Recognition
Current UAV-recorded datasets are mostly limited to action recognition and
object tracking, whereas the gesture signals datasets were mostly recorded in
indoor spaces. Currently, there is no outdoor recorded public video dataset for
UAV commanding signals. Gesture signals can be effectively used with UAVs by
leveraging the UAVs visual sensors and operational simplicity. To fill this gap
and enable research in wider application areas, we present a UAV gesture
signals dataset recorded in an outdoor setting. We selected 13 gestures
suitable for basic UAV navigation and command from general aircraft handling
and helicopter handling signals. We provide 119 high-definition video clips
consisting of 37151 frames. The overall baseline gesture recognition
performance computed using Pose-based Convolutional Neural Network (P-CNN) is
91.9 %. All the frames are annotated with body joints and gesture classes in
order to extend the dataset's applicability to a wider research area including
gesture recognition, action recognition, human pose recognition and situation
awareness.Comment: 12 pages, 4 figures, UAVision workshop, ECCV, 201
Audio to Body Dynamics
We present a method that gets as input an audio of violin or piano playing,
and outputs a video of skeleton predictions which are further used to animate
an avatar. The key idea is to create an animation of an avatar that moves their
hands similarly to how a pianist or violinist would do, just from audio. Aiming
for a fully detailed correct arms and fingers motion is a goal, however, it's
not clear if body movement can be predicted from music at all. In this paper,
we present the first result that shows that natural body dynamics can be
predicted at all. We built an LSTM network that is trained on violin and piano
recital videos uploaded to the Internet. The predicted points are applied onto
a rigged avatar to create the animation.Comment: Link with videos https://arviolin.github.io/AudioBodyDynamics
- …