39,328 research outputs found
Transportation mode recognition fusing wearable motion, sound and vision sensors
We present the first work that investigates the potential of improving the performance of transportation mode recognition through fusing multimodal data from wearable sensors: motion, sound and vision. We first train three independent deep neural network (DNN) classifiers, which work with the three types of sensors, respectively. We then propose two schemes that fuse the classification results from the three mono-modal classifiers. The first scheme makes an ensemble decision with fixed rules including Sum, Product, Majority Voting, and Borda Count. The second scheme is an adaptive fuser built as another classifier (including Naive Bayes, Decision Tree, Random Forest and Neural Network) that learns enhanced predictions by combining the outputs from the three mono-modal classifiers. We verify the advantage of the proposed method with the state-of-the-art Sussex-Huawei Locomotion and Transportation (SHL) dataset recognizing the eight transportation activities: Still, Walk, Run, Bike, Bus, Car, Train and Subway. We achieve F1 scores of 79.4%, 82.1% and 72.8% with the mono-modal motion, sound and vision classifiers, respectively. The F1 score is remarkably improved to 94.5% and 95.5% by the two data fusion schemes, respectively. The recognition performance can be further improved with a post-processing scheme that exploits the temporal continuity of transportation. When assessing generalization of the model to unseen data, we show that while performance is reduced - as expected - for each individual classifier, the benefits of fusion are retained with performance improved by 15 percentage points. Besides the actual performance increase, this work, most importantly, opens up the possibility for dynamically fusing modalities to achieve distinct power-performance trade-off at run time
Combining inertial and visual sensing for human action recognition in tennis
In this paper, we present a framework for both the automatic extraction of the temporal location of tennis strokes within a match and the subsequent classification of these as being either a serve, forehand or backhand. We employ the use of low-cost visual sensing and low-cost inertial sensing to achieve these aims, whereby a single modality can be used or a fusion of both classification strategies can be adopted if both modalities are available within a given capture scenario. This flexibility allows the framework to be applicable to a variety of user scenarios and hardware infrastructures. Our proposed approach is quantitatively evaluated using data captured from elite tennis players. Results point to the extremely accurate performance of the proposed approach irrespective of input modality configuration
Multi-sensor classification of tennis strokes
In this work, we investigate tennis stroke recognition
using a single inertial measuring unit attached to a player’s forearm during a competitive match. This paper evaluates the best approach for stroke detection using either accelerometers, gyroscopes or magnetometers, which are embedded into the inertial measuring unit. This work concludes what is the optimal training data set for stroke classification and proves that classifiers can perform well when tested on players who were not used to train the classifier. This work provides a significant step forward for our overall goal, which is to develop next generation sports coaching tools using both inertial and visual sensors in an instrumented indoor sporting environment
Improving Foot-Mounted Inertial Navigation Through Real-Time Motion Classification
We present a method to improve the accuracy of a foot-mounted,
zero-velocity-aided inertial navigation system (INS) by varying estimator
parameters based on a real-time classification of motion type. We train a
support vector machine (SVM) classifier using inertial data recorded by a
single foot-mounted sensor to differentiate between six motion types (walking,
jogging, running, sprinting, crouch-walking, and ladder-climbing) and report
mean test classification accuracy of over 90% on a dataset with five different
subjects. From these motion types, we select two of the most common (walking
and running), and describe a method to compute optimal zero-velocity detection
parameters tailored to both a specific user and motion type by maximizing the
detector F-score. By combining the motion classifier with a set of optimal
detection parameters, we show how we can reduce INS position error during mixed
walking and running motion. We evaluate our adaptive system on a total of 5.9
km of indoor pedestrian navigation performed by five different subjects moving
along a 130 m path with surveyed ground truth markers.Comment: In Proceedings of the International Conference on Indoor Positioning
and Indoor Navigation (IPIN'17), Sapporo, Japan, Sep. 18-21, 201
CaloriNet: From silhouettes to calorie estimation in private environments
We propose a novel deep fusion architecture, CaloriNet, for the online
estimation of energy expenditure for free living monitoring in private
environments, where RGB data is discarded and replaced by silhouettes. Our
fused convolutional neural network architecture is trainable end-to-end, to
estimate calorie expenditure, using temporal foreground silhouettes alongside
accelerometer data. The network is trained and cross-validated on a publicly
available dataset, SPHERE_RGBD + Inertial_calorie. Results show
state-of-the-art minimum error on the estimation of energy expenditure
(calories per minute), outperforming alternative, standard and single-modal
techniques.Comment: 11 pages, 7 figure
Neural Network Based Reinforcement Learning for Audio-Visual Gaze Control in Human-Robot Interaction
This paper introduces a novel neural network-based reinforcement learning
approach for robot gaze control. Our approach enables a robot to learn and to
adapt its gaze control strategy for human-robot interaction neither with the
use of external sensors nor with human supervision. The robot learns to focus
its attention onto groups of people from its own audio-visual experiences,
independently of the number of people, of their positions and of their physical
appearances. In particular, we use a recurrent neural network architecture in
combination with Q-learning to find an optimal action-selection policy; we
pre-train the network using a simulated environment that mimics realistic
scenarios that involve speaking/silent participants, thus avoiding the need of
tedious sessions of a robot interacting with people. Our experimental
evaluation suggests that the proposed method is robust against parameter
estimation, i.e. the parameter values yielded by the method do not have a
decisive impact on the performance. The best results are obtained when both
audio and visual information is jointly used. Experiments with the Nao robot
indicate that our framework is a step forward towards the autonomous learning
of socially acceptable gaze behavior.Comment: Paper submitted to Pattern Recognition Letter
- …