878 research outputs found
From Unimodal to Multimodal: improving the sEMG-Based Pattern Recognition via deep generative models
Multimodal hand gesture recognition (HGR) systems can achieve higher
recognition accuracy. However, acquiring multimodal gesture recognition data
typically requires users to wear additional sensors, thereby increasing
hardware costs. This paper proposes a novel generative approach to improve
Surface Electromyography (sEMG)-based HGR accuracy via virtual Inertial
Measurement Unit (IMU) signals. Specifically, we trained a deep generative
model based on the intrinsic correlation between forearm sEMG signals and
forearm IMU signals to generate virtual forearm IMU signals from the input
forearm sEMG signals at first. Subsequently, the sEMG signals and virtual IMU
signals were fed into a multimodal Convolutional Neural Network (CNN) model for
gesture recognition. To evaluate the performance of the proposed approach, we
conducted experiments on 6 databases, including 5 publicly available databases
and our collected database comprising 28 subjects performing 38 gestures,
containing both sEMG and IMU data. The results show that our proposed approach
outperforms the sEMG-based unimodal HGR method (with increases of
2.15%-13.10%). It demonstrates that incorporating virtual IMU signals,
generated by deep generative models, can significantly enhance the accuracy of
sEMG-based HGR. The proposed approach represents a successful attempt to
transition from unimodal HGR to multimodal HGR without additional sensor
hardware
MultiIoT: Towards Large-scale Multisensory Learning for the Internet of Things
The Internet of Things (IoT), the network integrating billions of smart
physical devices embedded with sensors, software, and communication
technologies for the purpose of connecting and exchanging data with other
devices and systems, is a critical and rapidly expanding component of our
modern world. The IoT ecosystem provides a rich source of real-world modalities
such as motion, thermal, geolocation, imaging, depth, sensors, video, and audio
for prediction tasks involving the pose, gaze, activities, and gestures of
humans as well as the touch, contact, pose, 3D of physical objects. Machine
learning presents a rich opportunity to automatically process IoT data at
scale, enabling efficient inference for impact in understanding human
wellbeing, controlling physical devices, and interconnecting smart cities. To
develop machine learning technologies for IoT, this paper proposes MultiIoT,
the most expansive IoT benchmark to date, encompassing over 1.15 million
samples from 12 modalities and 8 tasks. MultiIoT introduces unique challenges
involving (1) learning from many sensory modalities, (2) fine-grained
interactions across long temporal ranges, and (3) extreme heterogeneity due to
unique structure and noise topologies in real-world sensors. We also release a
set of strong modeling baselines, spanning modality and task-specific methods
to multisensory and multitask models to encourage future research in
multisensory representation learning for IoT
Deep combination of radar with optical data for gesture recognition: role of attention in fusion architectures
Multimodal time series classification is an important aspect of human gesture recognition, in which limitations of individual sensors can be overcome by combining data from multiple modalities. In a deep learning pipeline, the attention mechanism further allows for a selective, contextual concentration on relevant features. However, while the standard attention mechanism is an effective tool when working with Natural Language Processing (NLP), it is not ideal when working with temporally- or spatially-sparse multi-modal data. In this paper, we present a novel attention mechanism, Multi-Modal Attention Preconditioning (MMAP). We first demonstrate that MMAP outperforms regular attention for the task of classification of modalities involving temporal and spatial sparsity and secondly investigate the impact of attention in the fusion of radar and optical data for gesture recognition via three specific modalities: dense spatiotemporal optical data, spatially sparse/temporally dense kinematic data, and sparse spatiotemporal radar data. We explore the effect of attention on early, intermediate, and late fusion architectures and compare eight different pipelines in terms of accuracy and their ability to preserve detection accuracy when modalities are missing. Results highlight fundamental differences between late and intermediate attention mechanisms in respect to the fusion of radar and optical data
FAF: A novel multimodal emotion recognition approach integrating face, body and text
Multimodal emotion analysis performed better in emotion recognition depending
on more comprehensive emotional clues and multimodal emotion dataset. In this
paper, we developed a large multimodal emotion dataset, named "HED" dataset, to
facilitate the emotion recognition task, and accordingly propose a multimodal
emotion recognition method. To promote recognition accuracy, "Feature After
Feature" framework was used to explore crucial emotional information from the
aligned face, body and text samples. We employ various benchmarks to evaluate
the "HED" dataset and compare the performance with our method. The results show
that the five classification accuracy of the proposed multimodal fusion method
is about 83.75%, and the performance is improved by 1.83%, 9.38%, and 21.62%
respectively compared with that of individual modalities. The complementarity
between each channel is effectively used to improve the performance of emotion
recognition. We had also established a multimodal online emotion prediction
platform, aiming to provide free emotion prediction to more users
Analysis of the Efficacy of Real-Time Hand Gesture Detection with Hog and Haar-Like Features Using SVM Classification
The field of hand gesture recognition has recently reached new heights thanks to its widespread use in domains like remote sensing, robotic control, and smart home appliances, among others. Despite this, identifying gestures is difficult because of the intransigent features of the human hand, which make the codes used to decode them illegible and impossible to compare. Differentiating regional patterns is the job of pattern recognition. Pattern recognition is at the heart of sign language. People who are deaf or mute may understand the spoken language of the rest of the world by learning sign language. Any part of the body may be used to create signs in sign language. The suggested system employs a gesture recognition system trained on Indian sign language. The methods of preprocessing, hand segmentation, feature extraction, gesture identification, and classification of hand gestures are discussed in this work as they pertain to hand gesture sign language. A hybrid approach is used to extract the features, which combines the usage of Haar-like features with the application of Histogram of Oriented Gradients (HOG).The SVM classifier is then fed the characteristics it has extracted from the pictures in order to make an accurate classification. A false rejection error rate of 8% is achieved while the accuracy of hand gesture detection is improved by 93.5%
Robust Latent Representations via Cross-Modal Translation and Alignment
Multi-modal learning relates information across observation modalities of the
same physical phenomenon to leverage complementary information. Most
multi-modal machine learning methods require that all the modalities used for
training are also available for testing. This is a limitation when the signals
from some modalities are unavailable or are severely degraded by noise. To
address this limitation, we aim to improve the testing performance of uni-modal
systems using multiple modalities during training only. The proposed
multi-modal training framework uses cross-modal translation and
correlation-based latent space alignment to improve the representations of the
weaker modalities. The translation from the weaker to the stronger modality
generates a multi-modal intermediate encoding that is representative of both
modalities. This encoding is then correlated with the stronger modality
representations in a shared latent space. We validate the proposed approach on
the AVEC 2016 dataset for continuous emotion recognition and show the
effectiveness of the approach that achieves state-of-the-art (uni-modal)
performance for weaker modalities
- …