899 research outputs found
Audio-Visual Speaker Verification via Joint Cross-Attention
Speaker verification has been widely explored using speech signals, which has
shown significant improvement using deep models. Recently, there has been a
surge in exploring faces and voices as they can offer more complementary and
comprehensive information than relying only on a single modality of speech
signals. Though current methods in the literature on the fusion of faces and
voices have shown improvement over that of individual face or voice modalities,
the potential of audio-visual fusion is not fully explored for speaker
verification. Most of the existing methods based on audio-visual fusion either
rely on score-level fusion or simple feature concatenation. In this work, we
have explored cross-modal joint attention to fully leverage the inter-modal
complementary information and the intra-modal information for speaker
verification. Specifically, we estimate the cross-attention weights based on
the correlation between the joint feature presentation and that of the
individual feature representations in order to effectively capture both
intra-modal as well inter-modal relationships among the faces and voices. We
have shown that efficiently leveraging the intra- and inter-modal relationships
significantly improves the performance of audio-visual fusion for speaker
verification. The performance of the proposed approach has been evaluated on
the Voxceleb1 dataset. Results show that the proposed approach can
significantly outperform the state-of-the-art methods of audio-visual fusion
for speaker verification
Optimal set of EEG features for emotional state classification and trajectory visualization in Parkinson's disease
In addition to classic motor signs and symptoms, individuals with Parkinson's disease (PD) are characterized by emotional deficits. Ongoing brain activity can be recorded by electroencephalograph (EEG) to discover the links between emotional states and brain activity. This study utilized machine-learning algorithms to categorize emotional states in PD patients compared with healthy controls (HC) using EEG. Twenty non-demented PD patients and 20 healthy age-, gender-, and education level-matched controls viewed happiness, sadness, fear, anger, surprise, and disgust emotional stimuli while fourteen-channel EEG was being recorded. Multimodal stimulus (combination of audio and visual) was used to evoke the emotions. To classify the EEG-based emotional states and visualize the changes of emotional states over time, this paper compares four kinds of EEG features for emotional state classification and proposes an approach to track the trajectory of emotion changes with manifold learning. From the experimental results using our EEG data set, we found that (a) bispectrum feature is superior to other three kinds of features, namely power spectrum, wavelet packet and nonlinear dynamical analysis; (b) higher frequency bands (alpha, beta and gamma) play a more important role in emotion activities than lower frequency bands (delta and theta) in both groups and; (c) the trajectory of emotion changes can be visualized by reducing subject-independent features with manifold learning. This provides a promising way of implementing visualization of patient's emotional state in real time and leads to a practical system for noninvasive assessment of the emotional impairments associated with neurological disorders
Emotion Recognition by Video: A review
Video emotion recognition is an important branch of affective computing, and
its solutions can be applied in different fields such as human-computer
interaction (HCI) and intelligent medical treatment. Although the number of
papers published in the field of emotion recognition is increasing, there are
few comprehensive literature reviews covering related research on video emotion
recognition. Therefore, this paper selects articles published from 2015 to 2023
to systematize the existing trends in video emotion recognition in related
studies. In this paper, we first talk about two typical emotion models, then we
talk about databases that are frequently utilized for video emotion
recognition, including unimodal databases and multimodal databases. Next, we
look at and classify the specific structure and performance of modern unimodal
and multimodal video emotion recognition methods, talk about the benefits and
drawbacks of each, and then we compare them in detail in the tables. Further,
we sum up the primary difficulties right now looked by video emotion
recognition undertakings and point out probably the most encouraging future
headings, such as establishing an open benchmark database and better multimodal
fusion strategys. The essential objective of this paper is to assist scholarly
and modern scientists with keeping up to date with the most recent advances and
new improvements in this speedy, high-influence field of video emotion
recognition
- …