506 research outputs found
EmoDiarize: Speaker Diarization and Emotion Identification from Speech Signals using Convolutional Neural Networks
In the era of advanced artificial intelligence and human-computer
interaction, identifying emotions in spoken language is paramount. This
research explores the integration of deep learning techniques in speech emotion
recognition, offering a comprehensive solution to the challenges associated
with speaker diarization and emotion identification. It introduces a framework
that combines a pre-existing speaker diarization pipeline and an emotion
identification model built on a Convolutional Neural Network (CNN) to achieve
higher precision. The proposed model was trained on data from five speech
emotion datasets, namely, RAVDESS, CREMA-D, SAVEE, TESS, and Movie Clips, out
of which the latter is a speech emotion dataset created specifically for this
research. The features extracted from each sample include Mel Frequency
Cepstral Coefficients (MFCC), Zero Crossing Rate (ZCR), Root Mean Square (RMS),
and various data augmentation algorithms like pitch, noise, stretch, and shift.
This feature extraction approach aims to enhance prediction accuracy while
reducing computational complexity. The proposed model yields an unweighted
accuracy of 63%, demonstrating remarkable efficiency in accurately identifying
emotional states within speech signals
Sentiment analysis in non-fixed length audios using a Fully Convolutional Neural Network
.In this work, a sentiment analysis method that is capable of accepting audio of any length, without being fixed a priori, is proposed. Mel spectrogram and Mel Frequency Cepstral Coefficients are used as audio description methods and a Fully Convolutional Neural Network architecture is proposed as a classifier. The results have been validated using three well known datasets: EMODB, RAVDESS and TESS. The results obtained were promising, outperforming the state-of–the-art methods. Also, thanks to the fact that the proposed method admits audios of any size, it allows a sentiment analysis to be made in near real time, which is very interesting for a wide range of fields such as call centers, medical consultations or financial brokers.S
LSTM Deep Neural Networks Postfiltering for Improving the Quality of Synthetic Voices
Recent developments in speech synthesis have produced systems capable of
outcome intelligible speech, but now researchers strive to create models that
more accurately mimic human voices. One such development is the incorporation
of multiple linguistic styles in various languages and accents.
HMM-based Speech Synthesis is of great interest to many researchers, due to
its ability to produce sophisticated features with small footprint. Despite
such progress, its quality has not yet reached the level of the predominant
unit-selection approaches that choose and concatenate recordings of real
speech. Recent efforts have been made in the direction of improving these
systems.
In this paper we present the application of Long-Short Term Memory Deep
Neural Networks as a Postfiltering step of HMM-based speech synthesis, in order
to obtain closer spectral characteristics to those of natural speech. The
results show how HMM-voices could be improved using this approach.Comment: 5 pages, 5 figure
Unsupervised Representations Improve Supervised Learning in Speech Emotion Recognition
Speech Emotion Recognition (SER) plays a pivotal role in enhancing
human-computer interaction by enabling a deeper understanding of emotional
states across a wide range of applications, contributing to more empathetic and
effective communication. This study proposes an innovative approach that
integrates self-supervised feature extraction with supervised classification
for emotion recognition from small audio segments. In the preprocessing step,
to eliminate the need of crafting audio features, we employed a self-supervised
feature extractor, based on the Wav2Vec model, to capture acoustic features
from audio data. Then, the output featuremaps of the preprocessing step are fed
to a custom designed Convolutional Neural Network (CNN)-based model to perform
emotion classification. Utilizing the ShEMO dataset as our testing ground, the
proposed method surpasses two baseline methods, i.e. support vector machine
classifier and transfer learning of a pretrained CNN. comparing the propose
method to the state-of-the-art methods in SER task indicates the superiority of
the proposed method. Our findings underscore the pivotal role of deep
unsupervised feature learning in elevating the landscape of SER, offering
enhanced emotional comprehension in the realm of human-computer interactions
Speech emotion recognition using 2D-convolutional neural network
This research proposes a speech emotion recognition model to predict human emotions using the convolutional neural network (CNN) by learning segmented audio of specific emotions. Speech emotion recognition utilizes the extracted features of audio waves to learn speech emotion characteristics; one of them is mel frequency cepstral coefficient (MFCC). Dataset takes a vital role to obtain valuable results in model learning. Hence this research provides the leverage of dataset combination implementation. The model learns a combined dataset with audio segmentation and zero padding using 2D-CNN. Audio segmentation and zero padding equalize the extracted audio features to learn the characteristics. The model results in 83.69% accuracy to predict seven emotions: neutral, happy, sad, angry, fear, disgust, and surprise from the combined dataset with the segmentation of the audio files
Disentangling Prosody Representations with Unsupervised Speech Reconstruction
Human speech can be characterized by different components, including semantic
content, speaker identity and prosodic information. Significant progress has
been made in disentangling representations for semantic content and speaker
identity in Automatic Speech Recognition (ASR) and speaker verification tasks
respectively. However, it is still an open challenging research question to
extract prosodic information because of the intrinsic association of different
attributes, such as timbre and rhythm, and because of the need for supervised
training schemes to achieve robust large-scale and speaker-independent ASR. The
aim of this paper is to address the disentanglement of emotional prosody from
speech based on unsupervised reconstruction. Specifically, we identify, design,
implement and integrate three crucial components in our proposed speech
reconstruction model Prosody2Vec: (1) a unit encoder that transforms speech
signals into discrete units for semantic content, (2) a pretrained speaker
verification model to generate speaker identity embeddings, and (3) a trainable
prosody encoder to learn prosody representations. We first pretrain the
Prosody2Vec representations on unlabelled emotional speech corpora, then
fine-tune the model on specific datasets to perform Speech Emotion Recognition
(SER) and Emotional Voice Conversion (EVC) tasks. Both objective (weighted and
unweighted accuracies) and subjective (mean opinion score) evaluations on the
EVC task suggest that Prosody2Vec effectively captures general prosodic
features that can be smoothly transferred to other emotional speech. In
addition, our SER experiments on the IEMOCAP dataset reveal that the prosody
features learned by Prosody2Vec are complementary and beneficial for the
performance of widely used speech pretraining models and surpass the
state-of-the-art methods when combining Prosody2Vec with HuBERT
representations.Comment: Accepted by IEEE/ACM Transactions on Audio, Speech, and Language
Processin
- …