2,272 research outputs found
Jitter and Shimmer measurements for speaker diarization
Jitter and shimmer voice quality features have been successfully
used to characterize speaker voice traits and detect voice pathologies.
Jitter and shimmer measure variations in the fundamental frequency
and amplitude of speaker's voice, respectively. Due to their nature, they can be used to assess differences between speakers. In this paper, we investigate the usefulness of these voice quality features in the task of speaker diarization. The combination of voice quality features with the conventional spectral features, Mel-Frequency Cepstral Coefficients (MFCC), is addressed in the framework of Augmented Multiparty Interaction (AMI) corpus, a multi-party and spontaneous speech set of recordings. Both sets of features are independently modeled using mixture of Gaussians and fused together at the score likelihood level. The experiments carried out on the AMI corpus show that incorporating jitter and shimmer measurements to the baseline spectral features decreases the diarization error rate in most of the recordings.Peer ReviewedPostprint (published version
Decoding Emotions: A comprehensive Multilingual Study of Speech Models for Speech Emotion Recognition
Recent advancements in transformer-based speech representation models have
greatly transformed speech processing. However, there has been limited research
conducted on evaluating these models for speech emotion recognition (SER)
across multiple languages and examining their internal representations. This
article addresses these gaps by presenting a comprehensive benchmark for SER
with eight speech representation models and six different languages. We
conducted probing experiments to gain insights into inner workings of these
models for SER. We find that using features from a single optimal layer of a
speech model reduces the error rate by 32\% on average across seven datasets
when compared to systems where features from all layers of speech models are
used. We also achieve state-of-the-art results for German and Persian
languages. Our probing results indicate that the middle layers of speech models
capture the most important emotional information for speech emotion
recognition
Emotion recognition based on the energy distribution of plosive syllables
We usually encounter two problems during speech emotion recognition (SER): expression and perception problems, which vary considerably between speakers, languages, and sentence pronunciation. In fact, finding an optimal system that characterizes the emotions overcoming all these differences is a promising prospect. In this perspective, we considered two emotional databases: Moroccan Arabic dialect emotional database (MADED), and Ryerson audio-visual database on emotional speech and song (RAVDESS) which present notable differences in terms of type (natural/acted), and language (Arabic/English). We proposed a detection process based on 27 acoustic features extracted from consonant-vowel (CV) syllabic units: \ba, \du, \ki, \ta common to both databases. We tested two classification strategies: multiclass (all emotions combined: joy, sadness, neutral, anger) and binary (neutral vs. others, positive emotions (joy) vs. negative emotions (sadness, anger), sadness vs. anger). These strategies were tested three times: i) on MADED, ii) on RAVDESS, iii) on MADED and RAVDESS. The proposed method gave better recognition accuracy in the case of binary classification. The rates reach an average of 78% for the multi-class classification, 100% for neutral vs. other cases, 100% for the negative emotions (i.e. anger vs. sadness), and 96% for the positive vs. negative emotions
- …