7,159 research outputs found
How to improve TTS systems for emotional expressivity
Several experiments have been carried out that revealed weaknesses of the current Text-To-Speech (TTS) systems in their emotional expressivity. Although some TTS systems allow XML-based representations of prosodic and/or phonetic variables, few publications considered, as a pre-processing stage, the use of intelligent text processing to detect affective information that can be used to tailor the parameters needed for emotional expressivity. This paper describes a technique for an automatic prosodic parameterization based on affective clues. This technique recognizes the affective information conveyed in a text and, accordingly to its emotional connotation, assigns appropriate pitch accents and other prosodic parameters by XML-tagging. This pre-processing assists the TTS system to generate synthesized speech that contains emotional clues. The experimental results are encouraging and suggest the possibility of suitable emotional expressivity in speech synthesis
Analysis of prosodic correlates of emotional speech data
International audienceThe study of expressive speech styles remains an important topic as to their parameters detection or prediction in speech processing. In this paper, we analyze prosodic correlates for six emotion styles (anger, disgust, joy, fear, surprise and sadness), using data uttered by two speakers. The analysis is focused on the way pronunciations and prosodic parameters are modified in emotional speech, compared to neutral style. The analysis concerns speech pronunciation modifications, presence of pauses in sentences, and local prosodic behavior, with an emphasis set on the analysis of the prosody over prosodic groups and breathing groups
Going ba-na-nas: Prosodic analysis of spoken Japanese attitudes
International audienceThe aim of this paper is to examine cues for prosodic characterization of attitudes in Japanese. This work is based on previous studies where 16 communicative social affects were defined. The audio signal parameters (fundamental frequency, amplitude and duration) of previously recorded Japanese attitudes, are statistically analyzed. Interesting interactions among the parameters, the gender and the expression of specific attitude (e.g. politeness) were found, and we report on which parameters most significantly characterize each attitude. Index Terms: speech, prosody, attitude, social affect, emotional speech, Japanese languag
Synthesizing prosody : a prominence-based approach
A preliminary test exploring 4 emotions showed that conveying emotions by time domain synthesis may be possible. Therefore, a more sophisticated test was carried out in order to determine the influence of the prosodic parameters in the perception of a speaker's emotional state. Six different emotional states were investigated. The stimuli of the second test were used in three different testing procedures: as natural speech, resynthesized and reduced to a sawtooth signal. The recognition rates were lower than in the preliminary test, although the differences between the recognition rates of natural and synthetic speech were comparable for both tests. The outcome of the sawtooth test showed that the amount of information about a speaker's emotional state transported by F_{0}, energy and overall duration is rather small. However, we could determine relations between the acoustic prosodic parameters and the emotional content of speech
Generating expressive speech for storytelling applications
Work on expressive speech synthesis has long focused on the expression of basic emotions. In recent years, however, interest in other expressive styles has been increasing. The research presented in this paper aims at the generation of a storytelling speaking style, which is suitable for storytelling applications and more in general, for applications aimed at children. Based on an analysis of human storytellers' speech, we designed and implemented a set of prosodic rules for converting "neutral" speech, as produced by a text-to-speech system, into storytelling speech. An evaluation of our storytelling speech generation system showed encouraging results
Multimodal Speech Emotion Recognition Using Audio and Text
Speech emotion recognition is a challenging task, and extensive reliance has
been placed on models that use audio features in building well-performing
classifiers. In this paper, we propose a novel deep dual recurrent encoder
model that utilizes text data and audio signals simultaneously to obtain a
better understanding of speech data. As emotional dialogue is composed of sound
and spoken content, our model encodes the information from audio and text
sequences using dual recurrent neural networks (RNNs) and then combines the
information from these sources to predict the emotion class. This architecture
analyzes speech data from the signal level to the language level, and it thus
utilizes the information within the data more comprehensively than models that
focus on audio features. Extensive experiments are conducted to investigate the
efficacy and properties of the proposed model. Our proposed model outperforms
previous state-of-the-art methods in assigning data to one of four emotion
categories (i.e., angry, happy, sad and neutral) when the model is applied to
the IEMOCAP dataset, as reflected by accuracies ranging from 68.8% to 71.8%.Comment: 7 pages, Accepted as a conference paper at IEEE SLT 201
Emotion Recognition from Acted and Spontaneous Speech
DizertaÄnĂ prĂĄce se zabĂœvĂĄ rozpoznĂĄnĂm emoÄnĂho stavu mluvÄĂch z ĆeÄovĂ©ho signĂĄlu. PrĂĄce je rozdÄlena do dvou hlavnĂch ÄastĂ, prvnĂ ÄĂĄst popisuju navrĆŸenĂ© metody pro rozpoznĂĄnĂ emoÄnĂho stavu z hranĂœch databĂĄzĂ. V rĂĄmci tĂ©to ÄĂĄsti jsou pĆedstaveny vĂœsledky rozpoznĂĄnĂ pouĆŸitĂm dvou rĆŻznĂœch databĂĄzĂ s rĆŻznĂœmi jazyky. HlavnĂmi pĆĂnosy tĂ©to ÄĂĄsti je detailnĂ analĂœza rozsĂĄhlĂ© ĆĄkĂĄly rĆŻznĂœch pĆĂznakĆŻ zĂskanĂœch z ĆeÄovĂ©ho signĂĄlu, nĂĄvrh novĂœch klasifikaÄnĂch architektur jako je napĆĂklad âemoÄnĂ pĂĄrovĂĄnĂâ a nĂĄvrh novĂ© metody pro mapovĂĄnĂ diskrĂ©tnĂch emoÄnĂch stavĆŻ do dvou dimenzionĂĄlnĂho prostoru. DruhĂĄ ÄĂĄst se zabĂœvĂĄ rozpoznĂĄnĂm emoÄnĂch stavĆŻ z databĂĄze spontĂĄnnĂ ĆeÄi, kterĂĄ byla zĂskĂĄna ze zĂĄznamĆŻ hovorĆŻ z reĂĄlnĂœch call center. Poznatky z analĂœzy a nĂĄvrhu metod rozpoznĂĄnĂ z hranĂ© ĆeÄi byly vyuĆŸity pro nĂĄvrh novĂ©ho systĂ©mu pro rozpoznĂĄnĂ sedmi spontĂĄnnĂch emoÄnĂch stavĆŻ. JĂĄdrem navrĆŸenĂ©ho pĆĂstupu je komplexnĂ klasifikaÄnĂ architektura zaloĆŸena na fĂșzi rĆŻznĂœch systĂ©mĆŻ. PrĂĄce se dĂĄle zabĂœvĂĄ vlivem emoÄnĂho stavu mluvÄĂho na ĂșspÄĆĄnosti rozpoznĂĄnĂ pohlavĂ a nĂĄvrhem systĂ©mu pro automatickou detekci ĂșspÄĆĄnĂœch hovorĆŻ v call centrech na zĂĄkladÄ analĂœzy parametrĆŻ dialogu mezi ĂșÄastnĂky telefonnĂch hovorĆŻ.Doctoral thesis deals with emotion recognition from speech signals. The thesis is divided into two main parts; the first part describes proposed approaches for emotion recognition using two different multilingual databases of acted emotional speech. The main contributions of this part are detailed analysis of a big set of acoustic features, new classification schemes for vocal emotion recognition such as âemotion couplingâ and new method for mapping discrete emotions into two-dimensional space. The second part of this thesis is devoted to emotion recognition using multilingual databases of spontaneous emotional speech, which is based on telephone records obtained from real call centers. The knowledge gained from experiments with emotion recognition from acted speech was exploited to design a new approach for classifying seven emotional states. The core of the proposed approach is a complex classification architecture based on the fusion of different systems. The thesis also examines the influence of speakerâs emotional state on gender recognition performance and proposes system for automatic identification of successful phone calls in call center by means of dialogue features.
- âŠ