15,704 research outputs found
Cultural dialects of real and synthetic emotional facial expressions
In this article we discuss the aspects of designing facial expressions for virtual humans (VHs) with a specific culture. First we explore the notion of cultures and its relevance for applications with a VH. Then we give a general scheme of designing emotional facial expressions, and identify the stages where a human is involved, either as a real person with some specific role, or as a VH displaying facial expressions. We discuss how the display and the emotional meaning of facial expressions may be measured in objective ways, and how the culture of displayers and the judges may influence the process of analyzing human facial expressions and evaluating synthesized ones. We review psychological experiments on cross-cultural perception of emotional facial expressions. By identifying the culturally critical issues of data collection and interpretation with both real and VHs, we aim at providing a methodological reference and inspiration for further research
Speaker-independent emotion recognition exploiting a psychologically-inspired binary cascade classification schema
In this paper, a psychologically-inspired binary cascade classification schema is proposed for speech emotion recognition. Performance is enhanced because commonly confused pairs of emotions are distinguishable from one another. Extracted features are related to statistics of pitch, formants, and energy contours, as well as spectrum, cepstrum, perceptual and temporal features, autocorrelation, MPEG-7 descriptors, Fujisakis model parameters, voice quality, jitter, and shimmer. Selected features are fed as input to K nearest neighborhood classifier and to support vector machines. Two kernels are tested for the latter: Linear and Gaussian radial basis function. The recently proposed speaker-independent experimental protocol is tested on the Berlin emotional speech database for each gender separately. The best emotion recognition accuracy, achieved by support vector machines with linear kernel, equals 87.7%, outperforming state-of-the-art approaches. Statistical analysis is first carried out with respect to the classifiers error rates and then to evaluate the information expressed by the classifiers confusion matrices. © Springer Science+Business Media, LLC 2011
Evaluation of a transplantation algorithm for expressive speech synthesis
When designing human-machine interfaces it is important to consider not only the bare bones functionality but also the ease of use and accessibility it provides. When talking about voice-based inter- faces, it has been proven that imbuing expressiveness into the synthetic voices increases signi?cantly its perceived naturalness, which in the end is very helpful when building user friendly interfaces. This paper proposes an adaptation based expressiveness transplantation system capable of copying the emotions of a source speaker into any desired target speaker with just a few minutes of read speech and without requiring the record- ing of additional expressive data. This system was evaluated through a perceptual test for 3 speakers showing up to an average of 52% emotion recognition rates relative to the natural voice recognition rates, while at the same time keeping good scores in similarity and naturality
Expression of basic emotions in Estonian parametric text-to-speech synthesis
The goal of this study was to conduct modelling experiments, the purpose of which was the expression of three basic emotions (joy, sadness and anger) in Estonian parametric text-to-speech synthesis on the basis of both a male and a female voice. For each emotion, three different test models were constructed and presented for evaluation to subjects in perception tests. The test models were based on the basic emotions’ characteristic parameter values that had been determined on the basis of human speech. In synthetic speech, the test subjects most accurately recognized the emotion of sadness, and least accurately the emotion of joy. The results of the test showed that, in the case of the synthesized male voice, the model with enhanced parameter values performed best for all three emotions, whereas in the case of the synthetic female voice, different emotions called for different models: the model with decreased values was the most suitable one for the expression of joy, and the model with enhanced values was the most suitable for the expression of sadness and anger. Logistic regression was applied to the results of the perception tests in order to determine the significance and contribution of each acoustic parameter in the emotion models, and the possible need to adjust the values of the parameters.Kokkuvõte. Kairi Tamuri ja Meelis Mihkla: Põhiemotsioonide väljendusvõimalused eestikeelsel parameetrilisel kõnesünteesil. Uurimistöö eesmärk oli läbi viia modelleerimiseksperimente kolme põhiemotsiooni (rõõmu, kurbuse ja viha) väljendamiseks eestikeelsel parameetrilisel kõnesünteesil nii mees- kui ka naissünteeshääle baasil. Selleks koostati iga emotsiooni kohta kolm erinevat katsemudelit, mida lasti katseisikutel tajutestidel hinnata. Katsemudelite aluseks oli inimkõne põhjal määratud põhiemotsioonidele omased parameetrite väärtused. Emotsioonidest tunti sünteeskõnes kõige paremini ära kurbuse-emotsioon ning kõige halvemini rõõmu-emotsioon. Testitulemused näitasid, et kui meessünteeshääle puhul töötas kõigi kolme emotsiooni puhul kõige paremini võimendatud väärtuste mudel, siis naissünteeshääle puhul vajasid erinevad emotsioonid erinevaid mudeleid: rõõmu väljendamiseks sobis kõige paremini vähendatud väärtuste mudel, kurbuse ja viha väljendamiseks võimendatud väärtuste mudel. Tajutestide tulemusi analüüsiti logistilisel regressioonil, et teha kindlaks üksikute akustiliste parameetrite olulisus ja osakaal emotsiooni mudelites ning parameetrite väärtuste korrigeerimisvajadused.Märksõnad: eesti keel, emotsioonid, kõnesüntees, akustiline mudel, kõnetempo, intensiivsus, põhitoo
A virtual diary companion
Chatbots and embodied conversational agents show turn based conversation behaviour. In current research we almost always assume that each utterance of a human conversational partner should be followed by an intelligent and/or empathetic reaction of chatbot or embodied agent. They are assumed to be alert, trying to please the user. There are other applications which have not yet received much attention and which require a more patient or relaxed attitude, waiting for the right moment to provide feedback to the human partner. Being able and willing to listen is one of the conditions for being successful. In this paper we have some observations on listening behaviour research and introduce one of our applications, the virtual diary companion
Recommended from our members
Speaker and Expression Factorization for Audiobook Data: Expressiveness and Transplantation
Expressive synthesis from text is a challenging
problem. There are two issues. First, read text is often highly
expressive to convey the emotion and scenario in the text. Second,
since the expressive training speech is not always available for
different speakers, it is necessary to develop methods to share the
expressive information over speakers. This paper investigates the
approach of using very expressive, highly diverse audiobook data
from multiple speakers to build an expressive speech synthesis
system. Both of two problems are addressed by considering a
factorized framework where speaker and emotion are modelled
in separate sub-spaces of a cluster adaptive training (CAT)
parametric speech synthesis system. The sub-spaces for the
expressive state of a speaker and the characteristics of the speaker
are jointly trained using a set of audiobooks. In this work, the
expressive speech synthesis system works in two distinct modes.
In the first mode, the expressive information is given by audio
data and the adaptation method is used to extract the expressive
information in the audio data. In the second mode, the input of
the synthesis system is plain text and a full expressive synthesis
system is examined where the expressive state is predicted from
the text. In both modes, the expressive information is shared
and transplanted over different speakers. Experimental results
show that in both modes, the expressive speech synthesis method
proposed in this work significantly improves the expressiveness
of the synthetic speech for different speakers. Finally, this paper
also examines whether it is possible to predict the expressive
states from text for multiple speakers using a single model, or
whether the prediction process needs to be speaker specific.This is the accepted manuscript. The final version is available from IEEE at http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6995936&filter%3DAND%28p_IS_Number%3A7055953%29
- …