79 research outputs found
Influence of expressive speech on ASR performances: application to elderly assistance in smart home
International audienceSmart homes are discussed as a win-win solution for maintaining the Elderly at home as a better alternative to care homes for dependent elderly people. Such Smart homes are characterized by rich domestic commands devoted to elderly safety and comfort. The vocal command has been identified as an efficient , well accepted, interaction way, it can be directly addressed to the "habitat", or through a robotic interface. In daily use, the challenges of vocal commands recognition are the noisy environment but moreover the reformulation and the expressive change of the strictly authorized commands. This paper focuses (1) to show, on the base of elicited corpus, that expressive speech, in particular distress speech, strongly affects generic state of the art ASR systems (20 to 30%) (2) how interesting improvement thanks to ASR adaptation can regulate (15%) this degradation. We conclude on the necessary adaptation of ASR system to expressive speech when they are designed for person's assistance
Recommended from our members
“THIS IS A STUNNING, STUNNING NIGHT”: NEWS MEDIA CONSTRUCTIONS OF EMOTIONAL REALITY
The news media is one of the main influences of people’s perception, especially during elections. It can have an influence on a voter’s perception during and after the results of an election. One of the issues that arises is that news reports are biased, which affects viewers’ perceptions and interpretations of the information reported. This paper presents an analysis of Critical Discourse Analysis (CDA) in the genre of news media and further shows that prosodic features is another layer of analysis in CDA. I am looking specifically at how news broadcasts construct information after the 2016 election results. News media that represent either conservative liberal or Right-wing both use strategies to manipulate information even when the viewer wants to think of news media as neutral. In this paper, nine news reports that represent either conservative liberal or Right-wing from four leading news broadcasts were analyzed: Fox News, ABC News, NBC News, and CNN Breaking News.
A transcription key organized the notes made of the micro element patterns seen in each report, which the patterns are color-coded. Observations were made through differences in the visual content shown, word choices, face-threatening acts, phonetic features, and prosodic features. Vowels and intonation was analyzed through the program Praat, a free computer software package for the scientific analysis of speech in phonetics, which spectrograms are included to show the patterns that are evident. The patterns revealed that news media presents information based on the side they represent and leave out information that contradicts their representation of reality. The phonetic features also show that there is a construction of emotional speech, which in turn affects how the audience perceives the information
The Perception of Emotion from Acoustic Cues in Natural Speech
Knowledge of human perception of emotional speech is imperative for the development of emotion in speech recognition systems and emotional speech synthesis. Owing to the fact that there is a growing trend towards research on spontaneous, real-life data, the aim of the present thesis is to examine human perception of emotion in naturalistic speech. Although there are many available emotional speech corpora, most contain simulated expressions. Therefore, there remains a compelling need to obtain naturalistic speech corpora that are appropriate and freely available for research. In that regard, our initial aim was to acquire suitable naturalistic material and examine its emotional content based on listener perceptions. A web-based listening tool was developed to accumulate ratings based on large-scale listening groups. The emotional content present in the speech material was demonstrated by performing perception tests on conveyed levels of Activation and Evaluation. As a result, labels were determined that signified the emotional content, and thus contribute to the construction of a naturalistic emotional speech corpus. In line with the literature, the ratings obtained from the perception tests suggested that Evaluation (or hedonic valence) is not identified as reliably as Activation is. Emotional valence can be conveyed through both semantic and prosodic information, for which the meaning of one may serve to facilitate, modify, or conflict with the meaning of the other—particularly with naturalistic speech. The subsequent experiments aimed to investigate this concept by comparing ratings from perception tests of non-verbal speech with verbal speech. The method used to render non-verbal speech was low-pass filtering, and for this, suitable filtering conditions were determined by carrying out preliminary perception tests. The results suggested that nonverbal naturalistic speech provides sufficiently discernible levels of Activation and Evaluation. It appears that the perception of Activation and Evaluation is affected by low-pass filtering, but that the effect is relatively small. Moreover, the results suggest that there is a similar trend in agreement levels between verbal and non-verbal speech. To date it still remains difficult to determine unique acoustical patterns for hedonic valence of emotion, which may be due to inadequate labels or the incorrect selection of acoustic parameters. This study has implications for the labelling of emotional speech data and the determination of salient acoustic correlates of emotion
Models and Analysis of Vocal Emissions for Biomedical Applications
The International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications (MAVEBA) came into being in 1999 from the particularly felt need of sharing know-how, objectives and results between areas that until then seemed quite distinct such as bioengineering, medicine and singing. MAVEBA deals with all aspects concerning the study of the human voice with applications ranging from the neonate to the adult and elderly. Over the years the initial issues have grown and spread also in other aspects of research such as occupational voice disorders, neurology, rehabilitation, image and video analysis. MAVEBA takes place every two years always in Firenze, Italy. This edition celebrates twenty years of uninterrupted and succesfully research in the field of voice analysis
Exposing the hidden vocal channel: Analysis of vocal expression
This dissertation explored perception and modeling of human vocal expression, and began by asking what people heard in expressive speech. To address this fundamental question, clips from Shakespearian soliloquy and from the Library of Congress Veterans Oral History Collection were presented to Mechanical Turk workers (10 per clip); and the workers were asked to provide 1-3 keywords describing the vocal expression in the voice. The resulting keywords described prosody, voice quality, nonverbal quality, and emotion in the voice, along with the conversational style, and personal qualities attributed to the speaker. More than half of the keywords described emotion, and were wide-ranging and nuanced. In contrast, keywords describing prosody and voice quality reduced to a short list of frequently-repeating vocal elements. Given this description of perceived vocal expression, a 3-step process was used to model vocal qualities which listeners most frequently perceived. This process included 1) an interactive analysis across each condition to discover its distinguishing characteristics, 2) feature selection and evaluation via unequal variance sensitivity measurements and examination of means and 2-sigma variances across conditions, and 3) iterative, incremental classifier training and validation. The resulting models performed at 2-3.5 times chance. More importantly, the analysis revealed a continuum relationship across whispering, breathiness, modal speech, and resonance, and revealed multiple spectral sub-types of breathiness, modal speech, resonance, and creaky voice. Finally, latent semantic analysis (LSA) applied to the crowdsourced keyword descriptors enabled organic discovery of expressive dimensions present in each corpus, and revealed relationships among perceived voice qualities and emotions within each dimension and across the corpora. The resulting dimensional classifiers performed at up to 3 times chance, and a second study presented a dimensional analysis of laughter. This research produced a new way of exploring emotion in the voice, and of examining relationships among emotion, prosody, voice quality, conversation quality, personal quality, and other expressive vocal elements. For future work, this perception-grounded fusion of crowdsourcing and LSA technique can be applied to anything humans can describe, in any research domain
Evolutionary and Cognitive Approaches to Voice Perception in Humans: Acoustic Properties, Personality and Aesthetics
Voices are used as a vehicle for language, and variation in the acoustic properties of voices also contains information about the speaker. Listeners use measurable qualities, such as pitch and formant traits, as cues to a speaker’s physical stature and attractiveness. Emotional states and personality characteristics are also judged from vocal stimuli. The research contained in this thesis examines vocal masculinity, aesthetics and personality, with an emphasis on the perception of prosocial traits including trustworthiness and cooperativeness. I will also explore themes which are more cognitive in nature, testing aspects of vocal stimuli which may affect trait attribution, memory and the ascription of identity.
Chapters 2 and 3 explore systematic differences across vocal utterances, both in types of utterance using different classes of stimuli and across the time course of perception of the auditory signal. These chapters examine variation in acoustic measurements in addition to variation in listener attributions of commonly-judged speaker traits. The most important result from this work was that evaluations of attractiveness made using spontaneous speech correlated with those made using scripted speech recordings, but did not correlate with those made of the same persons using vowel stimuli. This calls into question the use of sustained vowel sounds for the attainment of ratings of subjective characteristics. Vowel and single-word stimuli are also quite short – while I found that attributions of masculinity were reliable at very short exposure times, more subjective traits like attractiveness and trustworthiness require a longer exposure time to elicit reliable attributions. I conclude with recommending an exposure time of at least 5 seconds in duration for such traits to be reliably assessed.
Chapter 4 examines what vocal traits affect perceptions of pro-social qualities using both natural and manipulated variation in voices. While feminine pitch traits (F0 and F0-SD) were linked to cooperativeness ratings, masculine formant traits (Df and Pf) were also associated with cooperativeness. The relative importance of these traits as social signals is discussed.
Chapter 5 questions what makes a voice memorable, and helps to differentiate between memory for individual voice identities and for the content which was spoken by administering recognition tests both within and across sensory modalities. While the data suggest that experimental manipulation of voice pitch did not influence memory for vocalised stimuli, attractive male voices were better remembered than unattractive voices, independent of pitch manipulation. Memory for cross-modal (textual) content was enhanced by raising the voice pitch of both male and female speakers. I link this pattern of results to the perceived dominance of voices which have been raised and lowered in pitch, and how this might impact how memories are formed and retained.
Chapter 6 examines masculinity across visual and auditory sensory modalities using a cross-modal matching task. While participants were able to match voices to muted videos of both male and female speakers at rates above chance, and to static face images of men (but not women), differences in masculinity did not influence observers in their judgements, and voice and face masculinity were not correlated. These results are discussed in terms of the generally-accepted theory that masculinity and femininity in faces and voices communicate the same underlying genetic quality. The biological mechanisms by which vocal and facial masculinity could develop independently are speculated
Models and analysis of vocal emissions for biomedical applications: 5th International Workshop: December 13-15, 2007, Firenze, Italy
The MAVEBA Workshop proceedings, held on a biannual basis, collect the scientific papers presented both as oral and poster contributions, during the conference. The main subjects are: development of theoretical and mechanical models as an aid to the study of main phonatory dysfunctions, as well as the biomedical engineering methods for the analysis of voice signals and images, as a support to clinical diagnosis and classification of vocal pathologies. The Workshop has the sponsorship of: Ente Cassa Risparmio di Firenze, COST Action 2103, Biomedical Signal Processing and Control Journal (Elsevier Eds.), IEEE Biomedical Engineering Soc. Special Issues of International Journals have been, and will be, published, collecting selected papers from the conference
- …