79 research outputs found

    Influence of expressive speech on ASR performances: application to elderly assistance in smart home

    No full text
    International audienceSmart homes are discussed as a win-win solution for maintaining the Elderly at home as a better alternative to care homes for dependent elderly people. Such Smart homes are characterized by rich domestic commands devoted to elderly safety and comfort. The vocal command has been identified as an efficient , well accepted, interaction way, it can be directly addressed to the "habitat", or through a robotic interface. In daily use, the challenges of vocal commands recognition are the noisy environment but moreover the reformulation and the expressive change of the strictly authorized commands. This paper focuses (1) to show, on the base of elicited corpus, that expressive speech, in particular distress speech, strongly affects generic state of the art ASR systems (20 to 30%) (2) how interesting improvement thanks to ASR adaptation can regulate (15%) this degradation. We conclude on the necessary adaptation of ASR system to expressive speech when they are designed for person's assistance

    The Perception of Emotion from Acoustic Cues in Natural Speech

    Get PDF
    Knowledge of human perception of emotional speech is imperative for the development of emotion in speech recognition systems and emotional speech synthesis. Owing to the fact that there is a growing trend towards research on spontaneous, real-life data, the aim of the present thesis is to examine human perception of emotion in naturalistic speech. Although there are many available emotional speech corpora, most contain simulated expressions. Therefore, there remains a compelling need to obtain naturalistic speech corpora that are appropriate and freely available for research. In that regard, our initial aim was to acquire suitable naturalistic material and examine its emotional content based on listener perceptions. A web-based listening tool was developed to accumulate ratings based on large-scale listening groups. The emotional content present in the speech material was demonstrated by performing perception tests on conveyed levels of Activation and Evaluation. As a result, labels were determined that signified the emotional content, and thus contribute to the construction of a naturalistic emotional speech corpus. In line with the literature, the ratings obtained from the perception tests suggested that Evaluation (or hedonic valence) is not identified as reliably as Activation is. Emotional valence can be conveyed through both semantic and prosodic information, for which the meaning of one may serve to facilitate, modify, or conflict with the meaning of the other—particularly with naturalistic speech. The subsequent experiments aimed to investigate this concept by comparing ratings from perception tests of non-verbal speech with verbal speech. The method used to render non-verbal speech was low-pass filtering, and for this, suitable filtering conditions were determined by carrying out preliminary perception tests. The results suggested that nonverbal naturalistic speech provides sufficiently discernible levels of Activation and Evaluation. It appears that the perception of Activation and Evaluation is affected by low-pass filtering, but that the effect is relatively small. Moreover, the results suggest that there is a similar trend in agreement levels between verbal and non-verbal speech. To date it still remains difficult to determine unique acoustical patterns for hedonic valence of emotion, which may be due to inadequate labels or the incorrect selection of acoustic parameters. This study has implications for the labelling of emotional speech data and the determination of salient acoustic correlates of emotion

    Models and Analysis of Vocal Emissions for Biomedical Applications

    Get PDF
    The International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications (MAVEBA) came into being in 1999 from the particularly felt need of sharing know-how, objectives and results between areas that until then seemed quite distinct such as bioengineering, medicine and singing. MAVEBA deals with all aspects concerning the study of the human voice with applications ranging from the neonate to the adult and elderly. Over the years the initial issues have grown and spread also in other aspects of research such as occupational voice disorders, neurology, rehabilitation, image and video analysis. MAVEBA takes place every two years always in Firenze, Italy. This edition celebrates twenty years of uninterrupted and succesfully research in the field of voice analysis

    Exposing the hidden vocal channel: Analysis of vocal expression

    Get PDF
    This dissertation explored perception and modeling of human vocal expression, and began by asking what people heard in expressive speech. To address this fundamental question, clips from Shakespearian soliloquy and from the Library of Congress Veterans Oral History Collection were presented to Mechanical Turk workers (10 per clip); and the workers were asked to provide 1-3 keywords describing the vocal expression in the voice. The resulting keywords described prosody, voice quality, nonverbal quality, and emotion in the voice, along with the conversational style, and personal qualities attributed to the speaker. More than half of the keywords described emotion, and were wide-ranging and nuanced. In contrast, keywords describing prosody and voice quality reduced to a short list of frequently-repeating vocal elements. Given this description of perceived vocal expression, a 3-step process was used to model vocal qualities which listeners most frequently perceived. This process included 1) an interactive analysis across each condition to discover its distinguishing characteristics, 2) feature selection and evaluation via unequal variance sensitivity measurements and examination of means and 2-sigma variances across conditions, and 3) iterative, incremental classifier training and validation. The resulting models performed at 2-3.5 times chance. More importantly, the analysis revealed a continuum relationship across whispering, breathiness, modal speech, and resonance, and revealed multiple spectral sub-types of breathiness, modal speech, resonance, and creaky voice. Finally, latent semantic analysis (LSA) applied to the crowdsourced keyword descriptors enabled organic discovery of expressive dimensions present in each corpus, and revealed relationships among perceived voice qualities and emotions within each dimension and across the corpora. The resulting dimensional classifiers performed at up to 3 times chance, and a second study presented a dimensional analysis of laughter. This research produced a new way of exploring emotion in the voice, and of examining relationships among emotion, prosody, voice quality, conversation quality, personal quality, and other expressive vocal elements. For future work, this perception-grounded fusion of crowdsourcing and LSA technique can be applied to anything humans can describe, in any research domain

    Evolutionary and Cognitive Approaches to Voice Perception in Humans: Acoustic Properties, Personality and Aesthetics

    Get PDF
    Voices are used as a vehicle for language, and variation in the acoustic properties of voices also contains information about the speaker. Listeners use measurable qualities, such as pitch and formant traits, as cues to a speaker’s physical stature and attractiveness. Emotional states and personality characteristics are also judged from vocal stimuli. The research contained in this thesis examines vocal masculinity, aesthetics and personality, with an emphasis on the perception of prosocial traits including trustworthiness and cooperativeness. I will also explore themes which are more cognitive in nature, testing aspects of vocal stimuli which may affect trait attribution, memory and the ascription of identity. Chapters 2 and 3 explore systematic differences across vocal utterances, both in types of utterance using different classes of stimuli and across the time course of perception of the auditory signal. These chapters examine variation in acoustic measurements in addition to variation in listener attributions of commonly-judged speaker traits. The most important result from this work was that evaluations of attractiveness made using spontaneous speech correlated with those made using scripted speech recordings, but did not correlate with those made of the same persons using vowel stimuli. This calls into question the use of sustained vowel sounds for the attainment of ratings of subjective characteristics. Vowel and single-word stimuli are also quite short – while I found that attributions of masculinity were reliable at very short exposure times, more subjective traits like attractiveness and trustworthiness require a longer exposure time to elicit reliable attributions. I conclude with recommending an exposure time of at least 5 seconds in duration for such traits to be reliably assessed. Chapter 4 examines what vocal traits affect perceptions of pro-social qualities using both natural and manipulated variation in voices. While feminine pitch traits (F0 and F0-SD) were linked to cooperativeness ratings, masculine formant traits (Df and Pf) were also associated with cooperativeness. The relative importance of these traits as social signals is discussed. Chapter 5 questions what makes a voice memorable, and helps to differentiate between memory for individual voice identities and for the content which was spoken by administering recognition tests both within and across sensory modalities. While the data suggest that experimental manipulation of voice pitch did not influence memory for vocalised stimuli, attractive male voices were better remembered than unattractive voices, independent of pitch manipulation. Memory for cross-modal (textual) content was enhanced by raising the voice pitch of both male and female speakers. I link this pattern of results to the perceived dominance of voices which have been raised and lowered in pitch, and how this might impact how memories are formed and retained. Chapter 6 examines masculinity across visual and auditory sensory modalities using a cross-modal matching task. While participants were able to match voices to muted videos of both male and female speakers at rates above chance, and to static face images of men (but not women), differences in masculinity did not influence observers in their judgements, and voice and face masculinity were not correlated. These results are discussed in terms of the generally-accepted theory that masculinity and femininity in faces and voices communicate the same underlying genetic quality. The biological mechanisms by which vocal and facial masculinity could develop independently are speculated

    Models and analysis of vocal emissions for biomedical applications: 5th International Workshop: December 13-15, 2007, Firenze, Italy

    Get PDF
    The MAVEBA Workshop proceedings, held on a biannual basis, collect the scientific papers presented both as oral and poster contributions, during the conference. The main subjects are: development of theoretical and mechanical models as an aid to the study of main phonatory dysfunctions, as well as the biomedical engineering methods for the analysis of voice signals and images, as a support to clinical diagnosis and classification of vocal pathologies. The Workshop has the sponsorship of: Ente Cassa Risparmio di Firenze, COST Action 2103, Biomedical Signal Processing and Control Journal (Elsevier Eds.), IEEE Biomedical Engineering Soc. Special Issues of International Journals have been, and will be, published, collecting selected papers from the conference
    • …
    corecore