22 research outputs found

    A generative framework for conversational laughter: Its 'language model' and laughter sound synthesis

    Full text link
    As the phonetic and acoustic manifestations of laughter in conversation are highly diverse, laughter synthesis should be capable of accommodating such diversity while maintaining high controllability. This paper proposes a generative model of laughter in conversation that can produce a wide variety of laughter by utilizing the emotion dimension as a conversational context. The model comprises two parts: the laughter "phones generator," which generates various, but realistic, combinations of laughter components for a given speaker ID and emotional state, and the laughter "sound synthesizer," which receives the laughter phone sequence and produces acoustic features that reflect the speaker's individuality and emotional state. The results of a listening experiment indicated that conditioning both the phones generator and the sound synthesizer on emotion dimensions resulted in the most effective control of the perceived emotion in synthesized laughter.Comment: Submitted to INTERSPEEC

    Can a robot laugh with you?: Shared laughter generation for empathetic spoken dialogue

    Get PDF
    äșșăšäž€ç·’ă«çŹ‘ă†äŒšè©±ăƒ­ăƒœăƒƒăƒˆă‚’é–‹ç™ș --äșșă«ć…±æ„Ÿă—ă€äșșăšć…±ç”Ÿă™ă‚‹äŒšè©±AIăźćźŸçŸă«ć‘ă‘ăŠ--. äșŹéƒœć€§ć­Šăƒ—ăƒŹă‚čăƒȘăƒȘăƒŒă‚č. 2022-09-29.Spoken dialogue systems must be able to express empathy to achieve natural interaction with human users. However, laughter generation requires a high level of dialogue understanding. Thus, implementing laughter in existing systems, such as in conversational robots, has been challenging. As a first step toward solving this problem, rather than generating laughter from user dialogue, we focus on “shared laughter, ” where a user laughs using either solo or speech laughs (initial laugh), and the system laughs in turn (response laugh). The proposed system consists of three models: 1) initial laugh detection, 2) shared laughter prediction, and 3) laugh type selection. We trained each model using a human-robot speed dating dialogue corpus. For the first model, a recurrent neural network was applied, and the detection performance achieved an F1 score of 82.6%. The second model used the acoustic and prosodic features of the initial laugh and achieved a prediction accuracy above that of the random prediction. The third model selects the type of system’s response laugh as social or mirthful laugh based on the same features of the initial laugh. We then implemented the full shared laughter generation system in an attentive listening dialogue system and conducted a dialogue listening experiment. The proposed system improved the impression of the dialogue system such as empathy perception compared to a naive baseline without laughter and a reactive system that always responded with only social laughs. We propose that our system can be used for situated robot interaction and also emphasize the need for integrating proper empathetic laughs into conversational robots and agents

    Building and Designing Expressive Speech Synthesis

    Get PDF
    We know there is something special about speech. Our voices are not just a means of communicating. They also give a deep impression of who we are and what we might know. They can betray our upbringing, our emotional state, our state of health. They can be used to persuade and convince, to calm and to excite. As speech systems enter the social domain they are required to interact, support and mediate our social relationships with 1) each other, 2) with digital information, and, increasingly, 3) with AI-based algorithms and processes. Socially Interactive Agents (SIAs) are at the fore- front of research and innovation in this area. There is an assumption that in the future “spoken language will provide a natural conversational interface between human beings and so-called intelligent systems.” [Moore 2017, p. 283]. A considerable amount of previous research work has tested this assumption with mixed results. However, as pointed out “voice interfaces have become notorious for fostering frustration and failure” [Nass and Brave 2005, p.6]. It is within this context, between our exceptional and intelligent human use of speech to communicate and interact with other humans, and our desire to leverage this means of communication for artificial systems, that the technology, often termed expressive speech synthesis uncomfortably falls. Uncomfortably, because it is often overshadowed by issues in interactivity and the underlying intelligence of the system which is something that emerges from the interaction of many of the components in a SIA. This is especially true of what we might term conversational speech, where decoupling how things are spoken, from when and to whom they are spoken, can seem an impossible task. This is an even greater challenge in evaluation and in characterising full systems which have made use of expressive speech. Furthermore when designing an interaction with a SIA, we must not only consider how SIAs should speak but how much, and whether they should even speak at all. These considerations cannot be ignored. Any speech synthesis that is used in the context of an artificial agent will have a perceived accent, a vocal style, an underlying emotion and an intonational model. Dimensions like accent and personality (cross speaker parameters) as well as vocal style, emotion and intonation during an interaction (within-speaker parameters) need to be built in the design of a synthetic voice. Even a default or neutral voice has to consider these same expressive speech synthesis components. Such design parameters have a strong influence on how effectively a system will interact, how it is perceived and its assumed ability to perform a task or function. To ignore these is to blindly accept a set of design decisions that ignores the complex effect speech has on the user’s successful interaction with a system. Thus expressive speech synthesis is a key design component in SIAs. This chapter explores the world of expressive speech synthesis, aiming to act as a starting point for those interested in the design, building and evaluation of such artificial speech. The debates and literature within this topic are vast and are fundamentally multidisciplinary in focus, covering a wide range of disciplines such as linguistics, pragmatics, psychology, speech and language technology, robotics and human-computer interaction (HCI), to name a few. It is not our aim to synthesise these areas but to give a scaffold and a starting point for the reader by exploring the critical dimensions and decisions they may need to consider when choosing to use expressive speech. To do this, the chapter explores the building of expressive synthesis, highlighting key decisions and parameters as well as emphasising future challenges in expressive speech research and development. Yet, before these are expanded upon we must first try and define what we actually mean by expressive speech

    An Overview of Affective Speech Synthesis and Conversion in the Deep Learning Era

    Get PDF
    Speech is the fundamental mode of human communication, and its synthesis has long been a core priority in human-computer interaction research. In recent years, machines have managed to master the art of generating speech that is understandable by humans. But the linguistic content of an utterance encompasses only a part of its meaning. Affect, or expressivity, has the capacity to turn speech into a medium capable of conveying intimate thoughts, feelings, and emotions -- aspects that are essential for engaging and naturalistic interpersonal communication. While the goal of imparting expressivity to synthesised utterances has so far remained elusive, following recent advances in text-to-speech synthesis, a paradigm shift is well under way in the fields of affective speech synthesis and conversion as well. Deep learning, as the technology which underlies most of the recent advances in artificial intelligence, is spearheading these efforts. In the present overview, we outline ongoing trends and summarise state-of-the-art approaches in an attempt to provide a comprehensive overview of this exciting field.Comment: Submitted to the Proceedings of IEE

    Towards Spontaneous Style Modeling with Semi-supervised Pre-training for Conversational Text-to-Speech Synthesis

    Full text link
    The spontaneous behavior that often occurs in conversations makes speech more human-like compared to reading-style. However, synthesizing spontaneous-style speech is challenging due to the lack of high-quality spontaneous datasets and the high cost of labeling spontaneous behavior. In this paper, we propose a semi-supervised pre-training method to increase the amount of spontaneous-style speech and spontaneous behavioral labels. In the process of semi-supervised learning, both text and speech information are considered for detecting spontaneous behaviors labels in speech. Moreover, a linguistic-aware encoder is used to model the relationship between each sentence in the conversation. Experimental results indicate that our proposed method achieves superior expressive speech synthesis performance with the ability to model spontaneous behavior in spontaneous-style speech and predict reasonable spontaneous behavior from text.Comment: Accepted by INTERSPEECH 202

    Hesitations in Spoken Dialogue Systems

    Get PDF
    Betz S. Hesitations in Spoken Dialogue Systems. Bielefeld: UniversitÀt Bielefeld; 2020

    Asymmetrinen Lombard-efekti – YhtĂ€aikainen keskustelu meluisassa ja hiljaisessa ympĂ€ristössĂ€

    Get PDF
    Ihmiset muuttavat ÀÀnentuotantoaan kuuluvammaksi meluisassa ympĂ€ristössĂ€ refleksinomaisesti. TĂ€tĂ€ ilmiötĂ€ kutsutaan Lombard-efektiksi. Efekti saa puhujan tuottamaan Lombard-puhetta, jota on tutkittu jo yli vuosisadan ajan eri nĂ€kökulmista. Lombard-puheen akustiikalle ominaista ovat korotettu ÀÀnenpainetaso, korotettu puheÀÀnen perustaajuus, muutokset ÀÀnen osataajuuksissa sekĂ€ muissa ÀÀnen spektrin rakenteissa. LisĂ€ksi Lombard-puheessa vokaalien pituuksilla on tapana kasvaa, ja ÀÀrimmĂ€isissĂ€ meluolosuhteissa hyperartikulaatiota voi esiintyĂ€. Puhetilanteeseen sisĂ€ltyvĂ€ kommunikatiivinen aspekti on keskeistĂ€ ilmiön synnylle. TĂ€mĂ€n tutkielman tavoitteena oli tutkia puheentuottoa keskustelutilanteessa, jossa samanaikaisesti toinen keskustelijoista on altistettuna melulle ja tuottaa tĂ€ten Lombard-puhetta, ja toinen keskustelija kommunikoi hiljaisuudessa ilman taustamelun suoria vaikutuksia, ja selvittÀÀ, onko puheen akustiikassa tai ymmĂ€rrettĂ€vyydessĂ€ eroavaisuuksia tĂ€llaisessa epĂ€symmetrisessĂ€ tilanteessa verrattuna symmetriseen puhetilanteeseen, jossa molempien puhujien ÀÀniympĂ€ristö on sama. Tutkimusta varten kaksi paria suomenkielisiĂ€ keskustelijoita (yhteensĂ€ neljĂ€ osallistujaa, kaikki naisia) ratkoivat pareittain sudokupohjaisia tehtĂ€viĂ€ kolmessa eri taustamelutilanteessa: (1) hiljaisuudessa, (2) molempien ollessa taustamelussa (symmetrinen), ja (3) vain toisen keskustelijan ollessa taustamelussa (asymmetrinen). Taustamelu, jota soitettiin koehenkilöille 75 dB ÀÀnenpainetasolla, oli laadultaan cocktail-melua, joka sisĂ€ltÀÀ niin kutsuttua puheensorinaa jossa useampi puhuja puhuu pÀÀllekkĂ€in. Keskustelut ÀÀnitettiin ja niistĂ€ kerĂ€ttiin yhteensĂ€ 453 maalitavua, joista kaikista analysoitiin keskimÀÀrĂ€inen ÀÀnenpainetaso, ja 417 maalitavusta analysoitiin keskimÀÀrĂ€inen perustaajuus. Ă„Ă€nenpainetason ja perustaajuuden arvot normalisoitiin ja arvoille suoritettiin keskiarvoja ja variansseja vertailevat tilastolliset testit. Odotetusti kaikki puhujat korottivat ÀÀnenpainetasoaan ja perustaajuuttaan siirryttĂ€essĂ€ hiljaisesta keskustelutilanteesta symmetriseen taustamelutilanteeseen, jossa molemmat keskustelukumppanit tuottivat Lombard-puhetta. Henkilöt, jotka asymmetrisessĂ€ keskustelutilanteessa olivat itse hiljaisuudessa ja kommunikoivat keskustelukumppanille, joka oli melussa, korottivat sekĂ€ ÀÀnenpainetasoaan ettĂ€ perustaajuuttaan asymmetrisessĂ€ keskustelutilanteessa verrattuna hiljaiseen keskustelutilanteeseen. LisĂ€ksi toinen nĂ€istĂ€ puhujista korotti sekĂ€ ÀÀnenpainetasoaan ettĂ€ perustaajuuttaan lĂ€hes oman Lombard-puheensa tasolle, jota mitattiin symmetrisessĂ€ tilanteessa. Puhujat, jotka olivat altistettuna melulle asymmetrisessĂ€ tilanteessa, kĂ€yttivĂ€t keskimÀÀrin matalampaa ÀÀnenpainetasoa asymmetrisessĂ€ kuin symmetrisessĂ€ tilanteessa, vaikka tuottivatkin Lombard-puhetta molemmissa tilanteissa. VÀÀrin kuultuja maalitavuja ei havaittu asymmetrisessĂ€ tilanteessa, vaan henkilöt, jotka olivat kyseisessĂ€ tilanteessa hiljaisuudessa, onnistuivat korottamaan ÀÀntÀÀn tarvittavalle tasolle, jotta ratkaiseva tieto saatiin kommunikoitua melussa olevalle henkilölle. TĂ€mĂ€ tutkimus osoitti, ettĂ€ kahden keskustelukumppanin ÀÀniympĂ€ristöjen ollessa eriĂ€vĂ€t, kumpikaan keskustelijoista ei tuota tĂ€ysin sentyyppistĂ€ puhetta, joka olisi sopivaa heidĂ€n senhetkiseen ÀÀniympĂ€ristöönsĂ€, vaan puheentuottoon vaikuttaa myös vĂ€lillisesti keskustelukumppanin ÀÀniympĂ€ristö. LisĂ€ksi tutkimus osoitti, ettĂ€ siinĂ€ missĂ€ puhetilanteen kommunikatiivisuus voi lisĂ€tĂ€ Lombard-efektin vaikutuksia, se voi myös hĂ€ivyttÀÀ niitĂ€. Jatkotutkimuksissa tulisi kerĂ€tĂ€ enemmĂ€n dataa ja suorittaa datalle laajempaa analyysiĂ€.Humans increase their vocal efforts in a noisy environment in a reflex-like manner. This phenomenon is called the Lombard effect. The effect causes the speaker to produce Lombard speech, which has been researched for over a century from different standpoints. Lombard speech is characterized by increased mean energy intensity level, increased fundamental frequency, changes in the formant frequencies, and in other spectral qualities of the voice. In addition, vowel durations tend to increase and in extreme noise conditions, a speaker might hyperarticulate. The communicative aspect of a speech situation is essential to the emergence of the phenomenon. The goal of this thesis was to examine speech production in a conversational situation where simultaneously one of the interlocutors engaged in a conversation is subjected to noise and is thus producing Lombard speech, while the other interlocutor is communicating in silence without the direct effects of background noise, and to determine, whether there are differences in the acoustics or the intelligibility of speech in such an asymmetrical speech situation compared to a symmetrical situation where the noise environment of the interlocutors is the same. Two pairs of Finnish speakers (4 participants altogether, all female) were recorded doing sudoku-based tasks in three different background noise conditions: (1) in quiet, (2) with both interlocutors in noise (symmetrical), and (3) with only one of the interlocutors subjected to noise (asymmetrical). The background noise, played at 75 dB, was cocktail noise, which includes unintelligible speech from simultaneous speakers. Altogether 453 target syllables were collected, and the mean energy intensity level was extracted from each syllable. Mean fundamental frequency (f0) data was extracted from 417 target syllables. The values of f0 and intensity were normalized and statistical tests comparing means and variances were carried out on the data. Expectedly all participants increased their intensity level and f0 from the quiet to the symmetrical condition, where both interlocutors produced Lombard speech. The participants who during the asymmetrical condition were in silence and communicated to the interlocutor who was in noise increased both their intensity and f0 in the asymmetrical condition compared to the quiet condition. In addition, one of these participants increased both measures to nearly the levels that were measured from her Lombard speech in the symmetrical condition. The participants who were subjected to noise during the asymmetrical condition on average used lower intensity levels in the asymmetrical condition than in the symmetrical condition, even though they produced Lombard speech during both. No target syllables were misheard during the asymmetrical condition, rather, the participants who were in silence during said condition managed to increase their vocal efforts to a level that ensured the communication of crucial information to the person in noise. This experiment demonstrated that when the sound environments of two interlocutors are different, neither of the interlocutors produces speech that would be completely suitable for their respective environments but are indirectly affected by the sound environments of their conversational partners. In addition, it was shown that while communicativeness can increase the effects of the Lombard effect, it can also decrease them. For further research into the topic more data should be gathered, and wider analyses should be carried out

    Evaluating pause particles and their functions in natural and synthesized speech in laboratory and lecture settings

    Get PDF
    Pause-internal phonetic particles (PINTs) comprise a variety of phenomena including: phonetic-acoustic silence, inhalation and exhalation breath noises, filler particles “uh” and “um” in English, tongue clicks, and many others. These particles are omni-present in spontaneous speech, however, they are under-researched in both natural speech and synthetic speech. The present work explores the influence of PINTs in small-context recall experiments, develops a bespoke speech synthesis system that incorporates the PINTs pattern of a single speaker, and evaluates the influence of PINTs on recall for larger material lengths, namely university lectures. The benefit of PINTs on recall has been documented in natural speech in small-context laboratory settings, however, this area of research has been under-explored for synthetic speech. We devised two experiments to evaluate if PINTs have the same recall benefit for synthetic material that is found with natural material. In the first experiment, we evaluated the recollection of consecutive missing digits for a randomized 7-digit number. Results indicated that an inserted silence improved recall accuracy for digits immediately following. In the second experiment, we evaluated sentence recollection. Results indicated that sentences preceded by an inhalation breath noise were better recalled than those with no inhalation. Together, these results reveal that in single-sentence laboratory settings PINTs can improve recall for synthesized speech. The speech synthesis systems used in the small-context recall experiments did not provide much freedom in terms of controlling PINT type or location. Therefore, we endeavoured to develop bespoke speech synthesis systems. Two neural text-to-speech (TTS) systems were created: one that used PINTs annotation labels in the training data, and another that did not include any PINTs labeling in the training material. The first system allowed fine-tuned control for inserting PINTs material into the rendered material. The second system produced PINTs probabilistally. To the best of our knowledge, these are the first TTS systems to render tongue clicks. Equipped with greater control of synthesized PINTs, we returned to evaluating the recall benefit of PINTs. This time we evaluated the influence of PINTs on the recollection of key information in lectures, an ecologically valid task that focused on larger material lengths. Results indicated that key information that followed PINTs material was less likely to be recalled. We were unable to replicate the benefits of PINTs found in the small-context laboratory settings. This body of work showcases that PINTs improve recall for TTS in small-context environments just like previous work had indicated for natural speech. Additionally, we’ve provided a technological contribution via a neural TTS system that exerts finer control over PINT type and placement. Lastly, we’ve shown the importance of using material rendered by speech synthesis systems in perceptual studies.This research was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) within the project “Pause-internal phonetic particles in speech communication” (project number: 418659027; project IDs: MO 597/10-1 and TR 468/3-1). Associate member of SFB1102 “Information Density and Linguistic Encoding” (project number: 232722074)

    A Review of Deep Learning Techniques for Speech Processing

    Full text link
    The field of speech processing has undergone a transformative shift with the advent of deep learning. The use of multiple processing layers has enabled the creation of models capable of extracting intricate features from speech data. This development has paved the way for unparalleled advancements in speech recognition, text-to-speech synthesis, automatic speech recognition, and emotion recognition, propelling the performance of these tasks to unprecedented heights. The power of deep learning techniques has opened up new avenues for research and innovation in the field of speech processing, with far-reaching implications for a range of industries and applications. This review paper provides a comprehensive overview of the key deep learning models and their applications in speech-processing tasks. We begin by tracing the evolution of speech processing research, from early approaches, such as MFCC and HMM, to more recent advances in deep learning architectures, such as CNNs, RNNs, transformers, conformers, and diffusion models. We categorize the approaches and compare their strengths and weaknesses for solving speech-processing tasks. Furthermore, we extensively cover various speech-processing tasks, datasets, and benchmarks used in the literature and describe how different deep-learning networks have been utilized to tackle these tasks. Additionally, we discuss the challenges and future directions of deep learning in speech processing, including the need for more parameter-efficient, interpretable models and the potential of deep learning for multimodal speech processing. By examining the field's evolution, comparing and contrasting different approaches, and highlighting future directions and challenges, we hope to inspire further research in this exciting and rapidly advancing field
    corecore