Search CORE

22 research outputs found

A generative framework for conversational laughter: Its 'language model' and laughter sound synthesis

Author: Kimura Shunya
Mori Hiroki
Publication venue
Publication date: 06/06/2023
Field of study

As the phonetic and acoustic manifestations of laughter in conversation are highly diverse, laughter synthesis should be capable of accommodating such diversity while maintaining high controllability. This paper proposes a generative model of laughter in conversation that can produce a wide variety of laughter by utilizing the emotion dimension as a conversational context. The model comprises two parts: the laughter "phones generator," which generates various, but realistic, combinations of laughter components for a given speaker ID and emotional state, and the laughter "sound synthesizer," which receives the laughter phone sequence and produces acoustic features that reflect the speaker's individuality and emotional state. The results of a listening experiment indicated that conditioning both the phones generator and the sound synthesizer on emotion dimensions resulted in the most effective control of the perceived emotion in synthesized laughter.Comment: Submitted to INTERSPEEC

arXiv.org e-Print Archive

Can a robot laugh with you?: Shared laughter generation for empathetic spoken dialogue

Author: Inoue Koji
Kawahara Tatsuya
Lala Divesh
Publication venue: 'Frontiers Media SA'
Publication date: 01/01/2022
Field of study

人と一緒に笑う会話ロボットを開発 --人に共感し、人と共生する会話AIの実現に向けて--. 京都大学プレスリリース. 2022-09-29.Spoken dialogue systems must be able to express empathy to achieve natural interaction with human users. However, laughter generation requires a high level of dialogue understanding. Thus, implementing laughter in existing systems, such as in conversational robots, has been challenging. As a first step toward solving this problem, rather than generating laughter from user dialogue, we focus on “shared laughter, ” where a user laughs using either solo or speech laughs (initial laugh), and the system laughs in turn (response laugh). The proposed system consists of three models: 1) initial laugh detection, 2) shared laughter prediction, and 3) laugh type selection. We trained each model using a human-robot speed dating dialogue corpus. For the first model, a recurrent neural network was applied, and the detection performance achieved an F1 score of 82.6%. The second model used the acoustic and prosodic features of the initial laugh and achieved a prediction accuracy above that of the random prediction. The third model selects the type of system’s response laugh as social or mirthful laugh based on the same features of the initial laugh. We then implemented the full shared laughter generation system in an attentive listening dialogue system and conducted a dialogue listening experiment. The proposed system improved the impression of the dialogue system such as empathy perception compared to a naive baseline without laughter and a reactive system that always responded with only social laughs. We propose that our system can be used for situated robot interaction and also emphasize the need for integrating proper empathetic laughs into conversational robots and agents

PubMed Central

Kyoto University Research Information Repository

Building and Designing Expressive Speech Synthesis

Author: Leigh Clark
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2021
Field of study

We know there is something special about speech. Our voices are not just a means of communicating. They also give a deep impression of who we are and what we might know. They can betray our upbringing, our emotional state, our state of health. They can be used to persuade and convince, to calm and to excite. As speech systems enter the social domain they are required to interact, support and mediate our social relationships with 1) each other, 2) with digital information, and, increasingly, 3) with AI-based algorithms and processes. Socially Interactive Agents (SIAs) are at the fore- front of research and innovation in this area. There is an assumption that in the future “spoken language will provide a natural conversational interface between human beings and so-called intelligent systems.” [Moore 2017, p. 283]. A considerable amount of previous research work has tested this assumption with mixed results. However, as pointed out “voice interfaces have become notorious for fostering frustration and failure” [Nass and Brave 2005, p.6]. It is within this context, between our exceptional and intelligent human use of speech to communicate and interact with other humans, and our desire to leverage this means of communication for artificial systems, that the technology, often termed expressive speech synthesis uncomfortably falls. Uncomfortably, because it is often overshadowed by issues in interactivity and the underlying intelligence of the system which is something that emerges from the interaction of many of the components in a SIA. This is especially true of what we might term conversational speech, where decoupling how things are spoken, from when and to whom they are spoken, can seem an impossible task. This is an even greater challenge in evaluation and in characterising full systems which have made use of expressive speech. Furthermore when designing an interaction with a SIA, we must not only consider how SIAs should speak but how much, and whether they should even speak at all. These considerations cannot be ignored. Any speech synthesis that is used in the context of an artificial agent will have a perceived accent, a vocal style, an underlying emotion and an intonational model. Dimensions like accent and personality (cross speaker parameters) as well as vocal style, emotion and intonation during an interaction (within-speaker parameters) need to be built in the design of a synthetic voice. Even a default or neutral voice has to consider these same expressive speech synthesis components. Such design parameters have a strong influence on how effectively a system will interact, how it is perceived and its assumed ability to perform a task or function. To ignore these is to blindly accept a set of design decisions that ignores the complex effect speech has on the user’s successful interaction with a system. Thus expressive speech synthesis is a key design component in SIAs. This chapter explores the world of expressive speech synthesis, aiming to act as a starting point for those interested in the design, building and evaluation of such artificial speech. The debates and literature within this topic are vast and are fundamentally multidisciplinary in focus, covering a wide range of disciplines such as linguistics, pragmatics, psychology, speech and language technology, robotics and human-computer interaction (HCI), to name a few. It is not our aim to synthesise these areas but to give a scaffold and a starting point for the reader by exploring the critical dimensions and decisions they may need to consider when choosing to use expressive speech. To do this, the chapter explores the building of expressive synthesis, highlighting key decisions and parameters as well as emphasising future challenges in expressive speech research and development. Yet, before these are expanded upon we must first try and define what we actually mean by expressive speech

Cronfa at Swansea University

An Overview of Affective Speech Synthesis and Conversion in the Deep Learning Era

Author: André Elisabeth
Fu Ruibo
He Xiangheng
İymen Gökçe
Liu Shuo
Mertes Silvan
Schuller Björn W.
Sezgin Metin
Tao Jianhua
Triantafyllopoulos Andreas
Tzirakis Panagiotis
Yang Zijiang
Publication venue
Publication date: 06/10/2022
Field of study

Speech is the fundamental mode of human communication, and its synthesis has long been a core priority in human-computer interaction research. In recent years, machines have managed to master the art of generating speech that is understandable by humans. But the linguistic content of an utterance encompasses only a part of its meaning. Affect, or expressivity, has the capacity to turn speech into a medium capable of conveying intimate thoughts, feelings, and emotions -- aspects that are essential for engaging and naturalistic interpersonal communication. While the goal of imparting expressivity to synthesised utterances has so far remained elusive, following recent advances in text-to-speech synthesis, a paradigm shift is well under way in the fields of affective speech synthesis and conversion as well. Deep learning, as the technology which underlies most of the recent advances in artificial intelligence, is spearheading these efforts. In the present overview, we outline ongoing trends and summarise state-of-the-art approaches in an attempt to provide a comprehensive overview of this exciting field.Comment: Submitted to the Proceedings of IEE

arXiv.org e-Print Archive

OPUS Augsburg

Towards Spontaneous Style Modeling with Semi-supervised Pre-training for Conversational Text-to-Speech Synthesis

Author: Huang Qiaochu
Kang Shiyin
Lei Shun
Li Weiqin
Meng Helen
Wu Zhiyong
Zhou Yixuan
Publication venue
Publication date: 31/08/2023
Field of study

The spontaneous behavior that often occurs in conversations makes speech more human-like compared to reading-style. However, synthesizing spontaneous-style speech is challenging due to the lack of high-quality spontaneous datasets and the high cost of labeling spontaneous behavior. In this paper, we propose a semi-supervised pre-training method to increase the amount of spontaneous-style speech and spontaneous behavioral labels. In the process of semi-supervised learning, both text and speech information are considered for detecting spontaneous behaviors labels in speech. Moreover, a linguistic-aware encoder is used to model the relationship between each sentence in the conversation. Experimental results indicate that our proposed method achieves superior expressive speech synthesis performance with the ability to model spontaneous behavior in spontaneous-style speech and predict reasonable spontaneous behavior from text.Comment: Accepted by INTERSPEECH 202

arXiv.org e-Print Archive

Hesitations in Spoken Dialogue Systems

Author: Betz Simon
Publication venue: Universität Bielefeld
Publication date: 01/01/2020
Field of study

Betz S. Hesitations in Spoken Dialogue Systems. Bielefeld: Universität Bielefeld; 2020

Publications at Bielefeld University

Asymmetrinen Lombard-efekti – Yhtäaikainen keskustelu meluisassa ja hiljaisessa ympäristössä

Author: Wikström Alexandra
Publication venue: Helsingfors universitet
Publication date: 01/01/2022
Field of study

Ihmiset muuttavat äänentuotantoaan kuuluvammaksi meluisassa ympäristössä refleksinomaisesti. Tätä ilmiötä kutsutaan Lombard-efektiksi. Efekti saa puhujan tuottamaan Lombard-puhetta, jota on tutkittu jo yli vuosisadan ajan eri näkökulmista. Lombard-puheen akustiikalle ominaista ovat korotettu äänenpainetaso, korotettu puheäänen perustaajuus, muutokset äänen osataajuuksissa sekä muissa äänen spektrin rakenteissa. Lisäksi Lombard-puheessa vokaalien pituuksilla on tapana kasvaa, ja äärimmäisissä meluolosuhteissa hyperartikulaatiota voi esiintyä. Puhetilanteeseen sisältyvä kommunikatiivinen aspekti on keskeistä ilmiön synnylle. Tämän tutkielman tavoitteena oli tutkia puheentuottoa keskustelutilanteessa, jossa samanaikaisesti toinen keskustelijoista on altistettuna melulle ja tuottaa täten Lombard-puhetta, ja toinen keskustelija kommunikoi hiljaisuudessa ilman taustamelun suoria vaikutuksia, ja selvittää, onko puheen akustiikassa tai ymmärrettävyydessä eroavaisuuksia tällaisessa epäsymmetrisessä tilanteessa verrattuna symmetriseen puhetilanteeseen, jossa molempien puhujien ääniympäristö on sama. Tutkimusta varten kaksi paria suomenkielisiä keskustelijoita (yhteensä neljä osallistujaa, kaikki naisia) ratkoivat pareittain sudokupohjaisia tehtäviä kolmessa eri taustamelutilanteessa: (1) hiljaisuudessa, (2) molempien ollessa taustamelussa (symmetrinen), ja (3) vain toisen keskustelijan ollessa taustamelussa (asymmetrinen). Taustamelu, jota soitettiin koehenkilöille 75 dB äänenpainetasolla, oli laadultaan cocktail-melua, joka sisältää niin kutsuttua puheensorinaa jossa useampi puhuja puhuu päällekkäin. Keskustelut äänitettiin ja niistä kerättiin yhteensä 453 maalitavua, joista kaikista analysoitiin keskimääräinen äänenpainetaso, ja 417 maalitavusta analysoitiin keskimääräinen perustaajuus. Äänenpainetason ja perustaajuuden arvot normalisoitiin ja arvoille suoritettiin keskiarvoja ja variansseja vertailevat tilastolliset testit. Odotetusti kaikki puhujat korottivat äänenpainetasoaan ja perustaajuuttaan siirryttäessä hiljaisesta keskustelutilanteesta symmetriseen taustamelutilanteeseen, jossa molemmat keskustelukumppanit tuottivat Lombard-puhetta. Henkilöt, jotka asymmetrisessä keskustelutilanteessa olivat itse hiljaisuudessa ja kommunikoivat keskustelukumppanille, joka oli melussa, korottivat sekä äänenpainetasoaan että perustaajuuttaan asymmetrisessä keskustelutilanteessa verrattuna hiljaiseen keskustelutilanteeseen. Lisäksi toinen näistä puhujista korotti sekä äänenpainetasoaan että perustaajuuttaan lähes oman Lombard-puheensa tasolle, jota mitattiin symmetrisessä tilanteessa. Puhujat, jotka olivat altistettuna melulle asymmetrisessä tilanteessa, käyttivät keskimäärin matalampaa äänenpainetasoa asymmetrisessä kuin symmetrisessä tilanteessa, vaikka tuottivatkin Lombard-puhetta molemmissa tilanteissa. Väärin kuultuja maalitavuja ei havaittu asymmetrisessä tilanteessa, vaan henkilöt, jotka olivat kyseisessä tilanteessa hiljaisuudessa, onnistuivat korottamaan ääntään tarvittavalle tasolle, jotta ratkaiseva tieto saatiin kommunikoitua melussa olevalle henkilölle. Tämä tutkimus osoitti, että kahden keskustelukumppanin ääniympäristöjen ollessa eriävät, kumpikaan keskustelijoista ei tuota täysin sentyyppistä puhetta, joka olisi sopivaa heidän senhetkiseen ääniympäristöönsä, vaan puheentuottoon vaikuttaa myös välillisesti keskustelukumppanin ääniympäristö. Lisäksi tutkimus osoitti, että siinä missä puhetilanteen kommunikatiivisuus voi lisätä Lombard-efektin vaikutuksia, se voi myös häivyttää niitä. Jatkotutkimuksissa tulisi kerätä enemmän dataa ja suorittaa datalle laajempaa analyysiä.Humans increase their vocal efforts in a noisy environment in a reflex-like manner. This phenomenon is called the Lombard effect. The effect causes the speaker to produce Lombard speech, which has been researched for over a century from different standpoints. Lombard speech is characterized by increased mean energy intensity level, increased fundamental frequency, changes in the formant frequencies, and in other spectral qualities of the voice. In addition, vowel durations tend to increase and in extreme noise conditions, a speaker might hyperarticulate. The communicative aspect of a speech situation is essential to the emergence of the phenomenon. The goal of this thesis was to examine speech production in a conversational situation where simultaneously one of the interlocutors engaged in a conversation is subjected to noise and is thus producing Lombard speech, while the other interlocutor is communicating in silence without the direct effects of background noise, and to determine, whether there are differences in the acoustics or the intelligibility of speech in such an asymmetrical speech situation compared to a symmetrical situation where the noise environment of the interlocutors is the same. Two pairs of Finnish speakers (4 participants altogether, all female) were recorded doing sudoku-based tasks in three different background noise conditions: (1) in quiet, (2) with both interlocutors in noise (symmetrical), and (3) with only one of the interlocutors subjected to noise (asymmetrical). The background noise, played at 75 dB, was cocktail noise, which includes unintelligible speech from simultaneous speakers. Altogether 453 target syllables were collected, and the mean energy intensity level was extracted from each syllable. Mean fundamental frequency (f0) data was extracted from 417 target syllables. The values of f0 and intensity were normalized and statistical tests comparing means and variances were carried out on the data. Expectedly all participants increased their intensity level and f0 from the quiet to the symmetrical condition, where both interlocutors produced Lombard speech. The participants who during the asymmetrical condition were in silence and communicated to the interlocutor who was in noise increased both their intensity and f0 in the asymmetrical condition compared to the quiet condition. In addition, one of these participants increased both measures to nearly the levels that were measured from her Lombard speech in the symmetrical condition. The participants who were subjected to noise during the asymmetrical condition on average used lower intensity levels in the asymmetrical condition than in the symmetrical condition, even though they produced Lombard speech during both. No target syllables were misheard during the asymmetrical condition, rather, the participants who were in silence during said condition managed to increase their vocal efforts to a level that ensured the communication of crucial information to the person in noise. This experiment demonstrated that when the sound environments of two interlocutors are different, neither of the interlocutors produces speech that would be completely suitable for their respective environments but are indirectly affected by the sound environments of their conversational partners. In addition, it was shown that while communicativeness can increase the effects of the Lombard effect, it can also decrease them. For further research into the topic more data should be gathered, and wider analyses should be carried out

Helsingin yliopiston digitaalinen arkisto

Evaluating pause particles and their functions in natural and synthesized speech in laboratory and lecture settings

Author: Elmers Mikey
Publication venue: Saarländische Universitäts- und Landesbibliothek
Publication date: 01/01/2023
Field of study

Pause-internal phonetic particles (PINTs) comprise a variety of phenomena including: phonetic-acoustic silence, inhalation and exhalation breath noises, filler particles “uh” and “um” in English, tongue clicks, and many others. These particles are omni-present in spontaneous speech, however, they are under-researched in both natural speech and synthetic speech. The present work explores the influence of PINTs in small-context recall experiments, develops a bespoke speech synthesis system that incorporates the PINTs pattern of a single speaker, and evaluates the influence of PINTs on recall for larger material lengths, namely university lectures. The benefit of PINTs on recall has been documented in natural speech in small-context laboratory settings, however, this area of research has been under-explored for synthetic speech. We devised two experiments to evaluate if PINTs have the same recall benefit for synthetic material that is found with natural material. In the first experiment, we evaluated the recollection of consecutive missing digits for a randomized 7-digit number. Results indicated that an inserted silence improved recall accuracy for digits immediately following. In the second experiment, we evaluated sentence recollection. Results indicated that sentences preceded by an inhalation breath noise were better recalled than those with no inhalation. Together, these results reveal that in single-sentence laboratory settings PINTs can improve recall for synthesized speech. The speech synthesis systems used in the small-context recall experiments did not provide much freedom in terms of controlling PINT type or location. Therefore, we endeavoured to develop bespoke speech synthesis systems. Two neural text-to-speech (TTS) systems were created: one that used PINTs annotation labels in the training data, and another that did not include any PINTs labeling in the training material. The first system allowed fine-tuned control for inserting PINTs material into the rendered material. The second system produced PINTs probabilistally. To the best of our knowledge, these are the first TTS systems to render tongue clicks. Equipped with greater control of synthesized PINTs, we returned to evaluating the recall benefit of PINTs. This time we evaluated the influence of PINTs on the recollection of key information in lectures, an ecologically valid task that focused on larger material lengths. Results indicated that key information that followed PINTs material was less likely to be recalled. We were unable to replicate the benefits of PINTs found in the small-context laboratory settings. This body of work showcases that PINTs improve recall for TTS in small-context environments just like previous work had indicated for natural speech. Additionally, we’ve provided a technological contribution via a neural TTS system that exerts finer control over PINT type and placement. Lastly, we’ve shown the importance of using material rendered by speech synthesis systems in perceptual studies.This research was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) within the project “Pause-internal phonetic particles in speech communication” (project number: 418659027; project IDs: MO 597/10-1 and TR 468/3-1). Associate member of SFB1102 “Information Density and Linguistic Encoding” (project number: 232722074)

Acronym

A Review of Deep Learning Techniques for Speech Processing

Author: Bhardwaj Rishabh
Majumder Navonil
Mehrish Ambuj
Mihalcea Rada
Poria Soujanya
Publication venue
Publication date: 01/05/2023
Field of study

The field of speech processing has undergone a transformative shift with the advent of deep learning. The use of multiple processing layers has enabled the creation of models capable of extracting intricate features from speech data. This development has paved the way for unparalleled advancements in speech recognition, text-to-speech synthesis, automatic speech recognition, and emotion recognition, propelling the performance of these tasks to unprecedented heights. The power of deep learning techniques has opened up new avenues for research and innovation in the field of speech processing, with far-reaching implications for a range of industries and applications. This review paper provides a comprehensive overview of the key deep learning models and their applications in speech-processing tasks. We begin by tracing the evolution of speech processing research, from early approaches, such as MFCC and HMM, to more recent advances in deep learning architectures, such as CNNs, RNNs, transformers, conformers, and diffusion models. We categorize the approaches and compare their strengths and weaknesses for solving speech-processing tasks. Furthermore, we extensively cover various speech-processing tasks, datasets, and benchmarks used in the literature and describe how different deep-learning networks have been utilized to tackle these tasks. Additionally, we discuss the challenges and future directions of deep learning in speech processing, including the need for more parameter-efficient, interpretable models and the potential of deep learning for multimodal speech processing. By examining the field's evolution, comparing and contrasting different approaches, and highlighting future directions and challenges, we hope to inspire further research in this exciting and rapidly advancing field

arXiv.org e-Print Archive