22 research outputs found
A generative framework for conversational laughter: Its 'language model' and laughter sound synthesis
As the phonetic and acoustic manifestations of laughter in conversation are
highly diverse, laughter synthesis should be capable of accommodating such
diversity while maintaining high controllability. This paper proposes a
generative model of laughter in conversation that can produce a wide variety of
laughter by utilizing the emotion dimension as a conversational context. The
model comprises two parts: the laughter "phones generator," which generates
various, but realistic, combinations of laughter components for a given speaker
ID and emotional state, and the laughter "sound synthesizer," which receives
the laughter phone sequence and produces acoustic features that reflect the
speaker's individuality and emotional state. The results of a listening
experiment indicated that conditioning both the phones generator and the sound
synthesizer on emotion dimensions resulted in the most effective control of the
perceived emotion in synthesized laughter.Comment: Submitted to INTERSPEEC
Can a robot laugh with you?: Shared laughter generation for empathetic spoken dialogue
äșșăšäžç·ă«çŹăäŒè©±ăăăăăéçș --äșșă«ć
±æăăäșșăšć
±çăăäŒè©±AIăźćźçŸă«ćăăŠ--. äșŹéœć€§ćŠăăŹăčăȘăȘăŒăč. 2022-09-29.Spoken dialogue systems must be able to express empathy to achieve natural interaction with human users. However, laughter generation requires a high level of dialogue understanding. Thus, implementing laughter in existing systems, such as in conversational robots, has been challenging. As a first step toward solving this problem, rather than generating laughter from user dialogue, we focus on âshared laughter, â where a user laughs using either solo or speech laughs (initial laugh), and the system laughs in turn (response laugh). The proposed system consists of three models: 1) initial laugh detection, 2) shared laughter prediction, and 3) laugh type selection. We trained each model using a human-robot speed dating dialogue corpus. For the first model, a recurrent neural network was applied, and the detection performance achieved an F1 score of 82.6%. The second model used the acoustic and prosodic features of the initial laugh and achieved a prediction accuracy above that of the random prediction. The third model selects the type of systemâs response laugh as social or mirthful laugh based on the same features of the initial laugh. We then implemented the full shared laughter generation system in an attentive listening dialogue system and conducted a dialogue listening experiment. The proposed system improved the impression of the dialogue system such as empathy perception compared to a naive baseline without laughter and a reactive system that always responded with only social laughs. We propose that our system can be used for situated robot interaction and also emphasize the need for integrating proper empathetic laughs into conversational robots and agents
Building and Designing Expressive Speech Synthesis
We know there is something special about speech. Our voices are not just a means of communicating. They also give a deep impression of who we are and what we might know. They can betray our upbringing, our emotional state, our state of health. They can be used to persuade and convince, to calm and to excite. As speech systems enter the social domain they are required to interact, support and mediate our social relationships with 1) each other, 2) with digital information, and, increasingly, 3) with AI-based algorithms and processes. Socially Interactive Agents (SIAs) are at the fore- front of research and innovation in this area. There is an assumption that in the future âspoken language will provide a natural conversational interface between human beings and so-called intelligent systems.â [Moore 2017, p. 283]. A considerable amount of previous research work has tested this assumption with mixed results. However, as pointed out âvoice interfaces have become notorious for fostering frustration and failureâ [Nass and Brave 2005, p.6]. It is within this context, between our exceptional and intelligent human use of speech to communicate and interact with other humans, and our desire to leverage this means of communication for artificial systems, that the technology, often termed expressive speech synthesis uncomfortably falls. Uncomfortably, because it is often overshadowed by issues in interactivity and the underlying intelligence of the system which is something that emerges from the interaction of many of the components in a SIA. This is especially true of what we might term conversational speech, where decoupling how things are spoken, from when and to whom they are spoken, can seem an impossible task. This is an even greater challenge in evaluation and in characterising full systems which have made use of expressive speech. Furthermore when designing an interaction with a SIA, we must not only consider how SIAs should speak but how much, and whether they should even speak at all. These considerations cannot be ignored. Any speech synthesis that is used in the context of an artificial agent will have a perceived accent, a vocal style, an underlying emotion and an intonational model. Dimensions like accent and personality (cross speaker parameters) as well as vocal style, emotion and intonation during an interaction (within-speaker parameters) need to be built in the design of a synthetic voice. Even a default or neutral voice has to consider these same expressive speech synthesis components. Such design parameters have a strong influence on how effectively a system will interact, how it is perceived and its assumed ability to perform a task or function. To ignore these is to blindly accept a set of design decisions that ignores the complex effect speech has on the userâs successful interaction with a system. Thus expressive speech synthesis is a key design component in SIAs. This chapter explores the world of expressive speech synthesis, aiming to act as a starting point for those interested in the design, building and evaluation of such artificial speech. The debates and literature within this topic are vast and are fundamentally multidisciplinary in focus, covering a wide range of disciplines such as linguistics, pragmatics, psychology, speech and language technology, robotics and human-computer interaction (HCI), to name a few. It is not our aim to synthesise these areas but to give a scaffold and a starting point for the reader by exploring the critical dimensions and decisions they may need to consider when choosing to use expressive speech. To do this, the chapter explores the building of expressive synthesis, highlighting key decisions and parameters as well as emphasising future challenges in expressive speech research and development. Yet, before these are expanded upon we must first try and define what we actually mean by expressive speech
An Overview of Affective Speech Synthesis and Conversion in the Deep Learning Era
Speech is the fundamental mode of human communication, and its synthesis has
long been a core priority in human-computer interaction research. In recent
years, machines have managed to master the art of generating speech that is
understandable by humans. But the linguistic content of an utterance
encompasses only a part of its meaning. Affect, or expressivity, has the
capacity to turn speech into a medium capable of conveying intimate thoughts,
feelings, and emotions -- aspects that are essential for engaging and
naturalistic interpersonal communication. While the goal of imparting
expressivity to synthesised utterances has so far remained elusive, following
recent advances in text-to-speech synthesis, a paradigm shift is well under way
in the fields of affective speech synthesis and conversion as well. Deep
learning, as the technology which underlies most of the recent advances in
artificial intelligence, is spearheading these efforts. In the present
overview, we outline ongoing trends and summarise state-of-the-art approaches
in an attempt to provide a comprehensive overview of this exciting field.Comment: Submitted to the Proceedings of IEE
Towards Spontaneous Style Modeling with Semi-supervised Pre-training for Conversational Text-to-Speech Synthesis
The spontaneous behavior that often occurs in conversations makes speech more
human-like compared to reading-style. However, synthesizing spontaneous-style
speech is challenging due to the lack of high-quality spontaneous datasets and
the high cost of labeling spontaneous behavior. In this paper, we propose a
semi-supervised pre-training method to increase the amount of spontaneous-style
speech and spontaneous behavioral labels. In the process of semi-supervised
learning, both text and speech information are considered for detecting
spontaneous behaviors labels in speech. Moreover, a linguistic-aware encoder is
used to model the relationship between each sentence in the conversation.
Experimental results indicate that our proposed method achieves superior
expressive speech synthesis performance with the ability to model spontaneous
behavior in spontaneous-style speech and predict reasonable spontaneous
behavior from text.Comment: Accepted by INTERSPEECH 202
Hesitations in Spoken Dialogue Systems
Betz S. Hesitations in Spoken Dialogue Systems. Bielefeld: UniversitÀt Bielefeld; 2020
Asymmetrinen Lombard-efekti â YhtĂ€aikainen keskustelu meluisassa ja hiljaisessa ympĂ€ristössĂ€
Ihmiset muuttavat ÀÀnentuotantoaan kuuluvammaksi meluisassa ympÀristössÀ refleksinomaisesti. TÀtÀ ilmiötÀ kutsutaan Lombard-efektiksi. Efekti saa puhujan tuottamaan Lombard-puhetta, jota on tutkittu jo yli vuosisadan ajan eri nÀkökulmista. Lombard-puheen akustiikalle ominaista ovat korotettu ÀÀnenpainetaso, korotettu puheÀÀnen perustaajuus, muutokset ÀÀnen osataajuuksissa sekÀ muissa ÀÀnen spektrin rakenteissa. LisÀksi Lombard-puheessa vokaalien pituuksilla on tapana kasvaa, ja ÀÀrimmÀisissÀ meluolosuhteissa hyperartikulaatiota voi esiintyÀ. Puhetilanteeseen sisÀltyvÀ kommunikatiivinen aspekti on keskeistÀ ilmiön synnylle.
TĂ€mĂ€n tutkielman tavoitteena oli tutkia puheentuottoa keskustelutilanteessa, jossa samanaikaisesti toinen keskustelijoista on altistettuna melulle ja tuottaa tĂ€ten Lombard-puhetta, ja toinen keskustelija kommunikoi hiljaisuudessa ilman taustamelun suoria vaikutuksia, ja selvittÀÀ, onko puheen akustiikassa tai ymmĂ€rrettĂ€vyydessĂ€ eroavaisuuksia tĂ€llaisessa epĂ€symmetrisessĂ€ tilanteessa verrattuna symmetriseen puhetilanteeseen, jossa molempien puhujien ÀÀniympĂ€ristö on sama. Tutkimusta varten kaksi paria suomenkielisiĂ€ keskustelijoita (yhteensĂ€ neljĂ€ osallistujaa, kaikki naisia) ratkoivat pareittain sudokupohjaisia tehtĂ€viĂ€ kolmessa eri taustamelutilanteessa: (1) hiljaisuudessa, (2) molempien ollessa taustamelussa (symmetrinen), ja (3) vain toisen keskustelijan ollessa taustamelussa (asymmetrinen). Taustamelu, jota soitettiin koehenkilöille 75 dB ÀÀnenpainetasolla, oli laadultaan cocktail-melua, joka sisĂ€ltÀÀ niin kutsuttua puheensorinaa jossa useampi puhuja puhuu pÀÀllekkĂ€in. Keskustelut ÀÀnitettiin ja niistĂ€ kerĂ€ttiin yhteensĂ€ 453 maalitavua, joista kaikista analysoitiin keskimÀÀrĂ€inen ÀÀnenpainetaso, ja 417 maalitavusta analysoitiin keskimÀÀrĂ€inen perustaajuus. ĂĂ€nenpainetason ja perustaajuuden arvot normalisoitiin ja arvoille suoritettiin keskiarvoja ja variansseja vertailevat tilastolliset testit.
Odotetusti kaikki puhujat korottivat ÀÀnenpainetasoaan ja perustaajuuttaan siirryttÀessÀ hiljaisesta keskustelutilanteesta symmetriseen taustamelutilanteeseen, jossa molemmat keskustelukumppanit tuottivat Lombard-puhetta. Henkilöt, jotka asymmetrisessÀ keskustelutilanteessa olivat itse hiljaisuudessa ja kommunikoivat keskustelukumppanille, joka oli melussa, korottivat sekÀ ÀÀnenpainetasoaan ettÀ perustaajuuttaan asymmetrisessÀ keskustelutilanteessa verrattuna hiljaiseen keskustelutilanteeseen. LisÀksi toinen nÀistÀ puhujista korotti sekÀ ÀÀnenpainetasoaan ettÀ perustaajuuttaan lÀhes oman Lombard-puheensa tasolle, jota mitattiin symmetrisessÀ tilanteessa. Puhujat, jotka olivat altistettuna melulle asymmetrisessÀ tilanteessa, kÀyttivÀt keskimÀÀrin matalampaa ÀÀnenpainetasoa asymmetrisessÀ kuin symmetrisessÀ tilanteessa, vaikka tuottivatkin Lombard-puhetta molemmissa tilanteissa. VÀÀrin kuultuja maalitavuja ei havaittu asymmetrisessÀ tilanteessa, vaan henkilöt, jotka olivat kyseisessÀ tilanteessa hiljaisuudessa, onnistuivat korottamaan ÀÀntÀÀn tarvittavalle tasolle, jotta ratkaiseva tieto saatiin kommunikoitua melussa olevalle henkilölle.
TÀmÀ tutkimus osoitti, ettÀ kahden keskustelukumppanin ÀÀniympÀristöjen ollessa eriÀvÀt, kumpikaan keskustelijoista ei tuota tÀysin sentyyppistÀ puhetta, joka olisi sopivaa heidÀn senhetkiseen ÀÀniympÀristöönsÀ, vaan puheentuottoon vaikuttaa myös vÀlillisesti keskustelukumppanin ÀÀniympÀristö. LisÀksi tutkimus osoitti, ettÀ siinÀ missÀ puhetilanteen kommunikatiivisuus voi lisÀtÀ Lombard-efektin vaikutuksia, se voi myös hÀivyttÀÀ niitÀ. Jatkotutkimuksissa tulisi kerÀtÀ enemmÀn dataa ja suorittaa datalle laajempaa analyysiÀ.Humans increase their vocal efforts in a noisy environment in a reflex-like manner. This phenomenon is called the Lombard effect. The effect causes the speaker to produce Lombard speech, which has been researched for over a century from different standpoints. Lombard speech is characterized by increased mean energy intensity level, increased fundamental frequency, changes in the formant frequencies, and in other spectral qualities of the voice. In addition, vowel durations tend to increase and in extreme noise conditions, a speaker might hyperarticulate. The communicative aspect of a speech situation is essential to the emergence of the phenomenon.
The goal of this thesis was to examine speech production in a conversational situation where simultaneously one of the interlocutors engaged in a conversation is subjected to noise and is thus producing Lombard speech, while the other interlocutor is communicating in silence without the direct effects of background noise, and to determine, whether there are differences in the acoustics or the intelligibility of speech in such an asymmetrical speech situation compared to a symmetrical situation where the noise environment of the interlocutors is the same. Two pairs of Finnish speakers (4 participants altogether, all female) were recorded doing sudoku-based tasks in three different background noise conditions: (1) in quiet, (2) with both interlocutors in noise (symmetrical), and (3) with only one of the interlocutors subjected to noise (asymmetrical). The background noise, played at 75 dB, was cocktail noise, which includes unintelligible speech from simultaneous speakers. Altogether 453 target syllables were collected, and the mean energy intensity level was extracted from each syllable. Mean fundamental frequency (f0) data was extracted from 417 target syllables. The values of f0 and intensity were normalized and statistical tests comparing means and variances were carried out on the data.
Expectedly all participants increased their intensity level and f0 from the quiet to the symmetrical condition, where both interlocutors produced Lombard speech. The participants who during the asymmetrical condition were in silence and communicated to the interlocutor who was in noise increased both their intensity and f0 in the asymmetrical condition compared to the quiet condition. In addition, one of these participants increased both measures to nearly the levels that were measured from her Lombard speech in the symmetrical condition. The participants who were subjected to noise during the asymmetrical condition on average used lower intensity levels in the asymmetrical condition than in the symmetrical condition, even though they produced Lombard speech during both. No target syllables were misheard during the asymmetrical condition, rather, the participants who were in silence during said condition managed to increase their vocal efforts to a level that ensured the communication of crucial information to the person in noise.
This experiment demonstrated that when the sound environments of two interlocutors are different, neither of the interlocutors produces speech that would be completely suitable for their respective environments but are indirectly affected by the sound environments of their conversational partners. In addition, it was shown that while communicativeness can increase the effects of the Lombard effect, it can also decrease them. For further research into the topic more data should be gathered, and wider analyses should be carried out
Evaluating pause particles and their functions in natural and synthesized speech in laboratory and lecture settings
Pause-internal phonetic particles (PINTs) comprise a variety of phenomena including: phonetic-acoustic silence, inhalation and exhalation breath noises, filler particles âuhâ and âumâ in English, tongue clicks, and many others. These particles are omni-present in spontaneous speech, however, they are under-researched in both natural speech and synthetic speech. The present work explores the influence of PINTs in small-context recall experiments, develops a bespoke speech synthesis system that incorporates the PINTs pattern of a single speaker, and evaluates the influence of PINTs on recall for larger material lengths, namely university lectures. The benefit of PINTs on recall has been documented in natural speech in small-context laboratory settings, however, this area of research has been under-explored for synthetic speech. We devised two experiments to evaluate if PINTs have the same recall benefit for synthetic material that is found with natural material. In the first experiment, we evaluated the recollection of consecutive missing digits for a randomized 7-digit number. Results indicated that an inserted silence improved recall accuracy for digits immediately following. In the second experiment, we evaluated sentence recollection. Results indicated that sentences preceded by an inhalation breath noise were better recalled than those with no inhalation. Together, these results reveal that in single-sentence laboratory settings PINTs can improve recall for synthesized speech.
The speech synthesis systems used in the small-context recall experiments did not provide much freedom in terms of controlling PINT type or location. Therefore, we endeavoured to develop bespoke speech synthesis systems. Two neural text-to-speech (TTS) systems were created: one that used PINTs annotation labels in the training data, and another that did not include any PINTs labeling in the training material. The first system allowed fine-tuned control for inserting PINTs material into the rendered material. The second system produced PINTs probabilistally. To the best of our knowledge, these are the first TTS systems to render tongue clicks.
Equipped with greater control of synthesized PINTs, we returned to evaluating the recall benefit of PINTs. This time we evaluated the influence of PINTs on the recollection of key information in lectures, an ecologically valid task that focused on larger material lengths. Results indicated that key information that followed PINTs material was less likely to be recalled. We were unable to replicate the benefits of PINTs found in the small-context laboratory settings. This body of work showcases that PINTs improve recall for TTS in small-context environments just like previous work had indicated for natural speech. Additionally, weâve provided a technological contribution via a neural TTS system that exerts finer control over PINT type and placement. Lastly, weâve shown the importance of using material rendered by speech synthesis systems in perceptual studies.This research was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) within the project âPause-internal phonetic particles in speech communicationâ (project number: 418659027; project IDs: MO 597/10-1 and TR 468/3-1).
Associate member of SFB1102 âInformation Density and Linguistic Encodingâ (project number: 232722074)
A Review of Deep Learning Techniques for Speech Processing
The field of speech processing has undergone a transformative shift with the
advent of deep learning. The use of multiple processing layers has enabled the
creation of models capable of extracting intricate features from speech data.
This development has paved the way for unparalleled advancements in speech
recognition, text-to-speech synthesis, automatic speech recognition, and
emotion recognition, propelling the performance of these tasks to unprecedented
heights. The power of deep learning techniques has opened up new avenues for
research and innovation in the field of speech processing, with far-reaching
implications for a range of industries and applications. This review paper
provides a comprehensive overview of the key deep learning models and their
applications in speech-processing tasks. We begin by tracing the evolution of
speech processing research, from early approaches, such as MFCC and HMM, to
more recent advances in deep learning architectures, such as CNNs, RNNs,
transformers, conformers, and diffusion models. We categorize the approaches
and compare their strengths and weaknesses for solving speech-processing tasks.
Furthermore, we extensively cover various speech-processing tasks, datasets,
and benchmarks used in the literature and describe how different deep-learning
networks have been utilized to tackle these tasks. Additionally, we discuss the
challenges and future directions of deep learning in speech processing,
including the need for more parameter-efficient, interpretable models and the
potential of deep learning for multimodal speech processing. By examining the
field's evolution, comparing and contrasting different approaches, and
highlighting future directions and challenges, we hope to inspire further
research in this exciting and rapidly advancing field