37 research outputs found
Mimicking the interlocutor’s speech: models of speaker entrainment for improving the naturalness of spoken dialogue systems
A medida que las tecnologías de procesamiento del habla continúan mejorando, gradualmente nos acercamos al viejo sueño de crear una máquina que hable. Los actuales sistemas interactivos de diálogo hablado permiten que los usuarios realicen tareas simples, tales como transacciones bancarias y reservas en hoteles, mediante la interacción verbal. Pese a ser relativamente exitosas, estas conversaciones humanocomputadora aún tienen un largo camino para recorrer en cuanto a su naturalidad: estos sistemas tienden a ser descriptos por los usuarios como “extraños” o incluso “intimidantes”. Entre las razones principales para esta falta de naturalidad, figura el modelado imperfecto de la variación prosódica, o cómo algunas propiedades del habla (tales como la entonación, la intensidad o el ritmo) cambian en las expresiones verbales. Los sistemas actuales todavía son incapaces de manejar estas características en forma correcta, tanto al entender el habla del usuario como para producir respuestas sintetizadas. La variación prosódica es extremadamente compleja en el habla espontánea, y se sabe que la afectan varios niveles de representación lingüística (léxica, sintáctica, semántica y pragmática). En el presente artículo, enfocamos nuestra atención en una dimensión particular de variación prosódica, conocida como “mimetización entre interlocutores”, que consiste en la alineación automática de características del habla entre los participantes de un diálogo. Tras un repaso general de la literatura de estos temas, describimos un proyecto de investigación en curso que busca modelar la mimetización prosódica en diálogos.As speech processing technologies continue to improve, the old dream of creating a machine that talks gradually becomes real. The present interactive speech systems enable users to perform simple tasks such as banking transactions and hotel reservations, through verbal interaction. Despite being relatively successful, these human-computer conversations still have a long way to go regarding their naturalness: these systems tend to be described as “odd” or even “intimidating” by users. Among the main reasons for this lack of naturalness, is the flawed modeling of prosodic variation or the way some properties of speech (such as intonation, intensity and rhythm) change in verbal expressions. Current systems are still unable to handle these features correctly, both to understand the speech of the user as to produce synthesized responses. Prosodic variation is extremely complex in spontaneous speech, and it is well known that it´s affected by several levels of linguistic representation (lexical, syntactic, semantic and pragmatic). The present article focuses on a specific dimension of prosodic variation, known as “mimetization between interlocutors”, which consists in the automatic alignment of speech features between the participants of a dialogue. After a general overview of the literature on these subjects, a research project in process that seeks to model the prosodic mimetizatin in dialogues is described.Fil: Gravano, Agustin. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Computación; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentin
Recommended from our members
Turn-Taking and Affirmative Cue Words in Task-Oriented Dialogue
As interactive voice response systems spread at a rapid pace, providing an increasingly more complex functionality, it is becoming clear that the challenges of such systems are not solely associated to their synthesis and recognition capabilities. Rather, issues such as the coordination of turn exchanges between system and user, or the correct generation and understanding of words that may convey multiple meanings, appear to play an important role in system usability. This thesis explores those two issues in the Columbia Games Corpus, a collection of spontaneous task-oriented dialogues in Standard American English. We provide evidence of the existence of seven turn-yielding cues -- prosodic, acoustic and syntactic events strongly associated with conversational turn endings -- and show that the likelihood of a turn-taking attempt from the interlocutor increases linearly with the number of cues conjointly displayed by the speaker. We present similar results related to six backchannel-inviting cues -- events that invite the interlocutor to produce a short utterance conveying continued attention. Additionally, we describe a series of studies of affirmative cue words -- a family of cue words such as 'okay' or 'alright' that speakers use frequently in conversation for several purposes: for acknowledging what the interlocutor has said, or for cueing the start of a new topic, among others. We find differences in the acoustic/prosodic realization of such functions, but observe that contextual information figures prominently in human disambiguation of these words. We also conduct machine learning experiments to explore the automatic classification of affirmative cue words. Finally, we examine a novel measure of speaker entrainment related to the usage of these words, showing its association with task success and dialogue coordination
Recommended from our members
Turn-Taking and Affirmative Cue Words in Task-Oriented Dialogue
As interactive voice response systems spread at a rapid pace, providing an increasingly more complex functionality, it is becoming clear that the challenges of such systems are not solely associated to their synthesis and recognition capabilities. Rather, issues such as the coordination of turn exchanges between system and user, or the correct generation and understanding of words that may convey multiple meanings, appear to play an important role in system usability. This thesis explores those two issues in the Columbia Games Corpus, a collection of spontaneous task-oriented dialogues in Standard American English. We provide evidence of the existence of seven turn-yielding cues -- prosodic, acoustic and syntactic events strongly associated with conversational turn endings -- and show that the likelihood of a turn-taking attempt from the interlocutor increases linearly with the number of cues conjointly displayed by the speaker. We present similar results related to six backchannel-inviting cues -- events that invite the interlocutor to produce a short utterance conveying continued attention. Additionally, we describe a series of studies of affirmative cue words -- a family of cue words such as 'okay' or 'alright' that speakers use frequently in conversation for several purposes: for acknowledging what the interlocutor has said, or for cueing the start of a new topic, among others. We find differences in the acoustic/prosodic realization of such functions, but observe that contextual information figures prominently in human disambiguation of these words. We also conduct machine learning experiments to explore the automatic classification of affirmative cue words. Finally, we examine a novel measure of speaker entrainment related to the usage of these words, showing its association with task success and dialogue coordination
Recommended from our members
Rapid Language Model Development Using External Resources for New Spoken Dialog Domains
This paper addresses a critical problem in deploying a spoken dialog system (SDS). One of the main bottlenecks of SDS deployment for a new domain is data sparseness in building a statistical language model. Our goal is to devise a method to efficiently build a reliable language model for a new SDS. We consider the worst yet quite common scenario where only a small amount (∼1.7K utterances) of domain specific data is available for the target domain. We present a new method that exploits external static text resources that are collected for other speech recognition tasks as well as dynamic text resources acquired from World Wide Web (WWW). We show that language models built using external resources can jointly be used with limited in–domain (baseline) language model to obtain significant improvements in speech recognition accuracy. Combining language models built using external resources with the in–domain language model provides over 20 % reduction in WER over the baseline in–domain language model. Equivalently, we achieve almost the same level of performance by having ten times as much in–domain data (17K utterances)
Restoring Punctuation and Capitalization in Transcribed Speech
Adding punctuation and capitalization greatly improves the readability of automatic speech transcripts. We discuss an approach for performing both tasks in a single pass using a purely text-basedn-gram language model. We study the effect on performance of varying the n-gram order (from n = 3 to n = 6) and the amount of training data (from 58 million to
55 billion tokens). Our results show that using larger training data sets consistently improves performance, while increasing the n-gram order does not help nearly as much
High Frequency Word Entertainment in Spoken Dialogue
Cognitive theories of dialogue hold that entrainment, the automatic alignment between dialogue partners at many levels of linguistic representation, is key to facilitating both production and comprehension in dialogue. In this paper we examine novel types of entrainment in two corpora—Switchboard and the Columbia Games corpus. We examine entrainment in use of high-frequency words (the most common words in the corpus), and its association with dialogue naturalness and flow, as well as with task success. Our results show that such entrainment is predictive of the perceived naturalness of dialogues and is significantly correlated with task success; in overall interaction flow, higher degrees of entrainment are associated with more overlaps and fewer interruptions
Recommended from our members
The Prosody of Backchannels in American English
We examine prosodic and contextual factors characterizing the backchannel function of single affirmative words. Data is drawn from
collaborative task-oriented dialogues between speakers of Standard American English. Despite high lexical variability, backchannels are
prosodically well defined: they have higher pitch and intensity and greater pitch slope than affirmative words expressing other pragmatic
functions. Additionally, we identify phrase-final rising pitch as a salient trigger for backchanneling
Fusion approaches for emotion recognition from speech using acoustic and text-based features
In this paper, we study different approaches for classifying emotions from
speech using acoustic and text-based features. We propose to obtain
contextualized word embeddings with BERT to represent the information contained
in speech transcriptions and show that this results in better performance than
using Glove embeddings. We also propose and compare different strategies to
combine the audio and text modalities, evaluating them on IEMOCAP and
MSP-PODCAST datasets. We find that fusing acoustic and text-based systems is
beneficial on both datasets, though only subtle differences are observed across
the evaluated fusion approaches. Finally, for IEMOCAP, we show the large effect
that the criteria used to define the cross-validation folds have on results. In
particular, the standard way of creating folds for this dataset results in a
highly optimistic estimation of performance for the text-based system,
suggesting that some previous works may overestimate the advantage of
incorporating transcriptions.Comment: 5 pages. Accepted in ICASSP 202
A cross-linguistic analysis of the temporal dynamics of turn-taking cues using machine learning as a descriptive tool
In dialogue, speakers produce and perceive acoustic/prosodic turn-taking cues, which are fundamental for negotiating turn exchanges with their interlocutors. However, little of the temporal dynamics and cross-linguistic validity of these cues is known. In this work, we explore a set of acoustic/prosodic cues preceding three turn-transition types (hold, switch and backchannel) in three different languages (Slovak, American English and Argentine Spanish). For this, we use and refine a set of machine learning techniques that enable a finer-grained temporal analysis of such cues, as well as a comparison of their relative explanatory power. Our results suggest that the three languages, despite belonging to distinct linguistic families, share the general usage of a handful of acoustic/prosodic features to signal turn transitions. We conclude that exploiting features such as speech rate, final-word lengthening, the pitch track over the final 200 ms, the intensity track over the final 1000 ms, and noise-to-harmonics ratio (a voice-quality feature) might prove useful for further improving the accuracy of the turn-taking modules found in modern spoken dialogue systems.Fil: Brusco, Pablo. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Computación; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Investigación en Ciencias de la Computación. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Investigación en Ciencias de la Computación; ArgentinaFil: Vidal, Jazmín. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Computación; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Investigación en Ciencias de la Computación. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Investigación en Ciencias de la Computación; ArgentinaFil: Beňuš, Štefan. University in Nitra; Eslovaquia. Slovak Academy of Sciences; EslovaquiaFil: Gravano, Agustin. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Investigación en Ciencias de la Computación. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Investigación en Ciencias de la Computación; Argentina. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Computación; Argentin
On the Role of Context and Prosody in the Interpretation of ‘Okay’
We examine the effect of contextual and acoustic cues in the disambiguation of three discourse-pragmatic functions of the word okay. Results of a perception study show that contextual cues are stronger predictors of discourse function than acoustic cues. However, acoustic features capturing the pitch excursion at the right edge of okay feature prominently in disambiguation, whether other contextual cues are present or not