37 research outputs found

    Mimicking the interlocutor’s speech: models of speaker entrainment for improving the naturalness of spoken dialogue systems

    Get PDF
    A medida que las tecnologías de procesamiento del habla continúan mejorando, gradualmente nos acercamos al viejo sueño de crear una máquina que hable. Los actuales sistemas interactivos de diálogo hablado permiten que los usuarios realicen tareas simples, tales como transacciones bancarias y reservas en hoteles, mediante la interacción verbal. Pese a ser relativamente exitosas, estas conversaciones humanocomputadora aún tienen un largo camino para recorrer en cuanto a su naturalidad: estos sistemas tienden a ser descriptos por los usuarios como “extraños” o incluso “intimidantes”. Entre las razones principales para esta falta de naturalidad, figura el modelado imperfecto de la variación prosódica, o cómo algunas propiedades del habla (tales como la entonación, la intensidad o el ritmo) cambian en las expresiones verbales. Los sistemas actuales todavía son incapaces de manejar estas características en forma correcta, tanto al entender el habla del usuario como para producir respuestas sintetizadas. La variación prosódica es extremadamente compleja en el habla espontánea, y se sabe que la afectan varios niveles de representación lingüística (léxica, sintáctica, semántica y pragmática). En el presente artículo, enfocamos nuestra atención en una dimensión particular de variación prosódica, conocida como “mimetización entre interlocutores”, que consiste en la alineación automática de características del habla entre los participantes de un diálogo. Tras un repaso general de la literatura de estos temas, describimos un proyecto de investigación en curso que busca modelar la mimetización prosódica en diálogos.As speech processing technologies continue to improve, the old dream of creating a machine that talks gradually becomes real. The present interactive speech systems enable users to perform simple tasks such as banking transactions and hotel reservations, through verbal interaction. Despite being relatively successful, these human-computer conversations still have a long way to go regarding their naturalness: these systems tend to be described as “odd” or even “intimidating” by users. Among the main reasons for this lack of naturalness, is the flawed modeling of prosodic variation or the way some properties of speech (such as intonation, intensity and rhythm) change in verbal expressions. Current systems are still unable to handle these features correctly, both to understand the speech of the user as to produce synthesized responses. Prosodic variation is extremely complex in spontaneous speech, and it is well known that it´s affected by several levels of linguistic representation (lexical, syntactic, semantic and pragmatic). The present article focuses on a specific dimension of prosodic variation, known as “mimetization between interlocutors”, which consists in the automatic alignment of speech features between the participants of a dialogue. After a general overview of the literature on these subjects, a research project in process that seeks to model the prosodic mimetizatin in dialogues is described.Fil: Gravano, Agustin. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Computación; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentin

    Restoring Punctuation and Capitalization in Transcribed Speech

    Get PDF
    Adding punctuation and capitalization greatly improves the readability of automatic speech transcripts. We discuss an approach for performing both tasks in a single pass using a purely text-basedn-gram language model. We study the effect on performance of varying the n-gram order (from n = 3 to n = 6) and the amount of training data (from 58 million to 55 billion tokens). Our results show that using larger training data sets consistently improves performance, while increasing the n-gram order does not help nearly as much

    High Frequency Word Entertainment in Spoken Dialogue

    Get PDF
    Cognitive theories of dialogue hold that entrainment, the automatic alignment between dialogue partners at many levels of linguistic representation, is key to facilitating both production and comprehension in dialogue. In this paper we examine novel types of entrainment in two corpora—Switchboard and the Columbia Games corpus. We examine entrainment in use of high-frequency words (the most common words in the corpus), and its association with dialogue naturalness and flow, as well as with task success. Our results show that such entrainment is predictive of the perceived naturalness of dialogues and is significantly correlated with task success; in overall interaction flow, higher degrees of entrainment are associated with more overlaps and fewer interruptions

    Fusion approaches for emotion recognition from speech using acoustic and text-based features

    Full text link
    In this paper, we study different approaches for classifying emotions from speech using acoustic and text-based features. We propose to obtain contextualized word embeddings with BERT to represent the information contained in speech transcriptions and show that this results in better performance than using Glove embeddings. We also propose and compare different strategies to combine the audio and text modalities, evaluating them on IEMOCAP and MSP-PODCAST datasets. We find that fusing acoustic and text-based systems is beneficial on both datasets, though only subtle differences are observed across the evaluated fusion approaches. Finally, for IEMOCAP, we show the large effect that the criteria used to define the cross-validation folds have on results. In particular, the standard way of creating folds for this dataset results in a highly optimistic estimation of performance for the text-based system, suggesting that some previous works may overestimate the advantage of incorporating transcriptions.Comment: 5 pages. Accepted in ICASSP 202

    A cross-linguistic analysis of the temporal dynamics of turn-taking cues using machine learning as a descriptive tool

    Get PDF
    In dialogue, speakers produce and perceive acoustic/prosodic turn-taking cues, which are fundamental for negotiating turn exchanges with their interlocutors. However, little of the temporal dynamics and cross-linguistic validity of these cues is known. In this work, we explore a set of acoustic/prosodic cues preceding three turn-transition types (hold, switch and backchannel) in three different languages (Slovak, American English and Argentine Spanish). For this, we use and refine a set of machine learning techniques that enable a finer-grained temporal analysis of such cues, as well as a comparison of their relative explanatory power. Our results suggest that the three languages, despite belonging to distinct linguistic families, share the general usage of a handful of acoustic/prosodic features to signal turn transitions. We conclude that exploiting features such as speech rate, final-word lengthening, the pitch track over the final 200 ms, the intensity track over the final 1000 ms, and noise-to-harmonics ratio (a voice-quality feature) might prove useful for further improving the accuracy of the turn-taking modules found in modern spoken dialogue systems.Fil: Brusco, Pablo. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Computación; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Investigación en Ciencias de la Computación. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Investigación en Ciencias de la Computación; ArgentinaFil: Vidal, Jazmín. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Computación; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Investigación en Ciencias de la Computación. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Investigación en Ciencias de la Computación; ArgentinaFil: Beňuš, Štefan. University in Nitra; Eslovaquia. Slovak Academy of Sciences; EslovaquiaFil: Gravano, Agustin. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Investigación en Ciencias de la Computación. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Investigación en Ciencias de la Computación; Argentina. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Computación; Argentin

    On the Role of Context and Prosody in the Interpretation of ‘Okay’

    Get PDF
    We examine the effect of contextual and acoustic cues in the disambiguation of three discourse-pragmatic functions of the word okay. Results of a perception study show that contextual cues are stronger predictors of discourse function than acoustic cues. However, acoustic features capturing the pitch excursion at the right edge of okay feature prominently in disambiguation, whether other contextual cues are present or not
    corecore