3,754 research outputs found

    A cross-linguistic analysis of the temporal dynamics of turn-taking cues using machine learning as a descriptive tool

    Get PDF
    In dialogue, speakers produce and perceive acoustic/prosodic turn-taking cues, which are fundamental for negotiating turn exchanges with their interlocutors. However, little of the temporal dynamics and cross-linguistic validity of these cues is known. In this work, we explore a set of acoustic/prosodic cues preceding three turn-transition types (hold, switch and backchannel) in three different languages (Slovak, American English and Argentine Spanish). For this, we use and refine a set of machine learning techniques that enable a finer-grained temporal analysis of such cues, as well as a comparison of their relative explanatory power. Our results suggest that the three languages, despite belonging to distinct linguistic families, share the general usage of a handful of acoustic/prosodic features to signal turn transitions. We conclude that exploiting features such as speech rate, final-word lengthening, the pitch track over the final 200 ms, the intensity track over the final 1000 ms, and noise-to-harmonics ratio (a voice-quality feature) might prove useful for further improving the accuracy of the turn-taking modules found in modern spoken dialogue systems.Fil: Brusco, Pablo. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Computación; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Investigación en Ciencias de la Computación. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Investigación en Ciencias de la Computación; ArgentinaFil: Vidal, Jazmín. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Computación; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Investigación en Ciencias de la Computación. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Investigación en Ciencias de la Computación; ArgentinaFil: Beňuš, Štefan. University in Nitra; Eslovaquia. Slovak Academy of Sciences; EslovaquiaFil: Gravano, Agustin. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Investigación en Ciencias de la Computación. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Investigación en Ciencias de la Computación; Argentina. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Computación; Argentin

    Automatic Evaluation of Turn-taking Cues in Conversational Speech Synthesis

    Full text link
    Turn-taking is a fundamental aspect of human communication where speakers convey their intention to either hold, or yield, their turn through prosodic cues. Using the recently proposed Voice Activity Projection model, we propose an automatic evaluation approach to measure these aspects for conversational speech synthesis. We investigate the ability of three commercial, and two open-source, Text-To-Speech (TTS) systems ability to generate turn-taking cues over simulated turns. By varying the stimuli, or controlling the prosody, we analyze the models performances. We show that while commercial TTS largely provide appropriate cues, they often produce ambiguous signals, and that further improvements are possible. TTS, trained on read or spontaneous speech, produce strong turn-hold but weak turn-yield cues. We argue that this approach, that focus on functional aspects of interaction, provides a useful addition to other important speech metrics, such as intelligibility and naturalness.Comment: Accepted at INTERSPEECH 2023, 5 pages, 2 figures, 4 table

    A study of turn-yelding cues in human-computer dialogue

    Get PDF
    Previous research has made signi cant advances in under- standing how humans manage to engage in smooth, well-coordinated conversation, and have unveiled the existence of several turn-yielding cues | lexico-syntactic, prosodic and acoustic events that may serve as predictors of conversational turn nality. These results have subse- quently aided the re nement of turn-taking pro ciency of spoken dia- logue systems. In this study, we nd empirical evidence in a corpus of human-computer dialogues that human users produce the same kinds of turn-yielding cues that have been observed in human-human interac- tions. We also show that a linear relation holds between the number of individual cues conjointly displayed and the likelihood of a turn switch.Sociedad Argentina de Informática e Investigación Operativa (SADIO

    A study of turn-yelding cues in human-computer dialogue

    Get PDF
    Previous research has made signi cant advances in under- standing how humans manage to engage in smooth, well-coordinated conversation, and have unveiled the existence of several turn-yielding cues | lexico-syntactic, prosodic and acoustic events that may serve as predictors of conversational turn nality. These results have subse- quently aided the re nement of turn-taking pro ciency of spoken dia- logue systems. In this study, we nd empirical evidence in a corpus of human-computer dialogues that human users produce the same kinds of turn-yielding cues that have been observed in human-human interac- tions. We also show that a linear relation holds between the number of individual cues conjointly displayed and the likelihood of a turn switch.Sociedad Argentina de Informática e Investigación Operativa (SADIO

    Dissociating task difficulty from incongruence in face-voice emotion integration

    Get PDF
    In the everyday environment, affective information is conveyed by both the face and the voice. Studies have demonstrated that a concurrently presented voice can alter the way that an emotional face expression is perceived, and vice versa, leading to emotional conflict if the information in the two modalities is mismatched. Additionally, evidence suggests that incongruence of emotional valence activates cerebral networks involved in conflict monitoring and resolution. However, it is currently unclear whether this is due to task difficulty—that incongruent stimuli are harder to categorize—or simply to the detection of mismatching information in the two modalities. The aim of the present fMRI study was to examine the neurophysiological correlates of processing incongruent emotional information, independent of task difficulty. Subjects were scanned while judging the emotion of face-voice affective stimuli. Both the face and voice were parametrically morphed between anger and happiness and then paired in all audiovisual combinations, resulting in stimuli each defined by two separate values: the degree of incongruence between the face and voice, and the degree of clarity of the combined face-voice information. Due to the specific morphing procedure utilized, we hypothesized that the clarity value, rather than incongruence value, would better reflect task difficulty. Behavioral data revealed that participants integrated face and voice affective information, and that the clarity, as opposed to incongruence value correlated with categorization difficulty. Cerebrally, incongruence was more associated with activity in the superior temporal region, which emerged after task difficulty had been accounted for. Overall, our results suggest that activation in the superior temporal region in response to incongruent information cannot be explained simply by task difficulty, and may rather be due to detection of mismatching information between the two modalities

    Next speakers plan their turn early and speak after turn-final ‘go-signals’

    Get PDF
    In conversation, turn-taking is usually fluid, with next speakers taking their turn right after the end of the previous turn. Most, but not all, previous studies show that next speakers start to plan their turn early, if possible already during the incoming turn. The present study makes use of the list-completion paradigm (Barthel et al., 2016), analyzing speech onset latencies and eye-movements of participants in a task-oriented dialogue with a confederate. The measures are used to disentangle the contributions to the timing of turn-taking of early planning of content on the one hand and initiation of articulation as a reaction to the upcoming turn-end on the other hand. Participants named objects visible on their computer screen in response to utterances that did, or did not, contain lexical and prosodic cues to the end of the incoming turn. In the presence of an early lexical cue, participants showed earlier gaze shifts toward the target objects and responded faster than in its absence, whereas the presence of a late intonational cue only led to faster response times and did not affect the timing of participants' eye movements. The results show that with a combination of eye-movement and turn-transition time measures it is possible to tease apart the effects of early planning and response initiation on turn timing. They are consistent with models of turn-taking that assume that next speakers (a) start planning their response as soon as the incoming turn's message can be understood and (b) monitor the incoming turn for cues to turn-completion so as to initiate their response when turn-transition becomes relevan
    corecore