71 research outputs found

    On the acoustics of overlapping laughter in conversational speech

    Get PDF
    The social nature of laughter invites people to laugh together. This joint vocal action often results in overlapping laughter. In this paper, we show that the acoustics of overlapping laughs are different from non-overlapping laughs. We found that overlapping laughs are stronger prosodically marked than non-overlapping ones, in terms of higher values for duration, mean F0, mean and maximum intensity, and the amount of voicing. This effect is intensified by the number of people joining in the laughter event, which suggests that entrainment is at work. We also found that group size affects the number of overlapping laughs which illustrates the contagious nature of laughter. Finally, people appear to join laughter simultaneously at a delay of approximately 500 ms; a delay that must be considered when developing spoken dialogue systems that are able to respond to users’ laughs

    Comparing non-verbal vocalisations in conversational speech corpora

    Get PDF
    Conversations do not only consist of spoken words but they also consist of non-verbal vocalisations. Since there is no standard to define and to classify (possible) non-speech sounds the annotations for these vocalisations differ very much for various corpora of conversational speech. There seems to be agreement in the six inspected corpora that hesitation sounds and feedback vocalisations are considered as words (without a standard orthography). The most frequent non-verbal vocalisation are laughter on the one hand and, if considered a vocal sound, breathing noises on the other

    Acoustic, Morphological, and Functional Aspects of `yeah/ja' in Dutch, English and German

    Get PDF
    We explore different forms and functions of one of the most common feedback expressions in Dutch, English, and German, namely `yeah/ja' which is known for its multi-functionality and ambiguous usage in dialog. For example, it can be used as a yes-answer, or as a pure continuer, or as a way to show agreement. In addition, `yeah/ja' can be used in its single form, but it can also be combined with other particles, forming multi-word expressions, especially in Dutch and German. We have found substantial differences on the morpho-lexical level between the three related languages which enhances the ambiguous character of `yeah/ja'. An explorative analysis of the prosodic features of `yeah/ja' has shown that mainly a higher intensity is used to signal speaker incipiency across the inspected languages

    Classification of cooperative and competitive overlaps in speech using cues from the context, overlapper, and overlappee

    Get PDF
    One of the major properties of overlapping speech is that it can be perceived as competitive or cooperative. For the development of real-time spoken dialog systems and the analysis of affective and social human behavior in conversations, it is important to (automatically) distinguish between these two types of overlap. We investigate acoustic characteristics of cooperative and competitive overlaps with the aim to develop automatic classifiers for the classification of overlaps. In addition to acoustic features, we also use information from gaze and head movement annotations. Contexts preceding and during the overlap are taken into account, as well as the behaviors of both the overlapper and the overlappee. We compare various feature sets in classification experiments that are performed on the AMI corpus. The best performances obtained lie around 27%–30% EER

    Backchannels: Quantity, Type and Timing Matters

    Get PDF
    In a perception experiment, we systematically varied the quantity, type and timing of backchannels. Participants viewed stimuli of a real speaker side-by-side with an animated listener and rated how human-like they perceived the latter's backchannel behavior. In addition, we obtained measures of appropriateness and optionality for each backchannel from key strokes. This approach allowed us to analyze the influence of each of the factors on entire fragments and on individual backchannels. The originally performed type and timing of a backchannel appeared to be more human-like, compared to a switched type or random timing. In addition, we found that nods are more often appropriate than vocalizations. For quantity, too few or too many backchannels per minute appeared to reduce the quality of the behavior. These findings are important for the design of algorithms for the automatic generation of backchannel behavior for artificial listeners

    Towards Speech Emotion Recognition "in the wild" using Aggregated Corpora and Deep Multi-Task Learning

    Get PDF
    One of the challenges in Speech Emotion Recognition (SER) "in the wild" is the large mismatch between training and test data (e.g. speakers and tasks). In order to improve the generalisation capabilities of the emotion models, we propose to use Multi-Task Learning (MTL) and use gender and naturalness as auxiliary tasks in deep neural networks. This method was evaluated in within-corpus and various cross-corpus classification experiments that simulate conditions "in the wild". In comparison to Single-Task Learning (STL) based state of the art methods, we found that our MTL method proposed improved performance significantly. Particularly, models using both gender and naturalness achieved more gains than those using either gender or naturalness separately. This benefit was also found in the high-level representations of the feature space, obtained from our method proposed, where discriminative emotional clusters could be observed.Comment: Published in the proceedings of INTERSPEECH, Stockholm, September, 201

    Exploring sequences of speech and laughter activity using visualisations of conversations

    Get PDF
    In this study, we analysed laughter in dyadic conversational interaction. We attempted to categorise patterns of speaking and laughing activity in conversation in order to gain more insight into how speaking and laughing are timed and related to each other. Special attention was paid to a particular sequencing of speech and laughter activity that is intended to invite an interlocutor to laugh (i.e. ‘invitation-acceptance’ scheme): the speaker invites the listener to laugh by producing a laugh after his/her own utterance, indicating that it is appropriate to laugh. We explored these kinds of sequences through visualisations of speech and laughter activity in conversations. Based on manual transcriptions of the HCRC Map Task corpus, we generated visualisations of speech and laughter activity. Using these visualisations, we found that people indeed show a tendency to adhere to the ‘invitation-acceptance’ scheme and that people tend to ‘wait’ to be invited to a shared laughter event rather than to ‘anticipate’ it. These speech-and-laugh-activity plots have shown to be helpful in analysing the interplay between laughing and speaking in conversation and can be used as a tool to enhance the researcher’s intuition on underresearched fields

    Detection of nonverbal vocalizations using Gaussian Mixture Models: looking for fillers and laughter in conversational speech

    Get PDF
    In this paper, we analyze acoustic profiles of fillers (i.e. filled pauses, FPs) and laughter with the aim to automatically localize these nonverbal vocalizations in a stream of audio. Among other features, we use voice quality features to capture the distinctive production modes of laughter and spectral similarity measures to capture the stability of the oral tract that is characteristic for FPs. Classification experiments with Gaussian Mixture Models and various sets of features are performed. We find that Mel-Frequency Cepstrum Coefficients are performing relatively well in comparison to other features for both FPs and laughter. In order to address the large variation in the frame-wise decision scores (e.g., log-likelihood ratios) observed in sequences of frames we apply a median filter to these scores, which yields large performance improvements. Our analyses and results are presented within the framework of this year’s Interspeech Computational Paralinguistics sub-Challenge on Social Signals

    A Multimodal Analysis of Vocal and Visual Backchannels in Spontaneous Dialogs

    Get PDF
    Backchannels (BCs) are short vocal and visual listener responses that signal attention, interest, and understanding to the speaker. Previous studies have investigated BC prediction in telephone-style dialogs from prosodic cues. In contrast, we consider spontaneous face-to-face dialogs. The additional visual modality allows speaker and listener to monitor each other's attention continuously, and we hypothesize that this affects the BC-inviting cues. In this study, we investigate how gaze, in addition to prosody, can cue BCs. Moreover, we focus on the type of BC performed, with the aim to find out whether vocal and visual BCs are invited by similar cues. In contrast to telephone-style dialogs, we do not find rising/falling pitch to be a BC-inviting cue. However, in a face-to-face setting, gaze appears to cue BCs. In addition, we find that mutual gaze occurs significantly more often during visual BCs. Moreover, vocal BCs are more likely to be timed during pauses in the speaker's speech
    corecore