202 research outputs found

    Hesitations in Spoken Dialogue Systems

    Get PDF
    Betz S. Hesitations in Spoken Dialogue Systems. Bielefeld: UniversitÀt Bielefeld; 2020

    Prediction and Realisation of Conversational Characteristics by Utilising Spontaneous Speech for Unit Selection

    Get PDF
    Unit selection speech synthesis has reached high levels of naturalness and intelligibility for neutral read aloud speech. However, synthetic speech generated using neutral read aloud data lacks all the attitude, intention and spontaneity associated with everyday conversations. Unit selection is heavily data dependent and thus in order to simulate human conversational speech, or create synthetic voices for believable virtual characters, we need to utilise speech data with examples of how people talk rather than how people read. In this paper we included carefully selected utterances from spontaneous conversational speech in a unit selection voice. Using this voice and by automatically predicting type and placement of lexical fillers and filled pauses we can synthesise utterances with conversational characteristics. A perceptual listening test showed that it is possible to make synthetic speech sound more conversational without degrading naturalness

    Fillers in Spoken Language Understanding: Computational and Psycholinguistic Perspectives

    Full text link
    Disfluencies (i.e. interruptions in the regular flow of speech), are ubiquitous to spoken discourse. Fillers ("uh", "um") are disfluencies that occur the most frequently compared to other kinds of disfluencies. Yet, to the best of our knowledge, there isn't a resource that brings together the research perspectives influencing Spoken Language Understanding (SLU) on these speech events. This aim of this article is to synthesise a breadth of perspectives in a holistic way; i.e. from considering underlying (psycho)linguistic theory, to their annotation and consideration in Automatic Speech Recognition (ASR) and SLU systems, to lastly, their study from a generation standpoint. This article aims to present the perspectives in an approachable way to the SLU and Conversational AI community, and discuss moving forward, what we believe are the trends and challenges in each area.Comment: To appear in TAL Journa

    Revisiting the Status of Speech Rhythm

    Get PDF
    Text-to-Speech synthesis offers an interesting manner of synthesising various knowledge components related to speech production. To a certain extent, it provides a new way of testing the coherence of our understanding of speech production in a highly systematic manner. For example, speech rhythm and temporal organisation of speech have to be well-captured in order to mimic a speaker correctly. The simulation approach used in our laboratory for two languages supports our original hypothesis of multidimensionality and non-linearity in the production of speech rhythm. This paper presents an overview of our approach towards this issue, as it has been developed over the last years. We conceive the production of speech rhythm as a multidimensional task, and the temporal organisation of speech as a key component of this task (i.e., the establishment of temporal boundaries and durations). As a result of this multidimensionality, text-to-speech systems have to accommodate a number of systematic transformations and computations at various levels. Our model of the temporal organisation of read speech in French and German emerges from a combination of quantitative and qualitative parameters, organised according to psycholinguistic and linguistic structures. (An ideal speech synthesiser would also take into account subphonemic as well as pragmatic parameters. However such systems are not yet available)

    Disfluent Lengthening in Spontaneous Speech

    Get PDF
    Betz S, Wagner P. Disfluent Lengthening in Spontaneous Speech. In: Jokisch O, ed. Elektronische Sprachsignalverarbeitung (ESSV) 2016. Studientexte zur Sprachkommunikation. Dresden: TUD Press; 2016

    Increasing Recall of Lengthening Detection via Semi-Automatic Classification

    Get PDF
    Betz S, Voße J, Zarrieß S, Wagner P. Increasing Recall of Lengthening Detection via Semi-Automatic Classification. In: Proceedings of Interspeech. 2017: 1084-1088

    Modular Synthesis of Disfluencies for Conversational Speech Systems

    Get PDF
    Betz S, Wagner P, Schlangen D. Modular Synthesis of Disfluencies for Conversational Speech Systems. Presented at the ESSV 2015, EichstÀtt.It has been shown that dialogue systems benefit from incremental architectures to produce fast responses and to interact with the interlocutor in a more human-like way. The advantage of quick responses yields the disadvantage of running out of things to say for a while. In such occasions, humans tend to produce disfluencies as a listener-oriented strategy to signal the ongoing production process and to buy time for finalizing the turn. Introducing disfluency capabilities into a speech synthesis module of a dialogue system may therefore be a straightforward strategy towards conversational speech systems. Disfluencies are a very complex matter, they can take various chaining and nested forms in human communication. We do not attempt to equip our system with the full range of possible disfluent time-buying strategies found in human interaction. For a first perceptual evaluation of the most suitable synthetic disfluency strategy to be integrated into the dialogue system, we focus on three structural factors that are able to cover a wide range of attested disfluency patterns: lengthening, word cutoffs and pauses. This leads to several different configurations a disfluent sentence can take. Sentences from a spontaneous speech corpus were resynthesized in all possible configurations using Mary TTS. In order to identify euphone configurations, these stimuli were then presented to test subjects in a perception test

    Micro-Structure of Disfluencies: Basics for Conversational Speech Synthesis

    Get PDF
    Betz S, Wagner P, Schlangen D. Micro-Structure of Disfluencies: Basics for Conversational Speech Synthesis. In: Interspeech 2015. 2015: 2222-2226

    The re-adjustment or word-fragments in spontaneous spoken French

    No full text
    International audienceA study of word-fragments in spoken French has been undertaken for a few years on the basis of non directive talks corpora recorded and transcribed according to GARS'conventions (DELIC currently). These disfluencies are often analyzed within the framework of disfluent repetitions. The observations made on these two types of disfluencies led us to distinguish them. The aim of our study is to describe on the one hand insertions which take place in relation to the word interruptions and their re-adjustment, and on the other hand, to specify the types and localizations of retracing which follow these interruptions. Two kinds of incidental clauses were observed at the time of the readjustments which follow these disturbances. Some, (the more numerous) are syntactically linked to the fragment or with its retracing, others are not. Moreover, the word-fragments which will be modified are the only one to be dependent on the type of localization. For the others, this localization does not make it possible to predict the category of interruption (complemented or unfinished). Our results on word-fragments, confirm however that in contemporary French, the retracing at the head of the nominal or verbal group which contains the disfluency remains the simplest example (at the same time the most frequent, [5]. Nevertheless, a third of the retracing either does not go back to the beginning of the Group, or exceeds it

    Artificial Personality and Disfluency

    Get PDF
    The focus of this paper is artificial voices with different person-alities. Previous studies have shown links between an individ-ual’s use of disfluencies in their speech and their perceived per-sonality. Here, filled pauses (uh and um) and discourse markers (like, you know, I mean) have been included in synthetic speech as a way of creating an artificial voice with different personali-ties. We discuss the automatic insertion of filled pauses and dis-course markers (i.e., fillers) into otherwise fluent texts. The au-tomatic system is compared to a ground truth of human “acted” filler insertion. Perceived personality (as defined by the big five personality dimensions) of the synthetic speech is assessed by means of a standardised questionnaire. Synthesis without fillers is compared to synthesis with either spontaneous or synthetic fillers. Our findings explore how the inclusion of disfluencies influences the way in which subjects rate the perceived person-ality of an artificial voice. Index Terms: artificial personality, TTS, disfluency 1
    • 

    corecore