3 research outputs found

    Strategies for developing a conversational speech dataset for Text-To-Speech Synthesis

    Get PDF
    Funding Information: The first author has received funding from the European Union's Horizon 2020 research and innovation program under the Marie Skłodowska Curie grant agreement No 859588. The authors are thankful to Maaike Groenewege, Johannah O'Mahony and ReadSpeaker's R&D team whose suggestions and discussions have been instrumental in shaping the direction of this paper. Funding Information: The first author has received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska Curie grant agreement No 859588. The authors are thankful to Maaike Groenewege, Johannah O’Mahony and ReadSpeaker’s R&D team whose suggestions and discussions have been instrumental in shaping the direction of this paper. Publisher Copyright: Copyright © 2022 ISCA.There have been many efforts to improve the quality of speech synthesis systems in conversational AI. Although state-of-the-art systems are capable of producing natural-sounding speech, the generated speech often lacks prosodic variation and is not always suited to the task. In this paper, we examine dialogue data collection methods to use as training data for our acoustic models. We collect speech using three different setups: (1) Random read-aloud sentences; (2) Performed dialogues; (3) Semi-Spontaneous dialogues. We analyze prosodic and textual properties of the data collected in these setups and make some recommendations to collect data for speech synthesis in conversational AI settings.Peer reviewe

    A Hierarchical Predictive Processing Approach to Modelling Prosody

    Get PDF
    Prosodic patterns—and linguistic structures in general— are hierarchical in nature, providing for efficient means for en- coding information in temporally constrained situations where communicative events occur. However, there are no theoreti- cal frameworks that are capable of representing the full extent of linguistic behaviour in a cohesive way that could capture the paradigmatic and syntagmatic links between the organizational levels present in everyday speech. Here we propose a novel theoretical and modelling account of perception and production of prosodic patterns in speech communication, derived from the influential Predictive Processing theory of neural implementation of perception and action based on a hierarchical system of generative models producing progressively more detailed probabilistic predictions of future events. The framework provides a conceptualization of the hierarchical organization of speech prosody as well as a principled way of unifying speech perception and production by postulating a single processing hierarchy shared by both modalities. We discuss the possible implications of the theory for prosodic analysis of speech communication, including conversational setting. In addition, we outline a viable computational implementation in the form of a machine learning architecture that can be used as a testbed for generating and evaluating predictions brought forth by the theory.Peer reviewe

    Annotation of Communicative Functions of Short Feedback Tokens in Switchboard

    No full text
    International audienceThere has been a lot of work on predicting the timing of feedback in conversational systems. However, there has been less focus on predicting the prosody and lexical form of feedback given their communicative function. Therefore, in this paper we present our preliminary annotations of the communicative functions of 1627 short feedback tokens from the Switchboard corpus and an analysis of their lexical realizations and prosodic characteristics. Since there is no standard scheme for annotating the communicative function of feedback we propose our own annotation scheme. Although our work is ongoing, our preliminary analysis revealed lexical tokens such as yeah are ambiguous and therefore lexical forms alone are not indicative of the function. Both the lexical form and prosodic characteristics need to be taken into account in order to predict the communicative function. We also found that feedback functions have distinguishable prosodic characteristics in terms of duration, mean pitch, pitch slope, and pitch range
    corecore