618 research outputs found

    Tacotron: Towards End-to-End Speech Synthesis

    Full text link
    A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain brittle design choices. In this paper, we present Tacotron, an end-to-end generative text-to-speech model that synthesizes speech directly from characters. Given pairs, the model can be trained completely from scratch with random initialization. We present several key techniques to make the sequence-to-sequence framework perform well for this challenging task. Tacotron achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness. In addition, since Tacotron generates speech at the frame level, it's substantially faster than sample-level autoregressive methods.Comment: Submitted to Interspeech 2017. v2 changed paper title to be consistent with our conference submission (no content change other than typo fixes

    Current trends in multilingual speech processing

    Get PDF
    In this paper, we describe recent work at Idiap Research Institute in the domain of multilingual speech processing and provide some insights into emerging challenges for the research community. Multilingual speech processing has been a topic of ongoing interest to the research community for many years and the field is now receiving renewed interest owing to two strong driving forces. Firstly, technical advances in speech recognition and synthesis are posing new challenges and opportunities to researchers. For example, discriminative features are seeing wide application by the speech recognition community, but additional issues arise when using such features in a multilingual setting. Another example is the apparent convergence of speech recognition and speech synthesis technologies in the form of statistical parametric methodologies. This convergence enables the investigation of new approaches to unified modelling for automatic speech recognition and text-to-speech synthesis (TTS) as well as cross-lingual speaker adaptation for TTS. The second driving force is the impetus being provided by both government and industry for technologies to help break down domestic and international language barriers, these also being barriers to the expansion of policy and commerce. Speech-to-speech and speech-to-text translation are thus emerging as key technologies at the heart of which lies multilingual speech processin

    Development of the Slovak HMM-Based TTS System and Evaluation of Voices in Respect to the Used Vocoding Techniques

    Get PDF
    This paper describes the development of a Slovak text-to-speech system which applies a technique wherein speech is directly synthesized from hidden Markov models. Statistical models for Slovak speech units are trained by using the newly created female and male phonetically balanced speech corpora. In addition, contextual informations about phonemes, syllables, words, phrases, and utterances were determined, as well as questions for decision tree-based context clustering algorithms. In this paper, recent statistical parametric speech synthesis methods including the conventional, STRAIGHT and AHOcoder speech synthesis systems are implemented and evaluated. Objective evaluation methods (mel-cepstral distortion and fundamental frequency comparison) and subjective ones (mean opinion score and semantically unpredictable sentences test) are carried out to compare these systems with each other and evaluation of their overall quality. The result of this work is a set of text to speech systems for Slovak language which are characterized by very good intelligibility and quite good naturalness of utterances at the output of these systems. In the subjective tests of intelligibility the STRAIGHT based female voice and AHOcoder based male voice reached the highest scores
    • …
    corecore