331 research outputs found

    Neural Speech Synthesis with Transformer Network

    Full text link
    Although end-to-end neural text-to-speech (TTS) methods (such as Tacotron2) are proposed and achieve state-of-the-art performance, they still suffer from two problems: 1) low efficiency during training and inference; 2) hard to model long dependency using current recurrent neural networks (RNNs). Inspired by the success of Transformer network in neural machine translation (NMT), in this paper, we introduce and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2. With the help of multi-head self-attention, the hidden states in the encoder and decoder are constructed in parallel, which improves the training efficiency. Meanwhile, any two inputs at different times are connected directly by self-attention mechanism, which solves the long range dependency problem effectively. Using phoneme sequences as input, our Transformer TTS network generates mel spectrograms, followed by a WaveNet vocoder to output the final audio results. Experiments are conducted to test the efficiency and performance of our new network. For the efficiency, our Transformer TTS network can speed up the training about 4.25 times faster compared with Tacotron2. For the performance, rigorous human tests show that our proposed model achieves state-of-the-art performance (outperforms Tacotron2 with a gap of 0.048) and is very close to human quality (4.39 vs 4.44 in MOS)

    The interaction effect of pronunciation and lexicogrammar on comprehensibility: a case of Mandarin-accented English

    Get PDF
    Scholars have argued that comprehensibility (i.e., ease of understanding), not nativelike performance, should be prioritized in second language learning, which inspired numerous studies to explore factors affecting comprehensibility. However, most of these studies did not consider potential interaction effects of these factors, resulting in a limited understanding of comprehensibility and less precise implications. This study investigates how pronunciation and lexicogrammar influences the comprehensibility of Mandarin-accented English. A total of 687 listeners were randomly allocated into six groups and rated (a) one baseline and (b) one of six experimental recordings for comprehensibility on a 9-point scale. The baseline recording, a 60 s spontaneous speech by an L1 English speaker with an American accent, was the same across groups. The six 75-s experimental recordings were the same in content but differed in (a) speakers’ degree of foreign accent (American, moderate Mandarin, and heavy Mandarin) and (b) lexicogrammar (with errors vs. without errors). The study found that pronunciation and lexicogrammar interacted to influence comprehensibility. That is, whether pronunciation affected comprehensibility depended on speakers’ lexicogrammar, and vice versa. The results have implications for theory-building to refine comprehensibility, as well as for pedagogy and testing priorities

    DiCLET-TTS: Diffusion Model based Cross-lingual Emotion Transfer for Text-to-Speech -- A Study between English and Mandarin

    Full text link
    While the performance of cross-lingual TTS based on monolingual corpora has been significantly improved recently, generating cross-lingual speech still suffers from the foreign accent problem, leading to limited naturalness. Besides, current cross-lingual methods ignore modeling emotion, which is indispensable paralinguistic information in speech delivery. In this paper, we propose DiCLET-TTS, a Diffusion model based Cross-Lingual Emotion Transfer method that can transfer emotion from a source speaker to the intra- and cross-lingual target speakers. Specifically, to relieve the foreign accent problem while improving the emotion expressiveness, the terminal distribution of the forward diffusion process is parameterized into a speaker-irrelevant but emotion-related linguistic prior by a prior text encoder with the emotion embedding as a condition. To address the weaker emotional expressiveness problem caused by speaker disentanglement in emotion embedding, a novel orthogonal projection based emotion disentangling module (OP-EDM) is proposed to learn the speaker-irrelevant but emotion-discriminative embedding. Moreover, a condition-enhanced DPM decoder is introduced to strengthen the modeling ability of the speaker and the emotion in the reverse diffusion process to further improve emotion expressiveness in speech delivery. Cross-lingual emotion transfer experiments show the superiority of DiCLET-TTS over various competitive models and the good design of OP-EDM in learning speaker-irrelevant but emotion-discriminative embedding.Comment: accepted by TASL

    Language-specific Acoustic Boundary Learning for Mandarin-English Code-switching Speech Recognition

    Full text link
    Code-switching speech recognition (CSSR) transcribes speech that switches between multiple languages or dialects within a single sentence. The main challenge in this task is that different languages often have similar pronunciations, making it difficult for models to distinguish between them. In this paper, we propose a method for solving the CSSR task from the perspective of language-specific acoustic boundary learning. We introduce language-specific weight estimators (LSWE) to model acoustic boundary learning in different languages separately. Additionally, a non-autoregressive (NAR) decoder and a language change detection (LCD) module are employed to assist in training. Evaluated on the SEAME corpus, our method achieves a state-of-the-art mixed error rate (MER) of 16.29% and 22.81% on the test_man and test_sge sets. We also demonstrate the effectiveness of our method on a 9000-hour in-house meeting code-switching dataset, where our method achieves a relatively 7.9% MER reduction
    • …
    corecore