331 research outputs found
Neural Speech Synthesis with Transformer Network
Although end-to-end neural text-to-speech (TTS) methods (such as Tacotron2)
are proposed and achieve state-of-the-art performance, they still suffer from
two problems: 1) low efficiency during training and inference; 2) hard to model
long dependency using current recurrent neural networks (RNNs). Inspired by the
success of Transformer network in neural machine translation (NMT), in this
paper, we introduce and adapt the multi-head attention mechanism to replace the
RNN structures and also the original attention mechanism in Tacotron2. With the
help of multi-head self-attention, the hidden states in the encoder and decoder
are constructed in parallel, which improves the training efficiency. Meanwhile,
any two inputs at different times are connected directly by self-attention
mechanism, which solves the long range dependency problem effectively. Using
phoneme sequences as input, our Transformer TTS network generates mel
spectrograms, followed by a WaveNet vocoder to output the final audio results.
Experiments are conducted to test the efficiency and performance of our new
network. For the efficiency, our Transformer TTS network can speed up the
training about 4.25 times faster compared with Tacotron2. For the performance,
rigorous human tests show that our proposed model achieves state-of-the-art
performance (outperforms Tacotron2 with a gap of 0.048) and is very close to
human quality (4.39 vs 4.44 in MOS)
The interaction effect of pronunciation and lexicogrammar on comprehensibility: a case of Mandarin-accented English
Scholars have argued that comprehensibility (i.e., ease of understanding), not nativelike performance, should be prioritized in second language learning, which inspired numerous studies to explore factors affecting comprehensibility. However, most of these studies did not consider potential interaction effects of these factors, resulting in a limited understanding of comprehensibility and less precise implications. This study investigates how pronunciation and lexicogrammar influences the comprehensibility of Mandarin-accented English. A total of 687 listeners were randomly allocated into six groups and rated (a) one baseline and (b) one of six experimental recordings for comprehensibility on a 9-point scale. The baseline recording, a 60 s spontaneous speech by an L1 English speaker with an American accent, was the same across groups. The six 75-s experimental recordings were the same in content but differed in (a) speakers’ degree of foreign accent (American, moderate Mandarin, and heavy Mandarin) and (b) lexicogrammar (with errors vs. without errors). The study found that pronunciation and lexicogrammar interacted to influence comprehensibility. That is, whether pronunciation affected comprehensibility depended on speakers’ lexicogrammar, and vice versa. The results have implications for theory-building to refine comprehensibility, as well as for pedagogy and testing priorities
DiCLET-TTS: Diffusion Model based Cross-lingual Emotion Transfer for Text-to-Speech -- A Study between English and Mandarin
While the performance of cross-lingual TTS based on monolingual corpora has
been significantly improved recently, generating cross-lingual speech still
suffers from the foreign accent problem, leading to limited naturalness.
Besides, current cross-lingual methods ignore modeling emotion, which is
indispensable paralinguistic information in speech delivery. In this paper, we
propose DiCLET-TTS, a Diffusion model based Cross-Lingual Emotion Transfer
method that can transfer emotion from a source speaker to the intra- and
cross-lingual target speakers. Specifically, to relieve the foreign accent
problem while improving the emotion expressiveness, the terminal distribution
of the forward diffusion process is parameterized into a speaker-irrelevant but
emotion-related linguistic prior by a prior text encoder with the emotion
embedding as a condition. To address the weaker emotional expressiveness
problem caused by speaker disentanglement in emotion embedding, a novel
orthogonal projection based emotion disentangling module (OP-EDM) is proposed
to learn the speaker-irrelevant but emotion-discriminative embedding. Moreover,
a condition-enhanced DPM decoder is introduced to strengthen the modeling
ability of the speaker and the emotion in the reverse diffusion process to
further improve emotion expressiveness in speech delivery. Cross-lingual
emotion transfer experiments show the superiority of DiCLET-TTS over various
competitive models and the good design of OP-EDM in learning speaker-irrelevant
but emotion-discriminative embedding.Comment: accepted by TASL
Language-specific Acoustic Boundary Learning for Mandarin-English Code-switching Speech Recognition
Code-switching speech recognition (CSSR) transcribes speech that switches
between multiple languages or dialects within a single sentence. The main
challenge in this task is that different languages often have similar
pronunciations, making it difficult for models to distinguish between them. In
this paper, we propose a method for solving the CSSR task from the perspective
of language-specific acoustic boundary learning. We introduce language-specific
weight estimators (LSWE) to model acoustic boundary learning in different
languages separately. Additionally, a non-autoregressive (NAR) decoder and a
language change detection (LCD) module are employed to assist in training.
Evaluated on the SEAME corpus, our method achieves a state-of-the-art mixed
error rate (MER) of 16.29% and 22.81% on the test_man and test_sge sets. We
also demonstrate the effectiveness of our method on a 9000-hour in-house
meeting code-switching dataset, where our method achieves a relatively 7.9% MER
reduction
- …