10 research outputs found
LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus
This paper introduces a new speech dataset called ``LibriTTS-R'' designed for
text-to-speech (TTS) use. It is derived by applying speech restoration to the
LibriTTS corpus, which consists of 585 hours of speech data at 24 kHz sampling
rate from 2,456 speakers and the corresponding texts. The constituent samples
of LibriTTS-R are identical to those of LibriTTS, with only the sound quality
improved. Experimental results show that the LibriTTS-R ground-truth samples
showed significantly improved sound quality compared to those in LibriTTS. In
addition, neural end-to-end TTS trained with LibriTTS-R achieved speech
naturalness on par with that of the ground-truth samples. The corpus is freely
available for download from \url{http://www.openslr.org/141/}.Comment: Accepted to Interspeech 202
Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech and Text Representations
Speech restoration (SR) is a task of converting degraded speech signals into
high-quality ones. In this study, we propose a robust SR model called Miipher,
and apply Miipher to a new SR application: increasing the amount of
high-quality training data for speech generation by converting speech samples
collected from the Web to studio-quality. To make our SR model robust against
various degradation, we use (i) a speech representation extracted from w2v-BERT
for the input feature, and (ii) a text representation extracted from
transcripts via PnG-BERT as a linguistic conditioning feature. Experiments show
that Miipher (i) is robust against various audio degradation and (ii) enable us
to train a high-quality text-to-speech (TTS) model from restored speech samples
collected from the Web. Audio samples are available at our demo page:
google.github.io/df-conformer/miipher/Comment: Accepted to WASPAA 202
A Comparative Study on Transformer vs RNN in Speech Applications
Sequence-to-sequence models have been widely used in end-to-end speech
processing, for example, automatic speech recognition (ASR), speech translation
(ST), and text-to-speech (TTS). This paper focuses on an emergent
sequence-to-sequence model called Transformer, which achieves state-of-the-art
performance in neural machine translation and other natural language processing
applications. We undertook intensive studies in which we experimentally
compared and analyzed Transformer and conventional recurrent neural networks
(RNN) in a total of 15 ASR, one multilingual ASR, one ST, and two TTS
benchmarks. Our experiments revealed various training tips and significant
performance benefits obtained with Transformer for each task including the
surprising superiority of Transformer in 13/15 ASR benchmarks in comparison
with RNN. We are preparing to release Kaldi-style reproducible recipes using
open source and publicly available datasets for all the ASR, ST, and TTS tasks
for the community to succeed our exciting outcomes.Comment: Accepted at ASRU 201
Recent Advances in End-to-End Speech Recognition
"講演者所属: NTT Communication Science Laboratories 講演日: 2019å¹´11月13æ—¥ è¬›æ¼”å ´æ‰€: æƒ…å ±ç§‘å¦æ£Ÿå¤§è¬›ç¾©å®¤L1This talk explains recent advances in end-to-end automatic speech recognition (ASR) at NTT. First, I will give an overview of NTT speech technologies and open-source toolkit ESPnet. Then, I will introduce our proposed semi-supervised end-to-end ASR method (ICASSP19 https://ieeexplore.ieee.org/abstract/document/8682890). In this paper, we introduce speech and text autoencoders that share encoders and decoders with an automatic speech recognition (ASR) model to improve ASR performance with large speech only and text only training datasets. To build the speech and text autoencoders, we leverage state-of-the-art ASR and text-to-speech (TTS) encoder-decoder architectures. These autoencoders learn features from speech only and text only datasets by switching the encoders and decoders used in the ASR and TTS models. Simultaneously, they aim to encode features to be compatible with ASR and TTS models by a multi-task loss. Additionally, we anticipate that TTS joint training can also improve ASR performance because both ASR and TTS models learn transformations between speech and text. The experimental result we obtained with our semi-supervised end-to-end ASR/TTS training revealed reductions from a model initially trained with a small paired subset of the LibriSpeech corpus in the character error rate from 10.4% to 8.4% and word error rate from 20.6% to 18.0% by retraining the model with a large unpaired subset of the corpus