Search CORE

10 research outputs found

LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus

Author: Bacchiani Michiel
Bapna Ankur
Ding Yifan
Han Wei
Karita Shigeki
Koizumi Yuma
Morioka Nobuyuki
Yatabe Kohei
Zen Heiga
Zhang Yu
Publication venue
Publication date: 30/05/2023
Field of study

This paper introduces a new speech dataset called ``LibriTTS-R'' designed for text-to-speech (TTS) use. It is derived by applying speech restoration to the LibriTTS corpus, which consists of 585 hours of speech data at 24 kHz sampling rate from 2,456 speakers and the corresponding texts. The constituent samples of LibriTTS-R are identical to those of LibriTTS, with only the sound quality improved. Experimental results show that the LibriTTS-R ground-truth samples showed significantly improved sound quality compared to those in LibriTTS. In addition, neural end-to-end TTS trained with LibriTTS-R achieved speech naturalness on par with that of the ground-truth samples. The corpus is freely available for download from \url{http://www.openslr.org/141/}.Comment: Accepted to Interspeech 202

arXiv.org e-Print Archive

Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech and Text Representations

Author: Bacchiani Michiel
Bapna Ankur
Ding Yifan
Han Wei
Karita Shigeki
Koizumi Yuma
Morioka Nobuyuki
Yatabe Kohei
Zen Heiga
Zhang Yu
Publication venue
Publication date: 14/08/2023
Field of study

Speech restoration (SR) is a task of converting degraded speech signals into high-quality ones. In this study, we propose a robust SR model called Miipher, and apply Miipher to a new SR application: increasing the amount of high-quality training data for speech generation by converting speech samples collected from the Web to studio-quality. To make our SR model robust against various degradation, we use (i) a speech representation extracted from w2v-BERT for the input feature, and (ii) a text representation extracted from transcripts via PnG-BERT as a linguistic conditioning feature. Experiments show that Miipher (i) is robust against various audio degradation and (ii) enable us to train a high-quality text-to-speech (TTS) model from restored speech samples collected from the Web. Audio samples are available at our demo page: google.github.io/df-conformer/miipher/Comment: Accepted to WASPAA 202

arXiv.org e-Print Archive

A Comparative Study on Transformer vs RNN in Speech Applications

Author: Chen Nanxin
Hayashi Tomoki
Hori Takaaki
Inaguma Hirofumi
Jiang Ziyan
Karita Shigeki
Someki Masao
Soplin Nelson Enrique Yalta
Wang Xiaofei
Watanabe Shinji
Yamamoto Ryuichi
Yoshimura Takenori
Zhang Wangyou
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 28/09/2019
Field of study

Sequence-to-sequence models have been widely used in end-to-end speech processing, for example, automatic speech recognition (ASR), speech translation (ST), and text-to-speech (TTS). This paper focuses on an emergent sequence-to-sequence model called Transformer, which achieves state-of-the-art performance in neural machine translation and other natural language processing applications. We undertook intensive studies in which we experimentally compared and analyzed Transformer and conventional recurrent neural networks (RNN) in a total of 15 ASR, one multilingual ASR, one ST, and two TTS benchmarks. Our experiments revealed various training tips and significant performance benefits obtained with Transformer for each task including the surprising superiority of Transformer in 13/15 ASR benchmarks in comparison with RNN. We are preparing to release Kaldi-style reproducible recipes using open source and publicly available datasets for all the ASR, ST, and TTS tasks for the community to succeed our exciting outcomes.Comment: Accepted at ASRU 201

arXiv.org e-Print Archive

Crossref

Recent Advances in End-to-End Speech Recognition

Author: Karita Shigeki
Publication venue: 'Nara Institute of Science and Technology'
Publication date: 13/11/2019
Field of study

"講演者所属: NTT Communication Science Laboratories 講演日: 2019年11月13日講演場所: 情報科学棟大講義室L1This talk explains recent advances in end-to-end automatic speech recognition (ASR) at NTT. First, I will give an overview of NTT speech technologies and open-source toolkit ESPnet. Then, I will introduce our proposed semi-supervised end-to-end ASR method (ICASSP19 https://ieeexplore.ieee.org/abstract/document/8682890). In this paper, we introduce speech and text autoencoders that share encoders and decoders with an automatic speech recognition (ASR) model to improve ASR performance with large speech only and text only training datasets. To build the speech and text autoencoders, we leverage state-of-the-art ASR and text-to-speech (TTS) encoder-decoder architectures. These autoencoders learn features from speech only and text only datasets by switching the encoders and decoders used in the ASR and TTS models. Simultaneously, they aim to encode features to be compatible with ASR and TTS models by a multi-task loss. Additionally, we anticipate that TTS joint training can also improve ASR performance because both ASR and TTS models learn transformations between speech and text. The experimental result we obtained with our semi-supervised end-to-end ASR/TTS training revealed reductions from a model initially trained with a small paired subset of the LibriSpeech corpus in the character error rate from 10.4% to 8.4% and word error rate from 20.6% to 18.0% by retraining the model with a large unpaired subset of the corpus

NAIST Academic Repository

Recent Advances in End-to-End Speech Recognition

Author: Karita Shigeki
Publication venue: 奈良先端科学技術大学院大学
Publication date: 06/03/2023
Field of study

Institutional Repositories DataBase (IRDB)