Towards Selection of Text-to-speech Data to Augment ASR Training

Kalinli, Ozlem; Keren, Gil; Liu, Shuo; Mahadeokar, Jay; Sarı, Leda; Shangguan, Yuan; Wu, Chunyang

Towards Selection of Text-to-speech Data to Augment ASR Training

Authors: Ozlem Kalinli
Gil Keren
Shuo Liu
Jay Mahadeokar
Leda Sarı
Yuan Shangguan
Chunyang Wu
Publication date: 30 May 2023
Publisher

Abstract

This paper presents a method for selecting appropriate synthetic speech samples from a given large text-to-speech (TTS) dataset as supplementary training data for an automatic speech recognition (ASR) model. We trained a neural network, which can be optimised using cross-entropy loss or Arcface loss, to measure the similarity of a synthetic data to real speech. We found that incorporating synthetic samples with considerable dissimilarity to real speech, owing in part to lexical differences, into ASR training is crucial for boosting recognition performance. Experimental results on Librispeech test sets indicate that, in order to maintain the same speech recognition accuracy as when using all TTS data, our proposed solution can reduce the size of the TTS data down below its

30\,\%

, which is superior to several baseline methods

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2306.00998

Last time updated on 06/06/2023