34 research outputs found
Efficient Implementation of the Room Simulator for Training Deep Neural Network Acoustic Models
In this paper, we describe how to efficiently implement an acoustic room
simulator to generate large-scale simulated data for training deep neural
networks. Even though Google Room Simulator in [1] was shown to be quite
effective in reducing the Word Error Rates (WERs) for far-field applications by
generating simulated far-field training sets, it requires a very large number
of Fast Fourier Transforms (FFTs) of large size. Room Simulator in [1] used
approximately 80 percent of Central Processing Unit (CPU) usage in our CPU +
Graphics Processing Unit (GPU) training architecture [2]. In this work, we
implement an efficient OverLap Addition (OLA) based filtering using the
open-source FFTW3 library. Further, we investigate the effects of the Room
Impulse Response (RIR) lengths. Experimentally, we conclude that we can cut the
tail portions of RIRs whose power is less than 20 dB below the maximum power
without sacrificing the speech recognition accuracy. However, we observe that
cutting RIR tail more than this threshold harms the speech recognition accuracy
for rerecorded test sets. Using these approaches, we were able to reduce CPU
usage for the room simulator portion down to 9.69 percent in CPU/GPU training
architecture. Profiling result shows that we obtain 22.4 times speed-up on a
single machine and 37.3 times speed up on Google's distributed training
infrastructure.Comment: Published at INTERSPEECH 2018.
(https://www.isca-speech.org/archive/Interspeech_2018/abstracts/2566.html
Restoring Punctuation and Capitalization in Transcribed Speech
Adding punctuation and capitalization greatly improves the readability of automatic speech transcripts. We discuss an approach for performing both tasks in a single pass using a purely text-basedn-gram language model. We study the effect on performance of varying the n-gram order (from n = 3 to n = 6) and the amount of training data (from 58 million to
55 billion tokens). Our results show that using larger training data sets consistently improves performance, while increasing the n-gram order does not help nearly as much
Recommended from our members
Improved Name-Recognition with Meta-data Dependent Name Networks
A transcription system that requires accurate general name transcription is faced with the problem of covering the large number of names it may encounter. Without any prior knowledge, this requires a large increase in the size and complexity of the system due to the expansion of the lexicon. Furthermore, this increase will adversely affect the system performance due to the increased confusability. Here we propose a method that uses meta-data, available at runtime to ensure better name coverage without significantly increasing the system complexity. We tested this approach on a voicemail transcription task and assumed meta-data to be available in the form of a caller ID string (as it would show up on a caller ID enabled phone) and the name of the mailbox owner. Networks representing possible spoken realization of those names are generated at runtime and included in network of the decoder. The decoder network is built at training time using a class-dependent language model, with caller and mailbox name instances modeled as class tokens. The class tokens are replaced at test time with the name networks built from the meta-data. The proposed algorithm showed a reduction in the error rate of name tokens of 22.1%
Multi-Dialect Speech Recognition With A Single Sequence-To-Sequence Model
Sequence-to-sequence models provide a simple and elegant solution for
building speech recognition systems by folding separate components of a typical
system, namely acoustic (AM), pronunciation (PM) and language (LM) models into
a single neural network. In this work, we look at one such sequence-to-sequence
model, namely listen, attend and spell (LAS), and explore the possibility of
training a single model to serve different English dialects, which simplifies
the process of training multi-dialect systems without the need for separate AM,
PM and LMs for each dialect. We show that simply pooling the data from all
dialects into one LAS model falls behind the performance of a model fine-tuned
on each dialect. We then look at incorporating dialect-specific information
into the model, both by modifying the training targets by inserting the dialect
symbol at the end of the original grapheme sequence and also feeding a 1-hot
representation of the dialect information into all layers of the model.
Experimental results on seven English dialects show that our proposed system is
effective in modeling dialect variations within a single LAS model,
outperforming a LAS model trained individually on each of the seven dialects by
3.1 ~ 16.5% relative.Comment: submitted to ICASSP 201
State-of-the-art Speech Recognition With Sequence-to-Sequence Models
Attention-based encoder-decoder architectures such as Listen, Attend, and
Spell (LAS), subsume the acoustic, pronunciation and language model components
of a traditional automatic speech recognition (ASR) system into a single neural
network. In previous work, we have shown that such architectures are comparable
to state-of-theart ASR systems on dictation tasks, but it was not clear if such
architectures would be practical for more challenging tasks such as voice
search. In this work, we explore a variety of structural and optimization
improvements to our LAS model which significantly improve performance. On the
structural side, we show that word piece models can be used instead of
graphemes. We also introduce a multi-head attention architecture, which offers
improvements over the commonly-used single-head attention. On the optimization
side, we explore synchronous training, scheduled sampling, label smoothing, and
minimum word error rate optimization, which are all shown to improve accuracy.
We present results with a unidirectional LSTM encoder for streaming
recognition. On a 12, 500 hour voice search task, we find that the proposed
changes improve the WER from 9.2% to 5.6%, while the best conventional system
achieves 6.7%; on a dictation task our model achieves a WER of 4.1% compared to
5% for the conventional system.Comment: ICASSP camera-ready versio
LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus
This paper introduces a new speech dataset called ``LibriTTS-R'' designed for
text-to-speech (TTS) use. It is derived by applying speech restoration to the
LibriTTS corpus, which consists of 585 hours of speech data at 24 kHz sampling
rate from 2,456 speakers and the corresponding texts. The constituent samples
of LibriTTS-R are identical to those of LibriTTS, with only the sound quality
improved. Experimental results show that the LibriTTS-R ground-truth samples
showed significantly improved sound quality compared to those in LibriTTS. In
addition, neural end-to-end TTS trained with LibriTTS-R achieved speech
naturalness on par with that of the ground-truth samples. The corpus is freely
available for download from \url{http://www.openslr.org/141/}.Comment: Accepted to Interspeech 202