284 research outputs found
Analyzing Hidden Representations in End-to-End Automatic Speech Recognition Systems
Neural models have become ubiquitous in automatic speech recognition systems.
While neural networks are typically used as acoustic models in more complex
systems, recent studies have explored end-to-end speech recognition systems
based on neural networks, which can be trained to directly predict text from
input acoustic features. Although such systems are conceptually elegant and
simpler than traditional systems, it is less obvious how to interpret the
trained models. In this work, we analyze the speech representations learned by
a deep end-to-end model that is based on convolutional and recurrent layers,
and trained with a connectionist temporal classification (CTC) loss. We use a
pre-trained model to generate frame-level features which are given to a
classifier that is trained on frame classification into phones. We evaluate
representations from different layers of the deep model and compare their
quality for predicting phone labels. Our experiments shed light on important
aspects of the end-to-end model such as layer depth, model complexity, and
other design choices.Comment: NIPS 201
Investigating gated recurrent neural networks for speech synthesis
Recently, recurrent neural networks (RNNs) as powerful sequence models have
re-emerged as a potential acoustic model for statistical parametric speech
synthesis (SPSS). The long short-term memory (LSTM) architecture is
particularly attractive because it addresses the vanishing gradient problem in
standard RNNs, making them easier to train. Although recent studies have
demonstrated that LSTMs can achieve significantly better performance on SPSS
than deep feed-forward neural networks, little is known about why. Here we
attempt to answer two questions: a) why do LSTMs work well as a sequence model
for SPSS; b) which component (e.g., input gate, output gate, forget gate) is
most important. We present a visual analysis alongside a series of experiments,
resulting in a proposal for a simplified architecture. The simplified
architecture has significantly fewer parameters than an LSTM, thus reducing
generation complexity considerably without degrading quality.Comment: Accepted by ICASSP 201
Self-Supervised Contrastive Learning for Unsupervised Phoneme Segmentation
We propose a self-supervised representation learning model for the task of
unsupervised phoneme boundary detection. The model is a convolutional neural
network that operates directly on the raw waveform. It is optimized to identify
spectral changes in the signal using the Noise-Contrastive Estimation
principle. At test time, a peak detection algorithm is applied over the model
outputs to produce the final boundaries. As such, the proposed model is trained
in a fully unsupervised manner with no manual annotations in the form of target
boundaries nor phonetic transcriptions. We compare the proposed approach to
several unsupervised baselines using both TIMIT and Buckeye corpora. Results
suggest that our approach surpasses the baseline models and reaches
state-of-the-art performance on both data sets. Furthermore, we experimented
with expanding the training set with additional examples from the Librispeech
corpus. We evaluated the resulting model on distributions and languages that
were not seen during the training phase (English, Hebrew and German) and showed
that utilizing additional untranscribed data is beneficial for model
performance.Comment: Interspeech 2020 pape
- …