2,644 research outputs found
Character-Level Incremental Speech Recognition with Recurrent Neural Networks
In real-time speech recognition applications, the latency is an important
issue. We have developed a character-level incremental speech recognition (ISR)
system that responds quickly even during the speech, where the hypotheses are
gradually improved while the speaking proceeds. The algorithm employs a
speech-to-character unidirectional recurrent neural network (RNN), which is
end-to-end trained with connectionist temporal classification (CTC), and an
RNN-based character-level language model (LM). The output values of the
CTC-trained RNN are character-level probabilities, which are processed by beam
search decoding. The RNN LM augments the decoding by providing long-term
dependency information. We propose tree-based online beam search with
additional depth-pruning, which enables the system to process infinitely long
input speech with low latency. This system not only responds quickly on speech
but also can dictate out-of-vocabulary (OOV) words according to pronunciation.
The proposed model achieves the word error rate (WER) of 8.90% on the Wall
Street Journal (WSJ) Nov'92 20K evaluation set when trained on the WSJ SI-284
training set.Comment: To appear in ICASSP 201
Transfer Learning for Speech and Language Processing
Transfer learning is a vital technique that generalizes models trained for
one setting or task to other settings or tasks. For example in speech
recognition, an acoustic model trained for one language can be used to
recognize speech in another language, with little or no re-training data.
Transfer learning is closely related to multi-task learning (cross-lingual vs.
multilingual), and is traditionally studied in the name of `model adaptation'.
Recent advance in deep learning shows that transfer learning becomes much
easier and more effective with high-level abstract features learned by deep
models, and the `transfer' can be conducted not only between data distributions
and data types, but also between model structures (e.g., shallow nets and deep
nets) or even model types (e.g., Bayesian models and neural models). This
review paper summarizes some recent prominent research towards this direction,
particularly for speech and language processing. We also report some results
from our group and highlight the potential of this very interesting research
field.Comment: 13 pages, APSIPA 201
Context-Dependent Acoustic Modeling without Explicit Phone Clustering
Phoneme-based acoustic modeling of large vocabulary automatic speech
recognition takes advantage of phoneme context. The large number of
context-dependent (CD) phonemes and their highly varying statistics require
tying or smoothing to enable robust training. Usually, Classification and
Regression Trees are used for phonetic clustering, which is standard in Hidden
Markov Model (HMM)-based systems. However, this solution introduces a secondary
training objective and does not allow for end-to-end training. In this work, we
address a direct phonetic context modeling for the hybrid Deep Neural Network
(DNN)/HMM, that does not build on any phone clustering algorithm for the
determination of the HMM state inventory. By performing different
decompositions of the joint probability of the center phoneme state and its
left and right contexts, we obtain a factorized network consisting of different
components, trained jointly. Moreover, the representation of the phonetic
context for the network relies on phoneme embeddings. The recognition accuracy
of our proposed models on the Switchboard task is comparable and outperforms
slightly the hybrid model using the standard state-tying decision trees.Comment: Submitted to Interspeech 202
Reinforcement Learning of Speech Recognition System Based on Policy Gradient and Hypothesis Selection
Speech recognition systems have achieved high recognition performance for
several tasks. However, the performance of such systems is dependent on the
tremendously costly development work of preparing vast amounts of task-matched
transcribed speech data for supervised training. The key problem here is the
cost of transcribing speech data. The cost is repeatedly required to support
new languages and new tasks. Assuming broad network services for transcribing
speech data for many users, a system would become more self-sufficient and more
useful if it possessed the ability to learn from very light feedback from the
users without annoying them. In this paper, we propose a general reinforcement
learning framework for speech recognition systems based on the policy gradient
method. As a particular instance of the framework, we also propose a hypothesis
selection-based reinforcement learning method. The proposed framework provides
a new view for several existing training and adaptation methods. The
experimental results show that the proposed method improves the recognition
performance compared to unsupervised adaptation.Comment: 5 pages, 6 figure
Multilingual Training and Cross-lingual Adaptation on CTC-based Acoustic Model
Multilingual models for Automatic Speech Recognition (ASR) are attractive as
they have been shown to benefit from more training data, and better lend
themselves to adaptation to under-resourced languages. However, initialisation
from monolingual context-dependent models leads to an explosion of
context-dependent states. Connectionist Temporal Classification (CTC) is a
potential solution to this as it performs well with monophone labels.
We investigate multilingual CTC in the context of adaptation and
regularisation techniques that have been shown to be beneficial in more
conventional contexts. The multilingual model is trained to model a universal
International Phonetic Alphabet (IPA)-based phone set using the CTC loss
function. Learning Hidden Unit Contribution (LHUC) is investigated to perform
language adaptive training. In addition, dropout during cross-lingual
adaptation is also studied and tested in order to mitigate the overfitting
problem.
Experiments show that the performance of the universal phoneme-based CTC
system can be improved by applying LHUC and it is extensible to new phonemes
during cross-lingual adaptation. Updating all the parameters shows consistent
improvement on limited data. Applying dropout during adaptation can further
improve the system and achieve competitive performance with Deep Neural Network
/ Hidden Markov Model (DNN/HMM) systems on limited data
- …