4,602 research outputs found
Direct Acoustics-to-Word Models for English Conversational Speech Recognition
Recent work on end-to-end automatic speech recognition (ASR) has shown that
the connectionist temporal classification (CTC) loss can be used to convert
acoustics to phone or character sequences. Such systems are used with a
dictionary and separately-trained Language Model (LM) to produce word
sequences. However, they are not truly end-to-end in the sense of mapping
acoustics directly to words without an intermediate phone representation. In
this paper, we present the first results employing direct acoustics-to-word CTC
models on two well-known public benchmark tasks: Switchboard and CallHome.
These models do not require an LM or even a decoder at run-time and hence
recognize speech with minimal complexity. However, due to the large number of
word output units, CTC word models require orders of magnitude more data to
train reliably compared to traditional systems. We present some techniques to
mitigate this issue. Our CTC word model achieves a word error rate of
13.0%/18.8% on the Hub5-2000 Switchboard/CallHome test sets without any LM or
decoder compared with 9.6%/16.0% for phone-based CTC with a 4-gram LM. We also
present rescoring results on CTC word model lattices to quantify the
performance benefits of a LM, and contrast the performance of word and phone
CTC models.Comment: Submitted to Interspeech-201
FPGA-Based Low-Power Speech Recognition with Recurrent Neural Networks
In this paper, a neural network based real-time speech recognition (SR)
system is developed using an FPGA for very low-power operation. The implemented
system employs two recurrent neural networks (RNNs); one is a
speech-to-character RNN for acoustic modeling (AM) and the other is for
character-level language modeling (LM). The system also employs a statistical
word-level LM to improve the recognition accuracy. The results of the AM, the
character-level LM, and the word-level LM are combined using a fairly simple
N-best search algorithm instead of the hidden Markov model (HMM) based network.
The RNNs are implemented using massively parallel processing elements (PEs) for
low latency and high throughput. The weights are quantized to 6 bits to store
all of them in the on-chip memory of an FPGA. The proposed algorithm is
implemented on a Xilinx XC7Z045, and the system can operate much faster than
real-time.Comment: Accepted to SiPS 201
Leveraging native language information for improved accented speech recognition
Recognition of accented speech is a long-standing challenge for automatic
speech recognition (ASR) systems, given the increasing worldwide population of
bi-lingual speakers with English as their second language. If we consider
foreign-accented speech as an interpolation of the native language (L1) and
English (L2), using a model that can simultaneously address both languages
would perform better at the acoustic level for accented speech. In this study,
we explore how an end-to-end recurrent neural network (RNN) trained system with
English and native languages (Spanish and Indian languages) could leverage data
of native languages to improve performance for accented English speech. To this
end, we examine pre-training with native languages, as well as multi-task
learning (MTL) in which the main task is trained with native English and the
secondary task is trained with Spanish or Indian Languages. We show that the
proposed MTL model performs better than the pre-training approach and
outperforms a baseline model trained simply with English data. We suggest a new
setting for MTL in which the secondary task is trained with both English and
the native language, using the same output set. This proposed scenario yields
better performance with +11.95% and +17.55% character error rate gains over
baseline for Hispanic and Indian accents, respectively.Comment: Accepted at Interspeech 201
Phonetic Temporal Neural Model for Language Identification
Deep neural models, particularly the LSTM-RNN model, have shown great
potential for language identification (LID). However, the use of phonetic
information has been largely overlooked by most existing neural LID methods,
although this information has been used very successfully in conventional
phonetic LID systems. We present a phonetic temporal neural model for LID,
which is an LSTM-RNN LID system that accepts phonetic features produced by a
phone-discriminative DNN as the input, rather than raw acoustic features. This
new model is similar to traditional phonetic LID methods, but the phonetic
knowledge here is much richer: it is at the frame level and involves compacted
information of all phones. Our experiments conducted on the Babel database and
the AP16-OLR database demonstrate that the temporal phonetic neural approach is
very effective, and significantly outperforms existing acoustic neural models.
It also outperforms the conventional i-vector approach on short utterances and
in noisy conditions.Comment: Submitted to TASL
- …