22,988 research outputs found
A Comparison between Deep Neural Nets and Kernel Acoustic Models for Speech Recognition
We study large-scale kernel methods for acoustic modeling and compare to DNNs
on performance metrics related to both acoustic modeling and recognition.
Measuring perplexity and frame-level classification accuracy, kernel-based
acoustic models are as effective as their DNN counterparts. However, on
token-error-rates DNN models can be significantly better. We have discovered
that this might be attributed to DNN's unique strength in reducing both the
perplexity and the entropy of the predicted posterior probabilities. Motivated
by our findings, we propose a new technique, entropy regularized perplexity,
for model selection. This technique can noticeably improve the recognition
performance of both types of models, and reduces the gap between them. While
effective on Broadcast News, this technique could be also applicable to other
tasks.Comment: arXiv admin note: text overlap with arXiv:1411.400
Context-Dependent Acoustic Modeling without Explicit Phone Clustering
Phoneme-based acoustic modeling of large vocabulary automatic speech
recognition takes advantage of phoneme context. The large number of
context-dependent (CD) phonemes and their highly varying statistics require
tying or smoothing to enable robust training. Usually, Classification and
Regression Trees are used for phonetic clustering, which is standard in Hidden
Markov Model (HMM)-based systems. However, this solution introduces a secondary
training objective and does not allow for end-to-end training. In this work, we
address a direct phonetic context modeling for the hybrid Deep Neural Network
(DNN)/HMM, that does not build on any phone clustering algorithm for the
determination of the HMM state inventory. By performing different
decompositions of the joint probability of the center phoneme state and its
left and right contexts, we obtain a factorized network consisting of different
components, trained jointly. Moreover, the representation of the phonetic
context for the network relies on phoneme embeddings. The recognition accuracy
of our proposed models on the Switchboard task is comparable and outperforms
slightly the hybrid model using the standard state-tying decision trees.Comment: Submitted to Interspeech 202
- …