8,101 research outputs found
Embedding-Based Speaker Adaptive Training of Deep Neural Networks
An embedding-based speaker adaptive training (SAT) approach is proposed and
investigated in this paper for deep neural network acoustic modeling. In this
approach, speaker embedding vectors, which are a constant given a particular
speaker, are mapped through a control network to layer-dependent element-wise
affine transformations to canonicalize the internal feature representations at
the output of hidden layers of a main network. The control network for
generating the speaker-dependent mappings is jointly estimated with the main
network for the overall speaker adaptive acoustic modeling. Experiments on
large vocabulary continuous speech recognition (LVCSR) tasks show that the
proposed SAT scheme can yield superior performance over the widely-used
speaker-aware training using i-vectors with speaker-adapted input features
A Comparison between Deep Neural Nets and Kernel Acoustic Models for Speech Recognition
We study large-scale kernel methods for acoustic modeling and compare to DNNs
on performance metrics related to both acoustic modeling and recognition.
Measuring perplexity and frame-level classification accuracy, kernel-based
acoustic models are as effective as their DNN counterparts. However, on
token-error-rates DNN models can be significantly better. We have discovered
that this might be attributed to DNN's unique strength in reducing both the
perplexity and the entropy of the predicted posterior probabilities. Motivated
by our findings, we propose a new technique, entropy regularized perplexity,
for model selection. This technique can noticeably improve the recognition
performance of both types of models, and reduces the gap between them. While
effective on Broadcast News, this technique could be also applicable to other
tasks.Comment: arXiv admin note: text overlap with arXiv:1411.400
Acoustic modeling using the digital waveguide mesh
The digital waveguide mesh has been an active area of music acoustics research for over ten years. Although founded in 1-D digital waveguide modeling, the principles on which it is based are not new to researchers grounded in numerical simulation, FDTD methods, electromagnetic simulation, etc. This article has attempted to provide a considerable review of how the DWM has been applied to acoustic modeling and sound synthesis problems, including new 2-D object synthesis and an overview of recent research activities in articulatory vocal tract modeling, RIR synthesis, and reverberation simulation. The extensive, although not by any means exhaustive, list of references indicates that though the DWM may have parallels in other disciplines, it still offers something new in the field of acoustic simulation and sound synth
Low-rank and Sparse Soft Targets to Learn Better DNN Acoustic Models
Conventional deep neural networks (DNN) for speech acoustic modeling rely on
Gaussian mixture models (GMM) and hidden Markov model (HMM) to obtain binary
class labels as the targets for DNN training. Subword classes in speech
recognition systems correspond to context-dependent tied states or senones. The
present work addresses some limitations of GMM-HMM senone alignments for DNN
training. We hypothesize that the senone probabilities obtained from a DNN
trained with binary labels can provide more accurate targets to learn better
acoustic models. However, DNN outputs bear inaccuracies which are exhibited as
high dimensional unstructured noise, whereas the informative components are
structured and low-dimensional. We exploit principle component analysis (PCA)
and sparse coding to characterize the senone subspaces. Enhanced probabilities
obtained from low-rank and sparse reconstructions are used as soft-targets for
DNN acoustic modeling, that also enables training with untranscribed data.
Experiments conducted on AMI corpus shows 4.6% relative reduction in word error
rate
Context-Dependent Acoustic Modeling without Explicit Phone Clustering
Phoneme-based acoustic modeling of large vocabulary automatic speech
recognition takes advantage of phoneme context. The large number of
context-dependent (CD) phonemes and their highly varying statistics require
tying or smoothing to enable robust training. Usually, Classification and
Regression Trees are used for phonetic clustering, which is standard in Hidden
Markov Model (HMM)-based systems. However, this solution introduces a secondary
training objective and does not allow for end-to-end training. In this work, we
address a direct phonetic context modeling for the hybrid Deep Neural Network
(DNN)/HMM, that does not build on any phone clustering algorithm for the
determination of the HMM state inventory. By performing different
decompositions of the joint probability of the center phoneme state and its
left and right contexts, we obtain a factorized network consisting of different
components, trained jointly. Moreover, the representation of the phonetic
context for the network relies on phoneme embeddings. The recognition accuracy
of our proposed models on the Switchboard task is comparable and outperforms
slightly the hybrid model using the standard state-tying decision trees.Comment: Submitted to Interspeech 202
- …