4 research outputs found
Context-Dependent Acoustic Modeling without Explicit Phone Clustering
Phoneme-based acoustic modeling of large vocabulary automatic speech
recognition takes advantage of phoneme context. The large number of
context-dependent (CD) phonemes and their highly varying statistics require
tying or smoothing to enable robust training. Usually, Classification and
Regression Trees are used for phonetic clustering, which is standard in Hidden
Markov Model (HMM)-based systems. However, this solution introduces a secondary
training objective and does not allow for end-to-end training. In this work, we
address a direct phonetic context modeling for the hybrid Deep Neural Network
(DNN)/HMM, that does not build on any phone clustering algorithm for the
determination of the HMM state inventory. By performing different
decompositions of the joint probability of the center phoneme state and its
left and right contexts, we obtain a factorized network consisting of different
components, trained jointly. Moreover, the representation of the phonetic
context for the network relies on phoneme embeddings. The recognition accuracy
of our proposed models on the Switchboard task is comparable and outperforms
slightly the hybrid model using the standard state-tying decision trees.Comment: Submitted to Interspeech 202
Equivalence of Segmental and Neural Transducer Modeling: A Proof of Concept
With the advent of direct models in automatic speech recognition (ASR), the
formerly prevalent frame-wise acoustic modeling based on hidden Markov models
(HMM) diversified into a number of modeling architectures like encoder-decoder
attention models, transducer models and segmental models (direct HMM). While
transducer models stay with a frame-level model definition, segmental models
are defined on the level of label segments directly. While
(soft-)attention-based models avoid explicit alignment, transducer and
segmental approach internally do model alignment, either by segment hypotheses
or, more implicitly, by emitting so-called blank symbols. In this work, we
prove that the widely used class of RNN-Transducer models and segmental models
(direct HMM) are equivalent and therefore show equal modeling power. It is
shown that blank probabilities translate into segment length probabilities and
vice versa. In addition, we provide initial experiments investigating decoding
and beam-pruning, comparing time-synchronous and label-/segment-synchronous
search strategies and their properties using the same underlying model.Comment: accepted at Interspeech202