15,546 research outputs found
Context-Dependent Acoustic Modeling without Explicit Phone Clustering
Phoneme-based acoustic modeling of large vocabulary automatic speech
recognition takes advantage of phoneme context. The large number of
context-dependent (CD) phonemes and their highly varying statistics require
tying or smoothing to enable robust training. Usually, Classification and
Regression Trees are used for phonetic clustering, which is standard in Hidden
Markov Model (HMM)-based systems. However, this solution introduces a secondary
training objective and does not allow for end-to-end training. In this work, we
address a direct phonetic context modeling for the hybrid Deep Neural Network
(DNN)/HMM, that does not build on any phone clustering algorithm for the
determination of the HMM state inventory. By performing different
decompositions of the joint probability of the center phoneme state and its
left and right contexts, we obtain a factorized network consisting of different
components, trained jointly. Moreover, the representation of the phonetic
context for the network relies on phoneme embeddings. The recognition accuracy
of our proposed models on the Switchboard task is comparable and outperforms
slightly the hybrid model using the standard state-tying decision trees.Comment: Submitted to Interspeech 202
Robust Lexically Mediated Compensation for Coarticulation: Christmash Time Is Here Again
First published: 20 April 2021A long-standing question in cognitive science is how high-level knowledge is integrated with sensory
input. For example, listeners can leverage lexical knowledge to interpret an ambiguous speech
sound, but do such effects reflect direct top-down influences on perception or merely postperceptual
biases? A critical test case in the domain of spoken word recognition is lexically mediated compensation
for coarticulation (LCfC). Previous LCfC studies have shown that a lexically restored context
phoneme (e.g., /s/ in Christma#) can alter the perceived place of articulation of a subsequent target
phoneme (e.g., the initial phoneme of a stimulus from a tapes-capes continuum), consistent with the
influence of an unambiguous context phoneme in the same position. Because this phoneme-to-phoneme
compensation for coarticulation is considered sublexical, scientists agree that evidence for LCfC would
constitute strong support for top–down interaction. However, results from previous LCfC studies have
been inconsistent, and positive effects have often been small. Here, we conducted extensive piloting of
stimuli prior to testing for LCfC. Specifically, we ensured that context items elicited robust phoneme
restoration (e.g., that the final phoneme of Christma# was reliably identified as /s/) and that unambiguous
context-final segments (e.g., a clear /s/ at the end of Christmas) drove reliable compensation for
coarticulation for a subsequent target phoneme.We observed robust LCfC in a well-powered, preregistered
experiment with these pretested items (N = 40) as well as in a direct replication study (N = 40).
These results provide strong evidence in favor of computational models of spoken word recognition
that include top–down feedback
Multitask Learning with Low-Level Auxiliary Tasks for Encoder-Decoder Based Speech Recognition
End-to-end training of deep learning-based models allows for implicit
learning of intermediate representations based on the final task loss. However,
the end-to-end approach ignores the useful domain knowledge encoded in explicit
intermediate-level supervision. We hypothesize that using intermediate
representations as auxiliary supervision at lower levels of deep networks may
be a good way of combining the advantages of end-to-end training and more
traditional pipeline approaches. We present experiments on conversational
speech recognition where we use lower-level tasks, such as phoneme recognition,
in a multitask training approach with an encoder-decoder model for direct
character transcription. We compare multiple types of lower-level tasks and
analyze the effects of the auxiliary tasks. Our results on the Switchboard
corpus show that this approach improves recognition accuracy over a standard
encoder-decoder model on the Eval2000 test set
Nonparametric Bayesian Double Articulation Analyzer for Direct Language Acquisition from Continuous Speech Signals
Human infants can discover words directly from unsegmented speech signals
without any explicitly labeled data. In this paper, we develop a novel machine
learning method called nonparametric Bayesian double articulation analyzer
(NPB-DAA) that can directly acquire language and acoustic models from observed
continuous speech signals. For this purpose, we propose an integrative
generative model that combines a language model and an acoustic model into a
single generative model called the "hierarchical Dirichlet process hidden
language model" (HDP-HLM). The HDP-HLM is obtained by extending the
hierarchical Dirichlet process hidden semi-Markov model (HDP-HSMM) proposed by
Johnson et al. An inference procedure for the HDP-HLM is derived using the
blocked Gibbs sampler originally proposed for the HDP-HSMM. This procedure
enables the simultaneous and direct inference of language and acoustic models
from continuous speech signals. Based on the HDP-HLM and its inference
procedure, we developed a novel double articulation analyzer. By assuming
HDP-HLM as a generative model of observed time series data, and by inferring
latent variables of the model, the method can analyze latent double
articulation structure, i.e., hierarchically organized latent words and
phonemes, of the data in an unsupervised manner. The novel unsupervised double
articulation analyzer is called NPB-DAA.
The NPB-DAA can automatically estimate double articulation structure embedded
in speech signals. We also carried out two evaluation experiments using
synthetic data and actual human continuous speech signals representing Japanese
vowel sequences. In the word acquisition and phoneme categorization tasks, the
NPB-DAA outperformed a conventional double articulation analyzer (DAA) and
baseline automatic speech recognition system whose acoustic model was trained
in a supervised manner.Comment: 15 pages, 7 figures, Draft submitted to IEEE Transactions on
Autonomous Mental Development (TAMD
- …