10,250 research outputs found
KL-Divergence Guided Two-Beam Viterbi Algorithm on Factorial HMMs
This thesis addresses the problem of the high computation complexity issue that arises when decoding hidden Markov models (HMMs) with a large number of states. A novel approach, the two-beam Viterbi, with an extra forward beam, for decoding HMMs is implemented on a system that uses factorial HMM to simultaneously recognize a pair of isolated digits on one audio channel. The two-beam Viterbi algorithm uses KL-divergence and hierarchical clustering to reduce the overall decoding complexity. This novel approach achieves 60% less computation compared to the baseline algorithm, the Viterbi beam search, while maintaining 82.5% recognition accuracy.Ope
Using multiple visual tandem streams in audio-visual speech recognition
The method which is called the "tandem approach" in speech recognition has been shown to increase performance by using classifier posterior probabilities as observations in a hidden Markov model. We study the effect of using visual tandem features in audio-visual speech recognition using a novel setup which uses multiple classifiers to obtain multiple visual tandem features. We adopt the approach of multi-stream hidden Markov models where visual tandem features from two different classifiers are considered as additional streams in the model. It is shown in our experiments that using multiple visual tandem features improve the recognition accuracy in various noise conditions. In addition, in order to handle asynchrony between audio and visual observations, we employ coupled hidden Markov models and obtain improved performance as compared to the synchronous model
Multimodal Speech Emotion Recognition Using Audio and Text
Speech emotion recognition is a challenging task, and extensive reliance has
been placed on models that use audio features in building well-performing
classifiers. In this paper, we propose a novel deep dual recurrent encoder
model that utilizes text data and audio signals simultaneously to obtain a
better understanding of speech data. As emotional dialogue is composed of sound
and spoken content, our model encodes the information from audio and text
sequences using dual recurrent neural networks (RNNs) and then combines the
information from these sources to predict the emotion class. This architecture
analyzes speech data from the signal level to the language level, and it thus
utilizes the information within the data more comprehensively than models that
focus on audio features. Extensive experiments are conducted to investigate the
efficacy and properties of the proposed model. Our proposed model outperforms
previous state-of-the-art methods in assigning data to one of four emotion
categories (i.e., angry, happy, sad and neutral) when the model is applied to
the IEMOCAP dataset, as reflected by accuracies ranging from 68.8% to 71.8%.Comment: 7 pages, Accepted as a conference paper at IEEE SLT 201
Nonparametric Bayesian Double Articulation Analyzer for Direct Language Acquisition from Continuous Speech Signals
Human infants can discover words directly from unsegmented speech signals
without any explicitly labeled data. In this paper, we develop a novel machine
learning method called nonparametric Bayesian double articulation analyzer
(NPB-DAA) that can directly acquire language and acoustic models from observed
continuous speech signals. For this purpose, we propose an integrative
generative model that combines a language model and an acoustic model into a
single generative model called the "hierarchical Dirichlet process hidden
language model" (HDP-HLM). The HDP-HLM is obtained by extending the
hierarchical Dirichlet process hidden semi-Markov model (HDP-HSMM) proposed by
Johnson et al. An inference procedure for the HDP-HLM is derived using the
blocked Gibbs sampler originally proposed for the HDP-HSMM. This procedure
enables the simultaneous and direct inference of language and acoustic models
from continuous speech signals. Based on the HDP-HLM and its inference
procedure, we developed a novel double articulation analyzer. By assuming
HDP-HLM as a generative model of observed time series data, and by inferring
latent variables of the model, the method can analyze latent double
articulation structure, i.e., hierarchically organized latent words and
phonemes, of the data in an unsupervised manner. The novel unsupervised double
articulation analyzer is called NPB-DAA.
The NPB-DAA can automatically estimate double articulation structure embedded
in speech signals. We also carried out two evaluation experiments using
synthetic data and actual human continuous speech signals representing Japanese
vowel sequences. In the word acquisition and phoneme categorization tasks, the
NPB-DAA outperformed a conventional double articulation analyzer (DAA) and
baseline automatic speech recognition system whose acoustic model was trained
in a supervised manner.Comment: 15 pages, 7 figures, Draft submitted to IEEE Transactions on
Autonomous Mental Development (TAMD
- …