3,458 research outputs found
Towards Structured Deep Neural Network for Automatic Speech Recognition
In this paper we propose the Structured Deep Neural Network (Structured DNN)
as a structured and deep learning algorithm, learning to find the best
structured object (such as a label sequence) given a structured input (such as
a vector sequence) by globally considering the mapping relationships between
the structure rather than item by item.
When automatic speech recognition is viewed as a special case of such a
structured learning problem, where we have the acoustic vector sequence as the
input and the phoneme label sequence as the output, it becomes possible to
comprehensively learned utterance by utterance as a whole, rather than frame by
frame.
Structured Support Vector Machine (structured SVM) was proposed to perform
ASR with structured learning previously, but limited by the linear nature of
SVM. Here we propose structured DNN to use nonlinear transformations in
multi-layers as a structured and deep learning algorithm. It was shown to beat
structured SVM in preliminary experiments on TIMIT
Personalized Acoustic Modeling by Weakly Supervised Multi-Task Deep Learning using Acoustic Tokens Discovered from Unlabeled Data
It is well known that recognizers personalized to each user are much more
effective than user-independent recognizers. With the popularity of smartphones
today, although it is not difficult to collect a large set of audio data for
each user, it is difficult to transcribe it. However, it is now possible to
automatically discover acoustic tokens from unlabeled personal data in an
unsupervised way. We therefore propose a multi-task deep learning framework
called a phoneme-token deep neural network (PTDNN), jointly trained from
unsupervised acoustic tokens discovered from unlabeled data and very limited
transcribed data for personalized acoustic modeling. We term this scenario
"weakly supervised". The underlying intuition is that the high degree of
similarity between the HMM states of acoustic token models and phoneme models
may help them learn from each other in this multi-task learning framework.
Initial experiments performed over a personalized audio data set recorded from
Facebook posts demonstrated that very good improvements can be achieved in both
frame accuracy and word accuracy over popularly-considered baselines such as
fDLR, speaker code and lightly supervised adaptation. This approach complements
existing speaker adaptation approaches and can be used jointly with such
techniques to yield improved results.Comment: 5 pages, 5 figures, published in IEEE ICASSP 201
- …