35,527 research outputs found
Modified SPLICE and its Extension to Non-Stereo Data for Noise Robust Speech Recognition
In this paper, a modification to the training process of the popular SPLICE
algorithm has been proposed for noise robust speech recognition. The
modification is based on feature correlations, and enables this stereo-based
algorithm to improve the performance in all noise conditions, especially in
unseen cases. Further, the modified framework is extended to work for
non-stereo datasets where clean and noisy training utterances, but not stereo
counterparts, are required. Finally, an MLLR-based computationally efficient
run-time noise adaptation method in SPLICE framework has been proposed. The
modified SPLICE shows 8.6% absolute improvement over SPLICE in Test C of
Aurora-2 database, and 2.93% overall. Non-stereo method shows 10.37% and 6.93%
absolute improvements over Aurora-2 and Aurora-4 baseline models respectively.
Run-time adaptation shows 9.89% absolute improvement in modified framework as
compared to SPLICE for Test C, and 4.96% overall w.r.t. standard MLLR
adaptation on HMMs.Comment: Submitted to Automatic Speech Recognition and Understanding (ASRU)
2013 Worksho
Learning An Invariant Speech Representation
Recognition of speech, and in particular the ability to generalize and learn
from small sets of labelled examples like humans do, depends on an appropriate
representation of the acoustic input. We formulate the problem of finding
robust speech features for supervised learning with small sample complexity as
a problem of learning representations of the signal that are maximally
invariant to intraclass transformations and deformations. We propose an
extension of a theory for unsupervised learning of invariant visual
representations to the auditory domain and empirically evaluate its validity
for voiced speech sound classification. Our version of the theory requires the
memory-based, unsupervised storage of acoustic templates -- such as specific
phones or words -- together with all the transformations of each that normally
occur. A quasi-invariant representation for a speech segment can be obtained by
projecting it to each template orbit, i.e., the set of transformed signals, and
computing the associated one-dimensional empirical probability distributions.
The computations can be performed by modules of filtering and pooling, and
extended to hierarchical architectures. In this paper, we apply a single-layer,
multicomponent representation for phonemes and demonstrate improved accuracy
and decreased sample complexity for vowel classification compared to standard
spectral, cepstral and perceptual features.Comment: CBMM Memo No. 022, 5 pages, 2 figure
Porting concepts from DNNs back to GMMs
Deep neural networks (DNNs) have been shown to outperform Gaussian Mixture Models (GMM) on a variety of speech recognition benchmarks. In this paper we analyze the differences between the DNN and GMM modeling techniques and port the best ideas from the DNN-based modeling to a GMM-based system. By going both deep (multiple layers) and wide (multiple parallel sub-models) and by sharing model parameters, we are able to close the gap between the two modeling techniques on the TIMIT database. Since the 'deep' GMMs retain the maximum-likelihood trained Gaussians as first layer, advanced techniques such as speaker adaptation and model-based noise robustness can be readily incorporated. Regardless of their similarities, the DNNs and the deep GMMs still show a sufficient amount of complementarity to allow effective system combination
A review of domain adaptation without target labels
Domain adaptation has become a prominent problem setting in machine learning
and related fields. This review asks the question: how can a classifier learn
from a source domain and generalize to a target domain? We present a
categorization of approaches, divided into, what we refer to as, sample-based,
feature-based and inference-based methods. Sample-based methods focus on
weighting individual observations during training based on their importance to
the target domain. Feature-based methods revolve around on mapping, projecting
and representing features such that a source classifier performs well on the
target domain and inference-based methods incorporate adaptation into the
parameter estimation procedure, for instance through constraints on the
optimization procedure. Additionally, we review a number of conditions that
allow for formulating bounds on the cross-domain generalization error. Our
categorization highlights recurring ideas and raises questions important to
further research.Comment: 20 pages, 5 figure
- …