1,143 research outputs found
Modified SPLICE and its Extension to Non-Stereo Data for Noise Robust Speech Recognition
In this paper, a modification to the training process of the popular SPLICE
algorithm has been proposed for noise robust speech recognition. The
modification is based on feature correlations, and enables this stereo-based
algorithm to improve the performance in all noise conditions, especially in
unseen cases. Further, the modified framework is extended to work for
non-stereo datasets where clean and noisy training utterances, but not stereo
counterparts, are required. Finally, an MLLR-based computationally efficient
run-time noise adaptation method in SPLICE framework has been proposed. The
modified SPLICE shows 8.6% absolute improvement over SPLICE in Test C of
Aurora-2 database, and 2.93% overall. Non-stereo method shows 10.37% and 6.93%
absolute improvements over Aurora-2 and Aurora-4 baseline models respectively.
Run-time adaptation shows 9.89% absolute improvement in modified framework as
compared to SPLICE for Test C, and 4.96% overall w.r.t. standard MLLR
adaptation on HMMs.Comment: Submitted to Automatic Speech Recognition and Understanding (ASRU)
2013 Worksho
Porting concepts from DNNs back to GMMs
Deep neural networks (DNNs) have been shown to outperform Gaussian Mixture Models (GMM) on a variety of speech recognition benchmarks. In this paper we analyze the differences between the DNN and GMM modeling techniques and port the best ideas from the DNN-based modeling to a GMM-based system. By going both deep (multiple layers) and wide (multiple parallel sub-models) and by sharing model parameters, we are able to close the gap between the two modeling techniques on the TIMIT database. Since the 'deep' GMMs retain the maximum-likelihood trained Gaussians as first layer, advanced techniques such as speaker adaptation and model-based noise robustness can be readily incorporated. Regardless of their similarities, the DNNs and the deep GMMs still show a sufficient amount of complementarity to allow effective system combination
Joint model-based recognition and localization of overlapped acoustic events using a set of distributed small microphone arrays
In the analysis of acoustic scenes, often the occurring sounds have to be
detected in time, recognized, and localized in space. Usually, each of these
tasks is done separately. In this paper, a model-based approach to jointly
carry them out for the case of multiple simultaneous sources is presented and
tested. The recognized event classes and their respective room positions are
obtained with a single system that maximizes the combination of a large set of
scores, each one resulting from a different acoustic event model and a
different beamformer output signal, which comes from one of several
arbitrarily-located small microphone arrays. By using a two-step method, the
experimental work for a specific scenario consisting of meeting-room acoustic
events, either isolated or overlapped with speech, is reported. Tests carried
out with two datasets show the advantage of the proposed approach with respect
to some usual techniques, and that the inclusion of estimated priors brings a
further performance improvement.Comment: Computational acoustic scene analysis, microphone array signal
processing, acoustic event detectio
Reinforcement Learning of Speech Recognition System Based on Policy Gradient and Hypothesis Selection
Speech recognition systems have achieved high recognition performance for
several tasks. However, the performance of such systems is dependent on the
tremendously costly development work of preparing vast amounts of task-matched
transcribed speech data for supervised training. The key problem here is the
cost of transcribing speech data. The cost is repeatedly required to support
new languages and new tasks. Assuming broad network services for transcribing
speech data for many users, a system would become more self-sufficient and more
useful if it possessed the ability to learn from very light feedback from the
users without annoying them. In this paper, we propose a general reinforcement
learning framework for speech recognition systems based on the policy gradient
method. As a particular instance of the framework, we also propose a hypothesis
selection-based reinforcement learning method. The proposed framework provides
a new view for several existing training and adaptation methods. The
experimental results show that the proposed method improves the recognition
performance compared to unsupervised adaptation.Comment: 5 pages, 6 figure
Speech Recognition in noisy environment using Deep Learning Neural Network
Recent researches in the field of automatic speaker recognition have shown that methods based
on deep learning neural networks provide better performance than other statistical classifiers. On
the other hand, these methods usually require adjustment of a significant number of parameters.
The goal of this thesis is to show that selecting appropriate value of parameters can significantly
improve speaker recognition performance of methods based on deep learning neural networks.
The reported study introduces an approach to automatic speaker recognition based on deep
neural networks and the stochastic gradient descent algorithm. It particularly focuses on three
parameters of the stochastic gradient descent algorithm: the learning rate, and the hidden and
input layer dropout rates. Additional attention was devoted to the research question of speaker
recognition under noisy conditions.
Thus, two experiments were conducted in the scope of this thesis. The first experiment was
intended to demonstrate that the optimization of the observed parameters of the stochastic
gradient descent algorithm can improve speaker recognition performance under no presence of
noise. This experiment was conducted in two phases. In the first phase, the recognition rate is
observed when the hidden layer dropout rate and the learning rate are varied, while the input
layer dropout rate was constant. In the second phase of this experiment, the recognition rate is
observed when the input layers dropout rate and learning rate are varied, while the hidden layer
dropout rate was constant. The second experiment was intended to show that the optimization of
the observed parameters of the stochastic gradient descent algorithm can improve speaker
recognition performance even under noisy conditions. Thus, different noise levels were
artificially applied on the original speech signal
An end-to-end machine learning system for harmonic analysis of music
We present a new system for simultaneous estimation of keys, chords, and bass
notes from music audio. It makes use of a novel chromagram representation of
audio that takes perception of loudness into account. Furthermore, it is fully
based on machine learning (instead of expert knowledge), such that it is
potentially applicable to a wider range of genres as long as training data is
available. As compared to other models, the proposed system is fast and memory
efficient, while achieving state-of-the-art performance.Comment: MIREX report and preparation of Journal submissio
- …