661 research outputs found
Phoneme Based Speaker Verification System Based on Two Stage Self-Organizing Map Design
Speaker verification is one of the pattern recognition task that authenticate a person by his or her voice. This thesis deals with a relatively new technique of classification
that is the self-organizing map (SOM). Self-organizing map, as an unsupervised learning artificial neural network, rarely used as final classification step in pattern
recognition task due to its relatively low accuracy. A two-stage self-organizing map design has been implemented in this thesis and showed improved results over conventional single stage design. For speech features extraction, this thesis does not introduce any new technique. A well study method that is the linear prediction analysis (LP A) has been used. Linear predictive analysis derived coefficients are extracted from segmented raw speech signal to train and test the front stage self-organizing map. Unlike other multistage or hierarchical self-organizing map designs, this thesis utilized residual vectors generated from front stage self-organizing map to train and test the second stage selforganizing map. The results showed that by breaking the classification tasks into two level or more detail resolution, an improvement of more than 5% can be obtained. Moreover, the computation time is also reduced greatly
Voice Conversion Using Sequence-to-Sequence Learning of Context Posterior Probabilities
Voice conversion (VC) using sequence-to-sequence learning of context
posterior probabilities is proposed. Conventional VC using shared context
posterior probabilities predicts target speech parameters from the context
posterior probabilities estimated from the source speech parameters. Although
conventional VC can be built from non-parallel data, it is difficult to convert
speaker individuality such as phonetic property and speaking rate contained in
the posterior probabilities because the source posterior probabilities are
directly used for predicting target speech parameters. In this work, we assume
that the training data partly include parallel speech data and propose
sequence-to-sequence learning between the source and target posterior
probabilities. The conversion models perform non-linear and variable-length
transformation from the source probability sequence to the target one. Further,
we propose a joint training algorithm for the modules. In contrast to
conventional VC, which separately trains the speech recognition that estimates
posterior probabilities and the speech synthesis that predicts target speech
parameters, our proposed method jointly trains these modules along with the
proposed probability conversion modules. Experimental results demonstrate that
our approach outperforms the conventional VC.Comment: Accepted to INTERSPEECH 201
The Microsoft 2016 Conversational Speech Recognition System
We describe Microsoft's conversational speech recognition system, in which we
combine recent developments in neural-network-based acoustic and language
modeling to advance the state of the art on the Switchboard recognition task.
Inspired by machine learning ensemble techniques, the system uses a range of
convolutional and recurrent neural networks. I-vector modeling and lattice-free
MMI training provide significant gains for all acoustic model architectures.
Language model rescoring with multiple forward and backward running RNNLMs, and
word posterior-based system combination provide a 20% boost. The best single
system uses a ResNet architecture acoustic model with RNNLM rescoring, and
achieves a word error rate of 6.9% on the NIST 2000 Switchboard task. The
combined system has an error rate of 6.2%, representing an improvement over
previously reported results on this benchmark task
- …