140 research outputs found
Augmentation of adaptation data
Linear regression based speaker adaptation approaches can improve Automatic Speech Recognition (ASR) accuracy significantly for a target speaker. However, when the available adaptation data is limited to a few seconds, the accuracy of the speaker adapted models is often worse compared with speaker independent models. In this paper, we propose an approach to select a set of reference speakers acoustically close to the target speaker whose data can be used to augment the adaptation data. To determine the acoustic similarity of two speakers, we propose a distance metric based on transforming sample points in the acoustic space with the regression matrices of the two speakers. We show the validity of this approach through a speaker identification task. ASR results on SCOTUS and AMI corpora with limited adaptation data of 10 to 15 seconds augmented by data from selected reference speakers show a significant improvement in Word Error Rate over speaker independent and speaker adapted models
Using contextual information in Joint Factor Eigenspace MLLR for speech recognition in diverse scenarios
This paper presents a new approach for rapid adaptation in the presence of highly diverse scenarios that takes advantage of information describing the input signals. We introduce a new method for joint factorisation of the background and the speaker in an eigenspace MLLR framework: Joint Factor Eigenspace MLLR (JFEMLLR). We further propose to use contextual information describing the speaker and background, such as tags or more complex metadata, to provide an immediate estimation of the best MLLR transformation for the utterance. This provides instant adaptation, since it does not require any transcription from a previous decoding stage. Evaluation in a highly diverse Automatic Speech Recognition (ASR) task, a modified version of WSJCAM0, yields an improvement of 26.9% over the baseline, which is an extra 1.2% reduction over two-pass MLLR adaptation
Adaptation Algorithms for Neural Network-Based Speech Recognition: An Overview
We present a structured overview of adaptation algorithms for neural
network-based speech recognition, considering both hybrid hidden Markov model /
neural network systems and end-to-end neural network systems, with a focus on
speaker adaptation, domain adaptation, and accent adaptation. The overview
characterizes adaptation algorithms as based on embeddings, model parameter
adaptation, or data augmentation. We present a meta-analysis of the performance
of speech recognition adaptation algorithms, based on relative error rate
reductions as reported in the literature.Comment: Submitted to IEEE Open Journal of Signal Processing. 30 pages, 27
figure
Extrapolating single view face models for multi-view recognition
Copyright © 2004 IEEEPerformance of face recognition systems can be adversely affected by mismatches between training and test poses, especially when there is only one training image available. We address this problem by extending each statistical frontal face model with artificially synthesized models for non-frontal views. The synthesis methods are based on several implementations of maximum likelihood linear regression (MLLR), as well as standard multivariate linear regression (LinReg). All synthesis techniques utilize prior information on how face models for the frontal view are related to face models for non-frontal views. The synthesis and extension approach is evaluated by applying it to two face verification systems: PCA based (holistic features) and DCTmod2 based (local features). Experiments on the FERET database suggest that for the PCA based system, the LinReg technique (which is based on a common relation between two sets of points) is more suited than the MLLR based techniques (which in effect are "single point to single point" transforms). For the DCTmod2 based system, the results show that synthesis via a new MLLR implementation obtains better performance than synthesis based on traditional MLLR (due to a lower number of free parameters). The results further show that extending frontal models considerably reduces errors.Conrad Sanderson and Samy Bengi
Confidence Scoring and Speaker Adaptation in Mobile Automatic Speech Recognition Applications
Generally, the user group of a language is remarkably diverse in terms of speaker-specific characteristics such as dialect and speaking style. Hence, quality of spoken content varies notably from one individual to another. This diversity causes problems for Automatic Speech Recognition systems. An Automatic Speech Recognition system should be able to assess the hypothesised results. This can be done by evaluating a confidence measure on the recognition results and comparing the resulting measure to a specified threshold. This threshold value, referred to as confidence score, informs how reliable a particular recognition result is for the given speech.
A system should perform optimally irrespective of input speaker characteristics. However, most systems are inflexible and non-adaptive and thus, speaker adaptability can be improved. For achieving these purposes, a solid criterion is required to evaluate the quality of spoken content and the system should be made robust and adaptive towards new speakers as well.
This thesis implements a confidence score using posterior probabilities to examine the quality of the output, based on the speech data and corpora provided by Devoca Oy. Furthermore, speaker adaptation algorithms: Maximum Likelihood Linear Regression and Maximum a Posteriori are applied on a GMM-HMM system and their results are compared. Experiments show that Maximum a Posteriori adaptation brings 2% to 25% improvement in word error rates of semi-continuous model and is recommended for use in the commercial product. The results of other methods are also reported. In addition, word graph is suggested as the method for obtaining posterior probabilities. Since it guarantees no such improvement in the results, the confidence score is proposed as an optional feature for the system
Automatic Speech Recognition for ageing voices
With ageing, human voices undergo several changes which are typically characterised
by increased hoarseness, breathiness, changes in articulatory patterns and slower speaking
rate. The focus of this thesis is to understand the impact of ageing on Automatic
Speech Recognition (ASR) performance and improve the ASR accuracies for older
voices.
Baseline results on three corpora indicate that the word error rates (WER) for older
adults are significantly higher than those of younger adults and the decrease in accuracies
is higher for males speakers as compared to females.
Acoustic parameters such as jitter and shimmer that measure glottal source disfluencies
were found to be significantly higher for older adults. However, the hypothesis
that these changes explain the differences in WER for the two age groups is proven incorrect.
Experiments with artificial introduction of glottal source disfluencies in speech
from younger adults do not display a significant impact on WERs. Changes in fundamental
frequency observed quite often in older voices has a marginal impact on ASR
accuracies.
Analysis of phoneme errors between younger and older speakers shows a pattern
of certain phonemes especially lower vowels getting more affected with ageing. These
changes however are seen to vary across speakers. Another factor that is strongly associated
with ageing voices is a decrease in the rate of speech. Experiments to analyse
the impact of slower speaking rate on ASR accuracies indicate that the insertion errors
increase while decoding slower speech with models trained on relatively faster speech.
We then propose a way to characterise speakers in acoustic space based on speaker
adaptation transforms and observe that speakers (especially males) can be segregated
with reasonable accuracies based on age. Inspired by this, we look at supervised hierarchical
acoustic models based on gender and age. Significant improvements in word
accuracies are achieved over the baseline results with such models. The idea is then extended
to construct unsupervised hierarchical models which also outperform the baseline
models by a good margin.
Finally, we hypothesize that the ASR accuracies can be improved by augmenting
the adaptation data with speech from acoustically closest speakers. A strategy to select
the augmentation speakers is proposed. Experimental results on two corpora indicate
that the hypothesis holds true only when the amount of available adaptation is limited
to a few seconds. The efficacy of such a speaker selection strategy is analysed for both
younger and older adults
- …