Search CORE

19 research outputs found

Automatic Speech Recognition for ageing voices

Author: Vipperla Ravichander
Publication venue: The University of Edinburgh
Publication date: 24/11/2011
Field of study

With ageing, human voices undergo several changes which are typically characterised by increased hoarseness, breathiness, changes in articulatory patterns and slower speaking rate. The focus of this thesis is to understand the impact of ageing on Automatic Speech Recognition (ASR) performance and improve the ASR accuracies for older voices. Baseline results on three corpora indicate that the word error rates (WER) for older adults are significantly higher than those of younger adults and the decrease in accuracies is higher for males speakers as compared to females. Acoustic parameters such as jitter and shimmer that measure glottal source disfluencies were found to be significantly higher for older adults. However, the hypothesis that these changes explain the differences in WER for the two age groups is proven incorrect. Experiments with artificial introduction of glottal source disfluencies in speech from younger adults do not display a significant impact on WERs. Changes in fundamental frequency observed quite often in older voices has a marginal impact on ASR accuracies. Analysis of phoneme errors between younger and older speakers shows a pattern of certain phonemes especially lower vowels getting more affected with ageing. These changes however are seen to vary across speakers. Another factor that is strongly associated with ageing voices is a decrease in the rate of speech. Experiments to analyse the impact of slower speaking rate on ASR accuracies indicate that the insertion errors increase while decoding slower speech with models trained on relatively faster speech. We then propose a way to characterise speakers in acoustic space based on speaker adaptation transforms and observe that speakers (especially males) can be segregated with reasonable accuracies based on age. Inspired by this, we look at supervised hierarchical acoustic models based on gender and age. Significant improvements in word accuracies are achieved over the baseline results with such models. The idea is then extended to construct unsupervised hierarchical models which also outperform the baseline models by a good margin. Finally, we hypothesize that the ASR accuracies can be improved by augmenting the adaptation data with speech from acoustically closest speakers. A strategy to select the augmentation speakers is proposed. Experimental results on two corpora indicate that the hypothesis holds true only when the amount of available adaptation is limited to a few seconds. The efficacy of such a speaker selection strategy is analysed for both younger and older adults

Edinburgh Research Archive

Online Non-Negative Convolutive Pattern Learning for Speech Signals

Author: Dong Wang
Nicholas Evans
Ravichander Vipperla
Thomas Fang Zheng
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Iterative Compression of End-to-End ASR Model using AutoML

Author: Abdelfattah Mohamed S.
Bhattacharya Sourav
Dudziak Łukasz
Ishtiaq Samin
Kim Daehyun
Lane Nicholas D.
Lee SangJeong
Lee Young-yoon
Mehrotra Abhinav
Ramos Alberto Gil C. P.
Vipperla Ravichander
Yeo Jinsu
Publication venue
Publication date: 06/08/2020
Field of study

Increasing demand for on-device Automatic Speech Recognition (ASR) systems has resulted in renewed interests in developing automatic model compression techniques. Past research have shown that AutoML-based Low Rank Factorization (LRF) technique, when applied to an end-to-end Encoder-Attention-Decoder style ASR model, can achieve a speedup of up to 3.7x, outperforming laborious manual rank-selection approaches. However, we show that current AutoML-based search techniques only work up to a certain compression level, beyond which they fail to produce compressed models with acceptable word error rates (WER). In this work, we propose an iterative AutoML-based LRF approach that achieves over 5x compression without degrading the WER, thereby advancing the state-of-the-art in ASR compression

arXiv.org e-Print Archive

Crossref

Evolutionary discriminative confidence estimation for spoken term detection

Author: A Sierra
A Sierra
Alejandro Echeverría
D Wang
D Watson
Dong Wang
HG Beyer
I Szöke
I Szöke
I Szöke
Javier Tejedor
K Thambiratmann
M Bisani
M Mitchell
M Rocha
M Rocha
O Cordón
R Damper
RA Fisher
Ravichander Vipperla
SH Chen
T Hain
T Mantere
W Daelemans
Z Michalewicz
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

The final publication is available at Springer via http://dx.doi.org/10.1007/s11042-011-0913-zSpoken term detection (STD) is the task of searching for occurrences of spoken terms in audio archives. It relies on robust confidence estimation to make a hit/false alarm (FA) decision. In order to optimize the decision in terms of the STD evaluation metric, the confidence has to be discriminative. Multi-layer perceptrons (MLPs) and support vector machines (SVMs) exhibit good performance in producing discriminative confidence; however they are severely limited by the continuous objective functions, and are therefore less capable of dealing with complex decision tasks. This leads to a substantial performance reduction when measuring detection of out-of-vocabulary (OOV) terms, where the high diversity in term properties usually leads to a complicated decision boundary. In this paper we present a new discriminative confidence estimation approach based on evolutionary discriminant analysis (EDA). Unlike MLPs and SVMs, EDA uses the classification error as its objective function, resulting in a model optimized towards the evaluation metric. In addition, EDA combines heterogeneous projection functions and classification strategies in decision making, leading to a highly flexible classifier that is capable of dealing with complex decision tasks. Finally, the evolutionary strategy of EDA reduces the risk of local minima. We tested the EDA-based confidence with a state-of-the-art phoneme-based STD system on an English meeting domain corpus, which employs a phoneme speech recognition system to produce lattices within which the phoneme sequences corresponding to the enquiry terms are searched. The test corpora comprise 11 hours of speech data recorded with individual head-mounted microphones from 30 meetings carried out at several institutes including ICSI; NIST; ISL; LDC; the Virginia Polytechnic Institute and State University; and the University of Edinburgh. The experimental results demonstrate that EDA considerably outperforms MLPs and SVMs on both classification and confidence measurement in STD, and the advantage is found to be more significant on OOV terms than on in-vocabulary (INV) terms. In terms of classification performance, EDA achieved an equal error rate (EER) of 11% on OOV terms, compared to 34% and 31% with MLPs and SVMs respectively; for INV terms, an EER of 15% was obtained with EDA compared to 17% obtained with MLPs and SVMs. In terms of STD performance for OOV terms, EDA presented a significant relative improvement of 1.4% and 2.5% in terms of average term-weighted value (ATWV) over MLPs and SVMs respectively.This work was partially supported by the French Ministry of Industry (Innovative Web call) under contract 09.2.93.0966, ‘Collaborative Annotation for Video Accessibility’ (ACAV) and by ‘The Adaptable Ambient Living Assistant’ (ALIAS) project funded through the joint national Ambient Assisted Living (AAL) programme

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

Biblos-e Archivo

Robust speech recognition in multi-source noise environments using convolutive non-negative matrix factorization

Author: Vipperla Ravichander
Publication venue: 'The International Fiscal Association of Korea'
Publication date: 01/09/2011
Field of study

EURECOM Repository

Speech overlap detection and attribution using convolutive non-negative sparse coding

Author: Vipperla Ravichander
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 25/03/2012
Field of study

EURECOM Repository

Speech overlap detection using convolutive non-negative sparse coding

Author: Vipperla Ravichander
Publication venue: EURECOM
Publication date: 20/06/2011
Field of study

EURECOM Repository