1,698 research outputs found

    A Subband-Based SVM Front-End for Robust ASR

    Full text link
    This work proposes a novel support vector machine (SVM) based robust automatic speech recognition (ASR) front-end that operates on an ensemble of the subband components of high-dimensional acoustic waveforms. The key issues of selecting the appropriate SVM kernels for classification in frequency subbands and the combination of individual subband classifiers using ensemble methods are addressed. The proposed front-end is compared with state-of-the-art ASR front-ends in terms of robustness to additive noise and linear filtering. Experiments performed on the TIMIT phoneme classification task demonstrate the benefits of the proposed subband based SVM front-end: it outperforms the standard cepstral front-end in the presence of noise and linear filtering for signal-to-noise ratio (SNR) below 12-dB. A combination of the proposed front-end with a conventional front-end such as MFCC yields further improvements over the individual front ends across the full range of noise levels

    Fast speaker independent large vocabulary continuous speech recognition [online]

    Get PDF

    Code-Switched Urdu ASR for Noisy Telephonic Environment using Data Centric Approach with Hybrid HMM and CNN-TDNN

    Full text link
    Call Centers have huge amount of audio data which can be used for achieving valuable business insights and transcription of phone calls is manually tedious task. An effective Automated Speech Recognition system can accurately transcribe these calls for easy search through call history for specific context and content allowing automatic call monitoring, improving QoS through keyword search and sentiment analysis. ASR for Call Center requires more robustness as telephonic environment are generally noisy. Moreover, there are many low-resourced languages that are on verge of extinction which can be preserved with help of Automatic Speech Recognition Technology. Urdu is the 10th10^{th} most widely spoken language in the world, with 231,295,440 worldwide still remains a resource constrained language in ASR. Regional call-center conversations operate in local language, with a mix of English numbers and technical terms generally causing a "code-switching" problem. Hence, this paper describes an implementation framework of a resource efficient Automatic Speech Recognition/ Speech to Text System in a noisy call-center environment using Chain Hybrid HMM and CNN-TDNN for Code-Switched Urdu Language. Using Hybrid HMM-DNN approach allowed us to utilize the advantages of Neural Network with less labelled data. Adding CNN with TDNN has shown to work better in noisy environment due to CNN's additional frequency dimension which captures extra information from noisy speech, thus improving accuracy. We collected data from various open sources and labelled some of the unlabelled data after analysing its general context and content from Urdu language as well as from commonly used words from other languages, primarily English and were able to achieve WER of 5.2% with noisy as well as clean environment in isolated words or numbers as well as in continuous spontaneous speech.Comment: 32 pages, 19 figures, 2 tables, preprin

    Speech and neural network dynamics

    Get PDF

    Fractal based speech recognition and synthesis

    Get PDF
    Transmitting a linguistic message is most often the primary purpose of speech com­munication and the recognition of this message by machine that would be most useful. This research consists of two major parts. The first part presents a novel and promis­ing approach for estimating the degree of recognition of speech phonemes and makes use of a new set of features based fractals. The main methods of computing the frac­tal dimension of speech signals are reviewed and a new speaker-independent speech recognition system developed at De Montfort University is described in detail. Fi­nally, a Least Square Method as well as a novel Neural Network algorithm is employed to derive the recognition performance of the speech data. The second part of this work studies the synthesis of speech words, which is based mainly on the fractal dimension to create natural sounding speech. The work shows that by careful use of the fractal dimension together with the phase of the speech signal to ensure consistent intonation contours, natural-sounding speech synthesis is achievable with word level speech. In order to extend the flexibility of this framework, we focused on the filtering and the compression of the phase to maintain and produce natural sounding speech. A ‘naturalness level’ is achieved as a result of the fractal characteristic used in the synthesis process. Finally, a novel speech synthesis system based on fractals developed at De Montfort University is discussed. Throughout our research simulation experiments were performed on continuous speech data available from the Texas Instrument Massachusetts institute of technology ( TIMIT) database, which is designed to provide the speech research community with a standarised corpus for the acquisition of acoustic-phonetic knowledge and for the development and evaluation of automatic speech recognition system

    Dirichlet process probit misclassification mixture model for misclassified binary data

    Get PDF
    Mislabelling or misclassification in binary data refers to incorrectly labelled responses and could arise due to problems in the labelling process or imperfect evidence for labelling. The latent misclassification process could take a variety of forms depending on how it relates to the true labels as well as the associated covariates of each response. Modelling under misclas- sification is challenging because of the inherent identifiability issues and ignoring misclassi- fication could lead to inaccurate inferences. Statistical methods addressing misclassification have appeared in the literature in a variety of contexts, sometimes using di↵erent terminology, and often focusing on a particular application. In this thesis, we first cast existing statistical methods under a unified framework and later propose a new flexible Bayesian mixture model for modelling misclassified binary data - the Dirichlet process probit misclassification mix- ture model. The main idea is to assume a Dirichlet process mixture model over the covariate space and misclassification probabilities. This naturally partitions observations into clusters where di↵erent clusters can possess di↵erent misclassification probabilities. The clustering uses both covariates and observed responses and covariates are approximated using a Dirich- let mixture of multivariate Gaussians. The incorporation of cluster-specific misclassification probabilities takes into consideration of the misclassification in the observed responses. An e cient Gibbs-like algorithm is available based on the truncated approximation of Dirichlet process and the stick-breaking construction. This thesis is motivated by the pervasiveness of label noise in a wide variety of applica- tions, coupled with the lack of unified statistical exposition and comparison of all available methods. The structure of the thesis as follows. Chapter 1 introduces the problem of label misclassification and reviews existing methods for modelling misclassification in binary data. Chapter 2 discusses the basic of Bayesian nonparametrics, Dirichlet process, Dirichlet pro- cess mixture models, and posterior inference procedures for Dirichlet process mixture models, which are essential components of the Dirichlet process probit misclassification mixtures that we propose later. Chapter 3 describes our proposed model for modelling mislabelled binary data. Chapter 4 presents experimental studies on our proposed model using a real dataset. Section 5 wraps up the discussion on the topic and include final remarks such as possible model extension

    Hidden Markov models and neural networks for speech recognition

    Get PDF
    The Hidden Markov Model (HMMs) is one of the most successful modeling approaches for acoustic events in speech recognition, and more recently it has proven useful for several problems in biological sequence analysis. Although the HMM is good at capturing the temporal nature of processes such as speech, it has a very limited capacity for recognizing complex patterns involving more than first order dependencies in the observed data sequences. This is due to the first order state process and the assumption of state conditional independence between observations. Artificial Neural Networks (NNs) are almost the opposite: they cannot model dynamic, temporally extended phenomena very well, but are good at static classification and regression tasks. Combining the two frameworks in a sensible way can therefore lead to a more powerful model with better classification abilities. The overall aim of this work has been to develop a probabilistic hybrid of hidden Markov models and neural networks and ..

    Adapting Prosody in a Text-to-Speech System

    Get PDF
    • …
    corecore