77 research outputs found

    The subspace Gaussian mixture model—A structured model for speech recognition

    Get PDF
    We describe a new approach to speech recognition, in which all Hidden Markov Model (HMM) states share the same Gaussian Mixture Model (GMM) structure with the same number of Gaussians in each state. The model is defined by vectors associated with each state with a dimension of, say, 50, together with a global mapping from this vector space to the space of parameters of the GMM. This model appears to give better results than a conventional model, and the extra structure offers many new opportunities for modeling innovations while maintaining compatibility with most standard techniques

    Towards natural speech acquisition: incremental word learning with limited data

    Get PDF
    Ayllon Clemente I. Towards natural speech acquisition: incremental word learning with limited data. Bielefeld: Bielefeld University; 2013

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    Subspace Gaussian mixture models for automatic speech recognition

    Get PDF
    In most of state-of-the-art speech recognition systems, Gaussian mixture models (GMMs) are used to model the density of the emitting states in the hidden Markov models (HMMs). In a conventional system, the model parameters of each GMM are estimated directly and independently given the alignment. This results a large number of model parameters to be estimated, and consequently, a large amount of training data is required to fit the model. In addition, different sources of acoustic variability that impact the accuracy of a recogniser such as pronunciation variation, accent, speaker factor and environmental noise are only weakly modelled and factorized by adaptation techniques such as maximum likelihood linear regression (MLLR), maximum a posteriori adaptation (MAP) and vocal tract length normalisation (VTLN). In this thesis, we will discuss an alternative acoustic modelling approach — the subspace Gaussian mixture model (SGMM), which is expected to deal with these two issues better. In an SGMM, the model parameters are derived from low-dimensional model and speaker subspaces that can capture phonetic and speaker correlations. Given these subspaces, only a small number of state-dependent parameters are required to derive the corresponding GMMs. Hence, the total number of model parameters can be reduced, which allows acoustic modelling with a limited amount of training data. In addition, the SGMM-based acoustic model factorizes the phonetic and speaker factors and within this framework, other source of acoustic variability may also be explored. In this thesis, we propose a regularised model estimation for SGMMs, which avoids overtraining in case that the training data is sparse. We will also take advantage of the structure of SGMMs to explore cross-lingual acoustic modelling for low-resource speech recognition. Here, the model subspace is estimated from out-domain data and ported to the target language system. In this case, only the state-dependent parameters need to be estimated which relaxes the requirement of the amount of training data. To improve the robustness of SGMMs against environmental noise, we propose to apply the joint uncertainty decoding (JUD) technique that is shown to be efficient and effective. We will report experimental results on the Wall Street Journal (WSJ) database and GlobalPhone corpora to evaluate the regularisation and cross-lingual modelling of SGMMs. Noise compensation using JUD for SGMM acoustic models is evaluated on the Aurora 4 database

    Robust Phase-based Speech Signal Processing From Source-Filter Separation to Model-Based Robust ASR

    Get PDF
    The Fourier analysis plays a key role in speech signal processing. As a complex quantity, it can be expressed in the polar form using the magnitude and phase spectra. The magnitude spectrum is widely used in almost every corner of speech processing. However, the phase spectrum is not an obviously appealing start point for processing the speech signal. In contrast to the magnitude spectrum whose fine and coarse structures have a clear relation to speech perception, the phase spectrum is difficult to interpret and manipulate. In fact, there is not a meaningful trend or extrema which may facilitate the modelling process. Nonetheless, the speech phase spectrum has recently gained renewed attention. An expanding body of work is showing that it can be usefully employed in a multitude of speech processing applications. Now that the potential for the phase-based speech processing has been established, there is a need for a fundamental model to help understand the way in which phase encodes speech information. In this thesis a novel phase-domain source-filter model is proposed that allows for deconvolution of the speech vocal tract (filter) and excitation (source) components through phase processing. This model utilises the Hilbert transform, shows how the excitation and vocal tract elements mix in the phase domain and provides a framework for efficiently segregating the source and filter components through phase manipulation. To investigate the efficacy of the suggested approach, a set of features is extracted from the phase filter part for automatic speech recognition (ASR) and the source part of the phase is utilised for fundamental frequency estimation. Accuracy and robustness in both cases are illustrated and discussed. In addition, the proposed approach is improved by replacing the log with the generalised logarithmic function in the Hilbert transform and also by computing the group delay via regression filter. Furthermore, statistical distribution of the phase spectrum and its representations along the feature extraction pipeline are studied. It is illustrated that the phase spectrum has a bell-shaped distribution. Some statistical normalisation methods such as mean-variance normalisation, Laplacianisation, Gaussianisation and Histogram equalisation are successfully applied to the phase-based features and lead to a significant robustness improvement. The robustness gain achieved through using statistical normalisation and generalised logarithmic function encouraged the use of more advanced model-based statistical techniques such as vector Taylor Series (VTS). VTS in its original formulation assumes usage of the log function for compression. In order to simultaneously take advantage of the VTS and generalised logarithmic function, a new formulation is first developed to merge both into a unified framework called generalised VTS (gVTS). Also in order to leverage the gVTS framework, a novel channel noise estimation method is developed. The extensions of the gVTS framework and the proposed channel estimation to the group delay domain are then explored. The problems it presents are analysed and discussed, some solutions are proposed and finally the corresponding formulae are derived. Moreover, the effect of additive noise and channel distortion in the phase and group delay domains are scrutinised and the results are utilised in deriving the gVTS equations. Experimental results in the Aurora-4 ASR task in an HMM/GMM set up along with a DNN-based bottleneck system in the clean and multi-style training modes confirmed the efficacy of the proposed approach in dealing with both additive and channel noise

    Variance Flooring, Scaling and Tying for Text-Dependent Speaker Verification

    No full text
    The problem of how to estimate variance parameters in client models from scarce data is addressed in the context of textdependent, HMM-based, automatic speaker verification. Variance flooring and variance scaling are investigated as two alternative estimation techniques and are used with or without variance tying on the state level to reduce the number of parameters to estimate. The best results are achieved with no tying and a variance flooring method where the floor to a variance vector in a client model is proportional to the corresponding variance vector in a gender-dependent, multispeaker, non-client model. Further, variance tying reduces storage requirements considerably without much loss in recognition accuracy. It is also confirmed from a previous study that re-using non-client variances has comparable performance to variance flooring and is much simpler. Comparisons are made on three large telephone quality speech corpora. 1

    Proceedings of the 7th Sound and Music Computing Conference

    Get PDF
    Proceedings of the SMC2010 - 7th Sound and Music Computing Conference, July 21st - July 24th 2010

    Wireless tools for neuromodulation

    Get PDF
    Epilepsy is a spectrum of diseases characterized by recurrent seizures. It is estimated that 50 million individuals worldwide are affected and 30% of cases are medically refractory or drug resistant. Vagus nerve stimulation (VNS) and deep brain stimulation (DBS) are the only FDA approved device based therapies. Neither therapy offers complete seizure freedom in a majority of users. Novel methodologies are needed to better understand mechanisms and chronic nature of epilepsy. Most tools for neuromodulation in rodents are tethered. The few wireless devices use batteries or are inductively powered. The tether restricts movement, limits behavioral tests, and increases the risk of infection. Batteries are large and heavy with a limited lifetime. Inductive powering suffers from rapid efficiency drops due to alignment mismatches and increased distances. Miniature wireless tools that offer behavioral freedom, data acquisition, and stimulation are needed. This dissertation presents a platform of electrical, optical and radiofrequency (RF) technologies for device based neuromodulation. The platform can be configured with features including: two channels differential recording, one channel electrical stimulation, and one channel optical stimulation. Typical device operation consumes less than 4 mW. The analog front end has a bandwidth of 0.7 Hz - 1 kHz and a gain of 60 dB, and the constant current driver provides biphasic electrical stimulation. For use with optogenetics, the deep brain optical stimulation module provides 27 mW/mm2 of blue light (473 nm) with 21.01 mA. Pairing of stimulating and recording technologies allows closed-loop operation. A wireless powering cage is designed using the resonantly coupled filter energy transfer (RCFET) methodology. RF energy is coupled through magnetic resonance. The cage has a PTE ranging from 1.8-6.28% for a volume of 11 x 11 x 11 in3. This is sufficient to chronically house subjects. The technologies are validated through various in vivo preparations. The tools are designed to study epilepsy, SUDEP, and urinary incontinence but can be configured for other studies. The broad application of these technologies can enable the scientific community to better study chronic diseases and closed-loop therapies
    • …
    corecore