425 research outputs found

    Speech recognition in noisy car environment based on OSALPC representation and robust similarity measuring techniques

    Get PDF
    The performance of the existing speech recognition systems degrades rapidly in the presence of background noise. The OSALPC (one-sided autocorrelation linear predictive coding) representation of the speech signal has shown to be attractive for speech recognition because of its simplicity and its high recognition performance with respect to the standard LPC in severe conditions of additive white noise. The aim of this paper is twofold: (1) to show that OSALPC also achieves good performance in a case of real noisy speech (in a car environment), and (2) to explore its combination with several robust similarity measuring techniques, showing that its performance improves by using cepstral liftering, dynamic features and multilabeling.Peer ReviewedPostprint (published version

    AR modeling of the speech autocorrelation to improve noisy speech recognition

    Get PDF
    Speech recognition in noisy environments remains an unsolved problem even in the case of isolated word recognition with small vocabularies. Recently, several techniques have been proposed to alleviate this problem. Concretely, two closely related parameterization techniques based on an AR modelling in the autocorrelation domain called SMC [1] and OSALPC [2] have shown good results using speech contaminated by additive white noise. The aim of this paper is twofold: to compare several techniques based on an AR modelling in the autocorrelation domain, including SMC and OSALPC, and to find the optimum model order and cepstral liftering for noisy conditions.Peer ReviewedPostprint (published version

    Investigation of the impact of high frequency transmitted speech on speaker recognition

    Get PDF
    Thesis (MScEng)--Stellenbosch University, 2002.Some digitised pages may appear illegible due to the condition of the original hard copy.ENGLISH ABSTRACT: Speaker recognition systems have evolved to a point where near perfect performance can be obtained under ideal conditions, even if the system must distinguish between a large number of speakers. Under adverse conditions, such as when high noise levels are present or when the transmission channel deforms the speech, the performance is often less than satisfying. This project investigated the performance of a popular speaker recognition system, that use Gaussian mixture models, on speech transmitted over a high frequency channel. Initial experiments demonstrated very unsatisfactory results for the base line system. We investigated a number of robust techniques. We implemented and applied some of them in an attempt to improve the performance of the speaker recognition systems. The techniques we tested showed only slight improvements. We also investigates the effects of a high frequency channel and single sideband modulation on the speech features of speech processing systems. The effects that can deform the features, and therefore reduce the performance of speech systems, were identified. One of the effects that can greatly affect the performance of a speech processing system is noise. We investigated some speech enhancement techniques and as a result we developed a new statistical based speech enhancement technique that employs hidden Markov models to represent the clean speech process.AFRIKAANSE OPSOMMING: Sprekerherkenning-stelsels het 'n punt bereik waar nabyaan perfekte resultate verwag kan word onder ideale kondisies, selfs al moet die stelsel tussen 'n groot aantal sprekers onderskei. Wanneer nie-ideale kondisies, soos byvoorbeeld hoë ruisvlakke of 'n transmissie kanaal wat die spraak vervorm, teenwoordig is, is die resultate gewoonlik nie bevredigend nie. Die projek ondersoek die werksverrigting van 'n gewilde sprekerherkenning-stelsel, wat gebruik maak van Gaussiese mengselmodelle, op spraak wat oor 'n hoë frekwensie transmissie kanaal gestuur is. Aanvanklike eksperimente wat gebruik maak van 'n basiese stelsel het nie goeie resultate opgelewer nie. Ons het 'n aantal robuuste tegnieke ondersoek en 'n paar van hulle geïmplementeer en getoets in 'n poging om die resultate van die sprekerherkenning-stelsel te verbeter. Die tegnieke wat ons getoets het, het net geringe verbetering getoon. Die studie het ook die effekte wat die hoë-frekwensie kanaal en enkel-syband modulasie op spraak kenmerkvektore, ondersoek. Die effekte wat die spraak kenmerkvektore kan vervorm en dus die werkverrigting van spraak stelsels kan verlaag, is geïdentifiseer. Een van die effekte wat 'n groot invloed op die werkverrigting van spraakstelsels het, is ruis. Ons het spraak verbeterings metodes ondersoek en dit het gelei tot die ontwikkeling van 'n statisties gebaseerde spraak verbeteringstegniek wat gebruik maak van verskuilde Markov modelle om die skoon spraakproses voor te stel

    Wavelet-based techniques for speech recognition

    Get PDF
    In this thesis, new wavelet-based techniques have been developed for the extraction of features from speech signals for the purpose of automatic speech recognition (ASR). One of the advantages of the wavelet transform over the short time Fourier transform (STFT) is its capability to process non-stationary signals. Since speech signals are not strictly stationary the wavelet transform is a better choice for time-frequency transformation of these signals. In addition it has compactly supported basis functions, thereby reducing the amount of computation as opposed to STFT where an overlapping window is needed. [Continues.

    Generalized Hidden Filter Markov Models Applied to Speaker Recognition

    Get PDF
    Classification of time series has wide Air Force, DoD and commercial interest, from automatic target recognition systems on munitions to recognition of speakers in diverse environments. The ability to effectively model the temporal information contained in a sequence is of paramount importance. Toward this goal, this research develops theoretical extensions to a class of stochastic models and demonstrates their effectiveness on the problem of text-independent (language constrained) speaker recognition. Specifically within the hidden Markov model architecture, additional constraints are implemented which better incorporate observation correlations and context, where standard approaches fail. Two methods of modeling correlations are developed, and their mathematical properties of convergence and reestimation are analyzed. These differ in modeling correlation present in the time samples and those present in the processed features, such as Mel frequency cepstral coefficients. The system models speaker dependent phonemes, making use of word dictionary grammars, and recognition is based on normalized log-likelihood Viterbi decoding. Both closed set identification and speaker verification using cohorts are performed on the YOHO database. YOHO is the only large scale, multiple-session, high-quality speech database for speaker authentication and contains over one hundred speakers stating combination locks. Equal error rates of 0.21% for males and 0.31% for females are demonstrated. A critical error analysis using a hypothesis test formulation provides the maximum number of errors observable while still meeting the goal error rates of 1% False Reject and 0.1% False Accept. Our system achieves this goal

    Speech Detection Using Gammatone Features And One-class Support Vector Machine

    Get PDF
    A network gateway is a mechanism which provides protocol translation and/or validation of network traffic using the metadata contained in network packets. For media applications such as Voice-over-IP, the portion of the packets containing speech data cannot be verified and can provide a means of maliciously transporting code or sensitive data undetected. One solution to this problem is through Voice Activity Detection (VAD). Many VAD’s rely on time-domain features and simple thresholds for efficient speech detection however this doesn’t say much about the signal being passed. More sophisticated methods employ machine learning algorithms, but train on specific noises intended for a target environment. Validating speech under a variety of unknown conditions must be possible; as well as differentiating between speech and nonspeech data embedded within the packets. A real-time speech detection method is proposed that relies only on a clean speech model for detection. Through the use of Gammatone filter bank processing, the Cepstrum and several frequency domain features are used to train a One-Class Support Vector Machine which provides a clean-speech model irrespective of environmental noise. A Wiener filter is used to provide improved operation for harsh noise environments. Greater than 90% detection accuracy is achieved for clean speech with approximately 70% accuracy for SNR as low as 5d
    • …
    corecore