425 research outputs found
Speech recognition in noisy car environment based on OSALPC representation and robust similarity measuring techniques
The performance of the existing speech recognition systems degrades rapidly in the presence of background noise. The OSALPC (one-sided autocorrelation linear predictive coding) representation of the speech signal has shown to be attractive for speech recognition because of its simplicity and its high recognition performance with respect to the standard LPC in severe conditions of additive white noise. The aim of this paper is twofold: (1) to show that OSALPC also achieves good performance in a case of real noisy speech (in a car environment), and (2) to explore its combination with several robust similarity measuring techniques, showing that its performance improves by using cepstral liftering, dynamic features and multilabeling.Peer ReviewedPostprint (published version
AR modeling of the speech autocorrelation to improve noisy speech recognition
Speech recognition in noisy environments remains an unsolved problem even in the case of isolated word recognition with small vocabularies. Recently, several techniques have been proposed to alleviate this problem. Concretely, two closely related parameterization techniques based on an AR modelling in the autocorrelation domain called SMC [1] and OSALPC [2] have shown good results using speech contaminated by additive white noise. The aim of this paper is twofold: to compare several techniques based on an AR modelling in the autocorrelation domain, including SMC and OSALPC, and to find the optimum model order and cepstral liftering for noisy conditions.Peer ReviewedPostprint (published version
Investigation of the impact of high frequency transmitted speech on speaker recognition
Thesis (MScEng)--Stellenbosch University, 2002.Some digitised pages may appear illegible due to the condition of the original hard copy.ENGLISH ABSTRACT: Speaker recognition systems have evolved to a point where near perfect performance can be
obtained under ideal conditions, even if the system must distinguish between a large number
of speakers. Under adverse conditions, such as when high noise levels are present or when the
transmission channel deforms the speech, the performance is often less than satisfying.
This project investigated the performance of a popular speaker recognition system, that use
Gaussian mixture models, on speech transmitted over a high frequency channel. Initial experiments
demonstrated very unsatisfactory results for the base line system.
We investigated a number of robust techniques. We implemented and applied some of them in
an attempt to improve the performance of the speaker recognition systems. The techniques we
tested showed only slight improvements.
We also investigates the effects of a high frequency channel and single sideband modulation on
the speech features of speech processing systems. The effects that can deform the features, and
therefore reduce the performance of speech systems, were identified.
One of the effects that can greatly affect the performance of a speech processing system is
noise. We investigated some speech enhancement techniques and as a result we developed a
new statistical based speech enhancement technique that employs hidden Markov models to
represent the clean speech process.AFRIKAANSE OPSOMMING: Sprekerherkenning-stelsels het 'n punt bereik waar nabyaan perfekte resultate verwag kan word
onder ideale kondisies, selfs al moet die stelsel tussen 'n groot aantal sprekers onderskei. Wanneer
nie-ideale kondisies, soos byvoorbeeld hoë ruisvlakke of 'n transmissie kanaal wat die
spraak vervorm, teenwoordig is, is die resultate gewoonlik nie bevredigend nie.
Die projek ondersoek die werksverrigting van 'n gewilde sprekerherkenning-stelsel, wat gebruik
maak van Gaussiese mengselmodelle, op spraak wat oor 'n hoë frekwensie transmissie
kanaal gestuur is. Aanvanklike eksperimente wat gebruik maak van 'n basiese stelsel het nie
goeie resultate opgelewer nie.
Ons het 'n aantal robuuste tegnieke ondersoek en 'n paar van hulle geĂŻmplementeer en getoets
in 'n poging om die resultate van die sprekerherkenning-stelsel te verbeter. Die tegnieke wat
ons getoets het, het net geringe verbetering getoon.
Die studie het ook die effekte wat die hoë-frekwensie kanaal en enkel-syband modulasie op
spraak kenmerkvektore, ondersoek. Die effekte wat die spraak kenmerkvektore kan vervorm en
dus die werkverrigting van spraak stelsels kan verlaag, is geĂŻdentifiseer.
Een van die effekte wat 'n groot invloed op die werkverrigting van spraakstelsels het, is ruis.
Ons het spraak verbeterings metodes ondersoek en dit het gelei tot die ontwikkeling van 'n
statisties gebaseerde spraak verbeteringstegniek wat gebruik maak van verskuilde Markov modelle
om die skoon spraakproses voor te stel
Wavelet-based techniques for speech recognition
In this thesis, new wavelet-based techniques have been developed for the
extraction of features from speech signals for the purpose of automatic speech
recognition (ASR). One of the advantages of the wavelet transform over the short
time Fourier transform (STFT) is its capability to process non-stationary signals.
Since speech signals are not strictly stationary the wavelet transform is a better
choice for time-frequency transformation of these signals. In addition it has
compactly supported basis functions, thereby reducing the amount of
computation as opposed to STFT where an overlapping window is needed. [Continues.
Generalized Hidden Filter Markov Models Applied to Speaker Recognition
Classification of time series has wide Air Force, DoD and commercial interest, from automatic target recognition systems on munitions to recognition of speakers in diverse environments. The ability to effectively model the temporal information contained in a sequence is of paramount importance. Toward this goal, this research develops theoretical extensions to a class of stochastic models and demonstrates their effectiveness on the problem of text-independent (language constrained) speaker recognition. Specifically within the hidden Markov model architecture, additional constraints are implemented which better incorporate observation correlations and context, where standard approaches fail. Two methods of modeling correlations are developed, and their mathematical properties of convergence and reestimation are analyzed. These differ in modeling correlation present in the time samples and those present in the processed features, such as Mel frequency cepstral coefficients. The system models speaker dependent phonemes, making use of word dictionary grammars, and recognition is based on normalized log-likelihood Viterbi decoding. Both closed set identification and speaker verification using cohorts are performed on the YOHO database. YOHO is the only large scale, multiple-session, high-quality speech database for speaker authentication and contains over one hundred speakers stating combination locks. Equal error rates of 0.21% for males and 0.31% for females are demonstrated. A critical error analysis using a hypothesis test formulation provides the maximum number of errors observable while still meeting the goal error rates of 1% False Reject and 0.1% False Accept. Our system achieves this goal
Speech Detection Using Gammatone Features And One-class Support Vector Machine
A network gateway is a mechanism which provides protocol translation and/or validation of network traffic using the metadata contained in network packets. For media applications such as Voice-over-IP, the portion of the packets containing speech data cannot be verified and can provide a means of maliciously transporting code or sensitive data undetected. One solution to this problem is through Voice Activity Detection (VAD). Many VAD’s rely on time-domain features and simple thresholds for efficient speech detection however this doesn’t say much about the signal being passed. More sophisticated methods employ machine learning algorithms, but train on specific noises intended for a target environment. Validating speech under a variety of unknown conditions must be possible; as well as differentiating between speech and nonspeech data embedded within the packets. A real-time speech detection method is proposed that relies only on a clean speech model for detection. Through the use of Gammatone filter bank processing, the Cepstrum and several frequency domain features are used to train a One-Class Support Vector Machine which provides a clean-speech model irrespective of environmental noise. A Wiener filter is used to provide improved operation for harsh noise environments. Greater than 90% detection accuracy is achieved for clean speech with approximately 70% accuracy for SNR as low as 5d
- …