6 research outputs found
Statistical single channel source separation
PhD ThesisSingle channel source separation (SCSS) principally is one of the challenging fields
in signal processing and has various significant applications. Unlike conventional
SCSS methods which were based on linear instantaneous model, this research sets out
to investigate the separation of single channel in two types of mixture which is
nonlinear instantaneous mixture and linear convolutive mixture. For the nonlinear
SCSS in instantaneous mixture, this research proposes a novel solution based on a
two-stage process that consists of a Gaussianization transform which efficiently
compensates for the nonlinear distortion follow by a maximum likelihood estimator to
perform source separation. For linear SCSS in convolutive mixture, this research
proposes new methods based on nonnegative matrix factorization which decomposes a
mixture into two-dimensional convolution factor matrices that represent the spectral
basis and temporal code. The proposed factorization considers the convolutive mixing
in the decomposition by introducing frequency constrained parameters in the model.
The method aims to separate the mixture into its constituent spectral-temporal source
components while alleviating the effect of convolutive mixing. In addition, family of
Itakura-Saito divergence has been developed as a cost function which brings the
beneficial property of scale-invariant. Two new statistical techniques are proposed,
namely, Expectation-Maximisation (EM) based algorithm framework which
maximizes the log-likelihood of a mixed signals, and the maximum a posteriori
approach which maximises the joint probability of a mixed signal using multiplicative
update rules. To further improve this research work, a novel method that incorporates
adaptive sparseness into the solution has been proposed to resolve the ambiguity and
hence, improve the algorithm performance. The theoretical foundation of the proposed
solutions has been rigorously developed and discussed in details. Results have
concretely shown the effectiveness of all the proposed algorithms presented in this
thesis in separating the mixed signals in single channel and have outperformed others
available methods.Universiti Teknikal Malaysia Melaka(UTeM),
Ministry of Higher Education of Malaysi
Autoregressive models for text independent speaker identification in noisy environments
The closed-set speaker identification problem is defined as the search within a set of persons for the speaker of a certain
utterance. It is reported that the Gaussian mixture model (GMM) classifier achieves very high classification accuracies (in the
range 95% - 100%) when both the training and testing utterances are recorded in sound proof studio, i.e., there is neither additive
noise nor spectral distortion to the speech signals.
However, in real life applications, speech is usually corrupted by noise and band-limitation. Moreover, there is a mismatch between the recording conditions of the training and testing environments. As a
result, the classification accuracy of GMM-based systems deteriorates significantly. In this thesis, we propose a two-step procedure for improving the speaker identification performance under noisy environment. In the first step, we introduce a new classifier: vector autoregressive Gaussian mixture (VARGM) model. Unlike the
GMM, the new classifier models correlations between successive feature vectors. We also integrate the proposed method into the
framework of the universal background model (UBM). In addition, we develop the learning procedure according to the maximum likelihood
(ML) criterion. Based on a thorough experimental evaluation, the proposed method achieves an improvement of 3 to 5% in the
identification accuracy.
In the second step, we propose a new compensation technique based on the generalized maximum likelihood (GML) decision rule. In particular, we assume a general form for the distribution of the
noise-corrupted utterances, which contains two types of parameters: clean speech-related parameters and noise-related parameters. While the clean speech related parameters are estimated during the
training phase, the noise related parameters are estimated from the corrupted speech in the testing phase. We applied the proposed
method to utterances of 50 speakers selected from the TIMIT database, artificially corrupted by convolutive and additive noise.
The signal to noise ratio (SNR) varies from 0 to 20 dB. Simulation results reveal that the proposed method achieves good robustness
against variation in the SNR. For utterances corrupted by covolutive noise, the improvement in the classification accuracy ranges from 70% for SNR = 0 dB to around 4% for SNR = 10dB, compared to the standard ML decision rule. For utterances corrupted by additive noise, the improvement in the classification accuracy ranges from 1% to 10% for SNRs ranging from 0 to 20 dB.
The proposed VARGM classifier is also applied to the speech emotion classification problem. In particular, we use the Berlin emotional speech database to validate the classification performance of the proposed VARGM classifier. The proposed technique provides a classification accuracy of 76% versus 71% for the hidden Markov model, 67% for the k-nearest neighbors, 55% for feed-forward neural networks. The model gives also better discrimination between
high-arousal emotions (joy, anger, fear), low arousal emotions (sadness, boredom), and neutral emotions than the HMM.
Another interesting application of the VARGM model is the blind equalization of multi input multiple output (MIMO) communication
channels. Based on VARGM modeling of MIMO channels, we propose a four-step equalization procedure. First, the received data vectors are fitted into a VARGM model using the expectation maximization (EM) algorithm. The constructed VARGM model is then used to filter the received data. A Baysian decision rule is then applied to
identify the transmitted symbols up to a permutation and phase ambiguities, which are finally resolved using a small training
sequence. Moreover, we propose a fast and easily implementable model order selection technique. The new equalization algorithm is
compared to the whitening method and found to provide less symbol error probability. The proposed technique is also applied to
frequency-flat slow fading channels and found to provide a more accurate estimate of the channel response than that provided by the blind de-convolution exploiting channel encoding (BDCC) method and at a higher information rate
Automatic speaker recognition: modelling, feature extraction and effects of clinical environment
Speaker recognition is the task of establishing identity of an individual based on his/her voice. It has a significant potential as a convenient biometric method for telephony applications and does not require sophisticated or dedicated hardware. The Speaker Recognition task is typically achieved by two-stage signal processing: training and testing. The training process calculates speaker-specific feature parameters from the speech. The features are used to generate statistical models of different speakers. In the testing phase, speech samples from unknown speakers are compared with the models and classified. Current state of the art speaker recognition systems use the Gaussian mixture model (GMM) technique in combination with the Expectation Maximization (EM) algorithm to build the speaker models. The most frequently used features are the Mel Frequency Cepstral Coefficients (MFCC). This thesis investigated areas of possible improvements in the field of speaker recognition. The identified drawbacks of the current speaker recognition systems included: slow convergence rates of the modelling techniques and feature’s sensitivity to changes due aging of speakers, use of alcohol and drugs, changing health conditions and mental state. The thesis proposed a new method of deriving the Gaussian mixture model (GMM) parameters called the EM-ITVQ algorithm. The EM-ITVQ showed a significant improvement of the equal error rates and higher convergence rates when compared to the classical GMM based on the expectation maximization (EM) method. It was demonstrated that features based on the nonlinear model of speech production (TEO based features) provided better performance compare to the conventional MFCCs features. For the first time the effect of clinical depression on the speaker verification rates was tested. It was demonstrated that the speaker verification results deteriorate if the speakers are clinically depressed. The deterioration process was demonstrated using conventional (MFCC) features. The thesis also showed that when replacing the MFCC features with features based on the nonlinear model of speech production (TEO based features), the detrimental effect of the clinical depression on speaker verification rates can be reduced