12,316 research outputs found

    Phoneme duration modelling for speaker verification

    Get PDF
    Higher-level features are considered to be a potential remedy against transmission line and cross-channel degradations, currently some of the biggest problems associated with speaker verification. Phoneme durations in particular are not altered by these factors; thus a robust duration model will be a particularly useful addition to traditional cepstral based speaker verification systems. In this dissertation we investigate the feasibility of phoneme durations as a feature for speaker verification. Simple speaker specific triphone duration models are created to statistically represent the phoneme durations. Durations are obtained from an automatic hidden Markov model (HMM) based automatic speech recognition system and are modeled using single mixture Gaussian distributions. These models are applied in a speaker verification system (trained and tested on the YOHO corpus) and found to be a useful feature, even when used in isolation. When fused with acoustic features, verification performance increases significantly. A novel speech rate normalization technique is developed in order to remove some of the inherent intra-speaker variability (due to differing speech rates). Speech rate variability has a negative impact on both speaker verification and automatic speech recognition. Although the duration modelling seems to benefit only slightly from this procedure, the fused system performance improvement is substantial. Other factors known to influence the duration of phonemes are incorporated into the duration model. Utterance final lengthening is known be a consistent effect and thus “position in sentence” is modeled. “Position in word” is also modeled since triphones do not provide enough contextual information. This is found to improve performance since some vowels’ duration are particularly sensitive to its position in the word. Data scarcity becomes a problem when building speaker specific duration models. By using information from available data, unknown durations can be predicted in an attempt to overcome the data scarcity problem. To this end we develop a novel approach to predict unknown phoneme durations from the values of known phoneme durations for a particular speaker, based on the maximum likelihood criterion. This model is based on the observation that phonemes from the same broad phonetic class tend to co-vary strongly, but that there is also significant cross-class correlations. This approach is tested on the TIMIT corpus and found to be more accurate than using back-off techniques.Dissertation (MEng)--University of Pretoria, 2009.Electrical, Electronic and Computer Engineeringunrestricte

    Development of an Isolated Digit Speech Recognition Based on Multilayer Perceptron Model

    Get PDF
    The automatic speech recognition (ASR) field has become one of the leading speech technology areas nowadays. The research in ASR has always been emphasizing on developing man-machine communication and promising in ease of use over the traditional keyboard and mouse. The speech recognition task is simple to be identified by human, but a very complex process for the machine to understand. Various methods have been introduced to develop an efficient ASR system. A Neural Network (NN) approach is one of the famous methods and widely used in this field. A Multilayer perceptron (MLP) is a popular NN model used in ASR field. In this study, a MLP with back propagation learning algorithm is implemented to perform the isolated digit speech recognition task for Malay language. However, one of the current problems faced by MLP and most NN models in ASR field is the long learning time. Besides that, the requirement to produce high recognition rate for isolated digit speech recognition system performed by MLP is also not trivial because it has been widely used in many applications. Thus, this study focuses on improving the learning time and recognition rate of the MLP neural network for Malay isolated digit speech recognition system. This current study proposes three new methods to fulfill the objective above. The improvement is made in preprocessing and recognition phase. In preprocessing phase, a new endpoint detection method is proposed and it is known as variance method. This method is introduced to overcome the disadvantages of the conventional method. The obstacles in the conventional method are unstable and difficult to set the threshold during the silence detection. Hence, poor recognition rate is produced. Another contribution in the preprocessing phase is in normalization phase. Three normalization methods are introduced to normalize the speech data before propagating to NN. The proposed methods consist of exponent, hybrid I and hybrid II. These methods are compared with 4 widely used conventional normalization methods. These include range I, range II, simple and variance method. The conventional methods have two limitations. The first is that some of the methods are very slow in learning phase but produce good recognition rate such as variance and range I methods. The second is that few of them are very fast in learning phase but produce low recognition rate such as simple and range II methods. Therefore, the new normalization methods are proposed to accelerate learning time and to produce high recognition rate. In recognition phase, a simple novel approach is introduced to increase the recognition rate. An adaptive sigmoid function is implemented to achieve this objective. A typical or fixed sigmoid function method is used in learning phase. In the recognition phase, an adaptive sigmoid function is employed. In this sense, the slope of the activation function is adjusted to gain highest recognition rate. This study emphasizes on 10 Malay words that comprise of “sifar” to “sembilan” (“0” to “9”). All utterances were recorded through single male speaker and each utterance was repeated 100 times. Thus the data set consist of 1000 utterances of Malay words. Four hundred data sets were split to utilize in the learning phase and the remaining 600 data for recognition phase. The TI46 standard data set was used to evaluate the performance of the all proposed method and 10 English words, consisting of “zero” to “nine” (“0” to “9”) are utilized throughout this study. Eight male and female speakers uttered each word 8 times. Hence, the total data set is 1600 for both speakers. The data set based on male and female speaker is trained separately. In this sense, four hundred male data sets were experimented during learning phase; meanwhile 400 data sets are kept as test data. The same approach is utilized in learning and recognition phase for female data sets. The Linear Predictive Coding (LPC) is implemented as a feature extraction method to represent the speech data. The experimental results show that the proposed endpoint detection (variance method) produced promising results in term of learning time and recognition rate. Meanwhile, the proposed normalization method has shown excellent results over all experiments. The adaptive sigmoid function also successfully increased the recognition rate in the most of the experiments. Finally, from the overall experiments, it can be concluded that the highest recognition rate for Malay data set is 99.83% with 82s convergence time. Meanwhile, for TI46 data set (female and male data set), the yielded convergence time is 55s and 111s with the recognition rate of 96.75% and 94.75% respectively

    Robust Speech Detection for Noisy Environments

    Get PDF
    This paper presents a robust voice activity detector (VAD) based on hidden Markov models (HMM) to improve speech recognition systems in stationary and non-stationary noise environments: inside motor vehicles (like cars or planes) or inside buildings close to high traffic places (like in a control tower for air traffic control (ATC)). In these environments, there is a high stationary noise level caused by vehicle motors and additionally, there could be people speaking at certain distance from the main speaker producing non-stationary noise. The VAD presented in this paper is characterized by a new front-end and a noise level adaptation process that increases significantly the VAD robustness for different signal to noise ratios (SNRs). The feature vector used by the VAD includes the most relevant Mel Frequency Cepstral Coefficients (MFCC), normalized log energy and delta log energy. The proposed VAD has been evaluated and compared to other well-known VADs using three databases containing different noise conditions: speech in clean environments (SNRs mayor que 20 dB), speech recorded in stationary noise environments (inside or close to motor vehicles), and finally, speech in non stationary environments (including noise from bars, television and far-field speakers). In the three cases, the detection error obtained with the proposed VAD is the lowest for all SNRs compared to Acero¿s VAD (reference of this work) and other well-known VADs like AMR, AURORA or G729 annex b
    corecore