144 research outputs found

    Arabic Isolated Word Speaker Dependent Recognition System

    Get PDF
    In this thesis we designed a new Arabic isolated word speaker dependent recognition system based on a combination of several features extraction and classifications techniques. Where, the system combines the methods outputs using a voting rule. The system is implemented with a graphic user interface under Matlab using G62 Core I3/2.26 Ghz processor laptop. The dataset used in this system include 40 Arabic words recorded in a calm environment with 5 different speakers using laptop microphone. Each speaker will read each word 8 times. 5 of them are used in training and the remaining are used in the test phase. First in the preprocessing step we used an endpoint detection technique based on energy and zero crossing rates to identify the start and the end of each word and remove silences then we used a discrete wavelet transform to remove noise from signal. In order to accelerate the system and reduce the execution time we make the system first to recognize the speaker and load only the reference model of that user. We compared 5 different methods which are pairwise Euclidean distance with MelFrequency cepstral coefficients (MFCC), Dynamic Time Warping (DTW) with Formants features, Gaussian Mixture Model (GMM) with MFCC, MFCC+DTW and Itakura distance with Linear Predictive Coding features (LPC) and we got a recognition rate of 85.23%, 57% , 87%, 90%, 83% respectively. In order to improve the accuracy of the system, we tested several combinations of these 5 methods. We find that the best combination is MFCC | Euclidean + Formant | DTW + MFCC | DTW + LPC | Itakura with an accuracy of 94.39% but with large computation time of 2.9 seconds. In order to reduce the computation time of this hybrid, we compare several subcombination of it and find that the best performance in trade off computation time is by first combining MFCC | Euclidean + LPC | Itakura and only when the two methods do not match the system will add Formant | DTW + MFCC | DTW methods to the combination, where the average computation time is reduced to the half to 1.56 seconds and the system accuracy is improved to 94.56%. Finally, the proposed system is good and competitive compared with other previous researches

    Cue estimation for vowel perception prediction in low signal-to-noise ratios

    Get PDF
    This study investigates the signal processing required in order to allow for the evaluation of hearing perception prediction models at low signal-to-noise Ratios (SNR). It focusses on speech enhancement and the estimation of the cues from which speech may be recognized, specifically where these cues are estimated from severely degraded speech (SNR ranging from -10 dB to -3 dB). This research has application in the field of cochlear implants (CI), where a listener would hear degraded speech due to several distortions introduced by the biophysical interface (e.g. frequency and amplitude discretization). These difficulties can also be interpreted as a loss in signal quality due to a specific type of noise. The ability to investigate perception in low SNR conditions may have application in the development of CI signal processing algorithms to counter the effects of noise. In the military domain a speech signal may be degraded intentionally by enemy forces or unintentionally owing to engine noise, for example. The ability to analyse and predict perception can be used for algorithm development to counter the unintentional or intentional interference or to predict perception degradation if low SNR conditions cannot be avoided. A previously documented perception model (Svirsky, 2000) is used to illustrate that the proposed signal processing steps can indeed be used to estimate the various cues used by the perception model at SNRs successfully as low as -10 dB. AFRIKAANS : Hierdie studie ondersoek die seinprosessering wat nodig is om ’n gehoorpersepsievoorspellingmodel te evalueer by lae sein-tot-ruis-verhoudings. Hierdie studie fokus op spraakverbetering en die estimasie van spraakeienskappe wat gebruik kan word tydens spraakherkenning, spesifiek waar hierdie eienskappe beraam word vir ernstig gedegradeerde spraak (sein-tot-ruisverhoudings van -10 dB tot -3 dB). Hierdie navorsing is van toepassing in die veld van kogleêre inplantings, waar die luisteraar degradering van spraak ervaar weens die bio-fisiese koppelvlak (bv. diskrete frekwensie en amplitude). Hierdie degradering kan gesien word as ’n verlies aan seinkwaliteit weens ’n spesifieke tipe ruis. Die vermoë om persepsie te ondersoek by lae sein-tot-ruis kan toegepas word tydens die ontwikkeling van kogleêre inplantingseinprosesseringalgoritmes om die effekte van ruis teen te werk. In die militêre omgewing kan spraak deur vyandige magte gedegradeer word, of degradering van spraak kan plaasvind as gevolg van bv. enjingeraas. Die vermoë om persepsie te ondersoek en te voorspel in die teenwoordigheid van ruis kan gebruik word vir algoritme-ontwikkeling om die ruis teen te werk of om die verlies aan persepsie te voorspel waar lae sein-tot-ruis verhoudings nie vermy kan word nie. ’n Voorheen gedokumenteerde persepsiemodel (Svirsky, 2000) word gebruik om te demonstreer dat die voorgestelde seinprosesseringstappe wel suksesvol gebruik kan word om die spraakeienskappe te beraam wat deur die persepsiemodel benodig word by sein-tot-ruis verhouding so laag as -10 dB. CopyrightDissertation (MEng)--University of Pretoria, 2009.Electrical, Electronic and Computer Engineeringunrestricte

    A new Automatic Formant Tracking approach based on scalogram maxima detection using complex wavelets

    Get PDF
    International audienceIn this paper we present a new formant tracking algorithm where the formant frequencies estimation was based on local maxima detection of a time frequency representation. This representation can be shown by a scalogram issued from a complex wavelet transform. The formant frequency candidates are validated as local maxima of scalogram which correspond to wavelet ridges. Then in the proposed algorithm, we have introduced the computation of center of gravity as tracking constraint. We tested our new algorithm by applying it on synthesized and natural voiced speech signals. The formant trajectories obtained by our algorithm were compared to those of manually-edited ones of our Arabic database as reference; those given by Fourier transform method and the LPC analysis used in Praat. The comparison of the results showed globally the adequacy of the first three formant trajectories using complex Morlet wavelet refers to the manually-edited formant tracks

    Speech Modeling and Robust Estimation for Diagnosis of Parkinson’s Disease

    Get PDF

    Discriminative features for GMM and i-vector based speaker diarization

    Get PDF
    Speaker diarization has received several research attentions over the last decade. Among the different domains of speaker diarization, diarization in meeting domain is the most challenging one. It usually contains spontaneous speech and is, for example, susceptible to reverberation. The appropriate selection of speech features is one of the factors that affect the performance of speaker diarization systems. Mel Frequency Cepstral Coefficients (MFCC) are the most widely used short-term speech features in speaker diarization. Other factors that affect the performance of speaker diarization systems are the techniques employed to perform both speaker segmentation and speaker clustering. In this thesis, we have proposed the use of jitter and shimmer long-term voice-quality features both for Gaussian Mixture Modeling (GMM) and i-vector based speaker diarization systems. The voice-quality features are used together with the state-of-the-art short-term cepstral and long-term speech ones. The long-term features consist of prosody and Glottal-to-Noise excitation ratio (GNE) descriptors. Firstly, the voice-quality, prosodic and GNE features are stacked in the same feature vector. Then, they are fused with cepstral coefficients at the score likelihood level both for the proposed Gaussian Mixture Modeling (GMM) and i-vector based speaker diarization systems. For the proposed GMM based speaker diarization system, independent HMM models are estimated from the short-term and long-term speech feature sets. The fusion of the short-term descriptors with the long-term ones in speaker segmentation is carried out by linearly weighting the log-likelihood scores of Viterbi decoding. In the case of speaker clustering, the fusion of the short-term cepstral features with the long-term ones is carried out by linearly fusing the Bayesian Information Criterion (BIC) scores corresponding to these feature sets. For the proposed i-vector based speaker diarization system, the speaker segmentation is carried out exactly the same as in the previously mentioned GMM based speaker diarization system. However, the speaker clustering technique is based on the recently introduced factor analysis paradigm. Two set of i-vectors are extracted from the speaker segmentation hypothesis. Whilst the first i-vector is extracted from short-term cepstral features, the second one is extracted from the voice quality, prosody and GNE descriptors. Then, the cosine-distance and Probabilistic Linear Discriminant Analysis (PLDA) scores of i-vectors are linearly weighted to obtain a fused similarity score. Finally, the fused score is used as speaker clustering distance. We have also proposed the use of delta dynamic features for speaker clustering. The motivation for using deltas in clustering is that delta dynamic features capture the transitional characteristics of the speech signal which contain speaker specific information. This information is not captured by the static cepstral coefficients. The delta features are used together with the short-term static cepstral coefficients and long-term speech features (i.e., voice-quality, prosody and GNE) both for GMM and i-vector based speaker diarization systems. The experiments have been carried out on Augmented Multi-party Interaction (AMI) meeting corpus. The experimental results show that the use of voice-quality, prosody, GNE and delta dynamic features improve the performance of both GMM and i-vector based speaker diarization systems.La diarización del altavoz ha recibido varias atenciones de investigación durante la última década. Entre los diferentes dominios de la diarización del hablante, la diarización en el dominio del encuentro es la más difícil. Normalmente contiene habla espontánea y, por ejemplo, es susceptible de reverberación. La selección apropiada de las características del habla es uno de los factores que afectan el rendimiento de los sistemas de diarización de los altavoces. Los Coeficientes Cepstral de Frecuencia Mel (MFCC) son las características de habla de corto plazo más utilizadas en la diarización de los altavoces. Otros factores que afectan el rendimiento de los sistemas de diarización del altavoz son las técnicas empleadas para realizar tanto la segmentación del altavoz como el agrupamiento de altavoces. En esta tesis, hemos propuesto el uso de jitter y shimmer características de calidad de voz a largo plazo tanto para GMM y i-vector basada en sistemas de diarización de altavoces. Las características de calidad de voz se utilizan junto con el estado de la técnica a corto plazo cepstral y de larga duración de habla. Las características a largo plazo consisten en la prosodia y los descriptores de relación de excitación Glottal-a-Ruido (GNE). En primer lugar, las características de calidad de voz, prosódica y GNE se apilan en el mismo vector de características. A continuación, se fusionan con coeficientes cepstrales en el nivel de verosimilitud de puntajes tanto para los sistemas de diarización de altavoces basados ¿¿en el modelo Gaussian Mixture Modeling (GMM) como en los sistemas basados ¿¿en i-vector. . Para el sistema de diarización de altavoces basado en GMM propuesto, se calculan modelos HMM independientes a partir de cada conjunto de características. En la segmentación de los altavoces, la fusión de los descriptores a corto plazo con los de largo plazo se lleva a cabo mediante la ponderación lineal de las puntuaciones log-probabilidad de decodificación Viterbi. En la agrupación de altavoces, la fusión de las características cepstrales a corto plazo con las de largo plazo se lleva a cabo mediante la fusión lineal de las puntuaciones Bayesian Information Criterion (BIC) correspondientes a estos conjuntos de características. Para el sistema de diarización de altavoces basado en un vector i, la fusión de características se realiza exactamente igual a la del sistema basado en GMM antes mencionado. Sin embargo, la técnica de agrupación de altavoces se basa en el paradigma de análisis de factores recientemente introducido. Dos conjuntos de i-vectores se extraen de la hipótesis de segmentación de altavoz. Mientras que el primer vector i se extrae de características espectrales a corto plazo, el segundo se extrae de los descriptores de calidad de voz apilados, prosódicos y GNE. A continuación, las puntuaciones de coseno-distancia y Probabilistic Linear Discriminant Analysis (PLDA) entre i-vectores se ponderan linealmente para obtener una puntuación de similitud fundida. Finalmente, la puntuación fusionada se utiliza como distancia de agrupación de altavoces. También hemos propuesto el uso de características dinámicas delta para la agrupación de locutores. La motivación para el uso de deltas en la agrupación es que las características dinámicas delta capturan las características de transición de la señal de voz que contienen información específica del locutor. Esta información no es capturada por los coeficientes cepstrales estáticos. Las características delta se usan junto con los coeficientes cepstrales estáticos a corto plazo y las características de voz a largo plazo (es decir, calidad de voz, prosodia y GNE) tanto para sistemas de diarización de altavoces basados en GMM como en sistemas i-vector. Los resultados experimentales sobre AMI muestran que el uso de calidad vocal, prosódica, GNE y dinámicas delta mejoran el rendimiento de los sistemas de diarización de altavoces basados en GMM e i-vector.Postprint (published version

    On the development of an automatic voice pleasantness classification and intensity estimation system

    Get PDF
    In the last few years, the number of systems and devices that use voice based interaction has grown significantly. For a continued use of these systems, the interface must be reliable and pleasant in order to provide an optimal user experience. However there are currently very few studies that try to evaluate how pleasant is a voice from a perceptual point of view when the final application is a speech based interface. In this paper we present an objective definition for voice pleasantness based on the composition of a representative feature subset and a new automatic voice pleasantness classification and intensity estimation system. Our study is based on a database composed by European Portuguese female voices but the methodology can be extended to male voices or to other languages. In the objective performance evaluation the system achieved a 9.1% error rate for voice pleasantness classification and a 15.7% error rate for voice pleasantness intensity estimation.Work partially supported by ERDF funds, the Spanish Government (TEC2009-14094-C04-04), and Xunta de Galicia (CN2011/019, 2009/062

    Automatic speech recognition: from study to practice

    Get PDF
    Today, automatic speech recognition (ASR) is widely used for different purposes such as robotics, multimedia, medical and industrial application. Although many researches have been performed in this field in the past decades, there is still a lot of room to work. In order to start working in this area, complete knowledge of ASR systems as well as their weak points and problems is inevitable. Besides that, practical experience improves the theoretical knowledge understanding in a reliable way. Regarding to these facts, in this master thesis, we have first reviewed the principal structure of the standard HMM-based ASR systems from technical point of view. This includes, feature extraction, acoustic modeling, language modeling and decoding. Then, the most significant challenging points in ASR systems is discussed. These challenging points address different internal components characteristics or external agents which affect the ASR systems performance. Furthermore, we have implemented a Spanish language recognizer using HTK toolkit. Finally, two open research lines according to the studies of different sources in the field of ASR has been suggested for future work

    Formant and burst spectral measurements with quantitative error models for speech sound classification.

    Get PDF
    Thesis (Ph. D.)—Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1996.Includes bibliographical references (p. 142-145).This electronic version was scanned from a copy of the thesis on file at the Speech Communication Group. The certified thesis is available in the Institute Archives and Special Collections.National Institute on Deafness and Other Communication Disorders. National Science Foundation.Ph. D

    Model-Based Speech Enhancement

    Get PDF
    Abstract A method of speech enhancement is developed that reconstructs clean speech from a set of acoustic features using a harmonic plus noise model of speech. This is a significant departure from traditional filtering-based methods of speech enhancement. A major challenge with this approach is to estimate accurately the acoustic features (voicing, fundamental frequency, spectral envelope and phase) from noisy speech. This is achieved using maximum a-posteriori (MAP) estimation methods that operate on the noisy speech. In each case a prior model of the relationship between the noisy speech features and the estimated acoustic feature is required. These models are approximated using speaker-independent GMMs of the clean speech features that are adapted to speaker-dependent models using MAP adaptation and for noise using the Unscented Transform. Objective results are presented to optimise the proposed system and a set of subjective tests compare the approach with traditional enhancement methods. Threeway listening tests examining signal quality, background noise intrusiveness and overall quality show the proposed system to be highly robust to noise, performing significantly better than conventional methods of enhancement in terms of background noise intrusiveness. However, the proposed method is shown to reduce signal quality, with overall quality measured to be roughly equivalent to that of the Wiener filter

    Recent Advances in Signal Processing

    Get PDF
    The signal processing task is a very critical issue in the majority of new technological inventions and challenges in a variety of applications in both science and engineering fields. Classical signal processing techniques have largely worked with mathematical models that are linear, local, stationary, and Gaussian. They have always favored closed-form tractability over real-world accuracy. These constraints were imposed by the lack of powerful computing tools. During the last few decades, signal processing theories, developments, and applications have matured rapidly and now include tools from many areas of mathematics, computer science, physics, and engineering. This book is targeted primarily toward both students and researchers who want to be exposed to a wide variety of signal processing techniques and algorithms. It includes 27 chapters that can be categorized into five different areas depending on the application at hand. These five categories are ordered to address image processing, speech processing, communication systems, time-series analysis, and educational packages respectively. The book has the advantage of providing a collection of applications that are completely independent and self-contained; thus, the interested reader can choose any chapter and skip to another without losing continuity
    corecore