Search CORE

331 research outputs found

Speaker Recognition Using Multiple Parametric Self-Organizing Maps

Author: Gomez Pablo
Publication venue: FIU Digital Commons
Publication date: 28/07/2003
Field of study

Speaker Recognition is the process of automatically recognizing a person who is speaking on the basis of individual parameters included in his/her voice. This technology allows systems to automatically verify identify in applications such as banking by telephone or forensic science. A Speaker Recognition system has the following main modules: Feature Extraction and Classification. For feature extraction the most commonly used techniques are MEL-Frequency Cepstrum Coefficients (MFCC) and Linear Predictive Coding (LPC). For classification and verification, technologies such as Vector Quantization (VQ), Hidden Markov Models (HMM) and Neural Networks have been used. The contribution of this thesis is a new methodology to achieve high accuracy identification and impostor rejection. The new proposed method, Multiple Parametric Self-Organizing Maps (M-PSOM) is a classification and verification technique. The new method was successfully implemented and tested using the CSLU Speaker Recognition Corpora of the Oregon School of Engineering with excellent results

DigitalCommons@Florida International University

Optimizing spectral feature based text-Independent speaker recognition

Author: Kinnunen Tomi H.
Publication venue: University of Joensuu
Publication date
Field of study

UEF Electronic Publications

Recommended from our members

Evaluation and analysis of hybrid intelligent pattern recognition techniques for speaker identification

Author: Almaadeed Noor
Publication venue: Brunel University School of Engineering and Design PhD Theses
Publication date: 01/01/2014
Field of study

This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.The rapid momentum of the technology progress in the recent years has led to a tremendous rise in the use of biometric authentication systems. The objective of this research is to investigate the problem of identifying a speaker from its voice regardless of the content (i.e. text-independent), and to design efficient methods of combining face and voice in producing a robust authentication system. A novel approach towards speaker identification is developed using wavelet analysis, and multiple neural networks including Probabilistic Neural Network (PNN), General Regressive Neural Network (GRNN)and Radial Basis Function-Neural Network (RBF NN) with the AND voting scheme. This approach is tested on GRID and VidTIMIT cor-pora and comprehensive test results have been validated with state- of-the-art approaches. The system was found to be competitive and it improved the recognition rate by 15% as compared to the classical Mel-frequency Cepstral Coe±cients (MFCC), and reduced the recognition time by 40% compared to Back Propagation Neural Network (BPNN), Gaussian Mixture Models (GMM) and Principal Component Analysis (PCA). Another novel approach using vowel formant analysis is implemented using Linear Discriminant Analysis (LDA). Vowel formant based speaker identification is best suitable for real-time implementation and requires only a few bytes of information to be stored for each speaker, making it both storage and time efficient. Tested on GRID and Vid-TIMIT, the proposed scheme was found to be 85.05% accurate when Linear Predictive Coding (LPC) is used to extract the vowel formants, which is much higher than the accuracy of BPNN and GMM. Since the proposed scheme does not require any training time other than creating a small database of vowel formants, it is faster as well. Furthermore, an increasing number of speakers makes it di±cult for BPNN and GMM to sustain their accuracy, but the proposed score-based methodology stays almost linear. Finally, a novel audio-visual fusion based identification system is implemented using GMM and MFCC for speaker identi¯cation and PCA for face recognition. The results of speaker identification and face recognition are fused at different levels, namely the feature, score and decision levels. Both the score-level and decision-level (with OR voting) fusions were shown to outperform the feature-level fusion in terms of accuracy and error resilience. The result is in line with the distinct nature of the two modalities which lose themselves when combined at the feature-level. The GRID and VidTIMIT test results validate that the proposed scheme is one of the best candidates for the fusion of face and voice due to its low computational time and high recognition accuracy

Brunel University Research Archive

Suppression of acoustic noise in speech using spectral subtraction

Author: Boll Steven F.
Publication venue: University of Utah
Publication date: 01/01/1979
Field of study

technical reportA stand alone noise suppression algorithm is presented for reducing the spectral effects of acoustically added noise in speech. Effective performance of digital speech processors operating in practical environments may require suppression of noise from the digital waveform. Spectral subtraction offers a computationally efficient, processor independent, approach to effective digital speech analysis. The method, requiring about the same computation as high-speed convolution, suppresses stationary noise for speech by subtracting the spectral noise bias calculated during non-speech activity. Secondary procedures and then applied to attenuate the residual noise left after subtraction. Since the algorithm resynthesizes a speech waveform, it can be used as a preprocessor to narrow band voice communications systems, speech recognition systems or speaker authentication systems

The University of Utah: J. Willard Marriott Digital Library

Assessing the influence of phonetic variation on the perception of spoken threats

Author: Tompkinson James
Publication venue: University of York
Publication date: 01/10/2018
Field of study

In spite of the belief that there is such a thing as a ‘threatening tone of voice’ (Watt, Kelly and Llamas, 2013), there is currently little research which explores how listeners infer traits such as threat from speakers’ voices. This thesis addresses the question of how listeners infer traits such as how threatening speakers sound, and whether phonetic aspects of speakers’ voices can play a role in shaping these evaluations. Additionally, it is sometimes the case that a victim of a crime will never see the perpetrator’s face but will hear the perpetrator’s voice. In such cases, attempts can be made to get the witness or victim to describe the offender’s voice. However, one problem with this is whether phonetically untrained listeners have the ability to accurately describe different aspects of speakers’ voices. This issue is also addressed throughout this thesis. Over five experiments, this thesis investigates the influence of a range of linguistic and phonetic variables on listeners’ evaluations of how threatening speakers sounded when producing indirect threat utterances. It also examines how accurately phonetically-untrained listeners can describe different aspects of speakers’ voices alongside their evaluative judgements of traits such as threat and intent-to-harm. As well as showing that a range of linguistic and phonetic variables can influence listeners’ threat evaluations, results support the view that caution should be adopted in over-reliance on the idea that people will “know a threat when they hear one” (Gingiss, 1986:153). This research begins to address the phonetic basis for the perceptual existence of a ‘threatening tone of voice’, along with how listeners evaluate and describe voices in earwitness contexts. Suggestions are made at the end of the thesis for improvements in the elicitation and implementation of accurate, meaningful information about speakers’ voices from linguistically-untrained listeners in evaluative settings involving spoken threats

White Rose E-theses Online

Nasality in automatic speaker verification

Author: Rooney Edmund Joseph
Publication venue: The University of Edinburgh
Publication date: 01/01/1990
Field of study

Edinburgh Research Archive

A comparison of features for large population speaker identification

Author: Baloyi Norman Tinyiko
Publication venue: Department of Electrical Engineering
Publication date: 01/01/2000
Field of study

Bibliography: leaves 95-104.Speech recognition systems all have one criterion in common; they perform better in a controlled environment using clean speech. Though performance can be excellent, even exceeding human capabilities for clean speech, systems fail when presented with speech data from more realistic environments such as telephone channels. The differences using a recognizer in clean and noisy environments are extreme, and this causes one of the major obstacles in producing commercial recognition systems to be used in normal environments. It is the lack of performance of speaker recognition systems with telephone channels that this work addresses. The human auditory system is a speech recognizer with excellent performance, especially in noisy environments. Since humans perform well at ignoring noise more than any machine, auditory-based methods are the promising approaches since they attempt to model the working of the human auditory system. These methods have been shown to outperform more conventional signal processing schemes for speech recognition, speech coding, word-recognition and phone classification tasks. Since speaker identification has received lot of attention in speech processing because of its waiting real-world applications, it is attractive to evaluate the performance using auditory models as features. Firstly, this study rums at improving the results for speaker identification. The improvements were made through the use of parameterized feature-sets together with the application of cepstral mean removal for channel equalization. The study is further extended to compare an auditory-based model, the Ensemble Interval Histogram, with mel-scale features, which was shown to perform almost error-free in clean speech. The previous studies of Elli to be more robust to noise were conducted on speaker dependent, small population, isolated words and now are extended to speaker independent, larger population, continuous speech. This study investigates whether the Elli representation is more resistant to telephone noise than mel-cepstrum as was shown in the previous studies, when now for the first time, it is applied for speaker identification task using the state-of-the-art Gaussian mixture model system

Cape Town University OpenUCT

ROBUST SPEAKER RECOGNITION BASED ON LATENT VARIABLE MODELS

Author: Garcia-Romero Daniel
Publication venue
Publication date: 01/01/2012
Field of study

Automatic speaker recognition in uncontrolled environments is a very challenging task due to channel distortions, additive noise and reverberation. To address these issues, this thesis studies probabilistic latent variable models of short-term spectral information that leverage large amounts of data to achieve robustness in challenging conditions. Current speaker recognition systems represent an entire speech utterance as a single point in a high-dimensional space. This representation is known as "supervector". This thesis starts by analyzing the properties of this representation. A novel visualization procedure of supervectors is presented by which qualitative insight about the information being captured is obtained. We then propose the use of an overcomplete dictionary to explicitly decompose a supervector into a speaker-specific component and an undesired variability component. An algorithm to learn the dictionary from a large collection of data is discussed and analyzed. A subset of the entries of the dictionary is learned to represent speaker-specific information and another subset to represent distortions. After encoding the supervector as a linear combination of the dictionary entries, the undesired variability is removed by discarding the contribution of the distortion components. This paradigm is closely related to the previously proposed paradigm of Joint Factor Analysis modeling of supervectors. We establish a connection between the two approaches and show how our proposed method provides improvements in terms of computation and recognition accuracy. An alternative way to handle undesired variability in supervector representations is to first project them into a lower dimensional space and then to model them in the reduced subspace. This low-dimensional projection is known as "i-vector". Unfortunately, i-vectors exhibit non-Gaussian behavior, and direct statistical modeling requires the use of heavy-tailed distributions for optimal performance. These approaches lack closed-form solutions, and therefore are hard to analyze. Moreover, they do not scale well to large datasets. Instead of directly modeling i-vectors, we propose to first apply a non-linear transformation and then use a linear-Gaussian model. We present two alternative transformations and show experimentally that the transformed i-vectors can be optimally modeled by a simple linear-Gaussian model (factor analysis). We evaluate our method on a benchmark dataset with a large amount of channel variability and show that the results compare favorably against the competitors. Also, our approach has closed-form solutions and scales gracefully to large datasets. Finally, a multi-classifier architecture trained on a multicondition fashion is proposed to address the problem of speaker recognition in the presence of additive noise. A large number of experiments are conducted to analyze the proposed architecture and to obtain guidelines for optimal performance in noisy environments. Overall, it is shown that multicondition training of multi-classifier architectures not only produces great robustness in the anticipated conditions, but also generalizes well to unseen conditions

CiteSeerX

Digital Repository at the University of Maryland

Reconocimiento automático de locutor e idioma mediante caracterización acústica de unidades lingüísticas

Author: Franco-Pedroso Javier
Publication venue
Publication date: 01/01/2016
Field of study

Tesis doctoral inédita leída en la Universidad Autónoma de Madrid, Escuela Politécnica Superior, Departamento de Tecnología Electrónica y de las Comunicaciones . Fecha de lectura: 30-06-201

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Biblos-e Archivo

An automatic speaker recognition system.

Author
Publication venue: Department of Cultural and Religious Studies, The Chinese University of Hong Kong
Publication date: 01/01/1989
Field of study

by Yu Chun Kei.Thesis (M.Phil.)--Chinese University of Hong Kong, 1989.Bibliography: leaves 86-88

CUHK Digital Repository