Search CORE

48 research outputs found

Physiologically-Motivated Feature Extraction Methods for Speaker Recognition

Author: Wang Jianglin
Publication venue: e-Publications@Marquette
Publication date: 01/10/2013
Field of study

Speaker recognition has received a great deal of attention from the speech community, and significant gains in robustness and accuracy have been obtained over the past decade. However, the features used for identification are still primarily representations of overall spectral characteristics, and thus the models are primarily phonetic in nature, differentiating speakers based on overall pronunciation patterns. This creates difficulties in terms of the amount of enrollment data and complexity of the models required to cover the phonetic space, especially in tasks such as identification where enrollment and testing data may not have similar phonetic coverage. This dissertation introduces new features based on vocal source characteristics intended to capture physiological information related to the laryngeal excitation energy of a speaker. These features, including RPCC, GLFCC and TPCC, represent the unique characteristics of speech production not represented in current state-of-the-art speaker identification systems. The proposed features are evaluated through three experimental paradigms including cross-lingual speaker identification, cross song-type avian speaker identification and mono-lingual speaker identification. The experimental results show that the proposed features provide information about speaker characteristics that is significantly different in nature from the phonetically-focused information present in traditional spectral features. The incorporation of the proposed glottal source features offers significant overall improvement to the robustness and accuracy of speaker identification tasks

epublications@Marquette

Automatic speaker recognition: modelling, feature extraction and effects of clinical environment

Author: Memon S
Publication venue: RMIT University
Publication date: 01/01/2010
Field of study

Speaker recognition is the task of establishing identity of an individual based on his/her voice. It has a significant potential as a convenient biometric method for telephony applications and does not require sophisticated or dedicated hardware. The Speaker Recognition task is typically achieved by two-stage signal processing: training and testing. The training process calculates speaker-specific feature parameters from the speech. The features are used to generate statistical models of different speakers. In the testing phase, speech samples from unknown speakers are compared with the models and classified. Current state of the art speaker recognition systems use the Gaussian mixture model (GMM) technique in combination with the Expectation Maximization (EM) algorithm to build the speaker models. The most frequently used features are the Mel Frequency Cepstral Coefficients (MFCC). This thesis investigated areas of possible improvements in the field of speaker recognition. The identified drawbacks of the current speaker recognition systems included: slow convergence rates of the modelling techniques and feature’s sensitivity to changes due aging of speakers, use of alcohol and drugs, changing health conditions and mental state. The thesis proposed a new method of deriving the Gaussian mixture model (GMM) parameters called the EM-ITVQ algorithm. The EM-ITVQ showed a significant improvement of the equal error rates and higher convergence rates when compared to the classical GMM based on the expectation maximization (EM) method. It was demonstrated that features based on the nonlinear model of speech production (TEO based features) provided better performance compare to the conventional MFCCs features. For the first time the effect of clinical depression on the speaker verification rates was tested. It was demonstrated that the speaker verification results deteriorate if the speakers are clinically depressed. The deterioration process was demonstrated using conventional (MFCC) features. The thesis also showed that when replacing the MFCC features with features based on the nonlinear model of speech production (TEO based features), the detrimental effect of the clinical depression on speaker verification rates can be reduced

RMIT Research Repository

Voice Spoofing Countermeasures: Taxonomy, State-of-the-art, experimental analysis of generalizability, open challenges, and the way forward

Author: Khan Awais
Malik Khalid Mahmood
Ryan James
Saravanan Mikul
Publication venue
Publication date: 21/11/2022
Field of study

Malicious actors may seek to use different voice-spoofing attacks to fool ASV systems and even use them for spreading misinformation. Various countermeasures have been proposed to detect these spoofing attacks. Due to the extensive work done on spoofing detection in automated speaker verification (ASV) systems in the last 6-7 years, there is a need to classify the research and perform qualitative and quantitative comparisons on state-of-the-art countermeasures. Additionally, no existing survey paper has reviewed integrated solutions to voice spoofing evaluation and speaker verification, adversarial/antiforensics attacks on spoofing countermeasures, and ASV itself, or unified solutions to detect multiple attacks using a single model. Further, no work has been done to provide an apples-to-apples comparison of published countermeasures in order to assess their generalizability by evaluating them across corpora. In this work, we conduct a review of the literature on spoofing detection using hand-crafted features, deep learning, end-to-end, and universal spoofing countermeasure solutions to detect speech synthesis (SS), voice conversion (VC), and replay attacks. Additionally, we also review integrated solutions to voice spoofing evaluation and speaker verification, adversarial and anti-forensics attacks on voice countermeasures, and ASV. The limitations and challenges of the existing spoofing countermeasures are also presented. We report the performance of these countermeasures on several datasets and evaluate them across corpora. For the experiments, we employ the ASVspoof2019 and VSDC datasets along with GMM, SVM, CNN, and CNN-GRU classifiers. (For reproduceability of the results, the code of the test bed can be found in our GitHub Repository

arXiv.org e-Print Archive

ASVspoof: the Automatic Speaker Verification Spoofing and Countermeasures Challenge

Author: Delgado Hector
Evans Nicholas
Hanilc Cemal
Kinnunen Tomi
Sahidullah Md
Sizov Aleksandr
Todisco Massimiliano
Wu Zhizheng
Yamagishi Junichi
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

Crossref

Edinburgh Research Explorer

EURECOM Repository

Emotion Recognition from Speech with Acoustic, Non-Linear and Wavelet-based Features Extracted in Different Acoustic Conditions

Author: Vásquez Correa Juan Camilo
Publication venue: Medellín, Colombia
Publication date: 01/01/2016
Field of study

ABSTRACT: In the last years, there has a great progress in automatic speech recognition. The challenge now it is not only recognize the semantic content in the speech but also the called "paralinguistic" aspects of the speech, including the emotions, and the personality of the speaker. This research work aims in the development of a methodology for the automatic emotion recognition from speech signals in non-controlled noise conditions. For that purpose, different sets of acoustic, non-linear, and wavelet based features are used to characterize emotions in different databases created for such purpose

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Biblioteca Digital del Sistema de Bibliotecas de la Universidad de Antioquia

Voice Spoofing Countermeasure : Securing Automatic Speaker Verification Against Logical Access Attacks

Author: Bashir Muhammad Ahmad
Publication venue
Publication date: 12/06/2023
Field of study

Trepo - Institutional Repository of Tampere University

Malay articulation system for early screening diagnostic using hidden markov model and genetic algorithm

Author: Mazenan Mohd. Nizam
Publication venue
Publication date: 01/10/2015
Field of study

Speech recognition is an important technology and can be used as a great aid for individuals with sight or hearing disabilities today. There are extensive research interest and development in this area for over the past decades. However, the prospect in Malaysia regarding the usage and exposure is still immature even though there is demand from the medical and healthcare sector. The aim of this research is to assess the quality and the impact of using computerized method for early screening of speech articulation disorder among Malaysian such as the omission, substitution, addition and distortion in their speech. In this study, the statistical probabilistic approach using Hidden Markov Model (HMM) has been adopted with newly designed Malay corpus for articulation disorder case following the SAMPA and IPA guidelines. Improvement is made at the front-end processing for feature vector selection by applying the silence region calibration algorithm for start and end point detection. The classifier had also been modified significantly by incorporating Viterbi search with Genetic Algorithm (GA) to obtain high accuracy in recognition result and for lexical unit classification. The results were evaluated by following National Institute of Standards and Technology (NIST) benchmarking. Based on the test, it shows that the recognition accuracy has been improved by 30% to 40% using Genetic Algorithm technique compared with conventional technique. A new corpus had been built with verification and justification from the medical expert in this study. In conclusion, computerized method for early screening can ease human effort in tackling speech disorders and the proposed Genetic Algorithm technique has been proven to improve the recognition performance in terms of search and classification task

Universiti Teknologi Malaysia Institutional Repository

CONDITION MONITORING BASED ON A WIRELESS, DISTRIBUTED AND SCALABLE PLATFORM

Author: ER POI VOON
Publication venue
Publication date: 26/08/2016
Field of study

Ph.DDOCTOR OF PHILOSOPH

ScholarBank@NUS

Análisis cepstral y la transformada de Hilbert-Huang para la detección automática de la enfermedad de Parkinson

Author: Arias-Vergara Tomas
López-Pabón Felipe O.
Orozco-Arroyave Juan R.
Publication venue: Instituto Tecnológico Metropolitano (ITM)
Publication date: 30/01/2020
Field of study

Most patients with Parkinson’s Disease (PD) develop speech deficits, including reduced sonority, altered articulation, and abnormal prosody. This article presents a methodology to automatically classify patients with PD and Healthy Control (HC) subjects. In this study, the Hilbert-Huang Transform (HHT) and Mel-Frequency Cepstral Coefficients (MFCCs) were considered to model modulated phonations (changing the tone from low to high and vice versa) of the vowels /a/, /i/, and /u/. The HHT was used to extract the first two formants from audio signals with the aim of modeling the stability of the tongue while the speakers were producing modulated vowels. Kruskal-Wallis statistical tests were used to eliminate redundant and non-relevant features in order to improve classification accuracy. PD patients and HC subjects were automatically classified using a Radial Basis Support Vector Machine (RBF-SVM). The results show that the proposed approach allows an automatic discrimination between PD and HC subjects with accuracies of up to 75 % for women and 73 % for men.La mayoría de las personas con la enfermedad de Parkinson (EP) desarrollan varios déficits del habla, incluyendo sonoridad reducida, alteración de la articulación y prosodia anormal. Este artículo presenta una metodología que permite la clasificación automática de pacientes con EP y sujetos de control sanos (CS). Se considera que la transformada de Hilbert-Huang (THH) y los Coeﬁcientes Cepstrales en las frecuencias de Mel modelan las fonaciones moduladas (cambiando el tono de bajo a alto y de alto a bajo) de las vocales /a/, /i/, y /u/. La THH se utiliza para extraer los dos primeros formantes de las señales de audio, con el objetivo de modelar la estabilidad de la lengua mientras los hablantes producen vocales moduladas. Pruebas estadísticas de Kruskal-Wallis se utilizan para eliminar características redundantes y no relevantes, con el fin de mejorar la precisión de la clasificación. La clasificación automática de sujetos con EP vs. CS se realiza mediante una máquina de soporte vectorial de base radial. De acuerdo con los resultados, el enfoque propuesto permite la discriminación automática de sujetos con EP vs. CS con precisiones de hasta el 75 % para los hombres y 73 % para las mujeres

Portal de Revistas Academicas del ITM (Institución Universitaria adscrita al Municipio de Medellín)

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Repositorio Institucional ITM