Search CORE

313 research outputs found

Using data-driven and phonetic units for speaker verification

Author: El Hannani Asmaa
Hennebert Jean
Montero-Asenjo Alberto
Petrovska-Delacrétaz Dijana
Toledano Doroteo T.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2006
Field of study

Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. A. E. Hannani, D. T. Toledano, D. Petrovska-Delacrétaz, A. Montero-Asenjo, J. Hennebert, "Using Data-driven and Phonetic Units for Speaker Verification" in Odyssey: The Speaker and Language Recognition Workshop, San Juan (Puerto Rico), 2006, pp.1 - 6Recognition of speaker identity based on modeling the streams produced by phonetic decoders (phonetic speaker recognition) has gained popularity during the past few years. Two of the major problems that arise when phone based systems are being developed are the possible mismatches between the development and evaluation data and the lack of transcribed databases. Data-driven segmentation techniques provide a potential solution to these problems because they do not use transcribed data and can easily be applied on development data minimizing the mismatches. In this paper we compare speaker recognition results using phonetic and data-driven decoders. To this end, we have compared the results obtained with a speaker recognition system based on data-driven acoustic units and phonetic speaker recognition systems trained on Spanish and English data. Results obtained on the NIST 2005 Speaker Recognition Evaluation data show that the data-driven approach outperforms the phonetic one and that further improvements can be achieved by combining both approache

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Biblos-e Archivo

Prosodic-Enhanced Siamese Convolutional Neural Networks for Cross-Device Text-Independent Speaker Verification

Author: Dabouei Ali
Dawson Jeremy
Iranmanesh Seyed Mehdi
Kazemi Hadi
Nasrabadi Nasser M.
Soleymani Sobhan
Publication venue
Publication date: 31/07/2018
Field of study

In this paper a novel cross-device text-independent speaker verification architecture is proposed. Majority of the state-of-the-art deep architectures that are used for speaker verification tasks consider Mel-frequency cepstral coefficients. In contrast, our proposed Siamese convolutional neural network architecture uses Mel-frequency spectrogram coefficients to benefit from the dependency of the adjacent spectro-temporal features. Moreover, although spectro-temporal features have proved to be highly reliable in speaker verification models, they only represent some aspects of short-term acoustic level traits of the speaker's voice. However, the human voice consists of several linguistic levels such as acoustic, lexicon, prosody, and phonetics, that can be utilized in speaker verification models. To compensate for these inherited shortcomings in spectro-temporal features, we propose to enhance the proposed Siamese convolutional neural network architecture by deploying a multilayer perceptron network to incorporate the prosodic, jitter, and shimmer features. The proposed end-to-end verification architecture performs feature extraction and verification simultaneously. This proposed architecture displays significant improvement over classical signal processing approaches and deep algorithms for forensic cross-device speaker verification.Comment: Accepted in 9th IEEE International Conference on Biometrics: Theory, Applications, and Systems (BTAS 2018

arXiv.org e-Print Archive

Crossref

On the use of high-level information in speaker and language recognition

Author: González Domínguez Javier
González-Rodríguez Joaquín
López Moreno Ignacio
Montero-Asenjo Alberto
Ramos Daniel
Toledano Doroteo T.
Publication venue
Publication date: 01/01/2006
Field of study

Actas de las IV Jornadas de Tecnología del Habla (JTH 2006)Automatic Speaker Recognition systems have been largely dominated by acoustic-spectral based systems, relying in proper modelling of the short-term vocal tract of speakers. However, there is scientific and intuitive evidence that speaker specific information is embedded in the speech signal in multiple short- and long-term characteristics. In this work, a multilevel speaker recognition system combining acoustic, phonotactic and prosodic subsystems is presented and assessed using NIST 2005 Speaker Recognition Evaluation data. For language recognition systems, the NIST 2005 Language Recognition Evaluation was selected to measure performance of a high-level language recognition systems

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Biblos-e Archivo

Using data-driven and phonetic units for speaker verication

Author: Alberto Montero-Asenjo
Asmaa El Hannani
Dijana Petrovska-Delacrétaz
Doroteo T Toledano
Jean Hennebert
Publication venue
Publication date: 01/01/2006
Field of study

Abstract Recognition of speaker identity based on modeling the streams produced by phonetic decoders (phonetic speaker recognition) has gained popularity during the past few years. Two of the major problems that arise when phone based systems are being developed are the possible mismatches between the development and evaluation data and the lack of transcribed databases. Data-driven segmentation techniques provide a potential solution to these problems because they do not use transcribed data and can easily be applied on development data minimizing the mismatches. In this paper we compare speaker recognition results using phonetic and data-driven decoders. To this end, we have compared the results obtained with a speaker recognition system based on data-driven acoustic units and phonetic speaker recognition systems trained on Spanish and English data. Results obtained on the NIST 2005 Speaker Recognition Evaluation data show that the data-driven approach outperforms the phonetic one and that further improvements can be achieved by combining both approaches

CiteSeerX

Frame-level features conveying phonetic information for language and speaker recognition

Author: Díez Sánchez Mireia
Publication venue
Publication date: 04/09/2015
Field of study

150 p.This Thesis, developed in the Software Technologies Working Group of the Departmentof Electricity and Electronics of the University of the Basque Country, focuseson the research eld of spoken language and speaker recognition technologies.More specically, the research carried out studies the design of a set of featuresconveying spectral acoustic and phonotactic information, searches for the optimalfeature extraction parameters, and analyses the integration and usage of the featuresin language recognition systems, and the complementarity of these approacheswith regard to state-of-the-art systems. The study reveals that systems trained onthe proposed set of features, denoted as Phone Log-Likelihood Ratios (PLLRs), arehighly competitive, outperforming in several benchmarks other state-of-the-art systems.Moreover, PLLR-based systems also provide complementary information withregard to other phonotactic and acoustic approaches, which makes them suitable infusions to improve the overall performance of spoken language recognition systems.The usage of this features is also studied in speaker recognition tasks. In this context,the results attained by the approaches based on PLLR features are not as remarkableas the ones of systems based on standard acoustic features, but they still providecomplementary information that can be used to enhance the overall performance ofthe speaker recognition systems

Archivo Digital para la Docencia y la Investigación

Open-set Speaker Identification

Author: Karadaghi Rawande
Publication venue
Publication date: 20/12/2018
Field of study

This study is motivated by the growing need for effective extraction of intelligence and evidence from audio recordings in the fight against crime, a need made ever more apparent with the recent expansion of criminal and terrorist organisations. The main focus is to enhance open-set speaker identification process within the speaker identification systems, which are affected by noisy audio data obtained under uncontrolled environments such as in the street, in restaurants or other places of businesses. Consequently, two investigations are initially carried out including the effects of environmental noise on the accuracy of open-set speaker recognition, which thoroughly cover relevant conditions in the considered application areas, such as variable training data length, background noise and real world noise, and the effects of short and varied duration reference data in open-set speaker recognition. The investigations led to a novel method termed “vowel boosting” to enhance the reliability in speaker identification when operating with varied duration speech data under uncontrolled conditions. Vowels naturally contain more speaker specific information. Therefore, by emphasising this natural phenomenon in speech data, it enables better identification performance. The traditional state-of-the-art GMM-UBMs and i-vectors are used to evaluate “vowel boosting”. The proposed approach boosts the impact of the vowels on the speaker scores, which improves the recognition accuracy for the specific case of open-set identification with short and varied duration of speech material

University of Hertfordshire Research Archive

Temporal contours in linguistic units for automatic text-independent speaker recognition

Author: Franco-Pedroso Javier
Publication venue
Publication date: 01/01/2013
Field of study

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Biblos-e Archivo

Automatic classification of speaker characteristics

Author: Huang Xu
Nguyen Phuoc
Sharma Dharmendra
Tran Dat
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2010
Field of study

Crossref

Deakin Research Online

University of Canberra Research Repository

Spoofing Detection in Automatic Speaker Verification Systems Using DNN Classifiers and Dynamic Acoustic Features

Author: Guo Jun
Ma Zhanyu
Martin Rainer
Tan Zheng Hua
Yu Hong
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/10/2018
Field of study

Crossref

VBN

비화자 요소에 강인한 화자 인식을 위한 딥러닝 기반 성문 추출

Author: 강우현
Publication venue: 서울대학교 대학원
Publication date: 01/02/2021
Field of study

학위논문 (박사) -- 서울대학교 대학원 : 공과대학 전기·정보공학부, 2021. 2. 김남수.Over the recent years, various deep learning-based embedding methods have been proposed and have shown impressive performance in speaker verification. However, as in most of the classical embedding techniques, the deep learning-based methods are known to suffer from severe performance degradation when dealing with speech samples with different conditions (e.g., recording devices, emotional states). Also, unlike the classical Gaussian mixture model (GMM)-based techniques (e.g., GMM supervector or i-vector), since the deep learning-based embedding systems are trained in a fully supervised manner, it is impossible for them to utilize unlabeled dataset when training. In this thesis, we propose a variational autoencoder (VAE)-based embedding framework, which extracts the total variability embedding and a representation for the uncertainty within the input speech distribution. Unlike the conventional deep learning-based embedding techniques (e.g., d-vector or x-vector), the proposed VAE-based embedding system is trained in an unsupervised manner, which enables the utilization of unlabeled datasets. Furthermore, in order to prevent the potential loss of information caused by the Kullback-Leibler divergence regularization term in the VAE-based embedding framework, we propose an adversarially learned inference (ALI)-based embedding technique. Both VAE- and ALI-based embedding techniques have shown great performance in terms of short duration speaker verification, outperforming the conventional i-vector framework. Additionally, we present a fully supervised training method for disentangling the non-speaker nuisance information from the speaker embedding. The proposed training scheme jointly extracts the speaker and nuisance attribute (e.g., recording channel, emotion) embeddings, and train them to have maximum information on their main-task while ensuring maximum uncertainty on their sub-task. Since the proposed method does not require any heuristic training strategy as in the conventional disentanglement techniques (e.g., adversarial learning, gradient reversal), optimizing the embedding network is relatively more stable. The proposed scheme have shown state-of-the-art performance in RSR2015 Part 3 dataset, and demonstrated its capability in efficiently disentangling the recording device and emotional information from the speaker embedding.최근 몇년간 다양한 딥러닝 기반 성문 추출 기법들이 제안되어 왔으며, 화자 인식에서 높은 성능을 보였다. 하지만 고전적인 성문 추출 기법에서와 마찬가지로, 딥러닝 기반 성문 추출 기법들은 서로 다른 환경 (e.g., 녹음 기기, 감정)에서 녹음된 음성들을 분석하는 과정에서 성능 저하를 겪는다. 또한 기존의 가우시안 혼합 모델 (Gaussian mixture model, GMM) 기반의 기법들 (e.g., GMM 슈퍼벡터, i-벡터)와 달리 딥러닝 기반 성문 추출 기법들은 교사 학습을 통하여 최적화되기에 라벨이 없는 데이터를 활용할 수 없다는 한계가 있다. 본 논문에서는 variational autoencoder (VAE) 기반의 성문 추출 기법을 제안하며, 해당 기법에서는 음성 분포 패턴을 요약하는 벡터와 음성 내의 불확실성을 표현하는 벡터를 추출한다. 기존의 딥러닝 기반 성문 추출 기법 (e.g., d-벡터, x-벡터)와는 달리, 제안하는 기법은 비교사 학습을 통하여 최적화 되기에 라벨이 없는 데이터를 활용할 수 있다. 더 나아가 VAE의 KL-divergence 제약 함수로 인한 정보 손실을 방지하기 위하여 adversarially learned inference (ALI) 기반의 성문 추출 기법을 추가적으로 제안한다. 제안한 VAE 및 ALI 기반의 성문 추출 기법은 짧은 음성에서의 화자 인증 실험에서 높은 성능을 보였으며, 기존의 i-벡터 기반의 기법보다 좋은 결과를 보였다. 또한 본 논문에서는 성문 벡터로부터 비 화자 요소 (e.g., 녹음 기기, 감정)에 대한 정보를 제거하는 학습법을 제안한다. 제안하는 기법은 화자 벡터와 비화자 벡터를 동시에 추출하며, 각 벡터는 자신의 주 목적에 대한 정보를 최대한 많이 유지하되, 부 목적에 대한 정보를 최소화하도록 학습된다. 기존의 비 화자 요소 정보 제거 기법들 (e.g., adversarial learning, gradient reversal)에 비하여 제안하는 기법은 휴리스틱한 학습 전략을 요하지 않기에, 보다 안정적인 학습이 가능하다. 제안하는 기법은 RSR2015 Part3 데이터셋에서 기존 기법들에 비하여 높은 성능을 보였으며, 성문 벡터 내의 녹음 기기 및 감정 정보를 억제하는데 효과적이었다.1. Introduction 1 2. Conventional embedding techniques for speaker recognition 7 2.1. i-vector framework 7 2.2. Deep learning-based speaker embedding 10 2.2.1. Deep embedding network 10 2.2.2. Conventional disentanglement methods 13 3. Unsupervised learning of total variability embedding for speaker verification with random digit strings 17 3.1. Introduction 17 3.2. Variational autoencoder 20 3.3. Variational inference model for non-linear total variability embedding 22 3.3.1. Maximum likelihood training 23 3.3.2. Non-linear feature extraction and speaker verification 25 3.4. Experiments 26 3.4.1. Databases 26 3.4.2. Experimental setup 27 3.4.3. Effect of the duration on the latent variable 28 3.4.4. Experiments with VAEs 30 3.4.5. Feature-level fusion of i-vector and latent variable 33 3.4.6. Score-level fusion of i-vector and latent variable 36 3.5. Summary 39 4. Adversarially learned total variability embedding for speaker recognition with random digit strings 41 4.1. Introduction 41 4.2. Adversarially learned inference 43 4.3. Adversarially learned feature extraction 45 4.3.1. Maximum likelihood criterion 47 4.3.2. Adversarially learned inference for non-linear i-vector extraction 49 4.3.3. Relationship to the VAE-based feature extractor 50 4.4. Experiments 51 4.4.1. Databases 51 4.4.2. Experimental setup 53 4.4.3. Effect of the duration on the latent variable 54 4.4.4. Speaker verification and identification with different utterance-level features 56 4.5. Summary 62 5. Disentangled speaker and nuisance attribute embedding for robust speaker verification 63 5.1. Introduction 63 5.2. Joint factor embedding 67 5.2.1. Joint factor embedding network architecture 67 5.2.2. Training for joint factor embedding 69 5.3. Experiments 71 5.3.1. Channel disentanglement experiments 71 5.3.2. Emotion disentanglement 82 5.3.3. Noise disentanglement 86 5.4. Summary 87 6. Conclusion 93 Bibliography 95 Abstract (Korean) 105Docto

SNU Open Repository and Archive