1,745 research outputs found

    Dictionary Attacks on Speaker Verification

    Get PDF
    In this paper, we propose dictionary attacks against speaker verification - a novel attack vector that aims to match a large fraction of speaker population by chance. We introduce a generic formulation of the attack that can be used with various speech representations and threat models. The attacker uses adversarial optimization to maximize raw similarity of speaker embeddings between a seed speech sample and a proxy population. The resulting master voice successfully matches a non-trivial fraction of people in an unknown population. Adversarial waveforms obtained with our approach can match on average 69% of females and 38% of males enrolled in the target system at a strict decision threshold calibrated to yield false alarm rate of 1%. By using the attack with a black-box voice cloning system, we obtain master voices that are effective in the most challenging conditions and transferable between speaker encoders. We also show that, combined with multiple attempts, this attack opens even more to serious issues on the security of these systems

    Self-Adaptive Soft Voice Activity Detection using Deep Neural Networks for Robust Speaker Verification

    Full text link
    Voice activity detection (VAD), which classifies frames as speech or non-speech, is an important module in many speech applications including speaker verification. In this paper, we propose a novel method, called self-adaptive soft VAD, to incorporate a deep neural network (DNN)-based VAD into a deep speaker embedding system. The proposed method is a combination of the following two approaches. The first approach is soft VAD, which performs a soft selection of frame-level features extracted from a speaker feature extractor. The frame-level features are weighted by their corresponding speech posteriors estimated from the DNN-based VAD, and then aggregated to generate a speaker embedding. The second approach is self-adaptive VAD, which fine-tunes the pre-trained VAD on the speaker verification data to reduce the domain mismatch. Here, we introduce two unsupervised domain adaptation (DA) schemes, namely speech posterior-based DA (SP-DA) and joint learning-based DA (JL-DA). Experiments on a Korean speech database demonstrate that the verification performance is improved significantly in real-world environments by using self-adaptive soft VAD.Comment: Accepted at 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2019

    Speaker recognition for door opening systems

    Get PDF
    Mestrado de dupla diplomação com a UTFPR - Universidade Tecnológica Federal do ParanáBesides being an important communication tool, the voice can also serve for identification purposes since it has an individual signature for each person. Speaker recognition technologies can use this signature as an authentication method to access environments. This work explores the development and testing of machine and deep learning models, specifically the GMM, the VGG-M, and ResNet50 models, for speaker recognition access control to build a system to grant access to CeDRI’s laboratory. The deep learning models were evaluated based on their performance in recognizing speakers from audio samples, emphasizing the Equal Error Rate metric to determine their effectiveness. The models were trained and tested initially in public datasets with 1251 to 6112 speakers and then fine-tuned on private datasets with 32 speakers of CeDri’s laboratory. In this study, we compared the performance of ResNet50, VGGM, and GMM models for speaker verification. After conducting experiments on our private datasets, we found that the ResNet50 model outperformed the other models. It achieved the lowest Equal Error Rate (EER) of 0.7% on the Framed Silence Removed dataset. On the same dataset,« the VGGM model achieved an EER of 5%, and the GMM model achieved an EER of 2.13%. Our best model’s performance was unable to achieve the current state-of-the-art of 2.87% in the VoxCeleb 1 verification dataset. However, our best implementation using ResNet50 achieved an EER of 5.96% while being trained on only a tiny portion of the data than it usually is. So, this result indicates that our model is robust and efficient and provides a significant improvement margin. This thesis provides insights into the capabilities of these models in a real-world application, aiming to deploy the system on a platform for practical use in laboratory access authorization. The results of this study contribute to the field of biometric security by demonstrating the potential of speaker recognition systems in controlled environments.Além de ser uma importante ferramenta de comunicação, a voz também pode servir para fins de identificação, pois possui uma assinatura individual para cada pessoa. As tecnologias de reconhecimento de voz podem usar essa assinatura como um método de autenticação para acessar ambientes. Este trabalho explora o desenvolvimento e teste de modelos de aprendizado de máquina e aprendizado profundo, especificamente os modelos GMM, VGG-M e ResNet50, para controle de acesso de reconhecimento de voz com o objetivo de construir um sistema para conceder acesso ao laboratório do CeDRI. Os modelos de aprendizado profundo foram avaliados com base em seu desempenho no reconhecimento de falantes a partir de amostras de áudio, enfatizando a métrica de Taxa de Erro Igual para determinar sua eficácia. Osmodelos foram inicialmente treinados e testados em conjuntos de dados públicos com 1251 a 6112 falantes e, em seguida, ajustados em conjuntos de dados privados com 32 falantes do laboratório do CeDri. Neste estudo, comparamos o desempenho dos modelos ResNet50, VGGM e GMM para verificação de falantes. Após realizar experimentos em nossos conjuntos de dados privados, descobrimos que o modelo ResNet50 superou os outros modelos. Ele alcançou a menor Taxa de Erro Igual (EER) de 0,7% no conjunto de dados Framed Silence Removed. No mesmo conjunto de dados, o modelo VGGM alcançou uma EER de 5% e o modelo GMM alcançou uma EER de 2,13%. O desempenho do nosso melhor modelo não conseguiu atingir o estado da arte atual de 2,87% no conjunto de dados de verificação VoxCeleb 1. No entanto, nossa melhor implementação usando o ResNet50 alcançou uma EER de 5,96%, mesmo sendo treinado em apenas uma pequena parte dos dados que normalmente são utilizados. Assim, este resultado indica que nosso modelo é robusto e eficiente e oferece uma margem significativa de melhoria. Esta tese oferece insights sobre as capacidades desses modelos em uma aplicação do mundo real, visando implantar o sistema em uma plataforma para uso prático na autorização de acesso ao laboratório. Os resultados deste estudo contribuem para o campo da segurança biométrica ao demonstrar o potencial dos sistemas de reconhecimento de voz em ambientes controlados

    Text-independent bilingual speaker verification system.

    Get PDF
    Ma Bin.Thesis (M.Phil.)--Chinese University of Hong Kong, 2003.Includes bibliographical references (leaves 96-102).Abstracts in English and Chinese.Abstract --- p.iAcknowledgement --- p.ivChapter 1 --- Introduction --- p.1Chapter 1.1 --- Biometrics --- p.2Chapter 1.2 --- Speaker Verification --- p.3Chapter 1.3 --- Overview of Speaker Verification Systems --- p.4Chapter 1.4 --- Text Dependency --- p.4Chapter 1.4.1 --- Text-Dependent Speaker Verification --- p.5Chapter 1.4.2 --- GMM-based Speaker Verification --- p.6Chapter 1.5 --- Language Dependency --- p.6Chapter 1.6 --- Normalization Techniques --- p.7Chapter 1.7 --- Objectives of the Thesis --- p.8Chapter 1.8 --- Thesis Organization --- p.8Chapter 2 --- Background --- p.10Chapter 2.1 --- Background Information --- p.11Chapter 2.1.1 --- Speech Signal Acquisition --- p.11Chapter 2.1.2 --- Speech Processing --- p.11Chapter 2.1.3 --- Engineering Model of Speech Signal --- p.13Chapter 2.1.4 --- Speaker Information in the Speech Signal --- p.14Chapter 2.1.5 --- Feature Parameters --- p.15Chapter 2.1.5.1 --- Mel-Frequency Cepstral Coefficients --- p.16Chapter 2.1.5.2 --- Linear Predictive Coding Derived Cep- stral Coefficients --- p.18Chapter 2.1.5.3 --- Energy Measures --- p.20Chapter 2.1.5.4 --- Derivatives of Cepstral Coefficients --- p.21Chapter 2.1.6 --- Evaluating Speaker Verification Systems --- p.22Chapter 2.2 --- Common Techniques --- p.24Chapter 2.2.1 --- Template Model Matching Methods --- p.25Chapter 2.2.2 --- Statistical Model Methods --- p.26Chapter 2.2.2.1 --- HMM Modeling Technique --- p.27Chapter 2.2.2.2 --- GMM Modeling Techniques --- p.30Chapter 2.2.2.3 --- Gaussian Mixture Model --- p.31Chapter 2.2.2.4 --- The Advantages of GMM --- p.32Chapter 2.2.3 --- Likelihood Scoring --- p.32Chapter 2.2.4 --- General Approach to Decision Making --- p.35Chapter 2.2.5 --- Cohort Normalization --- p.35Chapter 2.2.5.1 --- Probability Score Normalization --- p.36Chapter 2.2.5.2 --- Cohort Selection --- p.37Chapter 2.3 --- Chapter Summary --- p.38Chapter 3 --- Experimental Corpora --- p.39Chapter 3.1 --- The YOHO Corpus --- p.39Chapter 3.1.1 --- Design of the YOHO Corpus --- p.39Chapter 3.1.2 --- Data Collection Process of the YOHO Corpus --- p.40Chapter 3.1.3 --- Experimentation with the YOHO Corpus --- p.41Chapter 3.2 --- CUHK Bilingual Speaker Verification Corpus --- p.42Chapter 3.2.1 --- Design of the CUBS Corpus --- p.42Chapter 3.2.2 --- Data Collection Process for the CUBS Corpus --- p.44Chapter 3.3 --- Chapter Summary --- p.46Chapter 4 --- Text-Dependent Speaker Verification --- p.47Chapter 4.1 --- Front-End Processing on the YOHO Corpus --- p.48Chapter 4.2 --- Cohort Normalization Setup --- p.50Chapter 4.3 --- HMM-based Speaker Verification Experiments --- p.53Chapter 4.3.1 --- Subword HMM Models --- p.53Chapter 4.3.2 --- Experimental Results --- p.55Chapter 4.3.2.1 --- Comparison of Feature Representations --- p.55Chapter 4.3.2.2 --- Effect of Cohort Normalization --- p.58Chapter 4.4 --- Experiments on GMM-based Speaker Verification --- p.61Chapter 4.4.1 --- Experimental Setup --- p.61Chapter 4.4.2 --- The number of Gaussian Mixture Components --- p.62Chapter 4.4.3 --- The Effect of Cohort Normalization --- p.64Chapter 4.4.4 --- Comparison of HMM and GMM --- p.65Chapter 4.5 --- Comparison with Previous Systems --- p.67Chapter 4.6 --- Chapter Summary --- p.70Chapter 5 --- Language- and Text-Independent Speaker Verification --- p.71Chapter 5.1 --- Front-End Processing of the CUBS --- p.72Chapter 5.2 --- Language- and Text-Independent Speaker Modeling --- p.73Chapter 5.3 --- Cohort Normalization --- p.74Chapter 5.4 --- Experimental Results and Analysis --- p.75Chapter 5.4.1 --- Number of Gaussian Mixture Components --- p.78Chapter 5.4.2 --- The Cohort Normalization Effect --- p.79Chapter 5.4.3 --- Language Dependency --- p.80Chapter 5.4.4 --- Language-Independency --- p.83Chapter 5.5 --- Chapter Summary --- p.88Chapter 6 --- Conclusions and Future Work --- p.90Chapter 6.1 --- Summary --- p.90Chapter 6.1.1 --- Feature Comparison --- p.91Chapter 6.1.2 --- HMM Modeling --- p.91Chapter 6.1.3 --- GMM Modeling --- p.91Chapter 6.1.4 --- Cohort Normalization --- p.92Chapter 6.1.5 --- Language Dependency --- p.92Chapter 6.2 --- Future Work --- p.93Chapter 6.2.1 --- Feature Parameters --- p.93Chapter 6.2.2 --- Model Quality --- p.93Chapter 6.2.2.1 --- Variance Flooring --- p.93Chapter 6.2.2.2 --- Silence Detection --- p.94Chapter 6.2.3 --- Conversational Speaker Verification --- p.95Bibliography --- p.10

    Leveraging ASR Pretrained Conformers for Speaker Verification through Transfer Learning and Knowledge Distillation

    Full text link
    This paper explores the use of ASR-pretrained Conformers for speaker verification, leveraging their strengths in modeling speech signals. We introduce three strategies: (1) Transfer learning to initialize the speaker embedding network, improving generalization and reducing overfitting. (2) Knowledge distillation to train a more flexible speaker verification model, incorporating frame-level ASR loss as an auxiliary task. (3) A lightweight speaker adaptor for efficient feature conversion without altering the original ASR Conformer, allowing parallel ASR and speaker verification. Experiments on VoxCeleb show significant improvements: transfer learning yields a 0.48% EER, knowledge distillation results in a 0.43% EER, and the speaker adaptor approach, with just an added 4.92M parameters to a 130.94M-parameter model, achieves a 0.57% EER. Overall, our methods effectively transfer ASR capabilities to speaker verification tasks

    One-shot lip-based biometric authentication: extending behavioral features with authentication phrase information

    Full text link
    Lip-based biometric authentication (LBBA) is an authentication method based on a person's lip movements during speech in the form of video data captured by a camera sensor. LBBA can utilize both physical and behavioral characteristics of lip movements without requiring any additional sensory equipment apart from an RGB camera. State-of-the-art (SOTA) approaches use one-shot learning to train deep siamese neural networks which produce an embedding vector out of these features. Embeddings are further used to compute the similarity between an enrolled user and a user being authenticated. A flaw of these approaches is that they model behavioral features as style-of-speech without relation to what is being said. This makes the system vulnerable to video replay attacks of the client speaking any phrase. To solve this problem we propose a one-shot approach which models behavioral features to discriminate against what is being said in addition to style-of-speech. We achieve this by customizing the GRID dataset to obtain required triplets and training a siamese neural network based on 3D convolutions and recurrent neural network layers. A custom triplet loss for batch-wise hard-negative mining is proposed. Obtained results using an open-set protocol are 3.2% FAR and 3.8% FRR on the test set of the customized GRID dataset. Additional analysis of the results was done to quantify the influence and discriminatory power of behavioral and physical features for LBBA.Comment: 28 pages, 10 figures, 7 table
    corecore