572 research outputs found
Towards End-to-End Private Automatic Speaker Recognition
The development of privacy-preserving automatic speaker verification systems
has been the focus of a number of studies with the intent of allowing users to
authenticate themselves without risking the privacy of their voice. However,
current privacy-preserving methods assume that the template voice
representations (or speaker embeddings) used for authentication are extracted
locally by the user. This poses two important issues: first, knowledge of the
speaker embedding extraction model may create security and robustness
liabilities for the authentication system, as this knowledge might help
attackers in crafting adversarial examples able to mislead the system; second,
from the point of view of a service provider the speaker embedding extraction
model is arguably one of the most valuable components in the system and, as
such, disclosing it would be highly undesirable. In this work, we show how
speaker embeddings can be extracted while keeping both the speaker's voice and
the service provider's model private, using Secure Multiparty Computation.
Further, we show that it is possible to obtain reasonable trade-offs between
security and computational cost. This work is complementary to those showing
how authentication may be performed privately, and thus can be considered as
another step towards fully private automatic speaker recognition.Comment: Accepted for publication at Interspeech 202
Privacy-oriented manipulation of speaker representations
Speaker embeddings are ubiquitous, with applications ranging from speaker
recognition and diarization to speech synthesis and voice anonymisation. The
amount of information held by these embeddings lends them versatility, but also
raises privacy concerns. Speaker embeddings have been shown to contain
information on age, sex, health and more, which speakers may want to keep
private, especially when this information is not required for the target task.
In this work, we propose a method for removing and manipulating private
attributes from speaker embeddings that leverages a Vector-Quantized
Variational Autoencoder architecture, combined with an adversarial classifier
and a novel mutual information loss. We validate our model on two attributes,
sex and age, and perform experiments with ignorant and fully-informed
attackers, and with in-domain and out-of-domain data
Recommended from our members
Privacy-Preserving iVector-Based Speaker Verification
This paper introduces an efficient algorithm to develop a privacy-preserving voice verification based on iVector and linear discriminant analysis techniques. This research considers a scenario in which users enrol their voice biometric to access different services (i.e., banking). Once enrolment is completed, users can verify themselves using their voice print instead of alphanumeric passwords. Since a voice print is unique for everyone, storing it with a third-party server raises several privacy concerns. To address this challenge, this paper proposes a novel technique based on randomization to carry out voice authentication, which allows the user to enrol and verify their voice in the randomized domain. To achieve this, the iVector-based voice verification technique has been redesigned to work on the randomized domain. The proposed algorithm is validated using a well-known speech dataset. The proposed algorithm neither compromises the authentication accuracy nor adds additional complexity due to the randomization operations
SLMIA-SR: Speaker-Level Membership Inference Attacks against Speaker Recognition Systems
Membership inference attacks allow adversaries to determine whether a
particular example was contained in the model's training dataset. While
previous works have confirmed the feasibility of such attacks in various
applications, none has focused on speaker recognition (SR), a promising
voice-based biometric recognition technique. In this work, we propose SLMIA-SR,
the first membership inference attack tailored to SR. In contrast to
conventional example-level attack, our attack features speaker-level membership
inference, i.e., determining if any voices of a given speaker, either the same
as or different from the given inference voices, have been involved in the
training of a model. It is particularly useful and practical since the training
and inference voices are usually distinct, and it is also meaningful
considering the open-set nature of SR, namely, the recognition speakers were
often not present in the training data. We utilize intra-similarity and
inter-dissimilarity, two training objectives of SR, to characterize the
differences between training and non-training speakers and quantify them with
two groups of features driven by carefully-established feature engineering to
mount the attack. To improve the generalizability of our attack, we propose a
novel mixing ratio training strategy to train attack models. To enhance the
attack performance, we introduce voice chunk splitting to cope with the limited
number of inference voices and propose to train attack models dependent on the
number of inference voices. Our attack is versatile and can work in both
white-box and black-box scenarios. Additionally, we propose two novel
techniques to reduce the number of black-box queries while maintaining the
attack performance. Extensive experiments demonstrate the effectiveness of
SLMIA-SR.Comment: In Proceedings of the 31st Network and Distributed System Security
(NDSS) Symposium, 202
Speaker recognition for door opening systems
Mestrado de dupla diplomação com a UTFPR - Universidade Tecnológica Federal do ParanáBesides being an important communication tool, the voice can also serve for identification purposes since it has an individual signature for each person. Speaker recognition technologies can use this signature as an authentication method to access environments. This work explores the development and testing of machine and deep learning models, specifically the GMM, the VGG-M, and ResNet50 models, for speaker recognition access control to build a system to grant access to CeDRI’s laboratory. The deep learning models were evaluated based on their performance in recognizing speakers from audio samples, emphasizing the Equal Error Rate metric to determine their effectiveness. The models were trained and tested initially in public datasets with 1251 to 6112 speakers and then fine-tuned on private datasets with 32 speakers of CeDri’s laboratory. In this study, we compared the performance of ResNet50, VGGM, and GMM models for speaker verification. After conducting experiments on our private datasets, we found that the ResNet50 model outperformed the other models. It achieved the lowest Equal Error Rate (EER) of 0.7% on the Framed Silence Removed dataset. On the same dataset,« the VGGM model achieved an EER of 5%, and the GMM model achieved an EER of 2.13%. Our best model’s performance was unable to achieve the current state-of-the-art of 2.87% in the VoxCeleb 1 verification dataset. However, our best implementation using ResNet50 achieved an EER of 5.96% while being trained on only a tiny portion of the data than it usually is. So, this result indicates that our model is robust and efficient and provides a significant improvement margin. This thesis provides insights into the capabilities of these models in a real-world application, aiming to deploy the system on a platform for practical use in laboratory access authorization. The results of this study contribute to the field of biometric security by
demonstrating the potential of speaker recognition systems in controlled environments.Além de ser uma importante ferramenta de comunicação, a voz também pode servir para fins de identificação, pois possui uma assinatura individual para cada pessoa. As tecnologias de reconhecimento de voz podem usar essa assinatura como um método de autenticação para acessar ambientes. Este trabalho explora o desenvolvimento e teste de modelos de aprendizado de máquina e aprendizado profundo, especificamente os modelos GMM, VGG-M e ResNet50, para controle de acesso de reconhecimento de voz com o objetivo de construir um sistema para conceder acesso ao laboratório do CeDRI. Os modelos de aprendizado profundo foram avaliados com base em seu desempenho no reconhecimento de falantes a partir de amostras de áudio, enfatizando a métrica de Taxa de Erro Igual para determinar sua eficácia. Osmodelos foram inicialmente treinados e testados em conjuntos de dados públicos com 1251 a 6112 falantes e, em seguida, ajustados em conjuntos de dados privados com 32 falantes do laboratório do CeDri.
Neste estudo, comparamos o desempenho dos modelos ResNet50, VGGM e GMM para verificação de falantes. Após realizar experimentos em nossos conjuntos de dados privados, descobrimos que o modelo ResNet50 superou os outros modelos. Ele alcançou a menor Taxa de Erro Igual (EER) de 0,7% no conjunto de dados Framed Silence Removed. No mesmo conjunto de dados, o modelo VGGM alcançou uma EER de 5% e o modelo GMM alcançou uma EER de 2,13%.
O desempenho do nosso melhor modelo não conseguiu atingir o estado da arte atual de 2,87% no conjunto de dados de verificação VoxCeleb 1. No entanto, nossa melhor implementação usando o ResNet50 alcançou uma EER de 5,96%, mesmo sendo treinado em apenas uma pequena parte dos dados que normalmente são utilizados. Assim, este resultado indica que nosso modelo é robusto e eficiente e oferece uma margem significativa de melhoria. Esta tese oferece insights sobre as capacidades desses modelos em uma aplicação do mundo real, visando implantar o sistema em uma plataforma para uso prático na autorização
de acesso ao laboratório. Os resultados deste estudo contribuem para o campo da segurança biométrica ao demonstrar o potencial dos sistemas de reconhecimento de voz em ambientes controlados
Dictionary Attacks on Speaker Verification
In this paper, we propose dictionary attacks against speaker verification - a novel attack vector that aims to match a large fraction of speaker population by chance. We introduce a generic formulation of the attack that can be used with various speech representations and threat models. The attacker uses adversarial optimization to maximize raw similarity of speaker embeddings between a seed speech sample and a proxy population. The resulting master voice successfully matches a non-trivial fraction of people in an unknown population. Adversarial waveforms obtained with our approach can match on average 69% of females and 38% of males enrolled in the target system at a strict decision threshold calibrated to yield false alarm rate of 1%. By using the attack with a black-box voice cloning system, we obtain master voices that are effective in the most challenging conditions and transferable between speaker encoders. We also show that, combined with multiple attempts, this attack opens even more to serious issues on the security of these systems
- …