21 research outputs found
On the potential of jointly-optimised solutions to spoofing attack detection and automatic speaker verification
The spoofing-aware speaker verification (SASV) challenge was designed to
promote the study of jointly-optimised solutions to accomplish the
traditionally separately-optimised tasks of spoofing detection and speaker
verification. Jointly-optimised systems have the potential to operate in
synergy as a better performing solution to the single task of reliable speaker
verification. However, none of the 23 submissions to SASV 2022 are jointly
optimised. We have hence sought to determine why separately-optimised
sub-systems perform best or why joint optimisation was not successful.
Experiments reported in this paper show that joint optimisation is successful
in improving robustness to spoofing but that it degrades speaker verification
performance. The findings suggest that spoofing detection and speaker
verification sub-systems should be optimised jointly in a manner which reflects
the differences in how information provided by each sub-system is complementary
to that provided by the other. Progress will also likely depend upon the
collection of data from a larger number of speakers.Comment: Accepted to IberSPEECH 2022 Conferenc
Can spoofing countermeasure and speaker verification systems be jointly optimised?
Spoofing countermeasure (CM) and automatic speaker verification (ASV)
sub-systems can be used in tandem with a backend classifier as a solution to
the spoofing aware speaker verification (SASV) task. The two sub-systems are
typically trained independently to solve different tasks. While our previous
work demonstrated the potential of joint optimisation, it also showed a
tendency to over-fit to speakers and a lack of sub-system complementarity.
Using only a modest quantity of auxiliary data collected from new speakers, we
show that joint optimisation degrades the performance of separate CM and ASV
sub-systems, but that it nonetheless improves complementarity, thereby
delivering superior SASV performance. Using standard SASV evaluation data and
protocols, joint optimisation reduces the equal error rate by 27\% relative to
performance obtained using fixed, independently-optimised sub-systems under
like-for-like training conditions.Comment: Accepted to ICASSP 2023. Code will be available soo
Malafide: a novel adversarial convolutive noise attack against deepfake and spoofing detection systems
We present Malafide, a universal adversarial attack against automatic speaker
verification (ASV) spoofing countermeasures (CMs). By introducing convolutional
noise using an optimised linear time-invariant filter, Malafide attacks can be
used to compromise CM reliability while preserving other speech attributes such
as quality and the speaker's voice. In contrast to other adversarial attacks
proposed recently, Malafide filters are optimised independently of the input
utterance and duration, are tuned instead to the underlying spoofing attack,
and require the optimisation of only a small number of filter coefficients.
Even so, they degrade CM performance estimates by an order of magnitude, even
in black-box settings, and can also be configured to overcome integrated CM and
ASV subsystems. Integrated solutions that use self-supervised learning CMs,
however, are more robust, under both black-box and white-box settings.Comment: Accepted at INTERSPEECH 202
Towards single integrated spoofing-aware speaker verification embeddings
This study aims to develop a single integrated spoofing-aware speaker
verification (SASV) embeddings that satisfy two aspects. First, rejecting
non-target speakers' input as well as target speakers' spoofed inputs should be
addressed. Second, competitive performance should be demonstrated compared to
the fusion of automatic speaker verification (ASV) and countermeasure (CM)
embeddings, which outperformed single embedding solutions by a large margin in
the SASV2022 challenge. We analyze that the inferior performance of single SASV
embeddings comes from insufficient amount of training data and distinct nature
of ASV and CM tasks. To this end, we propose a novel framework that includes
multi-stage training and a combination of loss functions. Copy synthesis,
combined with several vocoders, is also exploited to address the lack of
spoofed data. Experimental results show dramatic improvements, achieving a
SASV-EER of 1.06% on the evaluation protocol of the SASV2022 challenge.Comment: Accepted by INTERSPEECH 2023. Code and models are available in
https://github.com/sasv-challenge/ASVSpoof5-SASVBaselin
Baseline Systems for the First Spoofing-Aware Speaker Verification Challenge: Score and Embedding Fusion
Deep learning has brought impressive progress in the study of both automatic
speaker verification (ASV) and spoofing countermeasures (CM). Although
solutions are mutually dependent, they have typically evolved as standalone
sub-systems whereby CM solutions are usually designed for a fixed ASV system.
The work reported in this paper aims to gauge the improvements in reliability
that can be gained from their closer integration. Results derived using the
popular ASVspoof2019 dataset indicate that the equal error rate (EER) of a
state-of-the-art ASV system degrades from 1.63% to 23.83% when the evaluation
protocol is extended with spoofed trials.%subjected to spoofing attacks.
However, even the straightforward integration of ASV and CM systems in the form
of score-sum and deep neural network-based fusion strategies reduce the EER to
1.71% and 6.37%, respectively. The new Spoofing-Aware Speaker Verification
(SASV) challenge has been formed to encourage greater attention to the
integration of ASV and CM systems as well as to provide a means to benchmark
different solutions.Comment: 8 pages, accepted by Odyssey 202
Optimisation des fonctionnalités frontales pour l'anti-usurpation
Les systèmes biométriques vocaux sont utilisés dans diverses applications pour une authentification sécurisée. Toutefois, ces systèmes sont vulnérables aux attaques par usurpation d'identité. Il est donc nécessaire de disposer de techniques de détection plus robustes. Cette thèse propose de nouvelles techniques de détection fiables et efficaces contre les attaques invisibles. La première contribution est un ensemble non linéaire de classificateurs de sous-bandes utilisant chacun un modèle de mélange gaussien. Des résultats compétitifs montrent que les modèles qui apprennent des indices discriminants spécifiques à la sous-bande peuvent être nettement plus performants que les modèles entraînés sur des signaux à bande complète. Étant donné que les DNN sont plus puissants et peuvent effectuer à la fois l'extraction de caractéristiques et la classification, la deuxième contribution est un modèle RawNet2. Il s'agit d'un modèle de bout en bout qui apprend les caractéristiques directement à partir de la forme d'onde brute. La troisième contribution comprend la première utilisation de réseaux neuronaux graphiques (GNN) avec un mécanisme d'attention pour modéliser la relation complexe entre les indices d'usurpation présents dans les domaines spectral et temporel. Nous proposons un réseau d'attention spectro-temporel E2E appelé RawGAT-ST. Il est ensuite étendu à un réseau d'attention spectro-temporel intégré, appelé AASIST, qui exploite la relation entre les graphes spectraux et temporels hétérogènes. Enfin, cette thèse propose une nouvelle technique d'augmentation des données appelée RawBoost et utilise un modèle vocal auto-supervisé et pré-entraîné pour améliorer la généralisation.Voice biometric systems are being used in various applications for secure user authentication using automatic speaker verification technology. However, these systems are vulnerable to spoofing attacks, which have become even more challenging with recent advances in artificial intelligence algorithms. There is hence a need for more robust, and efficient detection techniques. This thesis proposes novel detection algorithms which are designed to perform reliably in the face of the highest-quality attacks. The first contribution is a non-linear ensemble of sub-band classifiers each of which uses a Gaussian mixture model. Competitive results show that models which learn sub-band specific discriminative information can substantially outperform models trained on full-band signals. Given that deep neural networks are more powerful and can perform both feature extraction and classification, the second contribution is a RawNet2 model. It is an end-to-end (E2E) model which learns features directly from raw waveform. The third contribution includes the first use of graph neural networks (GNNs) with an attention mechanism to model the complex relationship between spoofing cues present in spectral and temporal domains. We propose an E2E spectro-temporal graph attention network called RawGAT-ST. RawGAT-ST model is further extended to an integrated spectro-temporal graph attention network, named AASIST which exploits the relationship between heterogeneous spectral and temporal graphs. Finally, this thesis proposes a novel data augmentation technique called RawBoost and uses a self-supervised, pre-trained speech model as a front-end to improve generalisation in the wild conditions