48 research outputs found
Subband modeling for spoofing detection in automatic speaker verification
Spectrograms - time-frequency representations of audio signals - have found widespread use in neural network-based spoofing detection. While deep models are trained on the fullband spectrum of the signal, we argue that not all frequency bands are useful for these tasks. In this paper, we systematically investigate the impact of different subbands and their importance on replay spoofing detection on two benchmark datasets: ASVspoof 2017 v2.0 and ASVspoof 2019 PA. We propose a joint subband modelling framework that employs n different sub-networks to learn subband specific features. These are later combined and passed to a classifier and the whole network weights are updated during training. Our findings on the ASVspoof 2017 dataset suggest that the most discriminative information appears to be in the first and the last 1 kHz frequency bands, and the joint model trained on these two subbands shows the best performance outperforming the baselines by a large margin. However, these findings do not generalise on the ASVspoof 2019 PA dataset. This suggests that the datasets available for training these models do not reflect real world replay conditions suggesting a need for careful design of datasets for training replay spoofing countermeasures
Audio Deepfake Detection: A Survey
Audio deepfake detection is an emerging active topic. A growing number of
literatures have aimed to study deepfake detection algorithms and achieved
effective performance, the problem of which is far from being solved. Although
there are some review literatures, there has been no comprehensive survey that
provides researchers with a systematic overview of these developments with a
unified evaluation. Accordingly, in this survey paper, we first highlight the
key differences across various types of deepfake audio, then outline and
analyse competitions, datasets, features, classifications, and evaluation of
state-of-the-art approaches. For each aspect, the basic techniques, advanced
developments and major challenges are discussed. In addition, we perform a
unified comparison of representative features and classifiers on ASVspoof 2021,
ADD 2023 and In-the-Wild datasets for audio deepfake detection, respectively.
The survey shows that future research should address the lack of large scale
datasets in the wild, poor generalization of existing detection methods to
unknown fake attacks, as well as interpretability of detection results
Voice biometric system security: Design and analysis of countermeasures for replay attacks.
PhD ThesisVoice biometric systems use automatic speaker veri cation (ASV) technology for
user authentication. Even if it is among the most convenient means of biometric
authentication, the robustness and security of ASV in the face of spoo ng attacks
(or presentation attacks) is of growing concern and is now well acknowledged
by the research community. A spoo ng attack involves illegitimate access to
personal data of a targeted user. Replay is among the simplest attacks to
mount | yet di cult to detect reliably and is the focus of this thesis.
This research focuses on the analysis and design of existing and novel countermeasures
for replay attack detection in ASV, organised in two major parts.
The rst part of the thesis investigates existing methods for spoo ng detection
from several perspectives. I rst study the generalisability of hand-crafted features
for replay detection that show promising results on synthetic speech detection.
I nd, however, that it is di cult to achieve similar levels of performance
due to the acoustically di erent problem under investigation. In addition, I show
how class-dependent cues in a benchmark dataset (ASVspoof 2017) can lead to
the manipulation of class predictions. I then analyse the performance of several
countermeasure models under varied replay attack conditions. I nd that it is
di cult to account for the e ects of various factors in a replay attack: acoustic
environment, playback device and recording device, and their interactions.
Subsequently, I developed and studied a convolutional neural network (CNN)
model that demonstrates comparable performance to the one that ranked rst
in the ASVspoof 2017 challenge. Here, the experiment analyses what the CNN
has learned for replay detection using a method from interpretable machine
learning. The ndings suggest that the model highly attends at the rst few
milliseconds of test recordings in order to make predictions. Then, I perform
an in-depth analysis of a benchmark dataset (ASVspoof 2017) for spoo ng detection
and demonstrate that any machine learning countermeasure model can
still exploit the artefacts I identi ed in this dataset.
The second part of the thesis studies the design of countermeasures for ASV,
focusing on model robustness and avoiding dataset biases. First, I proposed
an ensemble model combining shallow and deep machine learning methods for
spoo ng detection, and then demonstrate its e ectiveness on the latest benchmark
datasets (ASVspoof 2019). Next, I proposed the use of speech endpoint detection
for reliable and robust model predictions on the ASVspoof 2017 dataset.
For this, I created a publicly available collection of hand-annotations of speech
endpoints for the same dataset, and new benchmark results for both frame-based
and utterance-based countermeasures are also developed.
I then proposed spectral subband modelling using CNNs for replay detection.
My results indicate that models that learn subband-speci c information
substantially outperform models trained on complete spectrograms. Finally, I
proposed to use variational autoencoders | deep unsupervised generative models
| as an alternative backend for spoo ng detection and demonstrate encouraging
results when compared with the traditional Gaussian mixture mode
Self-Attention and Hybrid Features for Replay and Deep-Fake Audio Detection
Due to the successful application of deep learning, audio spoofing detection
has made significant progress. Spoofed audio with speech synthesis or voice
conversion can be well detected by many countermeasures. However, an automatic
speaker verification system is still vulnerable to spoofing attacks such as
replay or Deep-Fake audio. Deep-Fake audio means that the spoofed utterances
are generated using text-to-speech (TTS) and voice conversion (VC) algorithms.
Here, we propose a novel framework based on hybrid features with the
self-attention mechanism. It is expected that hybrid features can be used to
get more discrimination capacity. Firstly, instead of only one type of
conventional feature, deep learning features and Mel-spectrogram features will
be extracted by two parallel paths: convolution neural networks and a
short-time Fourier transform (STFT) followed by Mel-frequency. Secondly,
features will be concatenated by a max-pooling layer. Thirdly, there is a
Self-attention mechanism for focusing on essential elements. Finally, ResNet
and a linear layer are built to get the results. Experimental results reveal
that the hybrid features, compared with conventional features, can cover more
details of an utterance. We achieve the best Equal Error Rate (EER) of 9.67\%
in the physical access (PA) scenario and 8.94\% in the Deep fake task on the
ASVspoof 2021 dataset. Compared with the best baseline system, the proposed
approach improves by 74.60\% and 60.05\%, respectively
Audio compression-assisted feature extraction for voice replay attack detection
Replay attack is one of the most effective and simplest voice spoofing
attacks. Detecting replay attacks is challenging, according to the Automatic
Speaker Verification Spoofing and Countermeasures Challenge 2021 (ASVspoof
2021), because they involve a loudspeaker, a microphone, and acoustic
conditions (e.g., background noise). One obstacle to detecting replay attacks
is finding robust feature representations that reflect the channel noise
information added to the replayed speech. This study proposes a feature
extraction approach that uses audio compression for assistance. Audio
compression compresses audio to preserve content and speaker information for
transmission. The missed information after decompression is expected to contain
content- and speaker-independent information (e.g., channel noise added during
the replay process). We conducted a comprehensive experiment with a few data
augmentation techniques and 3 classifiers on the ASVspoof 2021 physical access
(PA) set and confirmed the effectiveness of the proposed feature extraction
approach. To the best of our knowledge, the proposed approach achieves the
lowest EER at 22.71% on the ASVspoof 2021 PA evaluation set