44 research outputs found
GANBA: Generative Adversarial Network for Biometric Anti-Spoofing
Acknowledgments: Alejandro Gomez-Alanis holds a FPU fellowship (FPU16/05490) from the
Spanish Ministry of Education and Vocational Training. Jose A. Gonzalez-Lopez also holds a Juan
de la Cierva-Incorporación fellowship (IJCI-2017-32926) from the Spanish Ministry of Science and
Innovation. Furthermore, we acknowledge the support of Nvidia with the donation of a Titan X GPU.Data Availability Statement: The ASVspoof 2019 datasets were used in this study. They are publicly
available at https://datashare.ed.ac.uk/handle/10283/3336 (accessed on 5 December 2021).Automatic speaker verification (ASV) is a voice biometric technology whose security
might be compromised by spoofing attacks. To increase the robustness against spoofing attacks,
presentation attack detection (PAD) or anti-spoofing systems for detecting replay, text-to-speech and
voice conversion-based spoofing attacks are being developed. However, it was recently shown that
adversarial spoofing attacks may seriously fool anti-spoofing systems. Moreover, the robustness of the
whole biometric system (ASV + PAD) against this new type of attack is completely unexplored. In
this work, a new generative adversarial network for biometric anti-spoofing (GANBA) is proposed.
GANBA has a twofold basis: (1) it jointly employs the anti-spoofing and ASV losses to yield very
damaging adversarial spoofing attacks, and (2) it trains the PAD as a discriminator in order to make
them more robust against these types of adversarial attacks. The proposed system is able to generate
adversarial spoofing attacks which can fool the complete voice biometric system. Then, the resulting
PAD discriminators of the proposed GANBA can be used as a defense technique for detecting both
original and adversarial spoofing attacks. The physical access (PA) and logical access (LA) scenarios of
the ASVspoof 2019 database were employed to carry out the experiments. The experimental results
show that the GANBA attacks are quite effective, outperforming other adversarial techniques when
applied in white-box and black-box attack setups. In addition, the resulting PAD discriminators are
more robust against both original and adversarial spoofing attacks.FEDER/Junta de Andalucía-Consejería de Transformación
Económica, Industria, Conocimiento y Universidades Proyecto PY20_00902PID2019-104206GB-I00 funded by MCIN/ AEI /10.13039/50110001103
A Statistical Perspective of the Empirical Mode Decomposition
This research focuses on non-stationary basis decompositions methods in time-frequency analysis. Classical methodologies in this field such as Fourier Analysis and Wavelet Transforms rely on strong assumptions of the underlying moment generating process, which, may not be valid in real data scenarios or modern applications of machine learning. The literature on non-stationary methods is still in its infancy, and the research contained in this thesis aims to address challenges arising in this area. Among several alternatives, this work is based on the method known as the Empirical Mode Decomposition (EMD). The EMD is a non-parametric time-series decomposition technique that produces a set of time-series functions denoted as Intrinsic Mode Functions (IMFs), which carry specific statistical properties. The main focus is providing a general and flexible family of basis extraction methods with minimal requirements compared to those within the Fourier or Wavelet techniques. This is highly important for two main reasons: first, more universal applications can be taken into account; secondly, the EMD has very little a priori knowledge of the process required to apply it, and as such, it can have greater generalisation properties in statistical applications across a wide array of applications and data types. The contributions of this work deal with several aspects of the decomposition. The first set regards the construction of an IMF from several perspectives: (1) achieving a semi-parametric representation of each basis; (2) extracting such semi-parametric functional forms in a computationally efficient and statistically robust framework. The EMD belongs to the class of path-based decompositions and, therefore, they are often not treated as a stochastic representation. (3) A major contribution involves the embedding of the deterministic pathwise decomposition framework into a formal stochastic process setting. One of the assumptions proper of the EMD construction is the requirement for a continuous function to apply the decomposition. In general, this may not be the case within many applications. (4) Various multi-kernel Gaussian Process formulations of the EMD will be proposed through the introduced stochastic embedding. Particularly, two different models will be proposed: one modelling the temporal mode of oscillations of the EMD and the other one capturing instantaneous frequencies location in specific frequency regions or bandwidths. (5) The construction of the second stochastic embedding will be achieved with an optimisation method called the cross-entropy method. Two formulations will be provided and explored in this regard. Application on speech time-series are explored to study such methodological extensions given that they are non-stationary
Audio Deepfake Detection Based on a Combination of F0 Information and Real Plus Imaginary Spectrogram Features
Recently, pioneer research works have proposed a large number of acoustic
features (log power spectrogram, linear frequency cepstral coefficients,
constant Q cepstral coefficients, etc.) for audio deepfake detection, obtaining
good performance, and showing that different subbands have different
contributions to audio deepfake detection. However, this lacks an explanation
of the specific information in the subband, and these features also lose
information such as phase. Inspired by the mechanism of synthetic speech, the
fundamental frequency (F0) information is used to improve the quality of
synthetic speech, while the F0 of synthetic speech is still too average, which
differs significantly from that of real speech. It is expected that F0 can be
used as important information to discriminate between bonafide and fake speech,
while this information cannot be used directly due to the irregular
distribution of F0. Insteadly, the frequency band containing most of F0 is
selected as the input feature. Meanwhile, to make full use of the phase and
full-band information, we also propose to use real and imaginary spectrogram
features as complementary input features and model the disjoint subbands
separately. Finally, the results of F0, real and imaginary spectrogram features
are fused. Experimental results on the ASVspoof 2019 LA dataset show that our
proposed system is very effective for the audio deepfake detection task,
achieving an equivalent error rate (EER) of 0.43%, which surpasses almost all
systems
Replay detection in voice biometrics: an investigation of adaptive and non-adaptive front-ends
Among various physiological and behavioural traits, speech has gained popularity as an effective mode of biometric authentication. Even though they are gaining popularity, automatic speaker verification systems are vulnerable to malicious attacks, known as spoofing attacks. Among various types of spoofing attacks, replay attack poses the biggest threat due to its simplicity and effectiveness. This thesis investigates the importance of 1) improving front-end feature extraction via novel feature extraction techniques and 2) enhancing spectral components via adaptive front-end frameworks to improve replay attack detection.
This thesis initially focuses on AM-FM modelling techniques and their use in replay attack detection. A novel method to extract the sub-band frequency modulation (FM) component using the spectral centroid of a signal is proposed, and its use as a potential acoustic feature is also discussed. Frequency Domain Linear Prediction (FDLP) is explored as a method to obtain the temporal envelope of a speech signal. The temporal envelope carries amplitude modulation (AM) information of speech resonances. Several features are extracted from the temporal envelope and the FDLP residual signal. These features are then evaluated for replay attack detection and shown to have significant capability in discriminating genuine and spoofed signals. Fusion of AM and FM-based features has shown that AM and FM carry complementary information that helps distinguish replayed signals from genuine ones. The importance of frequency band allocation when creating filter banks is studied as well to further advance the understanding of front-ends for replay attack detection.
Mechanisms inspired by the human auditory system that makes the human ear an excellent spectrum analyser have been investigated and integrated into front-ends. Spatial differentiation, a mechanism that provides additional sharpening to auditory filters is one of them that is used in this work to improve the selectivity of the sub-band decomposition filters. Two features are extracted using the improved filter bank front-end: spectral envelope centroid magnitude (SECM) and spectral envelope centroid frequency (SECF). These are used to establish the positive effect of spatial differentiation on discriminating spoofed signals. Level-dependent filter tuning, which allows the ear to handle a large dynamic range, is integrated into the filter bank to further improve the front-end. This mechanism converts the filter bank into an adaptive one where the selectivity of the filters is varied based on the input signal energy. Experimental results show that this leads to improved spoofing detection performance.
Finally, deep neural network (DNN) mechanisms are integrated into sub-band feature extraction to develop an adaptive front-end that adjusts its characteristics based on the sub-band signals. A DNN-based controller that takes sub-band FM components as input, is developed to adaptively control the selectivity and sensitivity of a parallel filter bank to enhance the artifacts that differentiate a replayed signal from a genuine signal. This work illustrates gradient-based optimization of a DNN-based controller using the feedback from a spoofing detection back-end classifier, thus training it to reduce spoofing detection error. The proposed framework has displayed a superior ability in identifying high-quality replayed signals compared to conventional non-adaptive frameworks.
All techniques proposed in this thesis have been evaluated on well-established databases on replay attack detection and compared with state-of-the-art baseline systems
Voice biometric system security: Design and analysis of countermeasures for replay attacks.
PhD ThesisVoice biometric systems use automatic speaker veri cation (ASV) technology for
user authentication. Even if it is among the most convenient means of biometric
authentication, the robustness and security of ASV in the face of spoo ng attacks
(or presentation attacks) is of growing concern and is now well acknowledged
by the research community. A spoo ng attack involves illegitimate access to
personal data of a targeted user. Replay is among the simplest attacks to
mount | yet di cult to detect reliably and is the focus of this thesis.
This research focuses on the analysis and design of existing and novel countermeasures
for replay attack detection in ASV, organised in two major parts.
The rst part of the thesis investigates existing methods for spoo ng detection
from several perspectives. I rst study the generalisability of hand-crafted features
for replay detection that show promising results on synthetic speech detection.
I nd, however, that it is di cult to achieve similar levels of performance
due to the acoustically di erent problem under investigation. In addition, I show
how class-dependent cues in a benchmark dataset (ASVspoof 2017) can lead to
the manipulation of class predictions. I then analyse the performance of several
countermeasure models under varied replay attack conditions. I nd that it is
di cult to account for the e ects of various factors in a replay attack: acoustic
environment, playback device and recording device, and their interactions.
Subsequently, I developed and studied a convolutional neural network (CNN)
model that demonstrates comparable performance to the one that ranked rst
in the ASVspoof 2017 challenge. Here, the experiment analyses what the CNN
has learned for replay detection using a method from interpretable machine
learning. The ndings suggest that the model highly attends at the rst few
milliseconds of test recordings in order to make predictions. Then, I perform
an in-depth analysis of a benchmark dataset (ASVspoof 2017) for spoo ng detection
and demonstrate that any machine learning countermeasure model can
still exploit the artefacts I identi ed in this dataset.
The second part of the thesis studies the design of countermeasures for ASV,
focusing on model robustness and avoiding dataset biases. First, I proposed
an ensemble model combining shallow and deep machine learning methods for
spoo ng detection, and then demonstrate its e ectiveness on the latest benchmark
datasets (ASVspoof 2019). Next, I proposed the use of speech endpoint detection
for reliable and robust model predictions on the ASVspoof 2017 dataset.
For this, I created a publicly available collection of hand-annotations of speech
endpoints for the same dataset, and new benchmark results for both frame-based
and utterance-based countermeasures are also developed.
I then proposed spectral subband modelling using CNNs for replay detection.
My results indicate that models that learn subband-speci c information
substantially outperform models trained on complete spectrograms. Finally, I
proposed to use variational autoencoders | deep unsupervised generative models
| as an alternative backend for spoo ng detection and demonstrate encouraging
results when compared with the traditional Gaussian mixture mode
Voice conversion versus speaker verification: an overview
A speaker verification system automatically accepts or rejects a claimed identity of a speaker based on a speech sample. Recently, a major progress was made in speaker verification which leads to mass market adoption, such as in smartphone and in online commerce for user authentication. A major concern when deploying speaker verification technology is whether a system is robust against spoofing attacks. Speaker verification studies provided us a good insight into speaker characterization, which has contributed to the progress of voice conversion technology. Unfortunately, voice conversion has become one of the most easily accessible techniques to carry out spoofing attacks; therefore, presents a threat to speaker verification systems. In this paper, we will briefly introduce the fundamentals of voice conversion and speaker verification technologies. We then give an overview of recent spoofing attack studies under different conditions with a focus on voice conversion spoofing attack. We will also discuss anti-spoofing attack measures for speaker verification.Published versio
ATMS: Algorithmic Trading-Guided Market Simulation
The effective construction of an Algorithmic Trading (AT) strategy often
relies on market simulators, which remains challenging due to existing methods'
inability to adapt to the sequential and dynamic nature of trading activities.
This work fills this gap by proposing a metric to quantify market discrepancy.
This metric measures the difference between a causal effect from underlying
market unique characteristics and it is evaluated through the interaction
between the AT agent and the market. Most importantly, we introduce Algorithmic
Trading-guided Market Simulation (ATMS) by optimizing our proposed metric.
Inspired by SeqGAN, ATMS formulates the simulator as a stochastic policy in
reinforcement learning (RL) to account for the sequential nature of trading.
Moreover, ATMS utilizes the policy gradient update to bypass differentiating
the proposed metric, which involves non-differentiable operations such as order
deletion from the market. Through extensive experiments on semi-real market
data, we demonstrate the effectiveness of our metric and show that ATMS
generates market data with improved similarity to reality compared to the
state-of-the-art conditional Wasserstein Generative Adversarial Network (cWGAN)
approach. Furthermore, ATMS produces market data with more balanced BUY and
SELL volumes, mitigating the bias of the cWGAN baseline approach, where a
simple strategy can exploit the BUY/SELL imbalance for profit
INCREASING ROBUSTNESS OF I-VECTORS VIA MASKING: A CASE STUDY IN SYNTHETIC SPEECH DETECTION
Ensuring security in speaker recognition systems is crucial. In the past years, it has been demonstrated that spoofing attacks can fool these systems. In order to deal with this issue, spoof speech detection systems have been developed. While these systems have served with a good performance, their effectiveness tends to degrade under noise. Traditional speech enhancement methods are not efficient for improving performance, they even make it worse. In this research paper, performance of the noise mask obtained via a convolutional neural network structure for reducing the noise effects was investigated. The mask is used to suppress noisy regions of spectrograms in order to extract robust i-vectors. The proposed system is tested on the ASVspoof 2015 database with three different noise types and accomplished superior performance compared to the traditional systems. However, there is a loss of performance in noise types that are not encountered during training phase