74 research outputs found

    Replay detection in voice biometrics: an investigation of adaptive and non-adaptive front-ends

    Full text link
    Among various physiological and behavioural traits, speech has gained popularity as an effective mode of biometric authentication. Even though they are gaining popularity, automatic speaker verification systems are vulnerable to malicious attacks, known as spoofing attacks. Among various types of spoofing attacks, replay attack poses the biggest threat due to its simplicity and effectiveness. This thesis investigates the importance of 1) improving front-end feature extraction via novel feature extraction techniques and 2) enhancing spectral components via adaptive front-end frameworks to improve replay attack detection. This thesis initially focuses on AM-FM modelling techniques and their use in replay attack detection. A novel method to extract the sub-band frequency modulation (FM) component using the spectral centroid of a signal is proposed, and its use as a potential acoustic feature is also discussed. Frequency Domain Linear Prediction (FDLP) is explored as a method to obtain the temporal envelope of a speech signal. The temporal envelope carries amplitude modulation (AM) information of speech resonances. Several features are extracted from the temporal envelope and the FDLP residual signal. These features are then evaluated for replay attack detection and shown to have significant capability in discriminating genuine and spoofed signals. Fusion of AM and FM-based features has shown that AM and FM carry complementary information that helps distinguish replayed signals from genuine ones. The importance of frequency band allocation when creating filter banks is studied as well to further advance the understanding of front-ends for replay attack detection. Mechanisms inspired by the human auditory system that makes the human ear an excellent spectrum analyser have been investigated and integrated into front-ends. Spatial differentiation, a mechanism that provides additional sharpening to auditory filters is one of them that is used in this work to improve the selectivity of the sub-band decomposition filters. Two features are extracted using the improved filter bank front-end: spectral envelope centroid magnitude (SECM) and spectral envelope centroid frequency (SECF). These are used to establish the positive effect of spatial differentiation on discriminating spoofed signals. Level-dependent filter tuning, which allows the ear to handle a large dynamic range, is integrated into the filter bank to further improve the front-end. This mechanism converts the filter bank into an adaptive one where the selectivity of the filters is varied based on the input signal energy. Experimental results show that this leads to improved spoofing detection performance. Finally, deep neural network (DNN) mechanisms are integrated into sub-band feature extraction to develop an adaptive front-end that adjusts its characteristics based on the sub-band signals. A DNN-based controller that takes sub-band FM components as input, is developed to adaptively control the selectivity and sensitivity of a parallel filter bank to enhance the artifacts that differentiate a replayed signal from a genuine signal. This work illustrates gradient-based optimization of a DNN-based controller using the feedback from a spoofing detection back-end classifier, thus training it to reduce spoofing detection error. The proposed framework has displayed a superior ability in identifying high-quality replayed signals compared to conventional non-adaptive frameworks. All techniques proposed in this thesis have been evaluated on well-established databases on replay attack detection and compared with state-of-the-art baseline systems

    Deep Generative Variational Autoencoding for Replay Spoof Detection in Automatic Speaker Verification

    Get PDF
    Automatic speaker verification (ASV) systems are highly vulnerable to presentation attacks, also called spoofing attacks. Replay is among the simplest attacks to mount - yet difficult to detect reliably. The generalization failure of spoofing countermeasures (CMs) has driven the community to study various alternative deep learning CMs. The majority of them are supervised approaches that learn a human-spoof discriminator. In this paper, we advocate a different, deep generative approach that leverages from powerful unsupervised manifold learning in classification. The potential benefits include the possibility to sample new data, and to obtain insights to the latent features of genuine and spoofed speech. To this end, we propose to use variational autoencoders (VAEs) as an alternative backend for replay attack detection, via three alternative models that differ in their class-conditioning. The first one, similar to the use of Gaussian mixture models (GMMs) in spoof detection, is to train independently two VAEs - one for each class. The second one is to train a single conditional model (C-VAE) by injecting a one-hot class label vector to the encoder and decoder networks. Our final proposal integrates an auxiliary classifier to guide the learning of the latent space. Our experimental results using constant-Q cepstral coefficient (CQCC) features on the ASVspoof 2017 and 2019 physical access subtask datasets indicate that the C-VAE offers substantial improvement in comparison to training two separate VAEs for each class. On the 2019 dataset, the C-VAE outperforms the VAE and the baseline GMM by an absolute 9-10% in both equal error rate (EER) and tandem detection cost function (t-DCF) metrics. Finally, we propose VAE residuals --- the absolute difference of the original input and the reconstruction as features for spoofing detection. The proposed frontend approach augmented with a convolutional neural network classifier demonstrated substantial improvement over the VAE backend use case

    Voice biometric system security: Design and analysis of countermeasures for replay attacks.

    Get PDF
    PhD ThesisVoice biometric systems use automatic speaker veri cation (ASV) technology for user authentication. Even if it is among the most convenient means of biometric authentication, the robustness and security of ASV in the face of spoo ng attacks (or presentation attacks) is of growing concern and is now well acknowledged by the research community. A spoo ng attack involves illegitimate access to personal data of a targeted user. Replay is among the simplest attacks to mount | yet di cult to detect reliably and is the focus of this thesis. This research focuses on the analysis and design of existing and novel countermeasures for replay attack detection in ASV, organised in two major parts. The rst part of the thesis investigates existing methods for spoo ng detection from several perspectives. I rst study the generalisability of hand-crafted features for replay detection that show promising results on synthetic speech detection. I nd, however, that it is di cult to achieve similar levels of performance due to the acoustically di erent problem under investigation. In addition, I show how class-dependent cues in a benchmark dataset (ASVspoof 2017) can lead to the manipulation of class predictions. I then analyse the performance of several countermeasure models under varied replay attack conditions. I nd that it is di cult to account for the e ects of various factors in a replay attack: acoustic environment, playback device and recording device, and their interactions. Subsequently, I developed and studied a convolutional neural network (CNN) model that demonstrates comparable performance to the one that ranked rst in the ASVspoof 2017 challenge. Here, the experiment analyses what the CNN has learned for replay detection using a method from interpretable machine learning. The ndings suggest that the model highly attends at the rst few milliseconds of test recordings in order to make predictions. Then, I perform an in-depth analysis of a benchmark dataset (ASVspoof 2017) for spoo ng detection and demonstrate that any machine learning countermeasure model can still exploit the artefacts I identi ed in this dataset. The second part of the thesis studies the design of countermeasures for ASV, focusing on model robustness and avoiding dataset biases. First, I proposed an ensemble model combining shallow and deep machine learning methods for spoo ng detection, and then demonstrate its e ectiveness on the latest benchmark datasets (ASVspoof 2019). Next, I proposed the use of speech endpoint detection for reliable and robust model predictions on the ASVspoof 2017 dataset. For this, I created a publicly available collection of hand-annotations of speech endpoints for the same dataset, and new benchmark results for both frame-based and utterance-based countermeasures are also developed. I then proposed spectral subband modelling using CNNs for replay detection. My results indicate that models that learn subband-speci c information substantially outperform models trained on complete spectrograms. Finally, I proposed to use variational autoencoders | deep unsupervised generative models | as an alternative backend for spoo ng detection and demonstrate encouraging results when compared with the traditional Gaussian mixture mode

    Impact of score fusion on voice biometrics and presentation attack detection in cross-database evaluations

    Get PDF
    Research in the area of automatic speaker verification (ASV) has been advanced enough for the industry to start using ASV systems in practical applications. However, these systems are highly vulnerable to spoofing or presentation attacks, limiting their wide deployment. Therefore, it is important to develop mechanisms that can detect such attacks, and it is equally important for these mechanisms to be seamlessly integrated into existing ASV systems for practical and attack-resistant solutions. To be practical, however, an attack detection should (i) have high accuracy, (ii) be well-generalized for different attacks, and (iii) be simple and efficient. Several audio-based presentation attack detection (PAD) methods have been proposed recently but their evaluation was usually done on a single, often obscure, database with limited number of attacks. Therefore, in this paper, we conduct an extensive study of eight state-of-the-art PAD methods and evaluate their ability to detect known and unknown attacks (e.g., in a cross-database scenario) using two major publicly available speaker databases with spoofing attacks: AVspoof and ASVspoof. We investigate whether combining several PAD systems via score fusion can improve attack detection accuracy. We also study the impact of fusing PAD systems (via parallel and cascading schemes) with two i-vector and inter-session variability based ASV systems on the overall performance in both bona fide (no attacks) and spoof scenarios. The evaluation results question the efficiency and practicality of the existing PAD systems, especially when comparing results for individual databases and cross-database data. Fusing several PAD systems can lead to a slightly improved performance; however, how to select which systems to fuse remains an open question. Joint ASV-PAD systems show a significantly increased resistance to the attacks at the expense of slightly degraded performance for bona fide scenarios

    Joint Operation of Voice Biometrics and Presentation Attack Detection

    Get PDF
    Research in the area of automatic speaker verification (ASV) has advanced enough for the industry to start using ASV systems in practical applications. However, as it was also shown for fingerprints, face, and other verification systems, ASV systems are highly vulnerable to spoofing or presentation attacks, limiting their wide practical deployment. Therefore, to protect against such attacks, effective anti-spoofing detection techniques, more formally known as presentation attack detection (PAD) systems, need to be developed. These techniques should be then seamlessly integrated into existing ASV systems for practical all-in-one solutions. In this paper, we focus on the integration of PAD and ASV systems. We consider the state of the art i-vector and ISV-based ASV systems and demonstrate the effect of score-based integration with a PAD system on the verification and attack detection accuracies. In our experiments, we rely on AVspoof database that contains realistic presentation attacks, which are considered by the industry to be the threat to practical ASV systems. Experimental results show a significantly increased resistance of the joint ASV-PAD system to the attacks at the expense of slightly degraded performance for scenarios without spoofing attacks. Also, an important contribution of the paper is an open source and an online-based implementations of the separate and joint ASV-PAD systems

    When the Differences in Frequency Domain are Compensated: Understanding and Defeating Modulated Replay Attacks on Automatic Speech Recognition

    Full text link
    Automatic speech recognition (ASR) systems have been widely deployed in modern smart devices to provide convenient and diverse voice-controlled services. Since ASR systems are vulnerable to audio replay attacks that can spoof and mislead ASR systems, a number of defense systems have been proposed to identify replayed audio signals based on the speakers' unique acoustic features in the frequency domain. In this paper, we uncover a new type of replay attack called modulated replay attack, which can bypass the existing frequency domain based defense systems. The basic idea is to compensate for the frequency distortion of a given electronic speaker using an inverse filter that is customized to the speaker's transform characteristics. Our experiments on real smart devices confirm the modulated replay attacks can successfully escape the existing detection mechanisms that rely on identifying suspicious features in the frequency domain. To defeat modulated replay attacks, we design and implement a countermeasure named DualGuard. We discover and formally prove that no matter how the replay audio signals could be modulated, the replay attacks will either leave ringing artifacts in the time domain or cause spectrum distortion in the frequency domain. Therefore, by jointly checking suspicious features in both frequency and time domains, DualGuard can successfully detect various replay attacks including the modulated replay attacks. We implement a prototype of DualGuard on a popular voice interactive platform, ReSpeaker Core v2. The experimental results show DualGuard can achieve 98% accuracy on detecting modulated replay attacks.Comment: 17 pages, 24 figures, In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security (CCS' 20

    A Statistical Perspective of the Empirical Mode Decomposition

    Get PDF
    This research focuses on non-stationary basis decompositions methods in time-frequency analysis. Classical methodologies in this field such as Fourier Analysis and Wavelet Transforms rely on strong assumptions of the underlying moment generating process, which, may not be valid in real data scenarios or modern applications of machine learning. The literature on non-stationary methods is still in its infancy, and the research contained in this thesis aims to address challenges arising in this area. Among several alternatives, this work is based on the method known as the Empirical Mode Decomposition (EMD). The EMD is a non-parametric time-series decomposition technique that produces a set of time-series functions denoted as Intrinsic Mode Functions (IMFs), which carry specific statistical properties. The main focus is providing a general and flexible family of basis extraction methods with minimal requirements compared to those within the Fourier or Wavelet techniques. This is highly important for two main reasons: first, more universal applications can be taken into account; secondly, the EMD has very little a priori knowledge of the process required to apply it, and as such, it can have greater generalisation properties in statistical applications across a wide array of applications and data types. The contributions of this work deal with several aspects of the decomposition. The first set regards the construction of an IMF from several perspectives: (1) achieving a semi-parametric representation of each basis; (2) extracting such semi-parametric functional forms in a computationally efficient and statistically robust framework. The EMD belongs to the class of path-based decompositions and, therefore, they are often not treated as a stochastic representation. (3) A major contribution involves the embedding of the deterministic pathwise decomposition framework into a formal stochastic process setting. One of the assumptions proper of the EMD construction is the requirement for a continuous function to apply the decomposition. In general, this may not be the case within many applications. (4) Various multi-kernel Gaussian Process formulations of the EMD will be proposed through the introduced stochastic embedding. Particularly, two different models will be proposed: one modelling the temporal mode of oscillations of the EMD and the other one capturing instantaneous frequencies location in specific frequency regions or bandwidths. (5) The construction of the second stochastic embedding will be achieved with an optimisation method called the cross-entropy method. Two formulations will be provided and explored in this regard. Application on speech time-series are explored to study such methodological extensions given that they are non-stationary
    • …
    corecore