17 research outputs found
An improved normalized gain-based score normalization technique for spoof detection algorithm
A spoof detection algorithm supports the speaker verification system to examine the false claims by an imposter through careful analysis of input test speech. The scores are employed to categorize the genuine and spoofed samples effectively. Under the mismatch conditions, the false acceptance ratio increases and can be reduced by appropriate score normalization techniques. In this article, we are using the normalized Discounted Cumulative Gain (nDCG) norm derived from ranking the speaker’s log-likelihood scores. The proposed scoring technique smoothens the decaying process due to logarithm with an added advantage from the ranking. The baseline spoof detection system employs Constant Q-Cepstral Co-efficient (CQCC) as the base features with a Gaussian Mixture Model (GMM) based classifier. The scores are computed using the ASVspoof 2019 dataset for normalized and without normalization conditions. The baseline techniques including the Zero normalization (Z-norm) and Test normalization (T-norm) are also considered. The proposed technique is found to perform better in terms of improved Equal Error Rate (EER) of 0.35 as against 0.43 for baseline system (no normalization) wrt to synthetic attacks using development data. Similarly, improvements are seen in the case of replay attack with EER of 7.83 for nDCG-norm and 9.87 with no normalization (no-norm). Furthermore, the tandem-Detection Cost Function (t-DCF) scores for synthetic attack are 0.015 for no-norm and 0.010 for proposed normalization. Additionally, for the replay attack the t-DCF scores are 0.195 for no-norm and 0.17 proposed normalization. The system performance is satisfactory when evaluated using evaluation data with EER of 8.96 for nDCG-norm as against 9.57 with no-norm for synthetic attacks while the EER of 9.79 for nDCG-norm as against 11.04 with no-norm for replay attacks. Supporting the EER, the t-DCF for nDCG-norm is 0.1989 and for no-norm is 0.2636 for synthetic attacks; while in case of replay attacks, the t-DCF is 0.2284 for the nDCG-norm and 0.2454 for no-norm. The proposed scoring technique is found to increase spoof detection accuracy and overall accuracy of speaker verification system
Non-linear frequency warping using constant-Q transformation for speech emotion recognition
International audienceIn this work, we explore the constant-Q transform (CQT) for speech emotion recognition (SER). The CQT-based time-frequency analysis provides variable spectro-temporal resolution with higher frequency resolution at lower frequencies. Since lower-frequency regions of speech signal contain more emotion-related information than higher-frequency regions, the increased low-frequency resolution of CQT makes it more promising for SER than standard short-time Fourier transform (STFT). We present a comparative analysis of short-term acoustic features based on STFT and CQT for SER with deep neural network (DNN) as a back-end classifier. We optimize different parameters for both features. The CQT-based features outperform the STFT-based spectral features for SER experiments. Further experiments with cross-corpora evaluation demonstrate that the CQT-based systems provide better generalization with out-of-domain training data
Voice biometric system security: Design and analysis of countermeasures for replay attacks.
PhD ThesisVoice biometric systems use automatic speaker veri cation (ASV) technology for
user authentication. Even if it is among the most convenient means of biometric
authentication, the robustness and security of ASV in the face of spoo ng attacks
(or presentation attacks) is of growing concern and is now well acknowledged
by the research community. A spoo ng attack involves illegitimate access to
personal data of a targeted user. Replay is among the simplest attacks to
mount | yet di cult to detect reliably and is the focus of this thesis.
This research focuses on the analysis and design of existing and novel countermeasures
for replay attack detection in ASV, organised in two major parts.
The rst part of the thesis investigates existing methods for spoo ng detection
from several perspectives. I rst study the generalisability of hand-crafted features
for replay detection that show promising results on synthetic speech detection.
I nd, however, that it is di cult to achieve similar levels of performance
due to the acoustically di erent problem under investigation. In addition, I show
how class-dependent cues in a benchmark dataset (ASVspoof 2017) can lead to
the manipulation of class predictions. I then analyse the performance of several
countermeasure models under varied replay attack conditions. I nd that it is
di cult to account for the e ects of various factors in a replay attack: acoustic
environment, playback device and recording device, and their interactions.
Subsequently, I developed and studied a convolutional neural network (CNN)
model that demonstrates comparable performance to the one that ranked rst
in the ASVspoof 2017 challenge. Here, the experiment analyses what the CNN
has learned for replay detection using a method from interpretable machine
learning. The ndings suggest that the model highly attends at the rst few
milliseconds of test recordings in order to make predictions. Then, I perform
an in-depth analysis of a benchmark dataset (ASVspoof 2017) for spoo ng detection
and demonstrate that any machine learning countermeasure model can
still exploit the artefacts I identi ed in this dataset.
The second part of the thesis studies the design of countermeasures for ASV,
focusing on model robustness and avoiding dataset biases. First, I proposed
an ensemble model combining shallow and deep machine learning methods for
spoo ng detection, and then demonstrate its e ectiveness on the latest benchmark
datasets (ASVspoof 2019). Next, I proposed the use of speech endpoint detection
for reliable and robust model predictions on the ASVspoof 2017 dataset.
For this, I created a publicly available collection of hand-annotations of speech
endpoints for the same dataset, and new benchmark results for both frame-based
and utterance-based countermeasures are also developed.
I then proposed spectral subband modelling using CNNs for replay detection.
My results indicate that models that learn subband-speci c information
substantially outperform models trained on complete spectrograms. Finally, I
proposed to use variational autoencoders | deep unsupervised generative models
| as an alternative backend for spoo ng detection and demonstrate encouraging
results when compared with the traditional Gaussian mixture mode