23 research outputs found
A Study on Replay Attack and Anti-Spoofing for Automatic Speaker Verification
For practical automatic speaker verification (ASV) systems, replay attack
poses a true risk. By replaying a pre-recorded speech signal of the genuine
speaker, ASV systems tend to be easily fooled. An effective replay detection
method is therefore highly desirable. In this study, we investigate a major
difficulty in replay detection: the over-fitting problem caused by variability
factors in speech signal. An F-ratio probing tool is proposed and three
variability factors are investigated using this tool: speaker identity, speech
content and playback & recording device. The analysis shows that device is the
most influential factor that contributes the highest over-fitting risk. A
frequency warping approach is studied to alleviate the over-fitting problem, as
verified on the ASV-spoof 2017 database
Investigation of Frame Alignments for GMM-based Digit-prompted Speaker Verification
Frame alignments can be computed by different methods in GMM-based speaker
verification. By incorporating a phonetic Gaussian mixture model (PGMM), we are
able to compare the performance using alignments extracted from the deep neural
networks (DNN) and the conventional hidden Markov model (HMM) in digit-prompted
speaker verification. Based on the different characteristics of these two
alignments, we present a novel content verification method to improve the
system security without much computational overhead. Our experiments on the
RSR2015 Part-3 digit-prompted task show that, the DNN based alignment performs
on par with the HMM alignment. The results also demonstrate the effectiveness
of the proposed Kullback-Leibler (KL) divergence based scoring to reject speech
with incorrect pass-phrases.Comment: accepted by APSIPA ASC 201
CFAD: A Chinese Dataset for Fake Audio Detection
Fake audio detection is a growing concern and some relevant datasets have
been designed for research. However, there is no standard public Chinese
dataset under complex conditions.In this paper, we aim to fill in the gap and
design a Chinese fake audio detection dataset (CFAD) for studying more
generalized detection methods. Twelve mainstream speech-generation techniques
are used to generate fake audio. To simulate the real-life scenarios, three
noise datasets are selected for noise adding at five different signal-to-noise
ratios, and six codecs are considered for audio transcoding (format
conversion). CFAD dataset can be used not only for fake audio detection but
also for detecting the algorithms of fake utterances for audio forensics.
Baseline results are presented with analysis. The results that show fake audio
detection methods with generalization remain challenging. The CFAD dataset is
publicly available at: https://zenodo.org/record/8122764.Comment: FAD renamed as CFA
Probing the Information Encoded in X-vectors
Deep neural network based speaker embeddings, such as x-vectors, have been
shown to perform well in text-independent speaker recognition/verification
tasks. In this paper, we use simple classifiers to investigate the contents
encoded by x-vector embeddings. We probe these embeddings for information
related to the speaker, channel, transcription (sentence, words, phones), and
meta information about the utterance (duration and augmentation type), and
compare these with the information encoded by i-vectors across a varying number
of dimensions. We also study the effect of data augmentation during extractor
training on the information captured by x-vectors. Experiments on the RedDots
data set show that x-vectors capture spoken content and channel-related
information, while performing well on speaker verification tasks.Comment: Accepted at IEEE Workshop on Automatic Speech Recognition and
Understanding (ASRU) 201
Time-Contrastive Learning Based Deep Bottleneck Features for Text-Dependent Speaker Verification
There are a number of studies about extraction of bottleneck (BN) features
from deep neural networks (DNNs)trained to discriminate speakers, pass-phrases
and triphone states for improving the performance of text-dependent speaker
verification (TD-SV). However, a moderate success has been achieved. A recent
study [1] presented a time contrastive learning (TCL) concept to explore the
non-stationarity of brain signals for classification of brain states. Speech
signals have similar non-stationarity property, and TCL further has the
advantage of having no need for labeled data. We therefore present a TCL based
BN feature extraction method. The method uniformly partitions each speech
utterance in a training dataset into a predefined number of multi-frame
segments. Each segment in an utterance corresponds to one class, and class
labels are shared across utterances. DNNs are then trained to discriminate all
speech frames among the classes to exploit the temporal structure of speech. In
addition, we propose a segment-based unsupervised clustering algorithm to
re-assign class labels to the segments. TD-SV experiments were conducted on the
RedDots challenge database. The TCL-DNNs were trained using speech data of
fixed pass-phrases that were excluded from the TD-SV evaluation set, so the
learned features can be considered phrase-independent. We compare the
performance of the proposed TCL bottleneck (BN) feature with those of
short-time cepstral features and BN features extracted from DNNs discriminating
speakers, pass-phrases, speaker+pass-phrase, as well as monophones whose labels
and boundaries are generated by three different automatic speech recognition
(ASR) systems. Experimental results show that the proposed TCL-BN outperforms
cepstral features and speaker+pass-phrase discriminant BN features, and its
performance is on par with those of ASR derived BN features. Moreover,....Comment: Copyright (c) 2019 IEEE. Personal use of this material is permitted.
Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising
or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of
this work in other work