2,977 research outputs found
Evaluation of the Vulnerability of Speaker Verification to Synthetic Speech
In this paper, we evaluate the vulnerability of a speaker verification
(SV) system to synthetic speech. Although this problem
was first examined over a decade ago, dramatic improvements
in both SV and speech synthesis have renewed interest in
this problem. We use a HMM-based speech synthesizer, which
creates synthetic speech for a targeted speaker through adaptation
of a background model and a GMM-UBM-based SV system.
Using 283 speakers from the Wall-Street Journal (WSJ)
corpus, our SV system has a 0.4% EER. When the system
is tested with synthetic speech generated from speaker models
derived from the WSJ journal corpus, 90% of the matched
claims are accepted. This result suggests a possible vulnerability
in SV systems to synthetic speech. In order to detect
synthetic speech prior to recognition, we investigate the
use of an automatic speech recognizer (ASR), dynamic-timewarping
(DTW) distance of mel-frequency cepstral coefficients
(MFCC), and previously-proposed average inter-frame difference
of log-likelihood (IFDLL). Overall, while SV systems
have impressive accuracy, even with the proposed detector,
high-quality synthetic speech can lead to an unacceptably high
acceptance rate of synthetic speakers
Sampling-based speech parameter generation using moment-matching networks
This paper presents sampling-based speech parameter generation using
moment-matching networks for Deep Neural Network (DNN)-based speech synthesis.
Although people never produce exactly the same speech even if we try to express
the same linguistic and para-linguistic information, typical statistical speech
synthesis produces completely the same speech, i.e., there is no
inter-utterance variation in synthetic speech. To give synthetic speech natural
inter-utterance variation, this paper builds DNN acoustic models that make it
possible to randomly sample speech parameters. The DNNs are trained so that
they make the moments of generated speech parameters close to those of natural
speech parameters. Since the variation of speech parameters is compressed into
a low-dimensional simple prior noise vector, our algorithm has lower
computation cost than direct sampling of speech parameters. As the first step
towards generating synthetic speech that has natural inter-utterance variation,
this paper investigates whether or not the proposed sampling-based generation
deteriorates synthetic speech quality. In evaluation, we compare speech quality
of conventional maximum likelihood-based generation and proposed sampling-based
generation. The result demonstrates the proposed generation causes no
degradation in speech quality.Comment: Submitted to INTERSPEECH 201
A Study on Replay Attack and Anti-Spoofing for Automatic Speaker Verification
For practical automatic speaker verification (ASV) systems, replay attack
poses a true risk. By replaying a pre-recorded speech signal of the genuine
speaker, ASV systems tend to be easily fooled. An effective replay detection
method is therefore highly desirable. In this study, we investigate a major
difficulty in replay detection: the over-fitting problem caused by variability
factors in speech signal. An F-ratio probing tool is proposed and three
variability factors are investigated using this tool: speaker identity, speech
content and playback & recording device. The analysis shows that device is the
most influential factor that contributes the highest over-fitting risk. A
frequency warping approach is studied to alleviate the over-fitting problem, as
verified on the ASV-spoof 2017 database
Effectiveness in the Realisation of Speaker Authentication
© 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.An important consideration for the deployment of speaker recognition in authentication applications is the approach to the formation of training and testing utterances . Whilst defining this for a specific scenario is influenced by the associated requirements and conditions, the process can be further guided through the establishment of the relative usefulness of alternative frameworks for composing the training and testing material. In this regard, the present paper provides an analysis of the effects, on the speaker recognition accuracy, of various bases for the formation of the training and testing data. The experimental investigations are conducted based on the use of digit utterances taken from the XM2VTS database. The paper presents a detailed description of the individual approaches considered and discusses the experimental results obtained in different cases
- …