6 research outputs found
Attacking Speaker Recognition With Deep Generative Models
In this paper we investigate the ability of generative adversarial networks
(GANs) to synthesize spoofing attacks on modern speaker recognition systems. We
first show that samples generated with SampleRNN and WaveNet are unable to fool
a CNN-based speaker recognition system. We propose a modification of the
Wasserstein GAN objective function to make use of data that is real but not
from the class being learned. Our semi-supervised learning method is able to
perform both targeted and untargeted attacks, raising questions related to
security in speaker authentication systems.Comment: 5 pages, 3 Figures, 1 tabl
TequilaGAN: How to easily identify GAN samples
In this paper we show strategies to easily identify fake samples generated
with the Generative Adversarial Network framework. One strategy is based on the
statistical analysis and comparison of raw pixel values and features extracted
from them. The other strategy learns formal specifications from the real data
and shows that fake samples violate the specifications of the real data. We
show that fake samples produced with GANs have a universal signature that can
be used to identify fake samples. We provide results on MNIST, CIFAR10, music
and speech data.Comment: 10 pages, 16 figure
Spoofing Speaker Verification Systems with Deep Multi-speaker Text-to-speech Synthesis
This paper proposes a deep multi-speaker text-to-speech (TTS) model for
spoofing speaker verification (SV) systems. The proposed model employs one
network to synthesize time-downsampled mel-spectrograms from text input and
another network to convert them to linear-frequency spectrograms, which are
further converted to the time domain using the Griffin-Lim algorithm. Both
networks are trained separately under the generative adversarial networks (GAN)
framework. Spoofing experiments on two state-of-the-art SV systems (i-vectors
and Google's GE2E) show that the proposed system can successfully spoof these
systems with a high success rate. Spoofing experiments on anti-spoofing systems
(i.e., binary classifiers for discriminating real and synthetic speech) also
show a high spoof success rate when such anti-spoofing systems' structures are
exposed to the proposed TTS system.Comment: Submitted to ICASSP 202
Neural voice cloning with a few low-quality samples
In this paper, we explore the possibility of speech synthesis from low
quality found data using only limited number of samples of target speaker. We
try to extract only the speaker embedding from found data of target speaker
unlike previous works which tries to train the entire text-to-speech system on
found data. Also, the two speaker mimicking approaches which are adaptation and
speaker-encoder-based are applied on newly released LibriTTS dataset and
previously released VCTK corpus to examine the impact of speaker variety on
clarity and target-speaker-similarity
Adversarial Learning in the Cyber Security Domain
In recent years, machine learning algorithms, and more specially, deep
learning algorithms, have been widely used in many fields, including cyber
security. However, machine learning systems are vulnerable to adversarial
attacks, and this limits the application of machine learning, especially in
non-stationary, adversarial environments, such as the cyber security domain,
where actual adversaries (e.g., malware developers) exist. This paper
comprehensively summarizes the latest research on adversarial attacks against
security solutions that are based on machine learning techniques and presents
the risks they pose to cyber security solutions. First, we discuss the unique
challenges of implementing end-to-end adversarial attacks in the cyber security
domain. Following that, we define a unified taxonomy, where the adversarial
attack methods are characterized based on their stage of occurrence, and the
attacker's goals and capabilities. Then, we categorize the applications of
adversarial attack techniques in the cyber security domain. Finally, we use our
taxonomy to shed light on gaps in the cyber security domain that have already
been addressed in other adversarial learning domains and discuss their impact
on future adversarial learning trends in the cyber security domain
Hear "No Evil", See "Kenansville": Efficient and Transferable Black-Box Attacks on Speech Recognition and Voice Identification Systems
Automatic speech recognition and voice identification systems are being
deployed in a wide array of applications, from providing control mechanisms to
devices lacking traditional interfaces, to the automatic transcription of
conversations and authentication of users. Many of these applications have
significant security and privacy considerations. We develop attacks that force
mistranscription and misidentification in state of the art systems, with
minimal impact on human comprehension. Processing pipelines for modern systems
are comprised of signal preprocessing and feature extraction steps, whose
output is fed to a machine-learned model. Prior work has focused on the models,
using white-box knowledge to tailor model-specific attacks. We focus on the
pipeline stages before the models, which (unlike the models) are quite similar
across systems. As such, our attacks are black-box and transferable, and
demonstrably achieve mistranscription and misidentification rates as high as
100% by modifying only a few frames of audio. We perform a study via Amazon
Mechanical Turk demonstrating that there is no statistically significant
difference between human perception of regular and perturbed audio. Our
findings suggest that models may learn aspects of speech that are generally not
perceived by human subjects, but that are crucial for model accuracy. We also
find that certain English language phonemes (in particular, vowels) are
significantly more susceptible to our attack. We show that the attacks are
effective when mounted over cellular networks, where signals are subject to
degradation due to transcoding, jitter, and packet loss