2,534 research outputs found
Towards Vulnerability Analysis of Voice-Driven Interfaces and Countermeasures for Replay
Fake audio detection is expected to become an important research area in the
field of smart speakers such as Google Home, Amazon Echo and chatbots developed
for these platforms. This paper presents replay attack vulnerability of
voice-driven interfaces and proposes a countermeasure to detect replay attack
on these platforms. This paper presents a novel framework to model replay
attack distortion, and then use a non-learning-based method for replay attack
detection on smart speakers. The reply attack distortion is modeled as a
higher-order nonlinearity in the replay attack audio. Higher-order spectral
analysis (HOSA) is used to capture characteristics distortions in the replay
audio. Effectiveness of the proposed countermeasure scheme is evaluated on
original speech as well as corresponding replayed recordings. The replay attack
recordings are successfully injected into the Google Home device via Amazon
Alexa using the drop-in conferencing feature.Comment: 6 pages, IEEE 2nd International Conference on Multimedia Information
Processing and Retrieval (IEEE MIPR 2019), March 28-30, 2019, San Jose, CA,
US
Detecting Replay Attacks Using Multi-Channel Audio: A Neural Network-Based Method
With the rapidly growing number of security-sensitive systems that use voice
as the primary input, it becomes increasingly important to address these
systems' potential vulnerability to replay attacks. Previous efforts to address
this concern have focused primarily on single-channel audio. In this paper, we
introduce a novel neural network-based replay attack detection model that
further leverages spatial information of multi-channel audio and is able to
significantly improve the replay attack detection performance.Comment: Code of this work is available here:
https://github.com/YuanGongND/multichannel-antispoo
Towards robust audio spoofing detection: a detailed comparison of traditional and learned features
Automatic speaker verification, like every other biometric system, is
vulnerable to spoofing attacks. Using only a few minutes of recorded voice of a
genuine client of a speaker verification system, attackers can develop a
variety of spoofing attacks that might trick such systems. Detecting these
attacks using the audio cues present in the recordings is an important
challenge. Most existing spoofing detection systems depend on knowing the used
spoofing technique. With this research, we aim at overcoming this limitation,
by examining robust audio features, both traditional and those learned through
an autoencoder, that are generalizable over different types of replay spoofing.
Furthermore, we provide a detailed account of all the steps necessary in
setting up state-of-the-art audio feature detection, pre-, and postprocessing,
such that the (non-audio expert) machine learning researcher can implement such
systems. Finally, we evaluate the performance of our robust replay speaker
detection system with a wide variety and different combinations of both
extracted and machine learned audio features on the `out in the wild' ASVspoof
2017 dataset. This dataset contains a variety of new spoofing configurations.
Since our focus is on examining which features will ensure robustness, we base
our system on a traditional Gaussian Mixture Model-Universal Background Model.
We then systematically investigate the relative contribution of each feature
set. The fused models, based on both the known audio features and the machine
learned features respectively, have a comparable performance with an Equal
Error Rate (EER) of 12. The final best performing model, which obtains an EER
of 10.8, is a hybrid model that contains both known and machine learned
features, thus revealing the importance of incorporating both types of features
when developing a robust spoofing prediction model
A Study on Smart Online Frame Forging Attacks against Video Surveillance System
Video Surveillance Systems (VSS) have become an essential infrastructural
element of smart cities by increasing public safety and countering criminal
activities. A VSS is normally deployed in a secure network to prevent access
from unauthorized personnel. Compared to traditional systems that continuously
record video regardless of the actions in the frame, a smart VSS has the
capability of capturing video data upon motion detection or object detection,
and then extracts essential information and send to users. This increasing
design complexity of the surveillance system, however, also introduces new
security vulnerabilities. In this work, a smart, real-time frame duplication
attack is investigated. We show the feasibility of forging the video streams in
real-time as the camera's surroundings change. The generated frames are
compared constantly and instantly to identify changes in the pixel values that
could represent motion detection or changes in light intensities outdoors. An
attacker (intruder) can remotely trigger the replay of some previously
duplicated video streams manually or automatically, via a special quick
response (QR) code or when the face of an intruder appears in the camera field
of view. A detection technique is proposed by leveraging the real-time
electrical network frequency (ENF) reference database to match with the power
grid frequency.Comment: To Appear in the 2019 SPIE Defense + Commercial Sensin
Generalization of Spoofing Countermeasures: a Case Study with ASVspoof 2015 and BTAS 2016 Corpora
Voice-based biometric systems are highly prone to spoofing attacks. Recently,
various countermeasures have been developed for detecting different kinds of
attacks such as replay, speech synthesis (SS) and voice conversion (VC). Most
of the existing studies are conducted with a specific training set defined by
the evaluation protocol. However, for realistic scenarios, selecting
appropriate training data is an open challenge for the system administrator.
Motivated by this practical concern, this work investigates the generalization
capability of spoofing countermeasures in restricted training conditions where
speech from a broad attack types are left out in the training database. We
demonstrate that different spoofing types have considerably different
generalization capabilities. For this study, we analyze the performance using
two kinds of features, mel-frequency cepstral coefficients (MFCCs) which are
considered as baseline and recently proposed constant Q cepstral coefficients
(CQCCs). The experiments are conducted with standard Gaussian mixture model -
maximum likelihood (GMM-ML) classifier on two recently released spoofing
corpora: ASVspoof 2015 and BTAS 2016 that includes cross-corpora performance
analysis. Feature-level analysis suggests that static and dynamic coefficients
of spectral features, both are important for detecting spoofing attacks in the
real-life condition
Replay attack detection with complementary high-resolution information using end-to-end DNN for the ASVspoof 2019 Challenge
In this study, we concentrate on replacing the process of extracting
hand-crafted acoustic feature with end-to-end DNN using complementary
high-resolution spectrograms. As a result of advance in audio devices, typical
characteristics of a replayed speech based on conventional knowledge alter or
diminish in unknown replay configurations. Thus, it has become increasingly
difficult to detect spoofed speech with a conventional knowledge-based
approach. To detect unrevealed characteristics that reside in a replayed
speech, we directly input spectrograms into an end-to-end DNN without
knowledge-based intervention. Explorations dealt in this study that
differentiates from existing spectrogram-based systems are twofold:
complementary information and high-resolution. Spectrograms with different
information are explored, and it is shown that additional information such as
the phase information can be complementary. High-resolution spectrograms are
employed with the assumption that the difference between a bona-fide and a
replayed speech exists in the details. Additionally, to verify whether other
features are complementary to spectrograms, we also examine raw waveform and an
i-vector based system. Experiments conducted on the ASVspoof 2019 physical
access challenge show promising results, where t-DCF and equal error rates are
0.0570 and 2.45 % for the evaluation set, respectively.Comment: Accepted for oral presentation at Interspeech 2019, code available at
https://github.com/Jungjee/ASVspoof2019_P
ReMASC: Realistic Replay Attack Corpus for Voice Controlled Systems
This paper introduces a new database of voice recordings with the goal of
supporting research on vulnerabilities and protection of voice-controlled
systems (VCSs). In contrast to prior efforts, the proposed database contains
both genuine voice commands and replayed recordings of such commands, collected
in realistic VCSs usage scenarios and using modern voice assistant development
kits. Specifically, the database contains recordings from four systems (each
with a different microphone array) in a variety of environmental conditions
with different forms of background noise and relative positions between speaker
and device. To the best of our knowledge, this is the first publicly available
database that has been specifically designed for the protection of
state-of-the-art voice-controlled systems against various replay attacks in
various conditions and environments.Comment: To appear in Interspeech 2019. Data set available at
https://github.com/YuanGongND/ReMAS
Discriminate natural versus loudspeaker emitted speech
In this work, we address a novel, but potentially emerging, problem of
discriminating the natural human voices and those played back by any kind of
audio devices in the context of interactions with in-house voice user
interface. The tackled problem may find relevant applications in (1) the
far-field voice interactions of vocal interfaces such as Amazon Echo, Google
Home, Facebook Portal, etc, and (2) the replay spoofing attack detection. The
detection of loudspeaker emitted speech will help avoid false wake-ups or
unintended interactions with the devices in the first application, while
eliminating attacks involve the replay of recordings collected from enrolled
speakers in the second one. At first we collect a real-world dataset under
well-controlled conditions containing two classes: recorded speeches directly
spoken by numerous people (considered as the natural speech), and recorded
speeches played back from various loudspeakers (considered as the loudspeaker
emitted speech). Then from this dataset, we build prediction models based on
Deep Neural Network (DNN) for which different combination of audio features
have been considered. Experiment results confirm the feasibility of the task
where the combination of audio embeddings extracted from SoundNet and VGGish
network yields the classification accuracy up to about 90%.Comment: 5 pages, 1 figur
Ensemble Models for Spoofing Detection in Automatic Speaker Verification
Detecting spoofing attempts of automatic speaker verification (ASV) systems
is challenging, especially when using only one modeling approach. For
robustness, we use both deep neural networks and traditional machine learning
models and combine them as ensemble models through logistic regression. They
are trained to detect logical access (LA) and physical access (PA) attacks on
the dataset released as part of the ASV Spoofing and Countermeasures Challenge
2019. We propose dataset partitions that ensure different attack types are
present during training and validation to improve system robustness. Our
ensemble model outperforms all our single models and the baselines from the
challenge for both attack types. We investigate why some models on the PA
dataset strongly outperform others and find that spoofed recordings in the
dataset tend to have longer silences at the end than genuine ones. By removing
them, the PA task becomes much more challenging, with the tandem detection cost
function (t-DCF) of our best single model rising from 0.1672 to 0.5018 and
equal error rate (EER) increasing from 5.98% to 19.8% on the development set.Comment: Accepted at Interspeech 2019, Graz, Austri
Audio-replay attack detection countermeasures
This paper presents the Speech Technology Center (STC) replay attack
detection systems proposed for Automatic Speaker Verification Spoofing and
Countermeasures Challenge 2017. In this study we focused on comparison of
different spoofing detection approaches. These were GMM based methods, high
level features extraction with simple classifier and deep learning frameworks.
Experiments performed on the development and evaluation parts of the challenge
dataset demonstrated stable efficiency of deep learning approaches in case of
changing acoustic conditions. At the same time SVM classifier with high level
features provided a substantial input in the efficiency of the resulting STC
systems according to the fusion systems results.Comment: 11 pages, 3 figures, accepted for Specom 201
- …