2,534 research outputs found

    Towards Vulnerability Analysis of Voice-Driven Interfaces and Countermeasures for Replay

    Full text link
    Fake audio detection is expected to become an important research area in the field of smart speakers such as Google Home, Amazon Echo and chatbots developed for these platforms. This paper presents replay attack vulnerability of voice-driven interfaces and proposes a countermeasure to detect replay attack on these platforms. This paper presents a novel framework to model replay attack distortion, and then use a non-learning-based method for replay attack detection on smart speakers. The reply attack distortion is modeled as a higher-order nonlinearity in the replay attack audio. Higher-order spectral analysis (HOSA) is used to capture characteristics distortions in the replay audio. Effectiveness of the proposed countermeasure scheme is evaluated on original speech as well as corresponding replayed recordings. The replay attack recordings are successfully injected into the Google Home device via Amazon Alexa using the drop-in conferencing feature.Comment: 6 pages, IEEE 2nd International Conference on Multimedia Information Processing and Retrieval (IEEE MIPR 2019), March 28-30, 2019, San Jose, CA, US

    Detecting Replay Attacks Using Multi-Channel Audio: A Neural Network-Based Method

    Full text link
    With the rapidly growing number of security-sensitive systems that use voice as the primary input, it becomes increasingly important to address these systems' potential vulnerability to replay attacks. Previous efforts to address this concern have focused primarily on single-channel audio. In this paper, we introduce a novel neural network-based replay attack detection model that further leverages spatial information of multi-channel audio and is able to significantly improve the replay attack detection performance.Comment: Code of this work is available here: https://github.com/YuanGongND/multichannel-antispoo

    Towards robust audio spoofing detection: a detailed comparison of traditional and learned features

    Full text link
    Automatic speaker verification, like every other biometric system, is vulnerable to spoofing attacks. Using only a few minutes of recorded voice of a genuine client of a speaker verification system, attackers can develop a variety of spoofing attacks that might trick such systems. Detecting these attacks using the audio cues present in the recordings is an important challenge. Most existing spoofing detection systems depend on knowing the used spoofing technique. With this research, we aim at overcoming this limitation, by examining robust audio features, both traditional and those learned through an autoencoder, that are generalizable over different types of replay spoofing. Furthermore, we provide a detailed account of all the steps necessary in setting up state-of-the-art audio feature detection, pre-, and postprocessing, such that the (non-audio expert) machine learning researcher can implement such systems. Finally, we evaluate the performance of our robust replay speaker detection system with a wide variety and different combinations of both extracted and machine learned audio features on the `out in the wild' ASVspoof 2017 dataset. This dataset contains a variety of new spoofing configurations. Since our focus is on examining which features will ensure robustness, we base our system on a traditional Gaussian Mixture Model-Universal Background Model. We then systematically investigate the relative contribution of each feature set. The fused models, based on both the known audio features and the machine learned features respectively, have a comparable performance with an Equal Error Rate (EER) of 12. The final best performing model, which obtains an EER of 10.8, is a hybrid model that contains both known and machine learned features, thus revealing the importance of incorporating both types of features when developing a robust spoofing prediction model

    A Study on Smart Online Frame Forging Attacks against Video Surveillance System

    Full text link
    Video Surveillance Systems (VSS) have become an essential infrastructural element of smart cities by increasing public safety and countering criminal activities. A VSS is normally deployed in a secure network to prevent access from unauthorized personnel. Compared to traditional systems that continuously record video regardless of the actions in the frame, a smart VSS has the capability of capturing video data upon motion detection or object detection, and then extracts essential information and send to users. This increasing design complexity of the surveillance system, however, also introduces new security vulnerabilities. In this work, a smart, real-time frame duplication attack is investigated. We show the feasibility of forging the video streams in real-time as the camera's surroundings change. The generated frames are compared constantly and instantly to identify changes in the pixel values that could represent motion detection or changes in light intensities outdoors. An attacker (intruder) can remotely trigger the replay of some previously duplicated video streams manually or automatically, via a special quick response (QR) code or when the face of an intruder appears in the camera field of view. A detection technique is proposed by leveraging the real-time electrical network frequency (ENF) reference database to match with the power grid frequency.Comment: To Appear in the 2019 SPIE Defense + Commercial Sensin

    Generalization of Spoofing Countermeasures: a Case Study with ASVspoof 2015 and BTAS 2016 Corpora

    Full text link
    Voice-based biometric systems are highly prone to spoofing attacks. Recently, various countermeasures have been developed for detecting different kinds of attacks such as replay, speech synthesis (SS) and voice conversion (VC). Most of the existing studies are conducted with a specific training set defined by the evaluation protocol. However, for realistic scenarios, selecting appropriate training data is an open challenge for the system administrator. Motivated by this practical concern, this work investigates the generalization capability of spoofing countermeasures in restricted training conditions where speech from a broad attack types are left out in the training database. We demonstrate that different spoofing types have considerably different generalization capabilities. For this study, we analyze the performance using two kinds of features, mel-frequency cepstral coefficients (MFCCs) which are considered as baseline and recently proposed constant Q cepstral coefficients (CQCCs). The experiments are conducted with standard Gaussian mixture model - maximum likelihood (GMM-ML) classifier on two recently released spoofing corpora: ASVspoof 2015 and BTAS 2016 that includes cross-corpora performance analysis. Feature-level analysis suggests that static and dynamic coefficients of spectral features, both are important for detecting spoofing attacks in the real-life condition

    Replay attack detection with complementary high-resolution information using end-to-end DNN for the ASVspoof 2019 Challenge

    Full text link
    In this study, we concentrate on replacing the process of extracting hand-crafted acoustic feature with end-to-end DNN using complementary high-resolution spectrograms. As a result of advance in audio devices, typical characteristics of a replayed speech based on conventional knowledge alter or diminish in unknown replay configurations. Thus, it has become increasingly difficult to detect spoofed speech with a conventional knowledge-based approach. To detect unrevealed characteristics that reside in a replayed speech, we directly input spectrograms into an end-to-end DNN without knowledge-based intervention. Explorations dealt in this study that differentiates from existing spectrogram-based systems are twofold: complementary information and high-resolution. Spectrograms with different information are explored, and it is shown that additional information such as the phase information can be complementary. High-resolution spectrograms are employed with the assumption that the difference between a bona-fide and a replayed speech exists in the details. Additionally, to verify whether other features are complementary to spectrograms, we also examine raw waveform and an i-vector based system. Experiments conducted on the ASVspoof 2019 physical access challenge show promising results, where t-DCF and equal error rates are 0.0570 and 2.45 % for the evaluation set, respectively.Comment: Accepted for oral presentation at Interspeech 2019, code available at https://github.com/Jungjee/ASVspoof2019_P

    ReMASC: Realistic Replay Attack Corpus for Voice Controlled Systems

    Full text link
    This paper introduces a new database of voice recordings with the goal of supporting research on vulnerabilities and protection of voice-controlled systems (VCSs). In contrast to prior efforts, the proposed database contains both genuine voice commands and replayed recordings of such commands, collected in realistic VCSs usage scenarios and using modern voice assistant development kits. Specifically, the database contains recordings from four systems (each with a different microphone array) in a variety of environmental conditions with different forms of background noise and relative positions between speaker and device. To the best of our knowledge, this is the first publicly available database that has been specifically designed for the protection of state-of-the-art voice-controlled systems against various replay attacks in various conditions and environments.Comment: To appear in Interspeech 2019. Data set available at https://github.com/YuanGongND/ReMAS

    Discriminate natural versus loudspeaker emitted speech

    Full text link
    In this work, we address a novel, but potentially emerging, problem of discriminating the natural human voices and those played back by any kind of audio devices in the context of interactions with in-house voice user interface. The tackled problem may find relevant applications in (1) the far-field voice interactions of vocal interfaces such as Amazon Echo, Google Home, Facebook Portal, etc, and (2) the replay spoofing attack detection. The detection of loudspeaker emitted speech will help avoid false wake-ups or unintended interactions with the devices in the first application, while eliminating attacks involve the replay of recordings collected from enrolled speakers in the second one. At first we collect a real-world dataset under well-controlled conditions containing two classes: recorded speeches directly spoken by numerous people (considered as the natural speech), and recorded speeches played back from various loudspeakers (considered as the loudspeaker emitted speech). Then from this dataset, we build prediction models based on Deep Neural Network (DNN) for which different combination of audio features have been considered. Experiment results confirm the feasibility of the task where the combination of audio embeddings extracted from SoundNet and VGGish network yields the classification accuracy up to about 90%.Comment: 5 pages, 1 figur

    Ensemble Models for Spoofing Detection in Automatic Speaker Verification

    Full text link
    Detecting spoofing attempts of automatic speaker verification (ASV) systems is challenging, especially when using only one modeling approach. For robustness, we use both deep neural networks and traditional machine learning models and combine them as ensemble models through logistic regression. They are trained to detect logical access (LA) and physical access (PA) attacks on the dataset released as part of the ASV Spoofing and Countermeasures Challenge 2019. We propose dataset partitions that ensure different attack types are present during training and validation to improve system robustness. Our ensemble model outperforms all our single models and the baselines from the challenge for both attack types. We investigate why some models on the PA dataset strongly outperform others and find that spoofed recordings in the dataset tend to have longer silences at the end than genuine ones. By removing them, the PA task becomes much more challenging, with the tandem detection cost function (t-DCF) of our best single model rising from 0.1672 to 0.5018 and equal error rate (EER) increasing from 5.98% to 19.8% on the development set.Comment: Accepted at Interspeech 2019, Graz, Austri

    Audio-replay attack detection countermeasures

    Full text link
    This paper presents the Speech Technology Center (STC) replay attack detection systems proposed for Automatic Speaker Verification Spoofing and Countermeasures Challenge 2017. In this study we focused on comparison of different spoofing detection approaches. These were GMM based methods, high level features extraction with simple classifier and deep learning frameworks. Experiments performed on the development and evaluation parts of the challenge dataset demonstrated stable efficiency of deep learning approaches in case of changing acoustic conditions. At the same time SVM classifier with high level features provided a substantial input in the efficiency of the resulting STC systems according to the fusion systems results.Comment: 11 pages, 3 figures, accepted for Specom 201
    corecore