Search CORE

2,565 research outputs found

ReMASC: Realistic Replay Attack Corpus for Voice Controlled Systems

Author: Gong Yuan
Huber Jacob
MacKnight Mitchell
Poellabauer Christian
Yang Jian
Publication venue
Publication date: 02/07/2019
Field of study

This paper introduces a new database of voice recordings with the goal of supporting research on vulnerabilities and protection of voice-controlled systems (VCSs). In contrast to prior efforts, the proposed database contains both genuine voice commands and replayed recordings of such commands, collected in realistic VCSs usage scenarios and using modern voice assistant development kits. Specifically, the database contains recordings from four systems (each with a different microphone array) in a variety of environmental conditions with different forms of background noise and relative positions between speaker and device. To the best of our knowledge, this is the first publicly available database that has been specifically designed for the protection of state-of-the-art voice-controlled systems against various replay attacks in various conditions and environments.Comment: To appear in Interspeech 2019. Data set available at https://github.com/YuanGongND/ReMAS

arXiv.org e-Print Archive

Detecting Replay Attacks Using Multi-Channel Audio: A Neural Network-Based Method

Author: Gong Yuan
Poellabauer Christian
Yang Jian
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 03/07/2020
Field of study

With the rapidly growing number of security-sensitive systems that use voice as the primary input, it becomes increasingly important to address these systems' potential vulnerability to replay attacks. Previous efforts to address this concern have focused primarily on single-channel audio. In this paper, we introduce a novel neural network-based replay attack detection model that further leverages spatial information of multi-channel audio and is able to significantly improve the replay attack detection performance.Comment: Code of this work is available here: https://github.com/YuanGongND/multichannel-antispoo

arXiv.org e-Print Archive

End-to-End Multi-Look Keyword Spotting

Author: Ji Xuan
Su Dan
Wu Bo
Yu Dong
Yu Meng
Publication venue
Publication date: 20/05/2020
Field of study

The performance of keyword spotting (KWS), measured in false alarms and false rejects, degrades significantly under the far field and noisy conditions. In this paper, we propose a multi-look neural network modeling for speech enhancement which simultaneously steers to listen to multiple sampled look directions. The multi-look enhancement is then jointly trained with KWS to form an end-to-end KWS model which integrates the enhanced signals from multiple look directions and leverages an attention mechanism to dynamically tune the model's attention to the reliable sources. We demonstrate, on our large noisy and far-field evaluation sets, that the proposed approach significantly improves the KWS performance against the baseline KWS system and a recent beamformer based multi-beam KWS system.Comment: Submitted to Interspeech202

arXiv.org e-Print Archive

On the use of DNN Autoencoder for Robust Speaker Recognition

Author: Glembek Ondrej
Matejka Pavel
Novotny Ondrej
Plchot Oldrich
Publication venue
Publication date: 07/11/2018
Field of study

In this paper, we present an analysis of a DNN-based autoencoder for speech enhancement, dereverberation and denoising. The target application is a robust speaker recognition system. We started with augmenting the Fisher database with artificially noised and reverberated data and we trained the autoencoder to map noisy and reverberated speech to its clean version. We use the autoencoder as a preprocessing step for a state-of-the-art text-independent speaker recognition system. We compare results achieved with pure autoencoder enhancement, multi-condition PLDA training and their simultaneous use. We present a detailed analysis with various conditions of NIST SRE 2010, PRISM and artificially corrupted NIST SRE 2010 telephone condition. We conclude that the proposed preprocessing significantly outperforms the baseline and that this technique can be used to build a robust speaker recognition system for reverberated and noisy data.Comment: 5 pages, 1 figur

arXiv.org e-Print Archive

Meeting Transcription Using Virtual Microphone Arrays

Author: Chen Zhuo
Dimitriadis Dimitrios
Hinthorn William
Huang Xuedong
Stolcke Andreas
Yoshioka Takuya
Zeng Michael
Publication venue
Publication date: 07/07/2019
Field of study

We describe a system that generates speaker-annotated transcripts of meetings by using a virtual microphone array, a set of spatially distributed asynchronous recording devices such as laptops and mobile phones. The system is composed of continuous audio stream alignment, blind beamforming, speech recognition, speaker diarization using prior speaker information, and system combination. When utilizing seven input audio streams, our system achieves a word error rate (WER) of 22.3% and comes within 3% of the close-talking microphone WER on the non-overlapping speech segments. The speaker-attributed WER (SAWER) is 26.7%. The relative gains in SAWER over the single-device system are 14.8%, 20.3%, and 22.4% for three, five, and seven microphones, respectively. The presented system achieves a 13.6% diarization error rate when 10% of the speech duration contains more than one speaker. The contribution of each component to the overall performance is also investigated, and we validate the system with experiments on the NIST RT-07 conference meeting test set

arXiv.org e-Print Archive

Noise Robust Speech Recognition Using Multi-Channel Based Channel Selection And ChannelWeighting

Author: Chng EngSiong
Li Haizhou
Wang Longbiao
Xiao Xiong
Zhang Zhaofeng
Publication venue
Publication date: 12/04/2016
Field of study

In this paper, we study several microphone channel selection and weighting methods for robust automatic speech recognition (ASR) in noisy conditions. For channel selection, we investigate two methods based on the maximum likelihood (ML) criterion and minimum autoencoder reconstruction criterion, respectively. For channel weighting, we produce enhanced log Mel filterbank coefficients as a weighted sum of the coefficients of all channels. The weights of the channels are estimated by using the ML criterion with constraints. We evaluate the proposed methods on the CHiME-3 noisy ASR task. Experiments show that channel weighting significantly outperforms channel selection due to its higher flexibility. Furthermore, on real test data in which different channels have different gains of the target signal, the channel weighting method performs equally well or better than the MVDR beamforming, despite the fact that the channel weighting does not make use of the phase delay information which is normally used in beamforming

arXiv.org e-Print Archive

Speaker Selective Beamformer with Keyword Mask Estimation

Author: Fujita Yuya
Kida Yusuke
Omachi Motoi
Taniguchi Toru
Tran Dung
Publication venue
Publication date: 07/11/2018
Field of study

This paper addresses the problem of automatic speech recognition (ASR) of a target speaker in background speech. The novelty of our approach is that we focus on a wakeup keyword, which is usually used for activating ASR systems like smart speakers. The proposed method firstly utilizes a DNN-based mask estimator to separate the mixture signal into the keyword signal uttered by the target speaker and the remaining background speech. Then the separated signals are used for calculating a beamforming filter to enhance the subsequent utterances from the target speaker. Experimental evaluations show that the trained DNN-based mask can selectively separate the keyword and background speech from the mixture signal. The effectiveness of the proposed method is also verified with Japanese ASR experiments, and we confirm that the character error rates are significantly improved by the proposed method for both simulated and real recorded test sets.Comment: Accepted by SLT201

arXiv.org e-Print Archive

Single Channel Far Field Feature Enhancement For Speaker Verification In The Wild

Author: Dehak Najim
García-Perera Paola
Kataria Saurabh
Nidadavolu Phani Sankar
Villalba Jesús
Publication venue
Publication date: 17/05/2020
Field of study

We investigated an enhancement and a domain adaptation approach to make speaker verification systems robust to perturbations of far-field speech. In the enhancement approach, using paired (parallel) reverberant-clean speech, we trained a supervised Generative Adversarial Network (GAN) along with a feature mapping loss. For the domain adaptation approach, we trained a Cycle Consistent Generative Adversarial Network (CycleGAN), which maps features from far-field domain to the speaker embedding training domain. This was trained on unpaired data in an unsupervised manner. Both networks, termed Supervised Enhancement Network (SEN) and Domain Adaptation Network (DAN) respectively, were trained with multi-task objectives in (filter-bank) feature domain. On a simulated test setup, we first note the benefit of using feature mapping (FM) loss along with adversarial loss in SEN. Then, we tested both supervised and unsupervised approaches on several real noisy datasets. We observed relative improvements ranging from 2% to 31% in terms of DCF. Using three training schemes, we also establish the effectiveness of the novel DAN approach.Comment: submitted to INTERSPEECH 202

arXiv.org e-Print Archive

Refining a Phase Vocoder for Vocal Modulation

Author: Crystal Benjamin E
Publication venue: UVM ScholarWorks
Publication date: 01/01/2019
Field of study

Vocal harmonies are a highly sought-after effect in the music industry, as they allow singers to portray more emotion and meaning through their voices. The chords one hears when listening to nearly any modern song are constructed through common ratios of frequencies (e.g., the recipe for a major triad is 4:5:6). Currently, vocal melodies are only readily obtainable through a few methods, including backup singers, looper-effects systems, and post-process overdubbing. The issue with these is that there is currently no publicly-available code that allows solo-artists to modulate input audio to whatever chord structure is desired while maintaining the same duration and timbre in the successive layers. This thesis plans to address this issue using the phase vocoder method. If this modulation technique is successful, this could revolutionize the way vocalists perform. The introduction of real-time self harmonization would allow artists to have access to emphasized lyrical phrases and vocals without needing to hire and train backup vocalists. This phase vocoder would also allow for more vocal improvisation, as the individual would only need to know how to harmonize with themselves and would thus not be relying on interpreting how backup vocalists plan on moving the melody when creating more spontaneously

The INTERSPEECH 2020 Far-Field Speaker Verification Challenge

Author: Bu Hui
Das Rohan Kumar
Li Haizhou
Li Ming
Narayanan Shrikanth
Qin Xiaoyi
Rao Wei
Publication venue
Publication date: 16/05/2020
Field of study

The INTERSPEECH 2020 Far-Field Speaker Verification Challenge (FFSVC 2020) addresses three different research problems under well-defined conditions: far-field text-dependent speaker verification from single microphone array, far-field text-independent speaker verification from single microphone array, and far-field text-dependent speaker verification from distributed microphone arrays. All three tasks pose a cross-channel challenge to the participants. To simulate the real-life scenario, the enrollment utterances are recorded from close-talk cellphone, while the test utterances are recorded from the far-field microphone arrays. In this paper, we describe the database, the challenge, and the baseline system, which is based on a ResNet-based deep speaker network with cosine similarity scoring. For a given utterance, the speaker embeddings of different channels are equally averaged as the final embedding. The baseline system achieves minDCFs of 0.62, 0.66, and 0.64 and EERs of 6.27%, 6.55%, and 7.18% for task 1, task 2, and task 3, respectively.Comment: Submitted to INTERSPEECH 202

arXiv.org e-Print Archive