2,565 research outputs found
ReMASC: Realistic Replay Attack Corpus for Voice Controlled Systems
This paper introduces a new database of voice recordings with the goal of
supporting research on vulnerabilities and protection of voice-controlled
systems (VCSs). In contrast to prior efforts, the proposed database contains
both genuine voice commands and replayed recordings of such commands, collected
in realistic VCSs usage scenarios and using modern voice assistant development
kits. Specifically, the database contains recordings from four systems (each
with a different microphone array) in a variety of environmental conditions
with different forms of background noise and relative positions between speaker
and device. To the best of our knowledge, this is the first publicly available
database that has been specifically designed for the protection of
state-of-the-art voice-controlled systems against various replay attacks in
various conditions and environments.Comment: To appear in Interspeech 2019. Data set available at
https://github.com/YuanGongND/ReMAS
Detecting Replay Attacks Using Multi-Channel Audio: A Neural Network-Based Method
With the rapidly growing number of security-sensitive systems that use voice
as the primary input, it becomes increasingly important to address these
systems' potential vulnerability to replay attacks. Previous efforts to address
this concern have focused primarily on single-channel audio. In this paper, we
introduce a novel neural network-based replay attack detection model that
further leverages spatial information of multi-channel audio and is able to
significantly improve the replay attack detection performance.Comment: Code of this work is available here:
https://github.com/YuanGongND/multichannel-antispoo
End-to-End Multi-Look Keyword Spotting
The performance of keyword spotting (KWS), measured in false alarms and false
rejects, degrades significantly under the far field and noisy conditions. In
this paper, we propose a multi-look neural network modeling for speech
enhancement which simultaneously steers to listen to multiple sampled look
directions. The multi-look enhancement is then jointly trained with KWS to form
an end-to-end KWS model which integrates the enhanced signals from multiple
look directions and leverages an attention mechanism to dynamically tune the
model's attention to the reliable sources. We demonstrate, on our large noisy
and far-field evaluation sets, that the proposed approach significantly
improves the KWS performance against the baseline KWS system and a recent
beamformer based multi-beam KWS system.Comment: Submitted to Interspeech202
On the use of DNN Autoencoder for Robust Speaker Recognition
In this paper, we present an analysis of a DNN-based autoencoder for speech
enhancement, dereverberation and denoising. The target application is a robust
speaker recognition system. We started with augmenting the Fisher database with
artificially noised and reverberated data and we trained the autoencoder to map
noisy and reverberated speech to its clean version. We use the autoencoder as a
preprocessing step for a state-of-the-art text-independent speaker recognition
system. We compare results achieved with pure autoencoder enhancement,
multi-condition PLDA training and their simultaneous use. We present a detailed
analysis with various conditions of NIST SRE 2010, PRISM and artificially
corrupted NIST SRE 2010 telephone condition. We conclude that the proposed
preprocessing significantly outperforms the baseline and that this technique
can be used to build a robust speaker recognition system for reverberated and
noisy data.Comment: 5 pages, 1 figur
Meeting Transcription Using Virtual Microphone Arrays
We describe a system that generates speaker-annotated transcripts of meetings
by using a virtual microphone array, a set of spatially distributed
asynchronous recording devices such as laptops and mobile phones. The system is
composed of continuous audio stream alignment, blind beamforming, speech
recognition, speaker diarization using prior speaker information, and system
combination. When utilizing seven input audio streams, our system achieves a
word error rate (WER) of 22.3% and comes within 3% of the close-talking
microphone WER on the non-overlapping speech segments. The speaker-attributed
WER (SAWER) is 26.7%. The relative gains in SAWER over the single-device system
are 14.8%, 20.3%, and 22.4% for three, five, and seven microphones,
respectively. The presented system achieves a 13.6% diarization error rate when
10% of the speech duration contains more than one speaker. The contribution of
each component to the overall performance is also investigated, and we validate
the system with experiments on the NIST RT-07 conference meeting test set
Noise Robust Speech Recognition Using Multi-Channel Based Channel Selection And ChannelWeighting
In this paper, we study several microphone channel selection and weighting
methods for robust automatic speech recognition (ASR) in noisy conditions. For
channel selection, we investigate two methods based on the maximum likelihood
(ML) criterion and minimum autoencoder reconstruction criterion, respectively.
For channel weighting, we produce enhanced log Mel filterbank coefficients as a
weighted sum of the coefficients of all channels. The weights of the channels
are estimated by using the ML criterion with constraints. We evaluate the
proposed methods on the CHiME-3 noisy ASR task. Experiments show that channel
weighting significantly outperforms channel selection due to its higher
flexibility. Furthermore, on real test data in which different channels have
different gains of the target signal, the channel weighting method performs
equally well or better than the MVDR beamforming, despite the fact that the
channel weighting does not make use of the phase delay information which is
normally used in beamforming
Speaker Selective Beamformer with Keyword Mask Estimation
This paper addresses the problem of automatic speech recognition (ASR) of a
target speaker in background speech. The novelty of our approach is that we
focus on a wakeup keyword, which is usually used for activating ASR systems
like smart speakers. The proposed method firstly utilizes a DNN-based mask
estimator to separate the mixture signal into the keyword signal uttered by the
target speaker and the remaining background speech. Then the separated signals
are used for calculating a beamforming filter to enhance the subsequent
utterances from the target speaker. Experimental evaluations show that the
trained DNN-based mask can selectively separate the keyword and background
speech from the mixture signal. The effectiveness of the proposed method is
also verified with Japanese ASR experiments, and we confirm that the character
error rates are significantly improved by the proposed method for both
simulated and real recorded test sets.Comment: Accepted by SLT201
Single Channel Far Field Feature Enhancement For Speaker Verification In The Wild
We investigated an enhancement and a domain adaptation approach to make
speaker verification systems robust to perturbations of far-field speech. In
the enhancement approach, using paired (parallel) reverberant-clean speech, we
trained a supervised Generative Adversarial Network (GAN) along with a feature
mapping loss. For the domain adaptation approach, we trained a Cycle Consistent
Generative Adversarial Network (CycleGAN), which maps features from far-field
domain to the speaker embedding training domain. This was trained on unpaired
data in an unsupervised manner. Both networks, termed Supervised Enhancement
Network (SEN) and Domain Adaptation Network (DAN) respectively, were trained
with multi-task objectives in (filter-bank) feature domain. On a simulated test
setup, we first note the benefit of using feature mapping (FM) loss along with
adversarial loss in SEN. Then, we tested both supervised and unsupervised
approaches on several real noisy datasets. We observed relative improvements
ranging from 2% to 31% in terms of DCF. Using three training schemes, we also
establish the effectiveness of the novel DAN approach.Comment: submitted to INTERSPEECH 202
Refining a Phase Vocoder for Vocal Modulation
Vocal harmonies are a highly sought-after effect in the music industry, as they allow singers to portray more emotion and meaning through their voices. The chords one hears when listening to nearly any modern song are constructed through common ratios of frequencies (e.g., the recipe for a major triad is 4:5:6). Currently, vocal melodies are only readily obtainable through a few methods, including backup singers, looper-effects systems, and post-process overdubbing. The issue with these is that there is currently no publicly-available code that allows solo-artists to modulate input audio to whatever chord structure is desired while maintaining the same duration and timbre in the successive layers.
This thesis plans to address this issue using the phase vocoder method. If this modulation technique is successful, this could revolutionize the way vocalists perform. The introduction of real-time self harmonization would allow artists to have access to emphasized lyrical phrases and vocals without needing to hire and train backup vocalists. This phase vocoder would also allow for more vocal improvisation, as the individual would only need to know how to harmonize with themselves and would thus not be relying on interpreting how backup vocalists plan on moving the melody when creating more spontaneously
The INTERSPEECH 2020 Far-Field Speaker Verification Challenge
The INTERSPEECH 2020 Far-Field Speaker Verification Challenge (FFSVC 2020)
addresses three different research problems under well-defined conditions:
far-field text-dependent speaker verification from single microphone array,
far-field text-independent speaker verification from single microphone array,
and far-field text-dependent speaker verification from distributed microphone
arrays. All three tasks pose a cross-channel challenge to the participants. To
simulate the real-life scenario, the enrollment utterances are recorded from
close-talk cellphone, while the test utterances are recorded from the far-field
microphone arrays. In this paper, we describe the database, the challenge, and
the baseline system, which is based on a ResNet-based deep speaker network with
cosine similarity scoring. For a given utterance, the speaker embeddings of
different channels are equally averaged as the final embedding. The baseline
system achieves minDCFs of 0.62, 0.66, and 0.64 and EERs of 6.27%, 6.55%, and
7.18% for task 1, task 2, and task 3, respectively.Comment: Submitted to INTERSPEECH 202
- …