1,071 research outputs found
Within-sample variability-invariant loss for robust speaker recognition under noisy environments
Despite the significant improvements in speaker recognition enabled by deep
neural networks, unsatisfactory performance persists under noisy environments.
In this paper, we train the speaker embedding network to learn the "clean"
embedding of the noisy utterance. Specifically, the network is trained with the
original speaker identification loss with an auxiliary within-sample
variability-invariant loss. This auxiliary variability-invariant loss is used
to learn the same embedding among the clean utterance and its noisy copies and
prevents the network from encoding the undesired noises or variabilities into
the speaker representation. Furthermore, we investigate the data preparation
strategy for generating clean and noisy utterance pairs on-the-fly. The
strategy generates different noisy copies for the same clean utterance at each
training step, helping the speaker embedding network generalize better under
noisy environments. Experiments on VoxCeleb1 indicate that the proposed
training framework improves the performance of the speaker verification system
in both clean and noisy conditions.Comment: Accepted at ICASSP 202
Domain Aware Training for Far-field Small-footprint Keyword Spotting
In this paper, we focus on the task of small-footprint keyword spotting under
the far-field scenario. Far-field environments are commonly encountered in
real-life speech applications, causing severe degradation of performance due to
room reverberation and various kinds of noises. Our baseline system is built on
the convolutional neural network trained with pooled data of both far-field and
close-talking speech. To cope with the distortions, we develop three domain
aware training systems, including the domain embedding system, the deep CORAL
system, and the multi-task learning system. These methods incorporate domain
knowledge into network training and improve the performance of the keyword
classifier on far-field conditions. Experimental results show that our proposed
methods manage to maintain the performance on the close-talking speech and
achieve significant improvement on the far-field test set.Comment: Submitted to INTERSPEECH 202
Neural Predictive Coding using Convolutional Neural Networks towards Unsupervised Learning of Speaker Characteristics
Learning speaker-specific features is vital in many applications like speaker
recognition, diarization and speech recognition. This paper provides a novel
approach, we term Neural Predictive Coding (NPC), to learn speaker-specific
characteristics in a completely unsupervised manner from large amounts of
unlabeled training data that even contain many non-speech events and
multi-speaker audio streams. The NPC framework exploits the proposed short-term
active-speaker stationarity hypothesis which assumes two temporally-close short
speech segments belong to the same speaker, and thus a common representation
that can encode the commonalities of both the segments, should capture the
vocal characteristics of that speaker. We train a convolutional deep siamese
network to produce "speaker embeddings" by learning to separate `same' vs
`different' speaker pairs which are generated from an unlabeled data of audio
streams. Two sets of experiments are done in different scenarios to evaluate
the strength of NPC embeddings and compare with state-of-the-art in-domain
supervised methods. First, two speaker identification experiments with
different context lengths are performed in a scenario with comparatively
limited within-speaker channel variability. NPC embeddings are found to perform
the best at short duration experiment, and they provide complementary
information to i-vectors for full utterance experiments. Second, a large scale
speaker verification task having a wide range of within-speaker channel
variability is adopted as an upper-bound experiment where comparisons are drawn
with in-domain supervised methods
Length- and Noise-aware Training Techniques for Short-utterance Speaker Recognition
Speaker recognition performance has been greatly improved with the emergence
of deep learning. Deep neural networks show the capacity to effectively deal
with impacts of noise and reverberation, making them attractive to far-field
speaker recognition systems. The x-vector framework is a popular choice for
generating speaker embeddings in recent literature due to its robust training
mechanism and excellent performance in various test sets. In this paper, we
start with early work on including invariant representation learning (IRL) to
the loss function and modify the approach with centroid alignment (CA) and
length variability cost (LVC) techniques to further improve robustness in
noisy, far-field applications. This work mainly focuses on improvements for
short-duration test utterances (1-8s). We also present improved results on
long-duration tasks. In addition, this work discusses a novel self-attention
mechanism. On the VOiCES far-field corpus, the combination of the proposed
techniques achieves relative improvements of 7.0% for extremely short and 8.2%
for full-duration test utterances on equal error rate (EER) over our baseline
system.Comment: To be published in proceedings of Interspeech 202
Speaker Recognition Based on Deep Learning: An Overview
Speaker recognition is a task of identifying persons from their voices.
Recently, deep learning has dramatically revolutionized speaker recognition.
However, there is lack of comprehensive reviews on the exciting progress.
In this paper, we review several major subtasks of speaker recognition,
including speaker verification, identification, diarization, and robust speaker
recognition, with a focus on deep-learning-based methods. Because the major
advantage of deep learning over conventional methods is its representation
ability, which is able to produce highly abstract embedding features from
utterances, we first pay close attention to deep-learning-based speaker feature
extraction, including the inputs, network structures, temporal pooling
strategies, and objective functions respectively, which are the fundamental
components of many speaker recognition subtasks. Then, we make an overview of
speaker diarization, with an emphasis of recent supervised, end-to-end, and
online diarization. Finally, we survey robust speaker recognition from the
perspectives of domain adaptation and speech enhancement, which are two major
approaches of dealing with domain mismatch and noise problems. Popular and
recently released corpora are listed at the end of the paper
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
We show that an end-to-end deep learning approach can be used to recognize
either English or Mandarin Chinese speech--two vastly different languages.
Because it replaces entire pipelines of hand-engineered components with neural
networks, end-to-end learning allows us to handle a diverse variety of speech
including noisy environments, accents and different languages. Key to our
approach is our application of HPC techniques, resulting in a 7x speedup over
our previous system. Because of this efficiency, experiments that previously
took weeks now run in days. This enables us to iterate more quickly to identify
superior architectures and algorithms. As a result, in several cases, our
system is competitive with the transcription of human workers when benchmarked
on standard datasets. Finally, using a technique called Batch Dispatch with
GPUs in the data center, we show that our system can be inexpensively deployed
in an online setting, delivering low latency when serving users at scale
Advanced Biometrics with Deep Learning
Biometrics, such as fingerprint, iris, face, hand print, hand vein, speech and gait recognition, etc., as a means of identity management have become commonplace nowadays for various applications. Biometric systems follow a typical pipeline, that is composed of separate preprocessing, feature extraction and classification. Deep learning as a data-driven representation learning approach has been shown to be a promising alternative to conventional data-agnostic and handcrafted pre-processing and feature extraction for biometric systems. Furthermore, deep learning offers an end-to-end learning paradigm to unify preprocessing, feature extraction, and recognition, based solely on biometric data. This Special Issue has collected 12 high-quality, state-of-the-art research papers that deal with challenging issues in advanced biometric systems based on deep learning. The 12 papers can be divided into 4 categories according to biometric modality; namely, face biometrics, medical electronic signals (EEG and ECG), voice print, and others
The INTERSPEECH 2020 Far-Field Speaker Verification Challenge
The INTERSPEECH 2020 Far-Field Speaker Verification Challenge (FFSVC 2020)
addresses three different research problems under well-defined conditions:
far-field text-dependent speaker verification from single microphone array,
far-field text-independent speaker verification from single microphone array,
and far-field text-dependent speaker verification from distributed microphone
arrays. All three tasks pose a cross-channel challenge to the participants. To
simulate the real-life scenario, the enrollment utterances are recorded from
close-talk cellphone, while the test utterances are recorded from the far-field
microphone arrays. In this paper, we describe the database, the challenge, and
the baseline system, which is based on a ResNet-based deep speaker network with
cosine similarity scoring. For a given utterance, the speaker embeddings of
different channels are equally averaged as the final embedding. The baseline
system achieves minDCFs of 0.62, 0.66, and 0.64 and EERs of 6.27%, 6.55%, and
7.18% for task 1, task 2, and task 3, respectively.Comment: Submitted to INTERSPEECH 202
Machine learning in acoustics: theory and applications
Acoustic data provide scientific and engineering insights in fields ranging
from biology and communications to ocean and Earth science. We survey the
recent advances and transformative potential of machine learning (ML),
including deep learning, in the field of acoustics. ML is a broad family of
techniques, which are often based in statistics, for automatically detecting
and utilizing patterns in data. Relative to conventional acoustics and signal
processing, ML is data-driven. Given sufficient training data, ML can discover
complex relationships between features and desired labels or actions, or
between features themselves. With large volumes of training data, ML can
discover models describing complex acoustic phenomena such as human speech and
reverberation. ML in acoustics is rapidly developing with compelling results
and significant future promise. We first introduce ML, then highlight ML
developments in four acoustics research areas: source localization in speech
processing, source localization in ocean acoustics, bioacoustics, and
environmental sounds in everyday scenes.Comment: Published with free access in Journal of the Acoustical Society of
America, 27 Nov. 201
Mic2Mic: Using Cycle-Consistent Generative Adversarial Networks to Overcome Microphone Variability in Speech Systems
Mobile and embedded devices are increasingly using microphones and
audio-based computational models to infer user context. A major challenge in
building systems that combine audio models with commodity microphones is to
guarantee their accuracy and robustness in the real-world. Besides many
environmental dynamics, a primary factor that impacts the robustness of audio
models is microphone variability. In this work, we propose Mic2Mic -- a
machine-learned system component -- which resides in the inference pipeline of
audio models and at real-time reduces the variability in audio data caused by
microphone-specific factors. Two key considerations for the design of Mic2Mic
were: a) to decouple the problem of microphone variability from the audio task,
and b) put a minimal burden on end-users to provide training data. With these
in mind, we apply the principles of cycle-consistent generative adversarial
networks (CycleGANs) to learn Mic2Mic using unlabeled and unpaired data
collected from different microphones. Our experiments show that Mic2Mic can
recover between 66% to 89% of the accuracy lost due to microphone variability
for two common audio tasks.Comment: Published at ACM IPSN 201
- …