18,396 research outputs found
End-to-End Text-Dependent Speaker Verification
In this paper we present a data-driven, integrated approach to speaker
verification, which maps a test utterance and a few reference utterances
directly to a single score for verification and jointly optimizes the system's
components using the same evaluation protocol and metric as at test time. Such
an approach will result in simple and efficient systems, requiring little
domain-specific knowledge and making few model assumptions. We implement the
idea by formulating the problem as a single neural network architecture,
including the estimation of a speaker model on only a few utterances, and
evaluate it on our internal "Ok Google" benchmark for text-dependent speaker
verification. The proposed approach appears to be very effective for big data
applications like ours that require highly accurate, easy-to-maintain systems
with a small footprint.Comment: submitted to ICASSP 201
A comprehensive study of batch construction strategies for recurrent neural networks in MXNet
In this work we compare different batch construction methods for mini-batch
training of recurrent neural networks. While popular implementations like
TensorFlow and MXNet suggest a bucketing approach to improve the
parallelization capabilities of the recurrent training process, we propose a
simple ordering strategy that arranges the training sequences in a stochastic
alternatingly sorted way. We compare our method to sequence bucketing as well
as various other batch construction strategies on the CHiME-4 noisy speech
recognition corpus. The experiments show that our alternated sorting approach
is able to compete both in training time and recognition performance while
being conceptually simpler to implement
Self-supervised speaker embeddings
Contrary to i-vectors, speaker embeddings such as x-vectors are incapable of
leveraging unlabelled utterances, due to the classification loss over training
speakers. In this paper, we explore an alternative training strategy to enable
the use of unlabelled utterances in training. We propose to train speaker
embedding extractors via reconstructing the frames of a target speech segment,
given the inferred embedding of another speech segment of the same utterance.
We do this by attaching to the standard speaker embedding extractor a decoder
network, which we feed not merely with the speaker embedding, but also with the
estimated phone sequence of the target frame sequence. The reconstruction loss
can be used either as a single objective, or be combined with the standard
speaker classification loss. In the latter case, it acts as a regularizer,
encouraging generalizability to speakers unseen during training. In all cases,
the proposed architectures are trained from scratch and in an end-to-end
fashion. We demonstrate the benefits from the proposed approach on VoxCeleb and
Speakers in the wild, and we report notable improvements over the baseline.Comment: Preprint. Submitted to Interspeech 2019. Updated results compared to
first version and minor correction
I-vector Transformation Using Conditional Generative Adversarial Networks for Short Utterance Speaker Verification
I-vector based text-independent speaker verification (SV) systems often have
poor performance with short utterances, as the biased phonetic distribution in
a short utterance makes the extracted i-vector unreliable. This paper proposes
an i-vector compensation method using a generative adversarial network (GAN),
where its generator network is trained to generate a compensated i-vector from
a short-utterance i-vector and its discriminator network is trained to
determine whether an i-vector is generated by the generator or the one
extracted from a long utterance. Additionally, we assign two other learning
tasks to the GAN to stabilize its training and to make the generated ivector
more speaker-specific. Speaker verification experiments on the NIST SRE 2008
"10sec-10sec" condition show that our method reduced the equal error rate by
11.3% from the conventional i-vector and PLDA system
Sequence Discriminative Training for Deep Learning based Acoustic Keyword Spotting
Speech recognition is a sequence prediction problem. Besides employing
various deep learning approaches for framelevel classification, sequence-level
discriminative training has been proved to be indispensable to achieve the
state-of-the-art performance in large vocabulary continuous speech recognition
(LVCSR). However, keyword spotting (KWS), as one of the most common speech
recognition tasks, almost only benefits from frame-level deep learning due to
the difficulty of getting competing sequence hypotheses. The few studies on
sequence discriminative training for KWS are limited for fixed vocabulary or
LVCSR based methods and have not been compared to the state-of-the-art deep
learning based KWS approaches. In this paper, a sequence discriminative
training framework is proposed for both fixed vocabulary and unrestricted
acoustic KWS. Sequence discriminative training for both sequence-level
generative and discriminative models are systematically investigated. By
introducing word-independent phone lattices or non-keyword blank symbols to
construct competing hypotheses, feasible and efficient sequence discriminative
training approaches are proposed for acoustic KWS. Experiments showed that the
proposed approaches obtained consistent and significant improvement in both
fixed vocabulary and unrestricted KWS tasks, compared to previous frame-level
deep learning based acoustic KWS methods.Comment: accepted by Speech Communication, 08/02/201
The NTNU System at the Interspeech 2020 Non-Native Children's Speech ASR Challenge
This paper describes the NTNU ASR system participating in the Interspeech
2020 Non-Native Children's Speech ASR Challenge supported by the SIG-CHILD
group of ISCA. This ASR shared task is made much more challenging due to the
coexisting diversity of non-native and children speaking characteristics. In
the setting of closed-track evaluation, all participants were restricted to
develop their systems merely based on the speech and text corpora provided by
the organizer. To work around this under-resourced issue, we built our ASR
system on top of CNN-TDNNF-based acoustic models, meanwhile harnessing the
synergistic power of various data augmentation strategies, including both
utterance- and word-level speed perturbation and spectrogram augmentation,
alongside a simple yet effective data-cleansing approach. All variants of our
ASR system employed an RNN-based language model to rescore the first-pass
recognition hypotheses, which was trained solely on the text dataset released
by the organizer. Our system with the best configuration came out in second
place, resulting in a word error rate (WER) of 17.59 %, while those of the
top-performing, second runner-up and official baseline systems are 15.67%,
18.71%, 35.09%, respectively.Comment: Submitted to Interspeech 2020 Special Session: Shared Task on
Automatic Speech Recognition for Non-Native Children's Speec
Meta-Learning for Short Utterance Speaker Recognition with Imbalance Length Pairs
In practical settings, a speaker recognition system needs to identify a
speaker given a short utterance, while the enrollment utterance may be
relatively long. However, existing speaker recognition models perform poorly
with such short utterances. To solve this problem, we introduce a meta-learning
framework for imbalance length pairs. Specifically, we use a Prototypical
Networks and train it with a support set of long utterances and a query set of
short utterances of varying lengths. Further, since optimizing only for the
classes in the given episode may be insufficient for learning discriminative
embeddings for unseen classes, we additionally enforce the model to classify
both the support and the query set against the entire set of classes in the
training set. By combining these two learning schemes, our model outperforms
existing state-of-the-art speaker verification models learned with a standard
supervised learning framework on short utterance (1-2 seconds) on the VoxCeleb
datasets. We also validate our proposed model for unseen speaker
identification, on which it also achieves significant performance gains over
the existing approaches. The codes are available at
https://github.com/seongmin-kye/meta-SR.Comment: Accepted to Interspeech 2020. The codes are available at
https://github.com/seongmin-kye/meta-S
Single-Channel Multi-talker Speech Recognition with Permutation Invariant Training
Although great progresses have been made in automatic speech recognition
(ASR), significant performance degradation is still observed when recognizing
multi-talker mixed speech. In this paper, we propose and evaluate several
architectures to address this problem under the assumption that only a single
channel of mixed signal is available. Our technique extends permutation
invariant training (PIT) by introducing the front-end feature separation module
with the minimum mean square error (MSE) criterion and the back-end recognition
module with the minimum cross entropy (CE) criterion. More specifically, during
training we compute the average MSE or CE over the whole utterance for each
possible utterance-level output-target assignment, pick the one with the
minimum MSE or CE, and optimize for that assignment. This strategy elegantly
solves the label permutation problem observed in the deep learning based
multi-talker mixed speech separation and recognition systems. The proposed
architectures are evaluated and compared on an artificially mixed AMI dataset
with both two- and three-talker mixed speech. The experimental results indicate
that our proposed architectures can cut the word error rate (WER) by 45.0% and
25.0% relatively against the state-of-the-art single-talker speech recognition
system across all speakers when their energies are comparable, for two- and
three-talker mixed speech, respectively. To our knowledge, this is the first
work on the multi-talker mixed speech recognition on the challenging
speaker-independent spontaneous large vocabulary continuous speech task.Comment: 11 pages, 6 figures, Submitted to IEEE/ACM Transactions on Audio,
Speech and Language Processing. arXiv admin note: text overlap with
arXiv:1704.0198
Short utterance compensation in speaker verification via cosine-based teacher-student learning of speaker embeddings
The short duration of an input utterance is one of the most critical threats
that degrade the performance of speaker verification systems. This study aimed
to develop an integrated text-independent speaker verification system that
inputs utterances with short duration of 2 seconds or less. We propose an
approach using a teacher-student learning framework for this goal, applied to
short utterance compensation for the first time in our knowledge. The core
concept of the proposed system is to conduct the compensation throughout the
network that extracts the speaker embedding, mainly in phonetic-level, rather
than compensating via a separate system after extracting the speaker embedding.
In the proposed architecture, phonetic-level features where each feature
represents a segment of 130 ms are extracted using convolutional layers. A
layer of gated recurrent units extracts an utterance-level feature using
phonetic-level features. The proposed approach also adopts a new objective
function for teacher-student learning that considers both Kullback-Leibler
divergence of output layers and cosine distance of speaker embeddings layers.
Experiments were conducted using deep neural networks that take raw waveforms
as input, and output speaker embeddings on VoxCeleb1 dataset. The proposed
model could compensate approximately 65 \% of the performance degradation due
to the shortened duration.Comment: 5 pages, 2 figures, submitted to Interspeech 2019 as a conference
pape
Learning from Past Mistakes: Improving Automatic Speech Recognition Output via Noisy-Clean Phrase Context Modeling
Automatic speech recognition (ASR) systems often make unrecoverable errors
due to subsystem pruning (acoustic, language and pronunciation models); for
example pruning words due to acoustics using short-term context, prior to
rescoring with long-term context based on linguistics. In this work we model
ASR as a phrase-based noisy transformation channel and propose an error
correction system that can learn from the aggregate errors of all the
independent modules constituting the ASR and attempt to invert those. The
proposed system can exploit long-term context using a neural network language
model and can better choose between existing ASR output possibilities as well
as re-introduce previously pruned or unseen (out-of-vocabulary) phrases. It
provides corrections under poorly performing ASR conditions without degrading
any accurate transcriptions; such corrections are greater on top of
out-of-domain and mismatched data ASR. Our system consistently provides
improvements over the baseline ASR, even when baseline is further optimized
through recurrent neural network language model rescoring. This demonstrates
that any ASR improvements can be exploited independently and that our proposed
system can potentially still provide benefits on highly optimized ASR. Finally,
we present an extensive analysis of the type of errors corrected by our system
- …