51 research outputs found
Adversarial Training for Multi-domain Speaker Recognition
In real-life applications, the performance of speaker recognition systems
always degrades when there is a mismatch between training and evaluation data.
Many domain adaptation methods have been successfully used for eliminating the
domain mismatches in speaker recognition. However, usually both training and
evaluation data themselves can be composed of several subsets. These inner
variances of each dataset can also be considered as different domains.
Different distributed subsets in source or target domain dataset can also cause
multi-domain mismatches, which are influential to speaker recognition
performance. In this study, we propose to use adversarial training for
multi-domain speaker recognition to solve the domain mismatch and the dataset
variance problems. By adopting the proposed method, we are able to obtain both
multi-domain-invariant and speaker-discriminative speech representations for
speaker recognition. Experimental results on DAC13 dataset indicate that the
proposed method is not only effective to solve the multi-domain mismatch
problem, but also outperforms the compared unsupervised domain adaptation
methods.Comment: 5 pages, 2 figure
Speaker Recognition Based on Deep Learning: An Overview
Speaker recognition is a task of identifying persons from their voices.
Recently, deep learning has dramatically revolutionized speaker recognition.
However, there is lack of comprehensive reviews on the exciting progress.
In this paper, we review several major subtasks of speaker recognition,
including speaker verification, identification, diarization, and robust speaker
recognition, with a focus on deep-learning-based methods. Because the major
advantage of deep learning over conventional methods is its representation
ability, which is able to produce highly abstract embedding features from
utterances, we first pay close attention to deep-learning-based speaker feature
extraction, including the inputs, network structures, temporal pooling
strategies, and objective functions respectively, which are the fundamental
components of many speaker recognition subtasks. Then, we make an overview of
speaker diarization, with an emphasis of recent supervised, end-to-end, and
online diarization. Finally, we survey robust speaker recognition from the
perspectives of domain adaptation and speech enhancement, which are two major
approaches of dealing with domain mismatch and noise problems. Popular and
recently released corpora are listed at the end of the paper
Unsupervised Cross-Lingual Speech Emotion Recognition Using DomainAdversarial Neural Network
By using deep learning approaches, Speech Emotion Recog-nition (SER) on a
single domain has achieved many excellentresults. However, cross-domain SER is
still a challenging taskdue to the distribution shift between source and target
domains.In this work, we propose a Domain Adversarial Neural Net-work (DANN)
based approach to mitigate this distribution shiftproblem for cross-lingual
SER. Specifically, we add a languageclassifier and gradient reversal layer
after the feature extractor toforce the learned representation both
language-independent andemotion-meaningful. Our method is unsupervised, i. e.,
labelson target language are not required, which makes it easier to ap-ply our
method to other languages. Experimental results showthe proposed method
provides an average absolute improve-ment of 3.91% over the baseline system for
arousal and valenceclassification task. Furthermore, we find that batch
normaliza-tion is beneficial to the performance gain of DANN. Thereforewe also
explore the effect of different ways of data combinationfor batch
normalization.Comment: This paper has been accepted by ISCSLP202
NPU Speaker Verification System for INTERSPEECH 2020 Far-Field Speaker Verification Challenge
This paper describes the NPU system submitted to Interspeech 2020 Far-Field
Speaker Verification Challenge (FFSVC). We particularly focus on far-field
text-dependent SV from single (task1) and multiple microphone arrays (task3).
The major challenges in such scenarios are short utterance and cross-channel
and distance mismatch for enrollment and test. With the belief that better
speaker embedding can alleviate the effects from short utterance, we introduce
a new speaker embedding architecture - ResNet-BAM, which integrates a
bottleneck attention module with ResNet as a simple and efficient way to
further improve the representation power of ResNet. This contribution brings up
to 1% EER reduction. We further address the mismatch problem in three
directions. First, domain adversarial training, which aims to learn
domain-invariant features, can yield to 0.8% EER reduction. Second, front-end
signal processing, including WPE and beamforming, has no obvious contribution,
but together with data selection and domain adversarial training, can further
contribute to 0.5% EER reduction. Finally, data augmentation, which works with
a specifically-designed data selection strategy, can lead to 2% EER reduction.
Together with the above contributions, in the middle challenge results, our
single submission system (without multi-system fusion) achieves the first and
second place on task 1 and task 3, respectively
Deep Representation Learning in Speech Processing: Challenges, Recent Advances, and Future Trends
Research on speech processing has traditionally considered the task of
designing hand-engineered acoustic features (feature engineering) as a separate
distinct problem from the task of designing efficient machine learning (ML)
models to make prediction and classification decisions. There are two main
drawbacks to this approach: firstly, the feature engineering being manual is
cumbersome and requires human knowledge; and secondly, the designed features
might not be best for the objective at hand. This has motivated the adoption of
a recent trend in speech community towards utilisation of representation
learning techniques, which can learn an intermediate representation of the
input signal automatically that better suits the task at hand and hence lead to
improved performance. The significance of representation learning has increased
with advances in deep learning (DL), where the representations are more useful
and less dependent on human knowledge, making it very conducive for tasks like
classification, prediction, etc. The main contribution of this paper is to
present an up-to-date and comprehensive survey on different techniques of
speech representation learning by bringing together the scattered research
across three distinct research areas including Automatic Speech Recognition
(ASR), Speaker Recognition (SR), and Speaker Emotion Recognition (SER). Recent
reviews in speech have been conducted for ASR, SR, and SER, however, none of
these has focused on the representation learning from speech---a gap that our
survey aims to bridge
Deep Transfer Learning for Automatic Speech Recognition: Towards Better Generalization
Automatic speech recognition (ASR) has recently become an important challenge
when using deep learning (DL). It requires large-scale training datasets and
high computational and storage resources. Moreover, DL techniques and machine
learning (ML) approaches in general, hypothesize that training and testing data
come from the same domain, with the same input feature space and data
distribution characteristics. This assumption, however, is not applicable in
some real-world artificial intelligence (AI) applications. Moreover, there are
situations where gathering real data is challenging, expensive, or rarely
occurring, which can not meet the data requirements of DL models. deep transfer
learning (DTL) has been introduced to overcome these issues, which helps
develop high-performing models using real datasets that are small or slightly
different but related to the training data. This paper presents a comprehensive
survey of DTL-based ASR frameworks to shed light on the latest developments and
helps academics and professionals understand current challenges. Specifically,
after presenting the DTL background, a well-designed taxonomy is adopted to
inform the state-of-the-art. A critical analysis is then conducted to identify
the limitations and advantages of each framework. Moving on, a comparative
study is introduced to highlight the current challenges before deriving
opportunities for future research
Multi-Level Transfer Learning from Near-Field to Far-Field Speaker Verification
In far-field speaker verification, the performance of speaker embeddings is
susceptible to degradation when there is a mismatch between the conditions of
enrollment and test speech. To solve this problem, we propose the feature-level
and instance-level transfer learning in the teacher-student framework to learn
a domain-invariant embedding space. For the feature-level knowledge transfer,
we develop the contrastive loss to transfer knowledge from teacher model to
student model, which can not only decrease the intra-class distance, but also
enlarge the inter-class distance. Moreover, we propose the instance-level
pairwise distance transfer method to force the student model to preserve
pairwise instances distance from the well optimized embedding space of the
teacher model. On FFSVC 2020 evaluation set, our EER on Full-eval trials is
relatively reduced by 13.9% compared with the fusion system result on
Partial-eval trials of Task2. On Task1, compared with the winner's DenseNet
result on Partial-eval trials, our minDCF on Full-eval trials is relatively
reduced by 6.3%. On Task3, the EER and minDCF of our proposed method on
Full-eval trials are very close to the result of the fusion system on
Partial-eval trials. Our results also outperform other competitive domain
adaptation methods
Exploiting Cross-Lingual Knowledge in Unsupervised Acoustic Modeling for Low-Resource Languages
(Short version of Abstract) This thesis describes an investigation on
unsupervised acoustic modeling (UAM) for automatic speech recognition (ASR) in
the zero-resource scenario, where only untranscribed speech data is assumed to
be available. UAM is not only important in addressing the general problem of
data scarcity in ASR technology development but also essential to many
non-mainstream applications, for examples, language protection, language
acquisition and pathological speech assessment. The present study is focused on
two research problems. The first problem concerns unsupervised discovery of
basic (subword level) speech units in a given language. Under the zero-resource
condition, the speech units could be inferred only from the acoustic signals,
without requiring or involving any linguistic direction and/or constraints. The
second problem is referred to as unsupervised subword modeling. In its essence
a frame-level feature representation needs to be learned from untranscribed
speech. The learned feature representation is the basis of subword unit
discovery. It is desired to be linguistically discriminative and robust to
non-linguistic factors. Particularly extensive use of cross-lingual knowledge
in subword unit discovery and modeling is a focus of this research.Comment: Ph.D. Thesis Submitted in May 2020 in partial fulfilment of the
requirements for the Degree of Doctor of Philosophy in Electronic
Engineering, The Chinese University of Hong Kong (CUHK) 134 page
Recent Progresses in Deep Learning based Acoustic Models (Updated)
In this paper, we summarize recent progresses made in deep learning based
acoustic models and the motivation and insights behind the surveyed techniques.
We first discuss acoustic models that can effectively exploit variable-length
contextual information, such as recurrent neural networks (RNNs), convolutional
neural networks (CNNs), and their various combination with other models. We
then describe acoustic models that are optimized end-to-end with emphasis on
feature representations learned jointly with rest of the system, the
connectionist temporal classification (CTC) criterion, and the attention-based
sequence-to-sequence model. We further illustrate robustness issues in speech
recognition systems, and discuss acoustic model adaptation, speech enhancement
and separation, and robust training strategies. We also cover modeling
techniques that lead to more efficient decoding and discuss possible future
directions in acoustic model research.Comment: This is an updated version with latest literature until ICASSP2018 of
the paper: Dong Yu and Jinyu Li, "Recent Progresses in Deep Learning based
Acoustic Models," vol.4, no.3, IEEE/CAA Journal of Automatica Sinica, 201
Generating Multilingual Voices Using Speaker Space Translation Based on Bilingual Speaker Data
We present progress towards bilingual Text-to-Speech which is able to
transform a monolingual voice to speak a second language while preserving
speaker voice quality. We demonstrate that a bilingual speaker embedding space
contains a separate distribution for each language and that a simple transform
in speaker space generated by the speaker embedding can be used to control the
degree of accent of a synthetic voice in a language. The same transform can be
applied even to monolingual speakers.
In our experiments speaker data from an English-Spanish (Mexican) bilingual
speaker was used, and the goal was to enable English speakers to speak Spanish
and Spanish speakers to speak English. We found that the simple transform was
sufficient to convert a voice from one language to the other with a high degree
of naturalness. In one case the transformed voice outperformed a native
language voice in listening tests. Experiments further indicated that the
transform preserved many of the characteristics of the original voice. The
degree of accent present can be controlled and naturalness is relatively
consistent across a range of accent values.Comment: Accepted to IEEE ICASSP 202
- …