193 research outputs found
Unsupervised speech representation learning using WaveNet autoencoders
We consider the task of unsupervised extraction of meaningful latent
representations of speech by applying autoencoding neural networks to speech
waveforms. The goal is to learn a representation able to capture high level
semantic content from the signal, e.g.\ phoneme identities, while being
invariant to confounding low level details in the signal such as the underlying
pitch contour or background noise. Since the learned representation is tuned to
contain only phonetic content, we resort to using a high capacity WaveNet
decoder to infer information discarded by the encoder from previous samples.
Moreover, the behavior of autoencoder models depends on the kind of constraint
that is applied to the latent representation. We compare three variants: a
simple dimensionality reduction bottleneck, a Gaussian Variational Autoencoder
(VAE), and a discrete Vector Quantized VAE (VQ-VAE). We analyze the quality of
learned representations in terms of speaker independence, the ability to
predict phonetic content, and the ability to accurately reconstruct individual
spectrogram frames. Moreover, for discrete encodings extracted using the
VQ-VAE, we measure the ease of mapping them to phonemes. We introduce a
regularization scheme that forces the representations to focus on the phonetic
content of the utterance and report performance comparable with the top entries
in the ZeroSpeech 2017 unsupervised acoustic unit discovery task.Comment: Accepted to IEEE TASLP, final version available at
http://dx.doi.org/10.1109/TASLP.2019.293886
Massively Multilingual Adversarial Speech Recognition
We report on adaptation of multilingual end-to-end speech recognition models
trained on as many as 100 languages. Our findings shed light on the relative
importance of similarity between the target and pretraining languages along the
dimensions of phonetics, phonology, language family, geographical location, and
orthography. In this context, experiments demonstrate the effectiveness of two
additional pretraining objectives in encouraging language-independent encoder
representations: a context-independent phoneme objective paired with a
language-adversarial classification objective.Comment: Accepted at NAACL-HLT 201
Exploiting Cross-Lingual Knowledge in Unsupervised Acoustic Modeling for Low-Resource Languages
(Short version of Abstract) This thesis describes an investigation on
unsupervised acoustic modeling (UAM) for automatic speech recognition (ASR) in
the zero-resource scenario, where only untranscribed speech data is assumed to
be available. UAM is not only important in addressing the general problem of
data scarcity in ASR technology development but also essential to many
non-mainstream applications, for examples, language protection, language
acquisition and pathological speech assessment. The present study is focused on
two research problems. The first problem concerns unsupervised discovery of
basic (subword level) speech units in a given language. Under the zero-resource
condition, the speech units could be inferred only from the acoustic signals,
without requiring or involving any linguistic direction and/or constraints. The
second problem is referred to as unsupervised subword modeling. In its essence
a frame-level feature representation needs to be learned from untranscribed
speech. The learned feature representation is the basis of subword unit
discovery. It is desired to be linguistically discriminative and robust to
non-linguistic factors. Particularly extensive use of cross-lingual knowledge
in subword unit discovery and modeling is a focus of this research.Comment: Ph.D. Thesis Submitted in May 2020 in partial fulfilment of the
requirements for the Degree of Doctor of Philosophy in Electronic
Engineering, The Chinese University of Hong Kong (CUHK) 134 page
Recent Progresses in Deep Learning based Acoustic Models (Updated)
In this paper, we summarize recent progresses made in deep learning based
acoustic models and the motivation and insights behind the surveyed techniques.
We first discuss acoustic models that can effectively exploit variable-length
contextual information, such as recurrent neural networks (RNNs), convolutional
neural networks (CNNs), and their various combination with other models. We
then describe acoustic models that are optimized end-to-end with emphasis on
feature representations learned jointly with rest of the system, the
connectionist temporal classification (CTC) criterion, and the attention-based
sequence-to-sequence model. We further illustrate robustness issues in speech
recognition systems, and discuss acoustic model adaptation, speech enhancement
and separation, and robust training strategies. We also cover modeling
techniques that lead to more efficient decoding and discuss possible future
directions in acoustic model research.Comment: This is an updated version with latest literature until ICASSP2018 of
the paper: Dong Yu and Jinyu Li, "Recent Progresses in Deep Learning based
Acoustic Models," vol.4, no.3, IEEE/CAA Journal of Automatica Sinica, 201
Speaker Recognition Based on Deep Learning: An Overview
Speaker recognition is a task of identifying persons from their voices.
Recently, deep learning has dramatically revolutionized speaker recognition.
However, there is lack of comprehensive reviews on the exciting progress.
In this paper, we review several major subtasks of speaker recognition,
including speaker verification, identification, diarization, and robust speaker
recognition, with a focus on deep-learning-based methods. Because the major
advantage of deep learning over conventional methods is its representation
ability, which is able to produce highly abstract embedding features from
utterances, we first pay close attention to deep-learning-based speaker feature
extraction, including the inputs, network structures, temporal pooling
strategies, and objective functions respectively, which are the fundamental
components of many speaker recognition subtasks. Then, we make an overview of
speaker diarization, with an emphasis of recent supervised, end-to-end, and
online diarization. Finally, we survey robust speaker recognition from the
perspectives of domain adaptation and speech enhancement, which are two major
approaches of dealing with domain mismatch and noise problems. Popular and
recently released corpora are listed at the end of the paper
Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
Single-channel, speaker-independent speech separation methods have recently
seen great progress. However, the accuracy, latency, and computational cost of
such methods remain insufficient. The majority of the previous methods have
formulated the separation problem through the time-frequency representation of
the mixed signal, which has several drawbacks, including the decoupling of the
phase and magnitude of the signal, the suboptimality of time-frequency
representation for speech separation, and the long latency in calculating the
spectrograms. To address these shortcomings, we propose a fully-convolutional
time-domain audio separation network (Conv-TasNet), a deep learning framework
for end-to-end time-domain speech separation. Conv-TasNet uses a linear encoder
to generate a representation of the speech waveform optimized for separating
individual speakers. Speaker separation is achieved by applying a set of
weighting functions (masks) to the encoder output. The modified encoder
representations are then inverted back to the waveforms using a linear decoder.
The masks are found using a temporal convolutional network (TCN) consisting of
stacked 1-D dilated convolutional blocks, which allows the network to model the
long-term dependencies of the speech signal while maintaining a small model
size. The proposed Conv-TasNet system significantly outperforms previous
time-frequency masking methods in separating two- and three-speaker mixtures.
Additionally, Conv-TasNet surpasses several ideal time-frequency magnitude
masks in two-speaker speech separation as evaluated by both objective
distortion measures and subjective quality assessment by human listeners.
Finally, Conv-TasNet has a significantly smaller model size and a shorter
minimum latency, making it a suitable solution for both offline and real-time
speech separation applications.Comment: Accepted by IEEE/ACM Transactions on Audio, Speech and Language
Processing. This version is the authors' version and may vary from the final
publication in detail
Cross-Domain Adaptation of Spoken Language Identification for Related Languages: The Curious Case of Slavic Languages
State-of-the-art spoken language identification (LID) systems, which are
based on end-to-end deep neural networks, have shown remarkable success not
only in discriminating between distant languages but also between
closely-related languages or even different spoken varieties of the same
language. However, it is still unclear to what extent neural LID models
generalize to speech samples with different acoustic conditions due to domain
shift. In this paper, we present a set of experiments to investigate the impact
of domain mismatch on the performance of neural LID systems for a subset of six
Slavic languages across two domains (read speech and radio broadcast) and
examine two low-level signal descriptors (spectral and cepstral features) for
this task. Our experiments show that (1) out-of-domain speech samples severely
hinder the performance of neural LID models, and (2) while both spectral and
cepstral features show comparable performance within-domain, spectral features
show more robustness under domain mismatch. Moreover, we apply unsupervised
domain adaptation to minimize the discrepancy between the two domains in our
study. We achieve relative accuracy improvements that range from 9% to 77%
depending on the diversity of acoustic conditions in the source domain.Comment: To appear in INTERSPEECH 202
Adversarial Transfer Learning for Punctuation Restoration
Previous studies demonstrate that word embeddings and part-of-speech (POS)
tags are helpful for punctuation restoration tasks. However, two drawbacks
still exist. One is that word embeddings are pre-trained by unidirectional
language modeling objectives. Thus the word embeddings only contain
left-to-right context information. The other is that POS tags are provided by
an external POS tagger. So computation cost will be increased and incorrect
predicted tags may affect the performance of restoring punctuation marks during
decoding. This paper proposes adversarial transfer learning to address these
problems. A pre-trained bidirectional encoder representations from transformers
(BERT) model is used to initialize a punctuation model. Thus the transferred
model parameters carry both left-to-right and right-to-left representations.
Furthermore, adversarial multi-task learning is introduced to learn task
invariant knowledge for punctuation prediction. We use an extra POS tagging
task to help the training of the punctuation predicting task. Adversarial
training is utilized to prevent the shared parameters from containing task
specific information. We only use the punctuation predicting task to restore
marks during decoding stage. Therefore, it will not need extra computation and
not introduce incorrect tags from the POS tagger. Experiments are conducted on
IWSLT2011 datasets. The results demonstrate that the punctuation predicting
models obtain further performance improvement with task invariant knowledge
from the POS tagging task. Our best model outperforms the previous
state-of-the-art model trained only with lexical features by up to 9.2%
absolute overall F_1-score on test set
Deep Representation Learning in Speech Processing: Challenges, Recent Advances, and Future Trends
Research on speech processing has traditionally considered the task of
designing hand-engineered acoustic features (feature engineering) as a separate
distinct problem from the task of designing efficient machine learning (ML)
models to make prediction and classification decisions. There are two main
drawbacks to this approach: firstly, the feature engineering being manual is
cumbersome and requires human knowledge; and secondly, the designed features
might not be best for the objective at hand. This has motivated the adoption of
a recent trend in speech community towards utilisation of representation
learning techniques, which can learn an intermediate representation of the
input signal automatically that better suits the task at hand and hence lead to
improved performance. The significance of representation learning has increased
with advances in deep learning (DL), where the representations are more useful
and less dependent on human knowledge, making it very conducive for tasks like
classification, prediction, etc. The main contribution of this paper is to
present an up-to-date and comprehensive survey on different techniques of
speech representation learning by bringing together the scattered research
across three distinct research areas including Automatic Speech Recognition
(ASR), Speaker Recognition (SR), and Speaker Emotion Recognition (SER). Recent
reviews in speech have been conducted for ASR, SR, and SER, however, none of
these has focused on the representation learning from speech---a gap that our
survey aims to bridge
Speaker information modification in the VoicePrivacy 2020 toolchain
This paper presents a study of the baseline system of the VoicePrivacy 2020 challenge. This baseline relies on a voice conversion system that aims at separating speaker identity and linguistic contents for a given speech utterance. To generate an anonymized speech waveform, the neural acoustic model and neural waveform model use the related linguistic content together with a selected pseudo-speaker identity. The linguistic content is estimated using bottleneck features extracted from a triphone classifier while the speaker information is extracted then modified to target a pseudo-speaker identity in the x-vector's space. In this work, we first proposed to replace the triphone-based bottleneck features extractor that requires supervised training by an end-to-end Automatic Speech Recognition (ASR) system. In this framework, we explored the use of adver-sarial and semi-adversarial training to learn linguistic features while masking speaker information. Last, we explored several anonymization schemes to introspect which module benefits the most from the generated pseudo-speaker identities
- …