64,537 research outputs found
Recent Progresses in Deep Learning based Acoustic Models (Updated)
In this paper, we summarize recent progresses made in deep learning based
acoustic models and the motivation and insights behind the surveyed techniques.
We first discuss acoustic models that can effectively exploit variable-length
contextual information, such as recurrent neural networks (RNNs), convolutional
neural networks (CNNs), and their various combination with other models. We
then describe acoustic models that are optimized end-to-end with emphasis on
feature representations learned jointly with rest of the system, the
connectionist temporal classification (CTC) criterion, and the attention-based
sequence-to-sequence model. We further illustrate robustness issues in speech
recognition systems, and discuss acoustic model adaptation, speech enhancement
and separation, and robust training strategies. We also cover modeling
techniques that lead to more efficient decoding and discuss possible future
directions in acoustic model research.Comment: This is an updated version with latest literature until ICASSP2018 of
the paper: Dong Yu and Jinyu Li, "Recent Progresses in Deep Learning based
Acoustic Models," vol.4, no.3, IEEE/CAA Journal of Automatica Sinica, 201
Multi-Task Learning with High-Order Statistics for X-vector based Text-Independent Speaker Verification
The x-vector based deep neural network (DNN) embedding systems have
demonstrated effectiveness for text-independent speaker verification. This
paper presents a multi-task learning architecture for training the speaker
embedding DNN with the primary task of classifying the target speakers, and the
auxiliary task of reconstructing the first- and higher-order statistics of the
original input utterance. The proposed training strategy aggregates both the
supervised and unsupervised learning into one framework to make the speaker
embeddings more discriminative and robust. Experiments are carried out using
the NIST SRE16 evaluation dataset and the VOiCES dataset. The results
demonstrate that our proposed method outperforms the original x-vector approach
with very low additional complexity added.Comment: 5 pages,2 figures, submitted to INTERSPEECH 201
An Analysis of Speech Enhancement and Recognition Losses in Limited Resources Multi-talker Single Channel Audio-Visual ASR
In this paper, we analyzed how audio-visual speech enhancement can help to
perform the ASR task in a cocktail party scenario. Therefore we considered two
simple end-to-end LSTM-based models that perform single-channel audio-visual
speech enhancement and phone recognition respectively. Then, we studied how the
two models interact, and how to train them jointly affects the final result. We
analyzed different training strategies that reveal some interesting and
unexpected behaviors. The experiments show that during optimization of the ASR
task the speech enhancement capability of the model significantly decreases and
vice-versa. Nevertheless the joint optimization of the two tasks shows a
remarkable drop of the Phone Error Rate (PER) compared to the audio-visual
baseline models trained only to perform phone recognition. We analyzed the
behaviors of the proposed models by using two limited-size datasets, and in
particular we used the mixed-speech versions of GRID and TCD-TIMIT
Domain-Invariant Speaker Vector Projection by Model-Agnostic Meta-Learning
Domain generalization remains a critical problem for speaker recognition,
even with the state-of-the-art architectures based on deep neural nets. For
example, a model trained on reading speech may largely fail when applied to
scenarios of singing or movie. In this paper, we propose a domain-invariant
projection to improve the generalizability of speaker vectors. This projection
is a simple neural net and is trained following the Model-Agnostic
Meta-Learning (MAML) principle, for which the objective is to classify speakers
in one domain if it had been updated with speech data in another domain. We
tested the proposed method on CNCeleb, a new dataset consisting of
single-speaker multi-condition (SSMC) data. The results demonstrated that the
MAML-based domain-invariant projection can produce more generalizable speaker
vectors, and effectively improve the performance in unseen domains.Comment: submitted to INTERSPEECH 202
A Purely End-to-end System for Multi-speaker Speech Recognition
Recently, there has been growing interest in multi-speaker speech
recognition, where the utterances of multiple speakers are recognized from
their mixture. Promising techniques have been proposed for this task, but
earlier works have required additional training data such as isolated source
signals or senone alignments for effective learning. In this paper, we propose
a new sequence-to-sequence framework to directly decode multiple label
sequences from a single speech sequence by unifying source separation and
speech recognition functions in an end-to-end manner. We further propose a new
objective function to improve the contrast between the hidden vectors to avoid
generating similar hypotheses. Experimental results show that the model is
directly able to learn a mapping from a speech mixture to multiple label
sequences, achieving 83.1 % relative improvement compared to a model trained
without the proposed objective. Interestingly, the results are comparable to
those produced by previous end-to-end works featuring explicit separation and
recognition modules.Comment: ACL 201
Leveraging Weakly Supervised Data to Improve End-to-End Speech-to-Text Translation
End-to-end Speech Translation (ST) models have many potential advantages when
compared to the cascade of Automatic Speech Recognition (ASR) and text Machine
Translation (MT) models, including lowered inference latency and the avoidance
of error compounding. However, the quality of end-to-end ST is often limited by
a paucity of training data, since it is difficult to collect large parallel
corpora of speech and translated transcript pairs. Previous studies have
proposed the use of pre-trained components and multi-task learning in order to
benefit from weakly supervised training data, such as speech-to-transcript or
text-to-foreign-text pairs. In this paper, we demonstrate that using
pre-trained MT or text-to-speech (TTS) synthesis models to convert weakly
supervised data into speech-to-translation pairs for ST training can be more
effective than multi-task learning. Furthermore, we demonstrate that a high
quality end-to-end ST model can be trained using only weakly supervised
datasets, and that synthetic data sourced from unlabeled monolingual text or
speech can be used to improve performance. Finally, we discuss methods for
avoiding overfitting to synthetic speech with a quantitative ablation study.Comment: ICASSP 201
Multi-Task Network for Noise-Robust Keyword Spotting and Speaker Verification using CTC-based Soft VAD and Global Query Attention
Keyword spotting (KWS) and speaker verification (SV) have been studied
independently although it is known that acoustic and speaker domains are
complementary. In this paper, we propose a multi-task network that performs KWS
and SV simultaneously to fully utilize the interrelated domain information. The
multi-task network tightly combines sub-networks aiming at performance
improvement in challenging conditions such as noisy environments,
open-vocabulary KWS, and short-duration SV, by introducing novel techniques of
connectionist temporal classification (CTC)-based soft voice activity detection
(VAD) and global query attention. Frame-level acoustic and speaker information
is integrated with phonetically originated weights so that forms a word-level
global representation. Then it is used for the aggregation of feature vectors
to generate discriminative embeddings. Our proposed approach shows 4.06% and
26.71% relative improvements in equal error rate (EER) compared to the
baselines for both tasks. We also present a visualization example and results
of ablation experiments.Comment: Accepted to Interspeech 202
Single-Channel Multi-talker Speech Recognition with Permutation Invariant Training
Although great progresses have been made in automatic speech recognition
(ASR), significant performance degradation is still observed when recognizing
multi-talker mixed speech. In this paper, we propose and evaluate several
architectures to address this problem under the assumption that only a single
channel of mixed signal is available. Our technique extends permutation
invariant training (PIT) by introducing the front-end feature separation module
with the minimum mean square error (MSE) criterion and the back-end recognition
module with the minimum cross entropy (CE) criterion. More specifically, during
training we compute the average MSE or CE over the whole utterance for each
possible utterance-level output-target assignment, pick the one with the
minimum MSE or CE, and optimize for that assignment. This strategy elegantly
solves the label permutation problem observed in the deep learning based
multi-talker mixed speech separation and recognition systems. The proposed
architectures are evaluated and compared on an artificially mixed AMI dataset
with both two- and three-talker mixed speech. The experimental results indicate
that our proposed architectures can cut the word error rate (WER) by 45.0% and
25.0% relatively against the state-of-the-art single-talker speech recognition
system across all speakers when their energies are comparable, for two- and
three-talker mixed speech, respectively. To our knowledge, this is the first
work on the multi-talker mixed speech recognition on the challenging
speaker-independent spontaneous large vocabulary continuous speech task.Comment: 11 pages, 6 figures, Submitted to IEEE/ACM Transactions on Audio,
Speech and Language Processing. arXiv admin note: text overlap with
arXiv:1704.0198
Adversarial Speaker Verification
The use of deep networks to extract embeddings for speaker recognition has
proven successfully. However, such embeddings are susceptible to performance
degradation due to the mismatches among the training, enrollment, and test
conditions. In this work, we propose an adversarial speaker verification (ASV)
scheme to learn the condition-invariant deep embedding via adversarial
multi-task training. In ASV, a speaker classification network and a condition
identification network are jointly optimized to minimize the speaker
classification loss and simultaneously mini-maximize the condition loss. The
target labels of the condition network can be categorical (environment types)
and continuous (SNR values). We further propose multi-factorial ASV to
simultaneously suppress multiple factors that constitute the condition
variability. Evaluated on a Microsoft Cortana text-dependent speaker
verification task, the ASV achieves 8.8% and 14.5% relative improvements in
equal error rates (EER) for known and unknown conditions, respectively.Comment: 5 pages, 1 figure, ICASSP 201
Machine Speech Chain with One-shot Speaker Adaptation
In previous work, we developed a closed-loop speech chain model based on deep
learning, in which the architecture enabled the automatic speech recognition
(ASR) and text-to-speech synthesis (TTS) components to mutually improve their
performance. This was accomplished by the two parts teaching each other using
both labeled and unlabeled data. This approach could significantly improve
model performance within a single-speaker speech dataset, but only a slight
increase could be gained in multi-speaker tasks. Furthermore, the model is
still unable to handle unseen speakers. In this paper, we present a new speech
chain mechanism by integrating a speaker recognition model inside the loop. We
also propose extending the capability of TTS to handle unseen speakers by
implementing one-shot speaker adaptation. This enables TTS to mimic voice
characteristics from one speaker to another with only a one-shot speaker
sample, even from a text without any speaker information. In the speech chain
loop mechanism, ASR also benefits from the ability to further learn an
arbitrary speaker's characteristics from the generated speech waveform,
resulting in a significant improvement in the recognition rate
- …