27 research outputs found
Non-native children speech recognition through transfer learning
This work deals with non-native children's speech and investigates both
multi-task and transfer learning approaches to adapt a multi-language Deep
Neural Network (DNN) to speakers, specifically children, learning a foreign
language. The application scenario is characterized by young students learning
English and German and reading sentences in these second-languages, as well as
in their mother language. The paper analyzes and discusses techniques for
training effective DNN-based acoustic models starting from children native
speech and performing adaptation with limited non-native audio material. A
multi-lingual model is adopted as baseline, where a common phonetic lexicon,
defined in terms of the units of the International Phonetic Alphabet (IPA), is
shared across the three languages at hand (Italian, German and English); DNN
adaptation methods based on transfer learning are evaluated on significant
non-native evaluation sets. Results show that the resulting non-native models
allow a significant improvement with respect to a mono-lingual system adapted
to speakers of the target language
Automatic speech recognition with deep neural networks for impaired speech
The final publication is available at https://link.springer.com/chapter/10.1007%2F978-3-319-49169-1_10Automatic Speech Recognition has reached almost human performance in some controlled scenarios. However, recognition of impaired speech is a difficult task for two main reasons: data is (i) scarce and (ii) heterogeneous. In this work we train different architectures on a database of dysarthric speech. A comparison between architectures shows that, even with a small database, hybrid DNN-HMM models outperform classical GMM-HMM according to word error rate measures. A DNN is able to improve the recognition word error rate a 13% for subjects with dysarthria with respect to the best classical architecture. This improvement is higher than the one given by other deep neural networks such as CNNs, TDNNs and LSTMs. All the experiments have been done with the Kaldi toolkit for speech recognition for which we have adapted several recipes to deal with dysarthric speech and work on the TORGO database. These recipes are publicly available.Peer ReviewedPostprint (author's final draft
Articulatory and bottleneck features for speaker-independent ASR of dysarthric speech
The rapid population aging has stimulated the development of assistive
devices that provide personalized medical support to the needies suffering from
various etiologies. One prominent clinical application is a computer-assisted
speech training system which enables personalized speech therapy to patients
impaired by communicative disorders in the patient's home environment. Such a
system relies on the robust automatic speech recognition (ASR) technology to be
able to provide accurate articulation feedback. With the long-term aim of
developing off-the-shelf ASR systems that can be incorporated in clinical
context without prior speaker information, we compare the ASR performance of
speaker-independent bottleneck and articulatory features on dysarthric speech
used in conjunction with dedicated neural network-based acoustic models that
have been shown to be robust against spectrotemporal deviations. We report ASR
performance of these systems on two dysarthric speech datasets of different
characteristics to quantify the achieved performance gains. Despite the
remaining performance gap between the dysarthric and normal speech, significant
improvements have been reported on both datasets using speaker-independent ASR
architectures.Comment: to appear in Computer Speech & Language -
https://doi.org/10.1016/j.csl.2019.05.002 - arXiv admin note: substantial
text overlap with arXiv:1807.1094
Structured Output Layer with Auxiliary Targets for Context-Dependent Acoustic Modelling
In previous work we have introduced a multi-task training tech-nique for neural network acoustic modelling, in which context-dependent and context-independent targets are jointly learned. In this paper, we extend the approach by structuring the out-put layer such that the context-dependent outputs are depen-dent on the context-independent outputs, thus using the context-independent predictions at run-time. We have also investigated the applicability of this idea to unsupervised speaker adapta-tion as an approach to overcome the data sparsity issues that comes to the fore when estimating systems with a large num-ber of context-dependent states, when data is limited. We have experimented with various amounts of training material (from 10 to 300 hours) and find the proposed techniques are particu-larly well suited to data-constrained conditions allowing to bet-ter utilise large context-dependent state-clustered trees. Exper-imental results are reported for large vocabulary speech recog-nition using the Switchboard and TED corpora. Index Terms: multitask learning, structured output layer, adap-tation, deep neural network
Spoken command recognition for robotics
In this thesis, I investigate spoken command recognition technology for robotics. While high
robustness is expected, the distant and noisy conditions in which the system has to operate
make the task very challenging. Unlike commercial systems which all rely on a "wake-up"
word to initiate the interaction, the pipeline proposed here directly detect and recognizes
commands from the continuous audio stream. In order to keep the task manageable despite
low-resource conditions, I propose to focus on a limited set of commands, thus trading off
flexibility of the system against robustness.
Domain and speaker adaptation strategies based on a multi-task regularization paradigm
are first explored. More precisely, two different methods are proposed which rely on a tied
loss function which penalizes the distance between the output of several networks. The first
method considers each speaker or domain as a task. A canonical task-independent network is
jointly trained with task-dependent models, allowing both types of networks to improve by
learning from one another. While an improvement of 3.2% on the frame error rate (FER) of
the task-independent network is obtained, this only partially carried over to the phone error
rate (PER), with 1.5% of improvement. Similarly, a second method explored the parallel
training of the canonical network with a privileged model having access to i-vectors. This
method proved less effective with only 1.2% of improvement on the FER.
In order to make the developed technology more accessible, I also investigated the use
of a sequence-to-sequence (S2S) architecture for command classification. The use of an
attention-based encoder-decoder model reduced the classification error by 40% relative to a
strong convolutional neural network (CNN)-hidden Markov model (HMM) baseline, showing
the relevance of S2S architectures in such context. In order to improve the flexibility of the
trained system, I also explored strategies for few-shot learning, which allow to extend the
set of commands with minimum requirements in terms of data. Retraining a model on the
combination of original and new commands, I managed to achieve 40.5% of accuracy on the
new commands with only 10 examples for each of them. This scores goes up to 81.5% of
accuracy with a larger set of 100 examples per new command. An alternative strategy, based
on model adaptation achieved even better scores, with 68.8% and 88.4% of accuracy with 10
and 100 examples respectively, while being faster to train. This high performance is obtained
at the expense of the original categories though, on which the accuracy deteriorated. Those
results are very promising as the methods allow to easily extend an existing S2S model with
minimal resources.
Finally, a full spoken command recognition system (named iCubrec) has been developed
for the iCub platform. The pipeline relies on a voice activity detection (VAD) system to
propose a fully hand-free experience. By segmenting only regions that are likely to contain
commands, the VAD module also allows to reduce greatly the computational cost of the
pipeline. Command candidates are then passed to the deep neural network (DNN)-HMM
command recognition system for transcription. The VoCub dataset has been specifically
gathered to train a DNN-based acoustic model for our task. Through multi-condition training
with the CHiME4 dataset, an accuracy of 94.5% is reached on VoCub test set. A filler model,
complemented by a rejection mechanism based on a confidence score, is finally added to the
system to reject non-command speech in a live demonstration of the system
Deep Learning for Distant Speech Recognition
Deep learning is an emerging technology that is considered one of the most
promising directions for reaching higher levels of artificial intelligence.
Among the other achievements, building computers that understand speech
represents a crucial leap towards intelligent machines. Despite the great
efforts of the past decades, however, a natural and robust human-machine speech
interaction still appears to be out of reach, especially when users interact
with a distant microphone in noisy and reverberant environments. The latter
disturbances severely hamper the intelligibility of a speech signal, making
Distant Speech Recognition (DSR) one of the major open challenges in the field.
This thesis addresses the latter scenario and proposes some novel techniques,
architectures, and algorithms to improve the robustness of distant-talking
acoustic models. We first elaborate on methodologies for realistic data
contamination, with a particular emphasis on DNN training with simulated data.
We then investigate on approaches for better exploiting speech contexts,
proposing some original methodologies for both feed-forward and recurrent
neural networks. Lastly, inspired by the idea that cooperation across different
DNNs could be the key for counteracting the harmful effects of noise and
reverberation, we propose a novel deep learning paradigm called network of deep
neural networks. The analysis of the original concepts were based on extensive
experimental validations conducted on both real and simulated data, considering
different corpora, microphone configurations, environments, noisy conditions,
and ASR tasks.Comment: PhD Thesis Unitn, 201
Losses Can Be Blessings: Routing Self-Supervised Speech Representations Towards Efficient Multilingual and Multitask Speech Processing
Self-supervised learning (SSL) for rich speech representations has achieved
empirical success in low-resource Automatic Speech Recognition (ASR) and other
speech processing tasks, which can mitigate the necessity of a large amount of
transcribed speech and thus has driven a growing demand for on-device ASR and
other speech processing. However, advanced speech SSL models have become
increasingly large, which contradicts the limited on-device resources. This gap
could be more severe in multilingual/multitask scenarios requiring
simultaneously recognizing multiple languages or executing multiple speech
processing tasks. Additionally, strongly overparameterized speech SSL models
tend to suffer from overfitting when being finetuned on low-resource speech
corpus. This work aims to enhance the practical usage of speech SSL models
towards a win-win in both enhanced efficiency and alleviated overfitting via
our proposed S-Router framework, which for the first time discovers that
simply discarding no more than 10\% of model weights via only finetuning model
connections of speech SSL models can achieve better accuracy over standard
weight finetuning on downstream speech processing tasks. More importantly,
S-Router can serve as an all-in-one technique to enable (1) a new
finetuning scheme, (2) an efficient multilingual/multitask solution, (3) a
state-of-the-art ASR pruning technique, and (4) a new tool to quantitatively
analyze the learned speech representation. We believe S-Router has provided
a new perspective for practical deployment of speech SSL models. Our codes are
available at: https://github.com/GATECH-EIC/S3-Router.Comment: Accepted at NeurIPS 202