3,678 research outputs found
Personalized Acoustic Modeling by Weakly Supervised Multi-Task Deep Learning using Acoustic Tokens Discovered from Unlabeled Data
It is well known that recognizers personalized to each user are much more
effective than user-independent recognizers. With the popularity of smartphones
today, although it is not difficult to collect a large set of audio data for
each user, it is difficult to transcribe it. However, it is now possible to
automatically discover acoustic tokens from unlabeled personal data in an
unsupervised way. We therefore propose a multi-task deep learning framework
called a phoneme-token deep neural network (PTDNN), jointly trained from
unsupervised acoustic tokens discovered from unlabeled data and very limited
transcribed data for personalized acoustic modeling. We term this scenario
"weakly supervised". The underlying intuition is that the high degree of
similarity between the HMM states of acoustic token models and phoneme models
may help them learn from each other in this multi-task learning framework.
Initial experiments performed over a personalized audio data set recorded from
Facebook posts demonstrated that very good improvements can be achieved in both
frame accuracy and word accuracy over popularly-considered baselines such as
fDLR, speaker code and lightly supervised adaptation. This approach complements
existing speaker adaptation approaches and can be used jointly with such
techniques to yield improved results.Comment: 5 pages, 5 figures, published in IEEE ICASSP 201
Transfer Learning for Speech and Language Processing
Transfer learning is a vital technique that generalizes models trained for
one setting or task to other settings or tasks. For example in speech
recognition, an acoustic model trained for one language can be used to
recognize speech in another language, with little or no re-training data.
Transfer learning is closely related to multi-task learning (cross-lingual vs.
multilingual), and is traditionally studied in the name of `model adaptation'.
Recent advance in deep learning shows that transfer learning becomes much
easier and more effective with high-level abstract features learned by deep
models, and the `transfer' can be conducted not only between data distributions
and data types, but also between model structures (e.g., shallow nets and deep
nets) or even model types (e.g., Bayesian models and neural models). This
review paper summarizes some recent prominent research towards this direction,
particularly for speech and language processing. We also report some results
from our group and highlight the potential of this very interesting research
field.Comment: 13 pages, APSIPA 201
Multilingual and Unsupervised Subword Modelingfor Zero-Resource Languages
Subword modeling for zero-resource languages aims to learn low-level
representations of speech audio without using transcriptions or other resources
from the target language (such as text corpora or pronunciation dictionaries).
A good representation should capture phonetic content and abstract away from
other types of variability, such as speaker differences and channel noise.
Previous work in this area has primarily focused unsupervised learning from
target language data only, and has been evaluated only intrinsically. Here we
directly compare multiple methods, including some that use only target language
speech data and some that use transcribed speech from other (non-target)
languages, and we evaluate using two intrinsic measures as well as on a
downstream unsupervised word segmentation and clustering task. We find that
combining two existing target-language-only methods yields better features than
either method alone. Nevertheless, even better results are obtained by
extracting target language bottleneck features using a model trained on other
languages. Cross-lingual training using just one other language is enough to
provide this benefit, but multilingual training helps even more. In addition to
these results, which hold across both intrinsic measures and the extrinsic
task, we discuss the qualitative differences between the different types of
learned features.Comment: 17 pages, 6 figures, 7 tables. Accepted for publication in Computer
Speech and Language. arXiv admin note: text overlap with arXiv:1803.0886
Adversarial Training in Affective Computing and Sentiment Analysis: Recent Advances and Perspectives
Over the past few years, adversarial training has become an extremely active
research topic and has been successfully applied to various Artificial
Intelligence (AI) domains. As a potentially crucial technique for the
development of the next generation of emotional AI systems, we herein provide a
comprehensive overview of the application of adversarial training to affective
computing and sentiment analysis. Various representative adversarial training
algorithms are explained and discussed accordingly, aimed at tackling diverse
challenges associated with emotional AI systems. Further, we highlight a range
of potential future research directions. We expect that this overview will help
facilitate the development of adversarial training for affective computing and
sentiment analysis in both the academic and industrial communities
Representation Learning With Hidden Unit Clustering For Low Resource Speech Applications
The representation learning of speech, without textual resources, is an area
of significant interest for many low resource speech applications. In this
paper, we describe an approach to self-supervised representation learning from
raw audio using a hidden unit clustering (HUC) framework. The input to the
model consists of audio samples that are windowed and processed with 1-D
convolutional layers. The learned "time-frequency" representations from the
convolutional neural network (CNN) module are further processed with long short
term memory (LSTM) layers which generate a contextual vector representation for
every windowed segment. The HUC framework, allowing the categorization of the
representations into a small number of phoneme-like units, is used to train the
model for learning semantically rich speech representations. The targets
consist of phoneme-like pseudo labels for each audio segment and these are
generated with an iterative k-means algorithm. We explore techniques that
improve the speaker invariance of the learned representations and illustrate
the effectiveness of the proposed approach on two settings, i) completely
unsupervised speech applications on the sub-tasks described as part of the
ZeroSpeech 2021 challenge and ii) semi-supervised automatic speech recognition
(ASR) applications on the TIMIT dataset and on the GramVaani challenge Hindi
dataset. In these experiments, we achieve state-of-art results for various
ZeroSpeech tasks. Further, on the ASR experiments, the HUC representations are
shown to improve significantly over other established benchmarks based on
Wav2vec, HuBERT and Best-RQ
- …