226 research outputs found
Deep Transfer Learning for Automatic Speech Recognition: Towards Better Generalization
Automatic speech recognition (ASR) has recently become an important challenge
when using deep learning (DL). It requires large-scale training datasets and
high computational and storage resources. Moreover, DL techniques and machine
learning (ML) approaches in general, hypothesize that training and testing data
come from the same domain, with the same input feature space and data
distribution characteristics. This assumption, however, is not applicable in
some real-world artificial intelligence (AI) applications. Moreover, there are
situations where gathering real data is challenging, expensive, or rarely
occurring, which can not meet the data requirements of DL models. deep transfer
learning (DTL) has been introduced to overcome these issues, which helps
develop high-performing models using real datasets that are small or slightly
different but related to the training data. This paper presents a comprehensive
survey of DTL-based ASR frameworks to shed light on the latest developments and
helps academics and professionals understand current challenges. Specifically,
after presenting the DTL background, a well-designed taxonomy is adopted to
inform the state-of-the-art. A critical analysis is then conducted to identify
the limitations and advantages of each framework. Moving on, a comparative
study is introduced to highlight the current challenges before deriving
opportunities for future research
Foundations and Recent Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions
Multimodal machine learning is a vibrant multi-disciplinary research field
that aims to design computer agents with intelligent capabilities such as
understanding, reasoning, and learning through integrating multiple
communicative modalities, including linguistic, acoustic, visual, tactile, and
physiological messages. With the recent interest in video understanding,
embodied autonomous agents, text-to-image generation, and multisensor fusion in
application domains such as healthcare and robotics, multimodal machine
learning has brought unique computational and theoretical challenges to the
machine learning community given the heterogeneity of data sources and the
interconnections often found between modalities. However, the breadth of
progress in multimodal research has made it difficult to identify the common
themes and open questions in the field. By synthesizing a broad range of
application domains and theoretical frameworks from both historical and recent
perspectives, this paper is designed to provide an overview of the
computational and theoretical foundations of multimodal machine learning. We
start by defining two key principles of modality heterogeneity and
interconnections that have driven subsequent innovations, and propose a
taxonomy of 6 core technical challenges: representation, alignment, reasoning,
generation, transference, and quantification covering historical and recent
trends. Recent technical achievements will be presented through the lens of
this taxonomy, allowing researchers to understand the similarities and
differences across new approaches. We end by motivating several open problems
for future research as identified by our taxonomy
Transfer Learning for Speech and Language Processing
Transfer learning is a vital technique that generalizes models trained for
one setting or task to other settings or tasks. For example in speech
recognition, an acoustic model trained for one language can be used to
recognize speech in another language, with little or no re-training data.
Transfer learning is closely related to multi-task learning (cross-lingual vs.
multilingual), and is traditionally studied in the name of `model adaptation'.
Recent advance in deep learning shows that transfer learning becomes much
easier and more effective with high-level abstract features learned by deep
models, and the `transfer' can be conducted not only between data distributions
and data types, but also between model structures (e.g., shallow nets and deep
nets) or even model types (e.g., Bayesian models and neural models). This
review paper summarizes some recent prominent research towards this direction,
particularly for speech and language processing. We also report some results
from our group and highlight the potential of this very interesting research
field.Comment: 13 pages, APSIPA 201
Adaptation Algorithms for Neural Network-Based Speech Recognition: An Overview
We present a structured overview of adaptation algorithms for neural
network-based speech recognition, considering both hybrid hidden Markov model /
neural network systems and end-to-end neural network systems, with a focus on
speaker adaptation, domain adaptation, and accent adaptation. The overview
characterizes adaptation algorithms as based on embeddings, model parameter
adaptation, or data augmentation. We present a meta-analysis of the performance
of speech recognition adaptation algorithms, based on relative error rate
reductions as reported in the literature.Comment: Submitted to IEEE Open Journal of Signal Processing. 30 pages, 27
figure
Predicting Linguistic Structure with Incomplete and Cross-Lingual Supervision
Contemporary approaches to natural language processing are predominantly based on statistical machine learning from large amounts of text, which has been manually annotated with the linguistic structure of interest. However, such complete supervision is currently only available for the world's major languages, in a limited number of domains and for a limited range of tasks. As an alternative, this dissertation considers methods for linguistic structure prediction that can make use of incomplete and cross-lingual supervision, with the prospect of making linguistic processing tools more widely available at a lower cost. An overarching theme of this work is the use of structured discriminative latent variable models for learning with indirect and ambiguous supervision; as instantiated, these models admit rich model features while retaining efficient learning and inference properties.
The first contribution to this end is a latent-variable model for fine-grained sentiment analysis with coarse-grained indirect supervision. The second is a model for cross-lingual word-cluster induction and the application thereof to cross-lingual model transfer. The third is a method for adapting multi-source discriminative cross-lingual transfer models to target languages, by means of typologically informed selective parameter sharing. The fourth is an ambiguity-aware self- and ensemble-training algorithm, which is applied to target language adaptation and relexicalization of delexicalized cross-lingual transfer parsers. The fifth is a set of sequence-labeling models that combine constraints at the level of tokens and types, and an instantiation of these models for part-of-speech tagging with incomplete cross-lingual and crowdsourced supervision. In addition to these contributions, comprehensive overviews are provided of structured prediction with no or incomplete supervision, as well as of learning in the multilingual and cross-lingual settings.
Through careful empirical evaluation, it is established that the proposed methods can be used to create substantially more accurate tools for linguistic processing, compared to both unsupervised methods and to recently proposed cross-lingual methods. The empirical support for this claim is particularly strong in the latter case; our models for syntactic dependency parsing and part-of-speech tagging achieve the hitherto best published results for a wide number of target languages, in the setting where no annotated training data is available in the target language
PersoNER: Persian named-entity recognition
© 1963-2018 ACL. Named-Entity Recognition (NER) is still a challenging task for languages with low digital resources. The main difficulties arise from the scarcity of annotated corpora and the consequent problematic training of an effective NER pipeline. To abridge this gap, in this paper we target the Persian language that is spoken by a population of over a hundred million people world-wide. We first present and provide ArmanPerosNERCorpus, the first manually-annotated Persian NER corpus. Then, we introduce PersoNER, an NER pipeline for Persian that leverages a word embedding and a sequential max-margin classifier. The experimental results show that the proposed approach is capable of achieving interesting MUC7 and CoNNL scores while outperforming two alternatives based on a CRF and a recurrent neural network
Improving the Generalizability of Speech Emotion Recognition: Methods for Handling Data and Label Variability
Emotion is an essential component in our interaction with others. It transmits information that helps us interpret the content of what others say. Therefore, detecting emotion from speech is an important step towards enabling machine understanding of human behaviors and intentions. Researchers have demonstrated the potential of emotion recognition in areas such as interactive systems in smart homes and mobile devices, computer games, and computational medical assistants. However, emotion communication is variable: individuals may express emotion in a manner that is uniquely their own; different speech content and environments may shape how emotion is expressed and recorded; individuals may perceive emotional messages differently. Practically, this variability is reflected in both the audio-visual data and the labels used to create speech emotion recognition (SER) systems. SER systems must be robust and generalizable to handle the variability effectively.
The focus of this dissertation is on the development of speech emotion recognition systems that handle variability in emotion communications. We break the dissertation into three parts, according to the type of variability we address: (I) in the data, (II) in the labels, and (III) in both the data and the labels.
Part I: The first part of this dissertation focuses on handling variability present in data. We approximate variations in environmental properties and expression styles by corpus and gender of the speakers. We find that training on multiple corpora and controlling for the variability in gender and corpus using multi-task learning result in more generalizable models, compared to the traditional single-task models that do not take corpus and gender variability into account. Another source of variability present in the recordings used in SER is the phonetic modulation of acoustics. On the other hand, phonemes also provide information about the emotion expressed in speech content. We discover that we can make more accurate predictions of emotion by explicitly considering both roles of phonemes.
Part II: The second part of this dissertation addresses variability present in emotion labels, including the differences between emotion expression and perception, and the variations in emotion perception. We discover that it is beneficial to jointly model both the perception of others and how one perceives one’s own expression, compared to focusing on either one. Further, we show that the variability in emotion perception is a modelable signal and can be captured using probability distributions that describe how groups of evaluators perceive emotional messages.
Part III: The last part of this dissertation presents methods that handle variability in both data and labels. We reduce the data variability due to non-emotional factors using deep metric learning and model the variability in emotion perception using soft labels. We propose a family of loss functions and show that by pairing examples that potentially vary in expression styles and lexical content and preserving the real-valued emotional similarity between them, we develop systems that generalize better across datasets and are more robust to over-training.
These works demonstrate the importance of considering data and label variability in the creation of robust and generalizable emotion recognition systems. We conclude this dissertation with the following future directions: (1) the development of real-time SER systems; (2) the personalization of general SER systems.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/147639/1/didizbq_1.pd
Losses Can Be Blessings: Routing Self-Supervised Speech Representations Towards Efficient Multilingual and Multitask Speech Processing
Self-supervised learning (SSL) for rich speech representations has achieved
empirical success in low-resource Automatic Speech Recognition (ASR) and other
speech processing tasks, which can mitigate the necessity of a large amount of
transcribed speech and thus has driven a growing demand for on-device ASR and
other speech processing. However, advanced speech SSL models have become
increasingly large, which contradicts the limited on-device resources. This gap
could be more severe in multilingual/multitask scenarios requiring
simultaneously recognizing multiple languages or executing multiple speech
processing tasks. Additionally, strongly overparameterized speech SSL models
tend to suffer from overfitting when being finetuned on low-resource speech
corpus. This work aims to enhance the practical usage of speech SSL models
towards a win-win in both enhanced efficiency and alleviated overfitting via
our proposed S-Router framework, which for the first time discovers that
simply discarding no more than 10\% of model weights via only finetuning model
connections of speech SSL models can achieve better accuracy over standard
weight finetuning on downstream speech processing tasks. More importantly,
S-Router can serve as an all-in-one technique to enable (1) a new
finetuning scheme, (2) an efficient multilingual/multitask solution, (3) a
state-of-the-art ASR pruning technique, and (4) a new tool to quantitatively
analyze the learned speech representation. We believe S-Router has provided
a new perspective for practical deployment of speech SSL models. Our codes are
available at: https://github.com/GATECH-EIC/S3-Router.Comment: Accepted at NeurIPS 202
- …