63 research outputs found
Losses Can Be Blessings: Routing Self-Supervised Speech Representations Towards Efficient Multilingual and Multitask Speech Processing
Self-supervised learning (SSL) for rich speech representations has achieved
empirical success in low-resource Automatic Speech Recognition (ASR) and other
speech processing tasks, which can mitigate the necessity of a large amount of
transcribed speech and thus has driven a growing demand for on-device ASR and
other speech processing. However, advanced speech SSL models have become
increasingly large, which contradicts the limited on-device resources. This gap
could be more severe in multilingual/multitask scenarios requiring
simultaneously recognizing multiple languages or executing multiple speech
processing tasks. Additionally, strongly overparameterized speech SSL models
tend to suffer from overfitting when being finetuned on low-resource speech
corpus. This work aims to enhance the practical usage of speech SSL models
towards a win-win in both enhanced efficiency and alleviated overfitting via
our proposed S-Router framework, which for the first time discovers that
simply discarding no more than 10\% of model weights via only finetuning model
connections of speech SSL models can achieve better accuracy over standard
weight finetuning on downstream speech processing tasks. More importantly,
S-Router can serve as an all-in-one technique to enable (1) a new
finetuning scheme, (2) an efficient multilingual/multitask solution, (3) a
state-of-the-art ASR pruning technique, and (4) a new tool to quantitatively
analyze the learned speech representation. We believe S-Router has provided
a new perspective for practical deployment of speech SSL models. Our codes are
available at: https://github.com/GATECH-EIC/S3-Router.Comment: Accepted at NeurIPS 202
Efficient Online Processing with Deep Neural Networks
The capabilities and adoption of deep neural networks (DNNs) grow at an
exhilarating pace: Vision models accurately classify human actions in videos
and identify cancerous tissue in medical scans as precisely than human experts;
large language models answer wide-ranging questions, generate code, and write
prose, becoming the topic of everyday dinner-table conversations. Even though
their uses are exhilarating, the continually increasing model sizes and
computational complexities have a dark side. The economic cost and negative
environmental externalities of training and serving models is in evident
disharmony with financial viability and climate action goals.
Instead of pursuing yet another increase in predictive performance, this
dissertation is dedicated to the improvement of neural network efficiency.
Specifically, a core contribution addresses the efficiency aspects during
online inference. Here, the concept of Continual Inference Networks (CINs) is
proposed and explored across four publications. CINs extend prior
state-of-the-art methods developed for offline processing of spatio-temporal
data and reuse their pre-trained weights, improving their online processing
efficiency by an order of magnitude. These advances are attained through a
bottom-up computational reorganization and judicious architectural
modifications. The benefit to online inference is demonstrated by reformulating
several widely used network architectures into CINs, including 3D CNNs,
ST-GCNs, and Transformer Encoders. An orthogonal contribution tackles the
concurrent adaptation and computational acceleration of a large source model
into multiple lightweight derived models. Drawing on fusible adapter networks
and structured pruning, Structured Pruning Adapters achieve superior predictive
accuracy under aggressive pruning using significantly fewer learned weights
compared to fine-tuning with pruning.Comment: PhD Dissertatio
A Review of Deep Learning Techniques for Speech Processing
The field of speech processing has undergone a transformative shift with the
advent of deep learning. The use of multiple processing layers has enabled the
creation of models capable of extracting intricate features from speech data.
This development has paved the way for unparalleled advancements in speech
recognition, text-to-speech synthesis, automatic speech recognition, and
emotion recognition, propelling the performance of these tasks to unprecedented
heights. The power of deep learning techniques has opened up new avenues for
research and innovation in the field of speech processing, with far-reaching
implications for a range of industries and applications. This review paper
provides a comprehensive overview of the key deep learning models and their
applications in speech-processing tasks. We begin by tracing the evolution of
speech processing research, from early approaches, such as MFCC and HMM, to
more recent advances in deep learning architectures, such as CNNs, RNNs,
transformers, conformers, and diffusion models. We categorize the approaches
and compare their strengths and weaknesses for solving speech-processing tasks.
Furthermore, we extensively cover various speech-processing tasks, datasets,
and benchmarks used in the literature and describe how different deep-learning
networks have been utilized to tackle these tasks. Additionally, we discuss the
challenges and future directions of deep learning in speech processing,
including the need for more parameter-efficient, interpretable models and the
potential of deep learning for multimodal speech processing. By examining the
field's evolution, comparing and contrasting different approaches, and
highlighting future directions and challenges, we hope to inspire further
research in this exciting and rapidly advancing field
Machine Learning for Microcontroller-Class Hardware -- A Review
The advancements in machine learning opened a new opportunity to bring
intelligence to the low-end Internet-of-Things nodes such as microcontrollers.
Conventional machine learning deployment has high memory and compute footprint
hindering their direct deployment on ultra resource-constrained
microcontrollers. This paper highlights the unique requirements of enabling
onboard machine learning for microcontroller class devices. Researchers use a
specialized model development workflow for resource-limited applications to
ensure the compute and latency budget is within the device limits while still
maintaining the desired performance. We characterize a closed-loop widely
applicable workflow of machine learning model development for microcontroller
class devices and show that several classes of applications adopt a specific
instance of it. We present both qualitative and numerical insights into
different stages of model development by showcasing several use cases. Finally,
we identify the open research challenges and unsolved questions demanding
careful considerations moving forward.Comment: Accepted for publication at IEEE Sensors Journa
Recommending on graphs: a comprehensive review from a data perspective
Recent advances in graph-based learning approaches have demonstrated their
effectiveness in modelling users' preferences and items' characteristics for
Recommender Systems (RSS). Most of the data in RSS can be organized into graphs
where various objects (e.g., users, items, and attributes) are explicitly or
implicitly connected and influence each other via various relations. Such a
graph-based organization brings benefits to exploiting potential properties in
graph learning (e.g., random walk and network embedding) techniques to enrich
the representations of the user and item nodes, which is an essential factor
for successful recommendations. In this paper, we provide a comprehensive
survey of Graph Learning-based Recommender Systems (GLRSs). Specifically, we
start from a data-driven perspective to systematically categorize various
graphs in GLRSs and analyze their characteristics. Then, we discuss the
state-of-the-art frameworks with a focus on the graph learning module and how
they address practical recommendation challenges such as scalability, fairness,
diversity, explainability and so on. Finally, we share some potential research
directions in this rapidly growing area.Comment: Accepted by UMUA
Adaptation of speech recognition systems to selected real-world deployment conditions
Tato habilitační práce se zabývá problematikou adaptace systémů
rozpoznávání řeči na vybrané reálné podmínky nasazení. Je koncipována
jako sborník celkem dvanácti článků, které se touto problematikou
zabývají. Jde o publikace, jejichž jsem hlavním autorem
nebo spoluatorem, a které vznikly v rámci několika navazujících
výzkumných projektů. Na řešení těchto projektů jsem se
podílel jak v roli člena výzkumného týmu, tak i v roli řešitele nebo
spoluřešitele.
Publikace zařazené do tohoto sborníku lze rozdělit podle tématu
do tří hlavních skupin. Jejich společným jmenovatelem je
snaha přizpůsobit daný rozpoznávací systém novým podmínkám či
konkrétnímu faktoru, který významným způsobem ovlivňuje jeho
funkci či přesnost.
První skupina článků se zabývá úlohou neřízené adaptace na
mluvčího, kdy systém přizpůsobuje svoje parametry specifickým
hlasovým charakteristikám dané mluvící osoby. Druhá část práce
se pak věnuje problematice identifikace neřečových událostí na vstupu
do systému a související úloze rozpoznávání řeči s hlukem
(a zejména hudbou) na pozadí. Konečně třetí část práce se zabývá
přístupy, které umožňují přepis audio signálu obsahujícího promluvy
ve více než v jednom jazyce. Jde o metody adaptace existujícího
rozpoznávacího systému na nový jazyk a metody identifikace
jazyka z audio signálu.
Obě zmíněné identifikační úlohy jsou přitom vyšetřovány zejména
v náročném a méně probádaném režimu zpracování po jednotlivých
rámcích vstupního signálu, který je jako jediný vhodný pro on-line
nasazení, např. pro streamovaná data.This habilitation thesis deals with adaptation of automatic speech
recognition (ASR) systems to selected real-world deployment conditions.
It is presented in the form of a collection of twelve articles
dealing with this task; I am the main author or a co-author of these
articles. They were published during my work on several consecutive
research projects. I have participated in the solution of them
as a member of the research team as well as the investigator or a
co-investigator.
These articles can be divided into three main groups according to
their topics. They have in common the effort to adapt a particular
ASR system to a specific factor or deployment condition that affects
its function or accuracy.
The first group of articles is focused on an unsupervised speaker
adaptation task, where the ASR system adapts its parameters to
the specific voice characteristics of one particular speaker. The second
part deals with a) methods allowing the system to identify
non-speech events on the input, and b) the related task of recognition
of speech with non-speech events, particularly music, in the
background. Finally, the third part is devoted to the methods
that allow the transcription of an audio signal containing multilingual
utterances. It includes a) approaches for adapting the existing
recognition system to a new language and b) methods for identification
of the language from the audio signal.
The two mentioned identification tasks are in particular investigated
under the demanding and less explored frame-wise scenario,
which is the only one suitable for processing of on-line data streams
Low-Rank Representation For Enhanced Deep Neural Network Acoustic Models
Automatic speech recognition (ASR) is a fascinating area of research towards realizing humanmachine interactions. After more than 30 years of exploitation of Gaussian Mixture Models (GMMs), state-of-the-art systems currently rely on Deep Neural Network (DNN) to estimate class-conditional posterior probabilities. The posterior probabilities are used for acoustic modeling in hidden Markov models (HMM), and form a hybrid DNN-HMM which is now the leading edge approach to solve ASR problems. The present work builds upon the hypothesis that the optimal acoustic models are sparse and lie on multiple low-rank probability subspaces. Hence, the main goal of this Master project aimed at investigating different ways to restructure the DNN outputs using low-rank representation. Exploiting a large number of training posterior vectors, the underlying low-dimensional subspace can be identified, and low-rank decomposition enables separation of the “optimal” posteriors from the spurious (unstructured) uncertainties at the DNN output. Experiments demonstrate that low-rank representation can enhance posterior probability estimation, and lead to higher ASR accuracy. The posteriors are grouped according to their subspace similarities, and structured through low-rank decomposition. Furthermore, a novel hashing technique is proposed exploiting the low-rank property of posterior subspaces that enables fast search in the space of posterior exemplars
- …