4,580 research outputs found
MobileASR: A resource-aware on-device learning framework for user voice personalization applications on mobile phones
We describe a comprehensive methodology for developing user-voice
personalized automatic speech recognition (ASR) models by effectively training
models on mobile phones, allowing user data and models to be stored and used
locally. To achieve this, we propose a resource-aware sub-model-based training
approach that considers the RAM, and battery capabilities of mobile phones. By
considering the evaluation metric and resource constraints of the mobile
phones, we are able to perform efficient training and halt the process
accordingly. To simulate real users, we use speakers with various accents. The
entire on-device training and evaluation framework was then tested on various
mobile phones across brands. We show that fine-tuning the models and selecting
the right hyperparameter values is a trade-off between the lowest achievable
performance metric, on-device training time, and memory consumption. Overall,
our methodology offers a comprehensive solution for developing personalized ASR
models while leveraging the capabilities of mobile phones, and balancing the
need for accuracy with resource constraints.Comment: Accepted in AIMLSystems 202
Designing Human-Centered Collective Intelligence
Human-Centered Collective Intelligence (HCCI) is an emergent research area that seeks to bring together major research areas like machine learning, statistical modeling, information retrieval, market research, and software engineering to address challenges pertaining to deriving intelligent insights and solutions through the collaboration of several intelligent sensors, devices and data sources. An archetypal contextual CI scenario might be concerned with deriving affect-driven intelligence through multimodal emotion detection sources in a bid to determine the likability of one movie trailer over another. On the other hand, the key tenets to designing robust and evolutionary software and infrastructure architecture models to address cross-cutting quality concerns is of keen interest in the “Cloud” age of today. Some of the key quality concerns of interest in CI scenarios span the gamut of security and privacy, scalability, performance, fault-tolerance, and reliability. I present recent advances in CI system design with a focus on highlighting optimal solutions for the aforementioned cross-cutting concerns. I also describe a number of design challenges and a framework that I have determined to be critical to designing CI systems. With inspiration from machine learning, computational advertising, ubiquitous computing, and sociable robotics, this literature incorporates theories and concepts from various viewpoints to empower the collective intelligence engine, ZOEI, to discover affective state and emotional intent across multiple mediums. The discerned affective state is used in recommender systems among others to support content personalization. I dive into the design of optimal architectures that allow humans and intelligent systems to work collectively to solve complex problems. I present an evaluation of various studies that leverage the ZOEI framework to design collective intelligence
CHORUS Deliverable 2.2: Second report - identification of multi-disciplinary key issues for gap analysis toward EU multimedia search engines roadmap
After addressing the state-of-the-art during the first year of Chorus and establishing the existing landscape in
multimedia search engines, we have identified and analyzed gaps within European research effort during our second year.
In this period we focused on three directions, notably technological issues, user-centred issues and use-cases and socio-
economic and legal aspects. These were assessed by two central studies: firstly, a concerted vision of functional breakdown
of generic multimedia search engine, and secondly, a representative use-cases descriptions with the related discussion on
requirement for technological challenges. Both studies have been carried out in cooperation and consultation with the
community at large through EC concertation meetings (multimedia search engines cluster), several meetings with our
Think-Tank, presentations in international conferences, and surveys addressed to EU projects coordinators as well as
National initiatives coordinators. Based on the obtained feedback we identified two types of gaps, namely core
technological gaps that involve research challenges, and “enablers”, which are not necessarily technical research
challenges, but have impact on innovation progress. New socio-economic trends are presented as well as emerging legal
challenges
Modeling Spoken Information Queries for Virtual Assistants: Open Problems, Challenges and Opportunities
Virtual assistants are becoming increasingly important speech-driven
Information Retrieval platforms that assist users with various tasks.
We discuss open problems and challenges with respect to modeling spoken
information queries for virtual assistants, and list opportunities where
Information Retrieval methods and research can be applied to improve the
quality of virtual assistant speech recognition.
We discuss how query domain classification, knowledge graphs and user
interaction data, and query personalization can be helpful to improve the
accurate recognition of spoken information domain queries. Finally, we also
provide a brief overview of current problems and challenges in speech
recognition.Comment: SIGIR '23. The 46th International ACM SIGIR Conference on Research &
Development in Information Retrieva
Targeted Subset Selection for Limited-data ASR Accent Adaptation
We study the task of adapting an existing ASR model to a non-native accent
while being constrained by a transcription budget on the duration of utterances
selected from a large unlabeled corpus. We propose a subset selection approach
using the recently proposed submodular mutual information functions, in which
we identify a diverse set of utterances that match the target accent. This is
specified through a few target utterances and achieved by modelling the
relationship between the target and the selected subsets using these functions.
The model adapts to the accent through fine-tuning with utterances selected and
transcribed from the unlabeled corpus. We also use an accent classifier to
learn accent-aware feature representations. Our method is also able to exploit
samples from other accents to perform out-of-domain selections for low-resource
accents which are not available in these corpora. We show that the targeted
subset selection approach improves significantly upon random sampling - by
around 5% to 10% (absolute) in most cases, and is around 10x more
label-efficient. We also compare with an oracle method where we specifically
pick from the target accent and our method is comparable to the oracle in its
selections and WER performance.Comment: Under review (INTERSPEECH 2022
Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource End-to-End Speech Recognition
With the rapid development of speech assistants, adapting server-intended
automatic speech recognition (ASR) solutions to a direct device has become
crucial. Researchers and industry prefer to use end-to-end ASR systems for
on-device speech recognition tasks. This is because end-to-end systems can be
made resource-efficient while maintaining a higher quality compared to hybrid
systems. However, building end-to-end models requires a significant amount of
speech data. Another challenging task associated with speech assistants is
personalization, which mainly lies in handling out-of-vocabulary (OOV) words.
In this work, we consider building an effective end-to-end ASR system in
low-resource setups with a high OOV rate, embodied in Babel Turkish and Babel
Georgian tasks. To address the aforementioned problems, we propose a method of
dynamic acoustic unit augmentation based on the BPE-dropout technique. It
non-deterministically tokenizes utterances to extend the token's contexts and
to regularize their distribution for the model's recognition of unseen words.
It also reduces the need for optimal subword vocabulary size search. The
technique provides a steady improvement in regular and personalized
(OOV-oriented) speech recognition tasks (at least 6% relative WER and 25%
relative F-score) at no additional computational cost. Owing to the use of
BPE-dropout, our monolingual Turkish Conformer established a competitive result
with 22.2% character error rate (CER) and 38.9% word error rate (WER), which is
close to the best published multilingual system.Comment: 16 pages, 7 figure
- …