13 research outputs found
Learning speech embeddings for speaker adaptation and speech understanding
In recent years, deep neural network models gained popularity as a modeling approach for many speech processing tasks including automatic speech recognition (ASR) and spoken language understanding (SLU). In this dissertation, there are two main goals. The first goal is to propose modeling approaches in order to learn speaker embeddings for speaker adaptation or to learn semantic speech embeddings. The second goal is to introduce training objectives that achieve fairness for the ASR and SLU problems. In the case of speaker adaptation, we introduce an auxiliary network to an ASR model and learn to simultaneously detect speaker changes and adapt to the speaker in an unsupervised way. We show that this joint model leads to lower error rates as compared to a two-step approach where the signal is segmented into single speaker regions and then fed into an adaptation model. We then reformulate the speaker adaptation problem from a counterfactual fairness point-of-view and introduce objective functions to match the ASR performance of the individuals in the dataset to that of their counterfactual counterparts. We show that we can achieve lower error rate in an ASR system while reducing the performance disparity between protected groups. In the second half of the dissertation, we focus on SLU and tackle two problems associated with SLU datasets. The first SLU problem is the lack of large speech corpora. To handle this issue, we propose to use available non-parallel text data so that we can leverage the information in text to guide learning of the speech embeddings. We show that this technique increases the intent classification accuracy as compared to a speech-only system. The second SLU problem is the label imbalance problem in the datasets, which is also related to fairness since a model trained on skewed data usually leads to biased results. To achieve fair SLU, we propose to maximize the F-measure instead of conventional cross-entropy minimization and show that it is possible to increase the number of classes with nonzero recall. In the last two chapters, we provide additional discussions on the impact of these projects from both technical and social perspectives, propose directions for future research and summarize the findings
Biased Self-supervised learning for ASR
Self-supervised learning via masked prediction pre-training (MPPT) has shown
impressive performance on a range of speech-processing tasks. This paper
proposes a method to bias self-supervised learning towards a specific task. The
core idea is to slightly finetune the model that is used to obtain the target
sequence. This leads to better performance and a substantial increase in
training speed. Furthermore, this paper proposes a variant of MPPT that allows
low-footprint streaming models to be trained effectively by computing the MPPT
loss on masked and unmasked frames. These approaches are evaluated for
automatic speech recognition on the Librispeech corpus, where 100 hours of data
served as the labelled data and 860 hours as the unlabelled data. The biased
training outperforms the unbiased training by 15.5% after 250k updates and
23.8% after 100k updates on test-other. For the streaming models, the
pre-training approach yields a reduction in word error rate of 44.1%.Comment: Submitted to ICASSP 202
Augmenting text for spoken language understanding with Large Language Models
Spoken semantic parsing (SSP) involves generating machine-comprehensible
parses from input speech. Training robust models for existing application
domains represented in training data or extending to new domains requires
corresponding triplets of speech-transcript-semantic parse data, which is
expensive to obtain. In this paper, we address this challenge by examining
methods that can use transcript-semantic parse data (unpaired text) without
corresponding speech. First, when unpaired text is drawn from existing textual
corpora, Joint Audio Text (JAT) and Text-to-Speech (TTS) are compared as ways
to generate speech representations for unpaired text. Experiments on the STOP
dataset show that unpaired text from existing and new domains improves
performance by 2% and 30% in absolute Exact Match (EM) respectively. Second, we
consider the setting when unpaired text is not available in existing textual
corpora. We propose to prompt Large Language Models (LLMs) to generate unpaired
text for existing and new domains. Experiments show that examples and words
that co-occur with intents can be used to generate unpaired text with Llama
2.0. Using the generated text with JAT and TTS for spoken semantic parsing
improves EM on STOP by 1.4% and 2.6% absolute for existing and new domains
respectively.Comment: Submitted to ICASSP 202
Dynamic ASR Pathways: An Adaptive Masking Approach Towards Efficient Pruning of A Multilingual ASR Model
Neural network pruning offers an effective method for compressing a
multilingual automatic speech recognition (ASR) model with minimal performance
loss. However, it entails several rounds of pruning and re-training needed to
be run for each language. In this work, we propose the use of an adaptive
masking approach in two scenarios for pruning a multilingual ASR model
efficiently, each resulting in sparse monolingual models or a sparse
multilingual model (named as Dynamic ASR Pathways). Our approach dynamically
adapts the sub-network, avoiding premature decisions about a fixed sub-network
structure. We show that our approach outperforms existing pruning methods when
targeting sparse monolingual models. Further, we illustrate that Dynamic ASR
Pathways jointly discovers and trains better sub-networks (pathways) of a
single multilingual model by adapting from different sub-network
initializations, thereby reducing the need for language-specific pruning
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale
Large-scale generative models such as GPT and DALL-E have revolutionized
natural language processing and computer vision research. These models not only
generate high fidelity text or image outputs, but are also generalists which
can solve tasks not explicitly taught. In contrast, speech generative models
are still primitive in terms of scale and task generalization. In this paper,
we present Voicebox, the most versatile text-guided generative model for speech
at scale. Voicebox is a non-autoregressive flow-matching model trained to
infill speech, given audio context and text, trained on over 50K hours of
speech that are neither filtered nor enhanced. Similar to GPT, Voicebox can
perform many different tasks through in-context learning, but is more flexible
as it can also condition on future context. Voicebox can be used for mono or
cross-lingual zero-shot text-to-speech synthesis, noise removal, content
editing, style conversion, and diverse sample generation. In particular,
Voicebox outperforms the state-of-the-art zero-shot TTS model VALL-E on both
intelligibility (5.9% vs 1.9% word error rates) and audio similarity (0.580 vs
0.681) while being up to 20 times faster. See voicebox.metademolab.com for a
demo of the model
PENGARUH PEMBERIAN INFORMASI MELALUI MEDIA BOOKLET TERHADAP TINGKAT KEPATUHAN PASIEN DM TIPE 2
Adherence is a major component of successful diabetes treatment which is influenced by knowledge and skills regarding disease management. Providing information through health education using a multimedia approach can help patients to master information more effectively, one example is using booklets. This study purposed to determine the effect of providing information through booklet media on the compliance level of type 2 DM patients. This study used a pre-experimental method with a One group pre-test-posttest design. This study included 36 samples selected by purposive sampling technique. Data collection using questionnaires, data analysis consists of univariate and bivariate analysis. This study showed the most results were 29 people or 80.6% were less obedient during the pre-test and the most were 34 people or 94.4% were obedient during the post-test. The results of the Wilcoxon sign rank test obtained Zstats = 4,949> Ztable = 1.96 and P-value = 0.001 <α 0.05, this result can be concluded that the provision of information through the media booklet has a significant effect on the level of compliance of type 2 DM patients. It is recommended that the hospital use booklet media when providing information to type 2 DM patients about diabetes mellitus and its treatment therapy, so that the information conveyed can be easier to understand
Novel Hepatitis B Virus Capsid Assembly Modulator Induces Potent Antiviral Responses In Vitro and in Humanized Mice
International audienceHepatitis B virus (HBV) affects an estimated 250 million chronic carriers worldwide. Though several vaccines exist, they are ineffective for those already infected. HBV persists due to the formation of covalently closed circular DNA (cccDNA)-the viral minichromosome-in the nucleus of hepatocytes. Current nucleoside analogs and interferon therapies rarely clear cccDNA, requiring lifelong treatment. Our group identified GLP-26, a novel glyoxamide derivative that alters HBV nucleocapsid assembly and prevents viral DNA replication. GLP-26 exhibited single-digit nanomolar anti-HBV activity, inhibition of HBV e antigen (HBeAg) secretion, and reduced cccDNA amplification, in addition to showing a promising preclinical profile. Strikingly, long term combination treatment with entecavir in a humanized mouse model induced a decrease in viral loads and viral antigens that was sustained for up to 12 weeks after treatment cessation
2′-Chloro,2′-fluoro Ribonucleotide Prodrugs with Potent Pan-genotypic Activity against Hepatitis C Virus Replication in Culture
Pan-genotypic
nucleoside HCV inhibitors display a high genetic
barrier to drug resistance and are the preferred direct-acting agents
to achieve complete sustained virologic response in humans. Herein,
we report, the discovery of a β-d-2′-Cl,2′-F-uridine
phosphoramidate nucleotide <b>16</b>, as a nontoxic pan-genotypic
anti-HCV agent. Phosphoramidate <b>16</b> in its 5′-triphosphate
form specifically inhibited HCV NS5B polymerase with no marked inhibition
of human polymerases and cellular mitochondrial RNA polymerase. Studies
on the intracellular half-life of phosphoramidate <b>16</b>-TP
in live cells demonstrated favorable half-life of 11.6 h, suggesting
once-a-day dosing. Stability in human blood and favorable metabolism
in human intestinal microsomes and liver microsomes make phosphoramidate <b>16</b> a prospective candidate for further studies to establish
its potential value as a new anti-HCV agent