22 research outputs found
Forgetting Private Textual Sequences in Language Models via Leave-One-Out Ensemble
Recent research has shown that language models have a tendency to memorize
rare or unique token sequences in the training corpus. After deploying a model,
practitioners might be asked to delete any personal information from the model
by individuals' requests. Re-training the underlying model every time
individuals would like to practice their rights to be forgotten is
computationally expensive. We employ a teacher-student framework and propose a
novel leave-one-out ensemble method to unlearn the targeted textual sequences
that need to be forgotten from the model. In our approach, multiple teachers
are trained on disjoint sets; for each targeted sequence to be removed, we
exclude the teacher trained on the set containing this sequence and aggregate
the predictions from remaining teachers to provide supervision during
fine-tuning. Experiments on LibriSpeech and WikiText-103 datasets show that the
proposed method achieves superior privacy-utility trade-offs than other
counterparts
Learning a Dual-Mode Speech Recognition Model via Self-Pruning
There is growing interest in unifying the streaming and full-context
automatic speech recognition (ASR) networks into a single end-to-end ASR model
to simplify the model training and deployment for both use cases. While in
real-world ASR applications, the streaming ASR models typically operate under
more storage and computational constraints - e.g., on embedded devices - than
any server-side full-context models. Motivated by the recent progress in
Omni-sparsity supernet training, where multiple subnetworks are jointly
optimized in one single model, this work aims to jointly learn a compact sparse
on-device streaming ASR model, and a large dense server non-streaming model, in
a single supernet. Next, we present that, performing supernet training on both
wav2vec 2.0 self-supervised learning and supervised ASR fine-tuning can not
only substantially improve the large non-streaming model as shown in prior
works, and also be able to improve the compact sparse streaming model.Comment: 7 pages, 1 figure. Accepted for publication at IEEE Spoken Language
Technology Workshop (SLT), 202
Contextual Biasing of Named-Entities with Large Language Models
This paper studies contextual biasing with Large Language Models (LLMs),
where during second-pass rescoring additional contextual information is
provided to a LLM to boost Automatic Speech Recognition (ASR) performance. We
propose to leverage prompts for a LLM without fine tuning during rescoring
which incorporate a biasing list and few-shot examples to serve as additional
information when calculating the score for the hypothesis. In addition to
few-shot prompt learning, we propose multi-task training of the LLM to predict
both the entity class and the next token. To improve the efficiency for
contextual biasing and to avoid exceeding LLMs' maximum sequence lengths, we
propose dynamic prompting, where we select the most likely class using the
class tag prediction, and only use entities in this class as contexts for next
token prediction. Word Error Rate (WER) evaluation is performed on i) an
internal calling, messaging, and dictation dataset, and ii) the SLUE-Voxpopuli
dataset. Results indicate that biasing lists and few-shot examples can achieve
17.8% and 9.6% relative improvement compared to first pass ASR, and that
multi-task training and dynamic prompting can achieve 20.0% and 11.3% relative
WER improvement, respectively.Comment: 5 pages, 4 figures. Conference: ICASSP 202
Modality Confidence Aware Training for Robust End-to-End Spoken Language Understanding
End-to-end (E2E) spoken language understanding (SLU) systems that generate a
semantic parse from speech have become more promising recently. This approach
uses a single model that utilizes audio and text representations from
pre-trained speech recognition models (ASR), and outperforms traditional
pipeline SLU systems in on-device streaming scenarios. However, E2E SLU systems
still show weakness when text representation quality is low due to ASR
transcription errors. To overcome this issue, we propose a novel E2E SLU system
that enhances robustness to ASR errors by fusing audio and text representations
based on the estimated modality confidence of ASR hypotheses. We introduce two
novel techniques: 1) an effective method to encode the quality of ASR
hypotheses and 2) an effective approach to integrate them into E2E SLU models.
We show accuracy improvements on STOP dataset and share the analysis to
demonstrate the effectiveness of our approach.Comment: INTERSPEECH 202
Learning ASR pathways: A sparse multilingual ASR model
Neural network pruning compresses automatic speech recognition (ASR) models
effectively. However, in multilingual ASR, language-agnostic pruning may lead
to severe performance drops on some languages because language-agnostic pruning
masks may not fit all languages and discard important language-specific
parameters. In this work, we present ASR pathways, a sparse multilingual ASR
model that activates language-specific sub-networks ("pathways"), such that the
parameters for each language are learned explicitly. With the overlapping
sub-networks, the shared parameters can also enable knowledge transfer for
lower-resource languages via joint multilingual training. We propose a novel
algorithm to learn ASR pathways, and evaluate the proposed method on 4
languages with a streaming RNN-T model. Our proposed ASR pathways outperform
both dense models and a language-agnostically pruned model, and provide better
performance on low-resource languages compared to the monolingual sparse
models.Comment: Accepted by ICASSP 202
End-to-End Speech Recognition Contextualization with Large Language Models
In recent years, Large Language Models (LLMs) have garnered significant
attention from the research community due to their exceptional performance and
generalization capabilities. In this paper, we introduce a novel method for
contextualizing speech recognition models incorporating LLMs. Our approach
casts speech recognition as a mixed-modal language modeling task based on a
pretrained LLM. We provide audio features, along with optional text tokens for
context, to train the system to complete transcriptions in a decoder-only
fashion. As a result, the system is implicitly incentivized to learn how to
leverage unstructured contextual information during training. Our empirical
results demonstrate a significant improvement in performance, with a 6% WER
reduction when additional textual context is provided. Moreover, we find that
our method performs competitively and improve by 7.5% WER overall and 17% WER
on rare words against a baseline contextualized RNN-T system that has been
trained on more than twenty five times larger speech dataset. Overall, we
demonstrate that by only adding a handful number of trainable parameters via
adapters, we can unlock contextualized speech recognition capability for the
pretrained LLM while keeping the same text-only input functionality
Towards Selection of Text-to-speech Data to Augment ASR Training
This paper presents a method for selecting appropriate synthetic speech
samples from a given large text-to-speech (TTS) dataset as supplementary
training data for an automatic speech recognition (ASR) model. We trained a
neural network, which can be optimised using cross-entropy loss or Arcface
loss, to measure the similarity of a synthetic data to real speech. We found
that incorporating synthetic samples with considerable dissimilarity to real
speech, owing in part to lexical differences, into ASR training is crucial for
boosting recognition performance. Experimental results on Librispeech test sets
indicate that, in order to maintain the same speech recognition accuracy as
when using all TTS data, our proposed solution can reduce the size of the TTS
data down below its , which is superior to several baseline methods
Anchored Speech Recognition with Neural Transducers
Neural transducers have achieved human level performance on standard speech
recognition benchmarks. However, their performance significantly degrades in
the presence of cross-talk, especially when the primary speaker has a low
signal-to-noise ratio. Anchored speech recognition refers to a class of methods
that use information from an anchor segment (e.g., wake-words) to recognize
device-directed speech while ignoring interfering background speech. In this
paper, we investigate anchored speech recognition to make neural transducers
robust to background speech. We extract context information from the anchor
segment with a tiny auxiliary network, and use encoder biasing and joiner
gating to guide the transducer towards the target speech. Moreover, to improve
the robustness of context embedding extraction, we propose auxiliary training
objectives to disentangle lexical content from speaking style. We evaluate our
methods on synthetic LibriSpeech-based mixtures comprising several SNR and
overlap conditions; they improve relative word error rates by 19.6% over a
strong baseline, when averaged over all conditions.Comment: To appear at IEEE ICASSP 202