8 research outputs found
CUED at ProbSum 2023: Hierarchical Ensemble of Summarization Models
In this paper, we consider the challenge of summarizing patients' medical
progress notes in a limited data setting. For the Problem List Summarization
(shared task 1A) at the BioNLP Workshop 2023, we demonstrate that Clinical-T5
fine-tuned to 765 medical clinic notes outperforms other extractive,
abstractive and zero-shot baselines, yielding reasonable baseline systems for
medical note summarization. Further, we introduce Hierarchical Ensemble of
Summarization Models (HESM), consisting of token-level ensembles of diverse
fine-tuned Clinical-T5 models, followed by Minimum Bayes Risk (MBR) decoding.
Our HESM approach lead to a considerable summarization performance boost, and
when evaluated on held-out challenge data achieved a ROUGE-L of 32.77, which
was the best-performing system at the top of the shared task leaderboard.Comment: BioNLP Workshop @ ACL 202
End-to-End Speech Recognition Contextualization with Large Language Models
In recent years, Large Language Models (LLMs) have garnered significant
attention from the research community due to their exceptional performance and
generalization capabilities. In this paper, we introduce a novel method for
contextualizing speech recognition models incorporating LLMs. Our approach
casts speech recognition as a mixed-modal language modeling task based on a
pretrained LLM. We provide audio features, along with optional text tokens for
context, to train the system to complete transcriptions in a decoder-only
fashion. As a result, the system is implicitly incentivized to learn how to
leverage unstructured contextual information during training. Our empirical
results demonstrate a significant improvement in performance, with a 6% WER
reduction when additional textual context is provided. Moreover, we find that
our method performs competitively and improve by 7.5% WER overall and 17% WER
on rare words against a baseline contextualized RNN-T system that has been
trained on more than twenty five times larger speech dataset. Overall, we
demonstrate that by only adding a handful number of trainable parameters via
adapters, we can unlock contextualized speech recognition capability for the
pretrained LLM while keeping the same text-only input functionality
Towards General-Purpose Speech Abilities for Large Language Models Using Unpaired Data
In this work, we extend the instruction-tuned Llama-2 model with end-to-end
general-purpose speech processing and reasoning abilities while maintaining the
wide range of LLM capabilities, without using any carefully curated paired
data. The proposed model can utilize audio prompts as a replacement for text
and sustain a conversation. Such a model also has extended cross-modal
capabilities such as being able to perform speech question answering, speech
translation, and audio summarization amongst many other closed and open-domain
tasks. This is unlike prior approaches in speech, in which LLMs are extended to
handle audio for a limited number of pre-designated tasks. Experiments show
that our end-to-end approach is on par with or outperforms a cascaded system
(speech recognizer + LLM) in terms of modeling the response to a prompt.
Furthermore, unlike a cascade, our approach shows the ability to interchange
text and audio modalities and utilize the prior context in a conversation to
provide better results
Prompting Large Language Models with Speech Recognition Abilities
Large language models have proven themselves highly flexible, able to solve a
wide range of generative tasks, such as abstractive summarization and
open-ended question answering. In this paper we extend the capabilities of LLMs
by directly attaching a small audio encoder allowing it to perform speech
recognition. By directly prepending a sequence of audial embeddings to the text
token embeddings, the LLM can be converted to an automatic speech recognition
(ASR) system, and be used in the exact same manner as its textual counterpart.
Experiments on Multilingual LibriSpeech (MLS) show that incorporating a
conformer encoder into the open sourced LLaMA-7B allows it to outperform
monolingual baselines by 18% and perform multilingual speech recognition
despite LLaMA being trained overwhelmingly on English text. Furthermore, we
perform ablation studies to investigate whether the LLM can be completely
frozen during training to maintain its original capabilities, scaling up the
audio encoder, and increasing the audio encoder striding to generate fewer
embeddings. The results from these studies show that multilingual ASR is
possible even when the LLM is frozen or when strides of almost 1 second are
used in the audio encoder opening up the possibility for LLMs to operate on
long-form audio
TODM: Train Once Deploy Many Efficient Supernet-Based RNN-T Compression For On-device ASR Models
Automatic Speech Recognition (ASR) models need to be optimized for specific
hardware before they can be deployed on devices. This can be done by tuning the
model's hyperparameters or exploring variations in its architecture.
Re-training and re-validating models after making these changes can be a
resource-intensive task. This paper presents TODM (Train Once Deploy Many), a
new approach to efficiently train many sizes of hardware-friendly on-device ASR
models with comparable GPU-hours to that of a single training job. TODM
leverages insights from prior work on Supernet, where Recurrent Neural Network
Transducer (RNN-T) models share weights within a Supernet. It reduces layer
sizes and widths of the Supernet to obtain subnetworks, making them smaller
models suitable for all hardware types. We introduce a novel combination of
three techniques to improve the outcomes of the TODM Supernet: adaptive
dropouts, an in-place Alpha-divergence knowledge distillation, and the use of
ScaledAdam optimizer. We validate our approach by comparing Supernet-trained
versus individually tuned Multi-Head State Space Model (MH-SSM) RNN-T using
LibriSpeech. Results demonstrate that our TODM Supernet either matches or
surpasses the performance of manually tuned models by up to a relative of 3%
better in word error rate (WER), while efficiently keeping the cost of training
many models at a small constant.Comment: Meta AI; Submitted to ICASSP 202
Recommended from our members
Efficient Sample-Specific Encoder Perturbations
Encoder-decoder foundation models have displayed state-of-the-art performance on a range of autoregressive sequence tasks. This paper proposes a simple and lightweight modification to such systems to control the behaviour according to a specific attribute of interest. This paper proposes a novel inference-efficient approach to modifying the behaviour of an encoder-decoder system according to a specific attribute of interest. Specifically, we show that a small proxy network can be used to find a sample-by-sample perturbation of the encoder output of a frozen foundation model to trigger the decoder to generate improved decodings. This work explores a specific realization of this framework focused on improving the COMET performance of Flan-T5 on Machine Translation and the WER of Whisper foundation models on Speech Recognition. Results display consistent improvements in performance evaluated through COMET and WER respectively. Furthermore, experiments also show that the proxies are robust to the exact nature of the data used to train them and can extend to other domains.Gates Cambridge Trust
Cambridge University Press & Assessmen
Recommended from our members
CUED at ProbSum 2023: Hierarchical Ensemble of Summarization Models
In this paper, we consider the challenge of summarizing patients' medical progress notes in a limited data setting. For the Problem List Summarization (shared task 1A) at the BioNLP Workshop 2023, we demonstrate that Clinical-T5 fine-tuned to 765 medical clinic notes outperforms other extractive, abstractive and zero-shot baselines, yielding reasonable baseline systems for medical note summarization. Further, we introduce Hierarchical Ensemble of Summarization Models (HESM), consisting of token-level ensembles of diverse fine-tuned Clinical-T5 models, followed by Minimum Bayes Risk (MBR) decoding. Our HESM approach lead to a considerable summarization performance boost, and when evaluated on held-out challenge data achieved a ROUGE-L of 32.77, which was the best-performing system at the top of the shared task leaderboard.1. Cambridge University Press & Assessment (CUP&A), a
department of The Chancellor, Masters, and Scholars of the University of Cambridge.
2. EPSRC (The Engineering
and Physical Sciences Research Council) Doctoral
Training Partnership (DTP) PhD studentship,
3. Cambridge International & St John’s College scholarship, and the Gates Cambridge Scholarship