Search CORE

8 research outputs found

CUED at ProbSum 2023: Hierarchical Ensemble of Summarization Models

Author: Fathullah Yassir
Gales Mark
Liusie Adian
Manakul Potsawee
Raina Vatsal
Raina Vyas
Publication venue
Publication date: 08/06/2023
Field of study

In this paper, we consider the challenge of summarizing patients' medical progress notes in a limited data setting. For the Problem List Summarization (shared task 1A) at the BioNLP Workshop 2023, we demonstrate that Clinical-T5 fine-tuned to 765 medical clinic notes outperforms other extractive, abstractive and zero-shot baselines, yielding reasonable baseline systems for medical note summarization. Further, we introduce Hierarchical Ensemble of Summarization Models (HESM), consisting of token-level ensembles of diverse fine-tuned Clinical-T5 models, followed by Minimum Bayes Risk (MBR) decoding. Our HESM approach lead to a considerable summarization performance boost, and when evaluated on held-out challenge data achieved a ROUGE-L of 32.77, which was the best-performing system at the top of the shared task leaderboard.Comment: BioNLP Workshop @ ACL 202

arXiv.org e-Print Archive

End-to-End Speech Recognition Contextualization with Large Language Models

Author: Fathullah Yassir
Fuegen Christian
Kalinli Ozlem
Lakomkin Egor
Seltzer Michael L.
Wu Chunyang
Publication venue
Publication date: 19/09/2023
Field of study

In recent years, Large Language Models (LLMs) have garnered significant attention from the research community due to their exceptional performance and generalization capabilities. In this paper, we introduce a novel method for contextualizing speech recognition models incorporating LLMs. Our approach casts speech recognition as a mixed-modal language modeling task based on a pretrained LLM. We provide audio features, along with optional text tokens for context, to train the system to complete transcriptions in a decoder-only fashion. As a result, the system is implicitly incentivized to learn how to leverage unstructured contextual information during training. Our empirical results demonstrate a significant improvement in performance, with a 6% WER reduction when additional textual context is provided. Moreover, we find that our method performs competitively and improve by 7.5% WER overall and 17% WER on rare words against a baseline contextualized RNN-T system that has been trained on more than twenty five times larger speech dataset. Overall, we demonstrate that by only adding a handful number of trainable parameters via adapters, we can unlock contextualized speech recognition capability for the pretrained LLM while keeping the same text-only input functionality

arXiv.org e-Print Archive

Towards General-Purpose Speech Abilities for Large Language Models Using Unpaired Data

Author: Fathullah Yassir
Fuegen Christian
Jia Junteng
Kalinli Ozlem
Lakomkin Egor
Mahadeokar Jay
Seltzer Mike
Shangguan Yuan
Wu Chunyang
Publication venue
Publication date: 12/11/2023
Field of study

In this work, we extend the instruction-tuned Llama-2 model with end-to-end general-purpose speech processing and reasoning abilities while maintaining the wide range of LLM capabilities, without using any carefully curated paired data. The proposed model can utilize audio prompts as a replacement for text and sustain a conversation. Such a model also has extended cross-modal capabilities such as being able to perform speech question answering, speech translation, and audio summarization amongst many other closed and open-domain tasks. This is unlike prior approaches in speech, in which LLMs are extended to handle audio for a limited number of pre-designated tasks. Experiments show that our end-to-end approach is on par with or outperforms a cascaded system (speech recognizer + LLM) in terms of modeling the response to a prompt. Furthermore, unlike a cascade, our approach shows the ability to interchange text and audio modalities and utilize the prior context in a conversation to provide better results

arXiv.org e-Print Archive

Prompting Large Language Models with Speech Recognition Abilities

Author: Fathullah Yassir
Fuegen Christian
Guo Jinxi
Jia Junteng
Kalinli Ozlem
Lakomkin Egor
Li Ke
Mahadeokar Jay
Seltzer Mike
Shangguan Yuan
Wu Chunyang
Xiong Wenhan
Publication venue
Publication date: 21/07/2023
Field of study

Large language models have proven themselves highly flexible, able to solve a wide range of generative tasks, such as abstractive summarization and open-ended question answering. In this paper we extend the capabilities of LLMs by directly attaching a small audio encoder allowing it to perform speech recognition. By directly prepending a sequence of audial embeddings to the text token embeddings, the LLM can be converted to an automatic speech recognition (ASR) system, and be used in the exact same manner as its textual counterpart. Experiments on Multilingual LibriSpeech (MLS) show that incorporating a conformer encoder into the open sourced LLaMA-7B allows it to outperform monolingual baselines by 18% and perform multilingual speech recognition despite LLaMA being trained overwhelmingly on English text. Furthermore, we perform ablation studies to investigate whether the LLM can be completely frozen during training to maintain its original capabilities, scaling up the audio encoder, and increasing the audio encoder striding to generate fewer embeddings. The results from these studies show that multilingual ASR is possible even when the LLM is frozen or when strides of almost 1 second are used in the audio encoder opening up the possibility for LLMs to operate on long-form audio

arXiv.org e-Print Archive

TODM: Train Once Deploy Many Efficient Supernet-Based RNN-T Compression For On-device ASR Models

Author: Chandra Vikas
Dalmia Ayushi
Fathullah Yassir
Jia Junteng
Kalinli Ozlem
Krishnamoorthi Raghuraman
Lei Xin
Li Danni
Mahadeokar Jay
Seltzer Mike
Shangguan Yuan
Wang Dilin
Wu Chunyang
Yang Haichuan
Publication venue
Publication date: 05/09/2023
Field of study

Automatic Speech Recognition (ASR) models need to be optimized for specific hardware before they can be deployed on devices. This can be done by tuning the model's hyperparameters or exploring variations in its architecture. Re-training and re-validating models after making these changes can be a resource-intensive task. This paper presents TODM (Train Once Deploy Many), a new approach to efficiently train many sizes of hardware-friendly on-device ASR models with comparable GPU-hours to that of a single training job. TODM leverages insights from prior work on Supernet, where Recurrent Neural Network Transducer (RNN-T) models share weights within a Supernet. It reduces layer sizes and widths of the Supernet to obtain subnetworks, making them smaller models suitable for all hardware types. We introduce a novel combination of three techniques to improve the outcomes of the TODM Supernet: adaptive dropouts, an in-place Alpha-divergence knowledge distillation, and the use of ScaledAdam optimizer. We validate our approach by comparing Supernet-trained versus individually tuned Multi-Head State Space Model (MH-SSM) RNN-T using LibriSpeech. Results demonstrate that our TODM Supernet either matches or surpasses the performance of manually tuned models by up to a relative of 3% better in word error rate (WER), while efficiently keeping the cost of training many models at a small constant.Comment: Meta AI; Submitted to ICASSP 202

arXiv.org e-Print Archive

Recommended from our members

Efficient Sample-Specific Encoder Perturbations

Author: Fathullah Yassir
Gales Mark
Publication venue: Department of Engineering Student
Publication date: 27/04/2024
Field of study

Encoder-decoder foundation models have displayed state-of-the-art performance on a range of autoregressive sequence tasks. This paper proposes a simple and lightweight modification to such systems to control the behaviour according to a specific attribute of interest. This paper proposes a novel inference-efficient approach to modifying the behaviour of an encoder-decoder system according to a specific attribute of interest. Specifically, we show that a small proxy network can be used to find a sample-by-sample perturbation of the encoder output of a frozen foundation model to trigger the decoder to generate improved decodings. This work explores a specific realization of this framework focused on improving the COMET performance of Flan-T5 on Machine Translation and the WER of Whisper foundation models on Speech Recognition. Results display consistent improvements in performance evaluated through COMET and WER respectively. Furthermore, experiments also show that the proxies are robust to the exact nature of the data used to train them and can extend to other domains.Gates Cambridge Trust Cambridge University Press & Assessmen

Apollo (Cambridge)

Recommended from our members

CUED at ProbSum 2023: Hierarchical Ensemble of Summarization Models

Author: Gales Mark
Liusie Adian
Manakul Potsawee
Raina Vatsal
Raina Vyas
Yassir Fathullah
Publication venue: Department of Engineering Student
Publication date: 13/07/2023
Field of study

Apollo (Cambridge)