Search CORE

73 research outputs found

Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation

Author: Chen Si-Qing
Ge Tao
Sui Zhifang
Wang Peiyi
Wei Furu
Xia Heming
Publication venue
Publication date: 29/10/2023
Field of study

We propose Speculative Decoding (SpecDec), for the first time ever, to formally study exploiting the idea of speculative execution to accelerate autoregressive (AR) decoding. Speculative Decoding has two innovations: Spec-Drafter -- an independent model specially optimized for efficient and accurate drafting -- and Spec-Verification -- a reliable method for verifying the drafted tokens efficiently in the decoding paradigm. Experimental results on various seq2seq tasks including machine translation and abstractive summarization show our approach can achieve around

5\times

speedup for the popular Transformer architectures with comparable generation quality to beam search decoding, refreshing the impression that the draft-then-verify paradigm introduces only

1.4\times

\sim

2\times

speedup. In addition to the remarkable speedup, we also demonstrate 3 additional advantages of SpecDec, revealing its practical value for accelerating generative models in real-world applications. Our models and codes are available at https://github.com/hemingkx/SpecDec.Comment:

\textbf{v1-v4}

(Early 2022): Initially announced with the name "Generalized Aggressive Decoding";

\textbf{v5}

(September 2022): Renamed to "Speculative Decoding" as the ICLR'23 submission (https://openreview.net/pdf?id=H-VlwsYvVi), marking

\textbf{the first time}

"Speculative Decoding" has been publicly proposed.

\textbf{v6}

: EMNLP'23 Findings camera read

arXiv.org e-Print Archive

Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster

Author: Chen Guihai
Gu Jinjie
Liu Zhining
Zhang Hongxuan
Zheng Jiaqi
Zhuang Chenyi
Publication venue
Publication date: 14/11/2023
Field of study

In this work, we propose FastCoT, a model-agnostic framework based on parallel decoding without any further training of an auxiliary model or modification to the LLM itself. FastCoT uses a size-varying context window whose size changes with position to conduct parallel decoding and auto-regressive decoding simultaneously, thus fully utilizing GPU computation resources. In FastCoT, the parallel decoding part provides the LLM with a quick glance of the future composed of approximate tokens, which could lead to faster answers compared to regular autoregressive decoding used by causal transformers. We also provide an implementation of parallel decoding within LLM, which supports KV-cache generation and batch processing. Through extensive experiments, we demonstrate that FastCoT saves inference time by nearly 20% with only a negligible performance drop compared to the regular approach. Additionally, we show that the context window size exhibits considerable robustness for different tasks

arXiv.org e-Print Archive

BECTRA: Transducer-based End-to-End ASR with BERT-Enhanced Encoder

Author: Higuchi Yosuke
Kobayashi Tetsunori
Ogawa Tetsuji
Watanabe Shinji
Publication venue
Publication date: 01/11/2022
Field of study

We present BERT-CTC-Transducer (BECTRA), a novel end-to-end automatic speech recognition (E2E-ASR) model formulated by the transducer with a BERT-enhanced encoder. Integrating a large-scale pre-trained language model (LM) into E2E-ASR has been actively studied, aiming to utilize versatile linguistic knowledge for generating accurate text. One crucial factor that makes this integration challenging lies in the vocabulary mismatch; the vocabulary constructed for a pre-trained LM is generally too large for E2E-ASR training and is likely to have a mismatch against a target ASR domain. To overcome such an issue, we propose BECTRA, an extended version of our previous BERT-CTC, that realizes BERT-based E2E-ASR using a vocabulary of interest. BECTRA is a transducer-based model, which adopts BERT-CTC for its encoder and trains an ASR-specific decoder using a vocabulary suitable for a target task. With the combination of the transducer and BERT-CTC, we also propose a novel inference algorithm for taking advantage of both autoregressive and non-autoregressive decoding. Experimental results on several ASR tasks, varying in amounts of data, speaking styles, and languages, demonstrate that BECTRA outperforms BERT-CTC by effectively dealing with the vocabulary mismatch while exploiting BERT knowledge.Comment: Submitted to ICASSP202

arXiv.org e-Print Archive