73 research outputs found
Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation
We propose Speculative Decoding (SpecDec), for the first time ever, to
formally study exploiting the idea of speculative execution to accelerate
autoregressive (AR) decoding. Speculative Decoding has two innovations:
Spec-Drafter -- an independent model specially optimized for efficient and
accurate drafting -- and Spec-Verification -- a reliable method for verifying
the drafted tokens efficiently in the decoding paradigm. Experimental results
on various seq2seq tasks including machine translation and abstractive
summarization show our approach can achieve around speedup for the
popular Transformer architectures with comparable generation quality to beam
search decoding, refreshing the impression that the draft-then-verify paradigm
introduces only speedup. In addition to the
remarkable speedup, we also demonstrate 3 additional advantages of SpecDec,
revealing its practical value for accelerating generative models in real-world
applications. Our models and codes are available at
https://github.com/hemingkx/SpecDec.Comment: (Early 2022): Initially announced with the name
"Generalized Aggressive Decoding"; (September 2022): Renamed to
"Speculative Decoding" as the ICLR'23 submission
(https://openreview.net/pdf?id=H-VlwsYvVi), marking
"Speculative Decoding" has been publicly proposed. : EMNLP'23
Findings camera read
Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster
In this work, we propose FastCoT, a model-agnostic framework based on
parallel decoding without any further training of an auxiliary model or
modification to the LLM itself. FastCoT uses a size-varying context window
whose size changes with position to conduct parallel decoding and
auto-regressive decoding simultaneously, thus fully utilizing GPU computation
resources. In FastCoT, the parallel decoding part provides the LLM with a quick
glance of the future composed of approximate tokens, which could lead to faster
answers compared to regular autoregressive decoding used by causal
transformers. We also provide an implementation of parallel decoding within
LLM, which supports KV-cache generation and batch processing. Through extensive
experiments, we demonstrate that FastCoT saves inference time by nearly 20%
with only a negligible performance drop compared to the regular approach.
Additionally, we show that the context window size exhibits considerable
robustness for different tasks
BECTRA: Transducer-based End-to-End ASR with BERT-Enhanced Encoder
We present BERT-CTC-Transducer (BECTRA), a novel end-to-end automatic speech
recognition (E2E-ASR) model formulated by the transducer with a BERT-enhanced
encoder. Integrating a large-scale pre-trained language model (LM) into E2E-ASR
has been actively studied, aiming to utilize versatile linguistic knowledge for
generating accurate text. One crucial factor that makes this integration
challenging lies in the vocabulary mismatch; the vocabulary constructed for a
pre-trained LM is generally too large for E2E-ASR training and is likely to
have a mismatch against a target ASR domain. To overcome such an issue, we
propose BECTRA, an extended version of our previous BERT-CTC, that realizes
BERT-based E2E-ASR using a vocabulary of interest. BECTRA is a transducer-based
model, which adopts BERT-CTC for its encoder and trains an ASR-specific decoder
using a vocabulary suitable for a target task. With the combination of the
transducer and BERT-CTC, we also propose a novel inference algorithm for taking
advantage of both autoregressive and non-autoregressive decoding. Experimental
results on several ASR tasks, varying in amounts of data, speaking styles, and
languages, demonstrate that BECTRA outperforms BERT-CTC by effectively dealing
with the vocabulary mismatch while exploiting BERT knowledge.Comment: Submitted to ICASSP202
- …