Autoregressive models, despite their commendable performance in a myriad of
generative tasks, face challenges stemming from their inherently sequential
structure. Inference on these models, by design, harnesses a temporal
dependency, where the current token's probability distribution is conditioned
on preceding tokens. This inherent characteristic severely impedes
computational efficiency during inference as a typical inference request can
require more than thousands of tokens, where generating each token requires a
load of entire model weights, making the inference more memory-bound. The large
overhead becomes profound in real deployment where requests arrive randomly,
necessitating various generation lengths. Existing solutions, such as dynamic
batching and concurrent instances, introduce significant response delays and
bandwidth contention, falling short of achieving optimal latency and
throughput. To address these shortcomings, we propose Flover -- a temporal
fusion framework for efficiently inferring multiple requests in parallel. We
deconstruct the general generation pipeline into pre-processing and token
generation, and equip the framework with a dedicated work scheduler for fusing
the generation process temporally across all requests. By orchestrating the
token-level parallelism, Flover exhibits optimal hardware efficiency and
significantly spares the system resources. By further employing a fast buffer
reordering algorithm that allows memory eviction of finished tasks, it brings
over 11x inference speedup on GPT and 16x on LLAMA compared to the cutting-edge
solutions provided by NVIDIA FasterTransformer. Crucially, by leveraging the
advanced tensor parallel technique, Flover proves efficacious across diverse
computational landscapes, from single-GPU setups to distributed scenarios,
thereby offering robust performance optimization that adapts to variable use
cases.Comment: In Proceeding of 30th IEEE International Conference on High
Performance Computing, Data, and Analytics (HiPC