5 research outputs found
Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference
Transformers have emerged as the underpinning architecture for Large Language
Models (LLMs). In generative language models, the inference process involves
two primary phases: prompt processing and token generation. Token generation,
which constitutes the majority of the computational workload, primarily entails
vector-matrix multiplications and interactions with the Key-Value (KV) Cache.
This phase is constrained by memory bandwidth due to the overhead of
transferring weights and KV cache values from the memory system to the
computing units. This memory bottleneck becomes particularly pronounced in
applications that require long-context and extensive text generation, both of
which are increasingly crucial for LLMs.
This paper introduces "Keyformer", an innovative inference-time approach, to
mitigate the challenges associated with KV cache size and memory bandwidth
utilization. Keyformer leverages the observation that approximately 90% of the
attention weight in generative inference focuses on a specific subset of
tokens, referred to as "key" tokens. Keyformer retains only the key tokens in
the KV cache by identifying these crucial tokens using a novel score function.
This approach effectively reduces both the KV cache size and memory bandwidth
usage without compromising model accuracy. We evaluate Keyformer's performance
across three foundational models: GPT-J, Cerebras-GPT, and MPT, which employ
various positional embedding algorithms. Our assessment encompasses a variety
of tasks, with a particular emphasis on summarization and conversation tasks
involving extended contexts. Keyformer's reduction of KV cache reduces
inference latency by 2.1x and improves token generation throughput by 2.4x,
while preserving the model's accuracy
Beyond the socket: NUMA-aware GPUs
GPUs achieve high throughput and power efficiency by employing many small single instruction multiple thread (SIMT) cores. To minimize scheduling logic and performance variance they utilize a uniform memory system and leverage strong data parallelism exposed via the programming model. With Moore's law slowing, for GPUs to continue scaling performance (which largely depends on SIMT core count) they are likely to embrace multi-socket designs where transistors are more readily available. However when moving to such designs, maintaining the illusion of a uniform memory system is increasingly difficult. In this work we investigate multi-socket non-uniform memory access (NUMA) GPU designs and show that significant changes are needed to both the GPU interconnect and cache architectures to achieve performance scalability. We show that application phase effects can be exploited allowing GPU sockets to dynamically optimize their individual interconnect and cache policies, minimizing the impact of NUMA effects. Our NUMA-aware GPU outperforms a single GPU by 1.5Ă—, 2.3Ă—, and 3.2Ă— while achieving 89%, 84%, and 76% of theoretical application scalability in 2, 4, and 8 sockets designs respectively. Implementable today, NUMA-aware multi-socket GPUs may be a promising candidate for scaling GPU performance beyond a single socket.We would like to thank anonymous reviewers and Steve Keckler for their help in improving this paper. The first author is supported by
the Ministry of Economy and Competitiveness of Spain (TIN2012-34557, TIN2015-65316-P, and BES-2013-063925)Peer ReviewedPostprint (published version
Using Low Cost Erasure and Error Correction Schemes to Improve Reliability of Commodity DRAM Systems
Beyond the socket: NUMA-aware GPUs
GPUs achieve high throughput and power efficiency by employing many small single instruction multiple thread (SIMT) cores. To minimize scheduling logic and performance variance they utilize a uniform memory system and leverage strong data parallelism exposed via the programming model. With Moore's law slowing, for GPUs to continue scaling performance (which largely depends on SIMT core count) they are likely to embrace multi-socket designs where transistors are more readily available. However when moving to such designs, maintaining the illusion of a uniform memory system is increasingly difficult. In this work we investigate multi-socket non-uniform memory access (NUMA) GPU designs and show that significant changes are needed to both the GPU interconnect and cache architectures to achieve performance scalability. We show that application phase effects can be exploited allowing GPU sockets to dynamically optimize their individual interconnect and cache policies, minimizing the impact of NUMA effects. Our NUMA-aware GPU outperforms a single GPU by 1.5Ă—, 2.3Ă—, and 3.2Ă— while achieving 89%, 84%, and 76% of theoretical application scalability in 2, 4, and 8 sockets designs respectively. Implementable today, NUMA-aware multi-socket GPUs may be a promising candidate for scaling GPU performance beyond a single socket.We would like to thank anonymous reviewers and Steve Keckler for their help in improving this paper. The first author is supported by
the Ministry of Economy and Competitiveness of Spain (TIN2012-34557, TIN2015-65316-P, and BES-2013-063925)Peer Reviewe