Search CORE

5 research outputs found

Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference

Author: Adnan Muhammad
Arunkumar Akhil
Jain Gaurav
Kamath Purushotham
Nair Prashant J.
Soloveychik Ilya
Publication venue
Publication date: 05/04/2024
Field of study

Transformers have emerged as the underpinning architecture for Large Language Models (LLMs). In generative language models, the inference process involves two primary phases: prompt processing and token generation. Token generation, which constitutes the majority of the computational workload, primarily entails vector-matrix multiplications and interactions with the Key-Value (KV) Cache. This phase is constrained by memory bandwidth due to the overhead of transferring weights and KV cache values from the memory system to the computing units. This memory bottleneck becomes particularly pronounced in applications that require long-context and extensive text generation, both of which are increasingly crucial for LLMs. This paper introduces "Keyformer", an innovative inference-time approach, to mitigate the challenges associated with KV cache size and memory bandwidth utilization. Keyformer leverages the observation that approximately 90% of the attention weight in generative inference focuses on a specific subset of tokens, referred to as "key" tokens. Keyformer retains only the key tokens in the KV cache by identifying these crucial tokens using a novel score function. This approach effectively reduces both the KV cache size and memory bandwidth usage without compromising model accuracy. We evaluate Keyformer's performance across three foundational models: GPT-J, Cerebras-GPT, and MPT, which employ various positional embedding algorithms. Our assessment encompasses a variety of tasks, with a particular emphasis on summarization and conversation tasks involving extended contexts. Keyformer's reduction of KV cache reduces inference latency by 2.1x and improves token generation throughput by 2.4x, while preserving the model's accuracy

arXiv.org e-Print Archive

Beyond the socket: NUMA-aware GPUs

Author: Arunkumar Akhil
Bolotin Evgeny
Ebrahimi Eiman
Jaleel Aamer
Nellans David
Ramirez Alex
Ugljesa Milic
Villa Oreste
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/10/2017
Field of study

GPUs achieve high throughput and power efficiency by employing many small single instruction multiple thread (SIMT) cores. To minimize scheduling logic and performance variance they utilize a uniform memory system and leverage strong data parallelism exposed via the programming model. With Moore's law slowing, for GPUs to continue scaling performance (which largely depends on SIMT core count) they are likely to embrace multi-socket designs where transistors are more readily available. However when moving to such designs, maintaining the illusion of a uniform memory system is increasingly difficult. In this work we investigate multi-socket non-uniform memory access (NUMA) GPU designs and show that significant changes are needed to both the GPU interconnect and cache architectures to achieve performance scalability. We show that application phase effects can be exploited allowing GPU sockets to dynamically optimize their individual interconnect and cache policies, minimizing the impact of NUMA effects. Our NUMA-aware GPU outperforms a single GPU by 1.5×, 2.3×, and 3.2× while achieving 89%, 84%, and 76% of theoretical application scalability in 2, 4, and 8 sockets designs respectively. Implementable today, NUMA-aware multi-socket GPUs may be a promising candidate for scaling GPU performance beyond a single socket.We would like to thank anonymous reviewers and Steve Keckler for their help in improving this paper. The first author is supported by the Ministry of Economy and Competitiveness of Spain (TIN2012-34557, TIN2015-65316-P, and BES-2013-063925)Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Using Low Cost Erasure and Error Correction Schemes to Improve Reliability of Commodity DRAM Systems

Author: Akhil Arunkumar
Carole-Jean Wu
Chaitali Chakrabarti
David Blaauw
Hsing-Min Chen
Supreet Jeloka
Trevor Mudge
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Beyond the socket: NUMA-aware GPUs

Author: Arunkumar Akhil
Bolotin Evgeny
Ebrahimi Eiman
Jaleel Aamer
Nellans David
Ramirez Alex
Ugljesa Milic
Villa Oreste
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

RECERCAT

Synthesis of 2-deoxy- d

Author: Aft
Ahangaran
Akbarzadeh
Akhil K. Dubey
Alessio
Ammar
Anand Ballal
Anselmo
Armarego
Arunkumar S. Koijam
Asadabad
Barar
Basu
Basu
Blanářová
Brown
Butler
Byrn
Caminade
Canetta
Chandan Kumar
Chen
Chen
Chin
Choi
Das
Dasari
Dhar
Fang
Ferjaoui
Florea
Gabano
Gao
Gao
Gatti
Ge
Ghosn
Gibson
Gonzalez
Guardia
Gupta
Hall
Hall
He
He
Hofmann
Housman
Huang
Huang
Hufschmid
Huseynov
Häfeli
Jain
Jiang
Johnstone
Jung
K. Shitaljit Sharma
Kandasamy
Karasawa
Karimzadeh
Kenny
Khot
Kolosnjaj-Tabi
Lee
Li
Li
Liberti
Lim
Liu
Locke
Lombardo
Ma
Maharramov
Mahmoudi
Maki
Mansoor
Mansoori
Medina
Mees
Monaco
Montagner
Montalbetti
Morel
Mosmann
Neamtu
Nemirovski
Novohradsky
Otto
Pinheiro
Prasad P. Phadnis
Pyrz
Qi
Rahimi
Rajesh K. Vatsa
Ravera
Ravera
Reddy
Rosenberg
Russell
Rybak
Sadhukha
Santos
Sastry
Sato
Sedletska
Selvan
Senapati
Shan
Sharma
Shi
Singh
Singh
Song
Spicer
Stöber
Sudip Mukherjee
Tian
Tiwari
Unsoy
Valeur
Verma
Wang
Wang
Wexselblatt
Wierzbinski
Wlassoff
Wu
Wu
Wu
Xi
Xie
Xu
Xu
Yang
Yang
Ye
Yew
Yi
Yu
Yuvakkumar
Zanellato
Zerrouki
Zhang
Zhang
Zhao
Zhao
Zheng
Publication venue: 'Royal Society of Chemistry (RSC)'
Publication date: 01/01/2020
Field of study

Crossref