Search CORE

8 research outputs found

Splitwise: Efficient generative LLM inference using phase splitting

Author: Bianchini Ricardo
Choukse Esha
Goiri Íñigo
Maleki Saeed
Patel Pratyush
Shah Aashaka
Zhang Chaojie
Publication venue
Publication date: 30/11/2023
Field of study

Recent innovations in generative large language models (LLMs) have made their applications and use-cases ubiquitous. This has led to large-scale deployments of these models, using complex, expensive, and power-hungry AI accelerators, most commonly GPUs. These developments make LLM inference efficiency an important challenge. Based on our extensive characterization, we find that there are two main phases during an LLM inference request: a compute-intensive prompt computation, and a memory-intensive token generation, each with distinct latency, throughput, memory, and power characteristics. Despite state-of-the-art batching and scheduling, the token generation phase underutilizes compute resources. Specifically, unlike compute-intensive prompt computation phases, token generation phases do not require the compute capability of the latest GPUs, and can be run with lower power and cost. With Splitwise, we propose splitting the two phases of a LLM inference request on to separate machines. This allows us to use hardware that is well-suited for each phase, and provision resources independently per phase. However, splitting an inference request across machines requires state transfer from the machine running prompt computation over to the machine generating tokens. We implement and optimize this state transfer using the fast back-plane interconnects available in today's GPU clusters. We use the Splitwise technique to design LLM inference clusters using the same or different types of machines for the prompt computation and token generation phases. Our clusters are optimized for three key objectives: throughput, cost, and power. In particular, we show that we can achieve 1.4x higher throughput at 20% lower cost than current designs. Alternatively, we can achieve 2.35x more throughput with the same cost and power budgets.Comment: 12 pages, 19 figure

arXiv.org e-Print Archive

TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches

Author: Chidambaram Vijay
Cowan Meghan
Maleki Saeed
Musuvathi Madan
Mytkowicz Todd
Nelson Jacob
Saarikivi Olli
Shah Aashaka
Singh Rachee
Publication venue
Publication date: 10/07/2022
Field of study

Machine learning models are increasingly being trained across multiple GPUs and multiple machines. In this setting, data is transferred between GPUs using communication collectives such as AlltoAll and AllReduce, which can become a significant bottleneck in large models. It is important to use efficient algorithms for collective communication. We introduce TACCL, a tool that allows algorithm designers to guide a synthesizer into automatically generating algorithms for a given hardware configuration and communication collective. TACCL uses the novel communication sketch abstraction to obtain crucial information from the designer that is used to significantly reduce the state space and guide the synthesizer towards better algorithms. TACCL also uses a novel encoding of the problem that allows it to scale beyond single-node topologies. We use TACCL to synthesize algorithms for three collectives and two hardware topologies: DGX-2 and NDv2. We demonstrate that the algorithms synthesized by TACCL outperform the NVIDIA Collective Communication Library (NCCL) by up to 6.7

\times

. We also show that TACCL can speed up end-to-end training of Transformer-XL and BERT models by 11%--2.3

\times

for different batch sizes.Comment: Accepted at NSDI'23. Contains 17 pages, 11 figures, including Appendi

arXiv.org e-Print Archive

Genomic Survey of E. coli From the Bladders of Women With and Without Lower Urinary Tract Symptoms

Urinary tract infections (UTIs) are one of the most common human bacterial infections. While UTIs are commonly associated with colonization by Escherichia coli, members of this species also have been found within the bladder of individuals with no lower urinary tract symptoms (no LUTS), also known as asymptomatic bacteriuria. Prior studies have found that both uropathogenic E. coli (UPEC) strains and E. coli isolates that are not associated with UTIs encode for virulence factors. Thus, the reason(s) why E. coli sometimes causes UTI-like symptoms remain(s) elusive. In this study, the genomes of 66 E. coli isolates from adult female bladders were sequenced. These isolates were collected from four cohorts, including women: (1) without lower urinary tract symptoms, (2) overactive bladder symptoms, (3) urgency urinary incontinence, and (4) a clinical diagnosis of UTI. Comparative genomic analyses were conducted, including core and accessory genome analyses, virulence and motility gene analyses, and antibiotic resistance prediction and testing. We found that the genomic content of these 66 E. coli isolates does not correspond with the participant’s symptom status. We thus looked beyond the E. coli genomes to the composition of the entire urobiome and found that the presence of E. coli alone was not sufficient to distinguish between the urobiomes of individuals with UTI and those with no LUTS. Because E. coli presence, abundance, and genomic content appear to be weak predictors of UTI status, we hypothesize that UTI symptoms associated with detection of E. coli are more likely the result of urobiome composition

Loyola eCommons

Recommended from our members

Optimizing ML systems without using experts

Author: Shah Aashaka
Publication venue
Publication date: 28/09/2024
Field of study

The growth of large deep learning networks to billions and trillions of parameters has enabled them to achieve state-of-the-art results in various fields, including vision, language, speech, and game-playing. This success of deep networks has also impacted the field of databases in an interesting way - to improve performance, database indexes are now being redesigned as learned models that fit the underlying data. Both deep networks and learned indexes have high resource usage and strict throughput requirements. Minor inefficiencies in resource utilization within these machine learning (ML) systems can incur heavy costs, making it important to optimize their resource efficiency. What makes doing this difficult is that the execution environment of ML systems is highly heterogeneous. A deep neural network is made of operators with disparate resource utilization profiles connected in different ways. It can also be executed on different types of hardware accelerators, each with distinct performance characteristics. Further, even the input workload to learned indexes can vary. For every new neural network architecture, hardware accelerator topology, or index structure workload, either an expert would be required to hand-craft solutions for efficient resource utilization from a large search space, or we would need to be satisfied with a generic solution that might leave performance on the table. In this dissertation, we ask the question - Is it possible to build solutions to optimize ML systems such that they perform instance-specific optimization under the hood and can be utilized by non-experts?. We demonstrate how we can build tools to optimize the execution of deep networks and learned indexes for different use cases while minimizing manual effort. In the first part of this dissertation, we present MONeT, an automated framework that jointly optimizes different memory-saving techniques for any deep network architecture. Using MONeT, model training on a single GPU always takes less memory than a user-provided memory budget while using less compute than standalone memory-saving techniques. In the second part of this dissertation, we present TACCL, a semi-automated tool that generates efficient communication algorithms based on the hardware topology and size of data to transfer in distributed deep learning. Using TACCL, network utilization can be improved which makes distributed ML execution faster. In the third and final part of this dissertation, we present MaPLE, a parameterized learned index that can achieve high performance on a wide variety of workload patterns while maintaining a similar memory footprint as another state-of-the-art learned index. The solutions we propose in this dissertation search from a large state space to give performant solutions for the particular use-case that match or outperform previous state-of-the-art but do not need manual tuning from an expert.Computer Science

Texas ScholarWorks

Chronic circadian misalignment results in reduced longevity and large-scale changes in gene expression in Drosophila

Author: A Dobin
A Jung
A Klarsfeld
A Vaccaro
Aashaka C. Shah
AJ Davidson
Alex C. Boomgarden
C Dubowy
Christine M. Dubowy
CS Pittendrigh
D Bushey
DA Golombek
Daniel J. Cavanaugh
E Filipski
E Ryder
F Supek
G Landis
Gabriel D. Sagewalker
Heather E. Wheeler
J Aschoff
J Wefers
JA Evans
JA Mohawk
JC Hendricks
JM Giebultowicz
K Koh
KM Vaze
L Kervezee
M Wittmann
M Zytnicki
MI Love
MS Kayser
N Park
NM Kettner
NN Takasu
O Castanon-Cervantes
PD Penev
PJ DeCoursey
PJ Shaw
Pramathini Patel
R Huber
R Weiss
RC Kuintzle
RG Foster
S Koudounas
S Yamazaki
Sarah D. Haider
SN Archer
T Roenneberg
TA Martino
U Saint Paul von
V Bustos
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref