16 research outputs found
Recommended from our members
Languages and Compilers for Writing Efficient High-Performance Computing Applications
Many everyday applications, such as web search, speech recognition, and weather prediction, are executed on high-performance systems containing thousands of Central Processing Units (CPUs) and Graphics Processing Units (GPUs). These applications can be written in either low-level programming languages, such as NVIDIA CUDA, or domain specific languages, like Halide for image processing and PyTorch for machine learning programs. Despite the popularity of these languages, there are several challenges that programmers face when developing efficient high-performance computing applications. First, since every hardware support a different low-level programming model, to utilize new hardware programmers need to rewrite their applications in another programming language. Second, writing efficient code involves restructuring the computation to ensure (i) regular memory access patterns, (ii) non-divergent control flow, and (iii) complete utilization of different programmer managed caches. Furthermore, since these low-level optimizations are known only to hardware experts, it is difficult for a domain expert to write optimized code for new computations. Third, existing domain specific languages suffer from optimization barriers in the language constructs that prevent new optimizations and hence, these languages provide sub-optimal performance. To address these challenges this thesis presents the following novel abstractions and compiler techniques for writing image processing and machine learning applications that can run efficiently on a variety of high-performance systems. First, this thesis presents techniques to optimize image processing programs on GPUs using the features of modern GPUs. These techniques improve the concurrency and register usage of generated code to provide better performance than the state-of-the-art. Second, this thesis presents NextDoor, which is the first system to provide an abstraction for writing graph sampling applications and efficiently executing these applications on GPUs. Third, this thesis presents CoCoNet, which is a domain specific language to co-optimize communication and computation in distributed machine learning workloads. By breaking the optimization barriers in existing domain specific languages, these techniques help programmers write correct and efficient code for diverse high-performance computing workloads
Fast Kronecker Matrix-Matrix Multiplication on GPUs
Kronecker Matrix-Matrix Multiplication (Kron-Matmul) is the multiplication of
a matrix with the Kronecker Product of several smaller matrices. Kron-Matmul is
a core operation for many scientific and machine learning computations.
State-of-the-art Kron-Matmul implementations utilize existing tensor algebra
operations, such as matrix multiplication, transpose, and tensor matrix
multiplication. However, this design choice prevents several Kron-Matmul
specific optimizations, thus, leaving significant performance on the table. To
address this issue, we present FastKron, an efficient technique for Kron-Matmul
on single and multiple GPUs. FastKron is independent of linear algebra
operations enabling several new optimizations for Kron-Matmul. Thus, it
performs up to 40.7x and 7.85x faster than existing implementations on 1 and 16
GPUs respectively.Comment: Accepted at PPoPP 202
A Framework for Fine-Grained Synchronization of Dependent GPU Kernels
Machine Learning (ML) models contain highly-parallel computations, such as,
Matrix Multiplication, Convolutions, Dropout, etc. These computations are
commonly executed on Graphics Processing Units (GPUs), by dividing the
computation in independent processing blocks, known as tiles. Since the number
of tiles are usually higher than the execution units of a GPU, tiles are
executed on all execution units in waves. However, the tiles executed in the
last wave can under-utilize the execution units because tiles are not always a
multiple of execution units. This under-utilization can be reduced by executing
multiple independent kernels concurrently on a GPU, but is not currently
possible for dependent kernels.
In this paper, we present cuSync, a framework to write custom fine-grained
synchronization policies for dependent kernels to improve GPU utilization.
cuSync synchronizes tiles instead of kernels, which allows executing tiles of
multiple dependent kernels. Using cuSync we expressed several synchronization
policies in a few lines of code and reduced the inference times of GPT-3 and
ResNet-38 by up to 1.19x and 1.16x respectively
Knowledge Transfer from High-Resource to Low-Resource Programming Languages for Code LLMs
Over the past few years, Large Language Models of Code (Code LLMs) have
started to have a significant impact on programming practice. Code LLMs are
also emerging as a building block for research in programming languages and
software engineering. However, the quality of code produced by a Code LLM
varies significantly by programming languages. Code LLMs produce impressive
results on programming languages that are well represented in their training
data (e.g., Java, Python, or JavaScript), but struggle with low-resource
languages, like OCaml and Racket.
This paper presents an effective approach for boosting the performance of
Code LLMs on low-resource languages using semi-synthetic data. Our approach
generates high-quality datasets for low-resource languages, which can then be
used to fine-tune any pretrained Code LLM. Our approach, called MultiPL-T,
translates training data from high-resource languages into training data for
low-resource languages. We apply our approach to generate tens of thousands of
new, validated training items for Racket, OCaml, and Lua from Python. Moreover,
we use an open dataset (The Stack) and model (StarCoderBase), which allow us to
decontaminate benchmarks and train models on this data without violating the
model license.
With MultiPL-T generated data, we present fine-tuned versions of
StarCoderBase that achieve state-of-the-art performance for Racket, OCaml, and
Lua on benchmark problems. For Lua, our fine-tuned model achieves the same
performance as StarCoderBase as Python -- a very high-resource language -- on
the MultiPL-E benchmarks. For Racket and OCaml, we double their performance on
MultiPL-E, bringing their performance close to higher-resource languages such
as Ruby and C#