35 research outputs found
Texturizing PPCG: Supporting Texture Memory in a Polyhedral Compiler
In this paper, we discuss techniques to transform
sequential programs to texture/surface memory optimized CUDA
programs. We achieve this by using PPCG, an automatic paral-
lelizing compiler based on the Polyhedral model. We implemented
a static analysis in PPCG which validates the semantics of the
texturized transformed program. Depending on the results of
the analysis, our algorithm chooses to use texture and/or surface
memory, and alters the Abstract Syntax Tree accordingly. We
also modified the code-generation phase of PPCG to take care
of various subtleties. We evaluated the texturization algorithm
on the PolyBench (4.2.1 beta) benchmark and observed up to
1.6x speedup with a geometric mean of 1.103X. The title and
at many places, the paper uses term Texture memory. But, the
optimizations are for Texture and Surface memory
Optimization and parallelization of tensor and ODE/PDE computations on GPU
We propose a multi-level GPU-based parallelization algorithm to solve the multi-compartment
Hodgkin Huxley (HH) model equation that requires solving the Hines matrix. We use
a ‘parallel-in-time’ algorithm (like the Parareal strategy) for obtaining outer level parallelism,
and an Exact Domain Decomposition (EDD) algorithm with fine-decomposition for
inner-level parallelism. We show that our technique can also be applied to any differential
equation like the heat equations which induce tridiagonal systems.
Typically, a solution to the HH equation runs for hundreds to tens of thousands of time-steps
while solving a Hines matrix at each time step. Previous solutions by Michael Mascagni
et al. (1991) and Hines et al. (2008) to this problem have tackled only solving the Hines
matrix in parallel.
Our approach uses the dynamic parallelism of CUDA to achieve multi-level parallelism
on GPUs. Our solution outperforms the sequential time method on standard neuron morphologies
upto 2.5x. We also show that iterative part of parareal method converges in 5-7
iterations on average with an accuracy of 10−6.
We also propose a GPU optimization for the Higher Order Tensor Renormalization Group
problem, where the tensor contraction operations inside HOTRG is optimized by a multi-
GPU implementation using cuBLAS xt API
Vectorization, Obfuscation and P4 LLVM Tool-chain
This thesis broadly focuses on three different areas: Loop Vectorization, Code Obfuscation,
and P4LLVM compiler. The work in Loop vectorization starts with a
comparison of Auto-vectorization of GCC, ICC and LLVM compilers and show their
strengths and weakness. As an attempt to improve LLVM’s Auto-vectorization, we
propose to improve Loop Distribution using exact dependences from Polly. Our work
on Loop Distribution shows promising results. We developed an LLVM based Code
Obfuscation engine with various obfuscation techniques as transformation passes, our
techniques are novel and are different from existing works [1]. In hardware circuit
obfuscation several methods were proposed at the hardware level to secure the IP.
Our approach is to obfuscate the circuits at the software level, using code obfuscation
techniques
Optimizations In Compiler: Vectorization, Reordering, Register Allocation And Verification Of Explicitly Parallel Programs
Compiler Optimizations form a very important part of compiler development as they make a major
difeerence between an average and a great compiler. There are various modules of a compiler-which
opens opportunities for optimizations on various spheres. In this thesis, a comparative study of
vectorization is done exposing the strengths and weaknesses of various contemporary compilers.
Additionally, a study on the impact of vectorization on tiled code is performed. Different strategies
for loop nest optimization is explored. An algorithm for statement reordering in loops to enhance
performance has been developed. An Integer Linear Program formulation is done to improve loop
parallelism, which makes use of loop unrolling and explicitly parallel directives. Finally, an attempt
for optimal loop distribution is made. Following loop nest optimization chapter, an explanation
of interprocedural register allocation(IPRA) for ARM32 and AArch64 is given. Additionally, a
brief description of the problems for implementing IPRA for those architectures is presented. We
conclude the chapter with the performance results with IPRA for those platforms. In the last
chapter, a description of VoPiL, a static OpenMP verifier in LLVM, is presented. A brief description
of the analysis and the results are included
Polyhedral Compilation: Applications, Approximations and GPU-specific Optimizations
Polyhedral compilation has been successful in analyzing, optimizing, automatically parallelizing
a�ne computations for modern heterogenous target architectures. Many of the tools have been
developed to automate the process of program analysis and transformations for a�ne control parts
of programs including widely used open-source and production compilers such as GCC, LLVM,
IBM/XL. This thesis makes contribution to the polyhedral model in three orthogonal dimensions as
follows:
• Applications: Applies polyhedral loop transformations on Deep learning computation kernel
to demonstrate the e�ectiveness of complex loop transformations on these kernels.
• Approximations: Developes two efficient algorithms to over-approximate convex polyhedra
into U-TVPI polyhedra having applications in polyhedral compilation as well as automated
program verification.
• GPU-Specific Optimizations: Builds end-to-end fully automatic compiler framework to
generate cache optimized CUDA code begining from sequential C program by using polyhedral
modelling techniques.
LLOV: A Fast Static Data-Race Checker for OpenMP Programs
In the era of Exascale computing, writing efficient parallel programs is indispensable and at the same time,
writing sound parallel programs is highly difficult. While parallel programming is easier with frameworks
such as OpenMP, the possibility of data races in these programs still persists. In this paper, we propose a
fast, lightweight, language agnostic, and static data race checker for OpenMP programs based on the LLVM
compiler framework. We compare our tool with other state-of-the-art data race checkers on a variety of
well-established benchmarks. We show that the precision, accuracy, and the F1 score of our tool is comparable
to other checkers while being orders of magnitude faster. To the best of our knowledge, this work is the only
tool among the state-of-the-art data race checkers that can verify a FORTRAN program to be data race free
RL4ReAl: Reinforcement Learning for Register Allocation
We propose a novel solution for the Register Allocation problem, leveraging
multi-agent hierarchical Reinforcement Learning. We formalize the constraints
that precisely define the problem for a given instruction-set architecture,
while ensuring that the generated code preserves semantic correctness. We also
develop a gRPC based framework providing a modular and efficient compiler
interface for training and inference. Experimental results match or outperform
the LLVM register allocators, targeting Intel x86 and ARM AArch64
OpenMP aware MHP Analysis for Improved Static Data-Race Detection
Data races, a major source of bugs in concurrent programs, can result in loss of manpower and time as well as data loss due to system failures. OpenMP, the de facto shared memory parallelism framework used in the HPC community, also suffers from data races. To detect race conditions in OpenMP programs and improve turnaround time and/or developer productivity, we present a data flow analysis based, fast, static data race checker in the LLVM compiler framework. Our tool can detect races in the presence or absence of explicit barriers, with implicit or explicit synchronization. In addition, our tool effectively works for the OpenMP target offloading constructs and also supports the frequently used OpenMP constructs.We formalize and provide a data flow analysis framework to perform Phase Interval Analysis (PIA) of OpenMP programs. Phase intervals are then used to compute the MHP (and its complement NHP) sets for the programs, which, in turn, are used to detect data races statically.We evaluate our work using multiple OpenMP race detection benchmarks and real world applications. Our experiments show that the checker is comparable to the state-of-The-Art in various performance metrics with around 90% accuracy, almost perfect recall, and significantly lower runtime and memory footprint. © 2021 IEEE
LLOV: A Fast Static Data-Race Checker for OpenMP Programs
In the era of Exascale computing, writing efficient parallel programs is
indispensable and at the same time, writing sound parallel programs is very
difficult. Specifying parallelism with frameworks such as OpenMP is relatively
easy, but data races in these programs are an important source of bugs. In this
paper, we propose LLOV, a fast, lightweight, language agnostic, and static data
race checker for OpenMP programs based on the LLVM compiler framework. We
compare LLOV with other state-of-the-art data race checkers on a variety of
well-established benchmarks. We show that the precision, accuracy, and the F1
score of LLOV is comparable to other checkers while being orders of magnitude
faster. To the best of our knowledge, LLOV is the only tool among the
state-of-the-art data race checkers that can verify a C/C++ or FORTRAN program
to be data race free.Comment: Accepted in ACM TACO, August 202