35 research outputs found

    Texturizing PPCG: Supporting Texture Memory in a Polyhedral Compiler

    Get PDF
    In this paper, we discuss techniques to transform sequential programs to texture/surface memory optimized CUDA programs. We achieve this by using PPCG, an automatic paral- lelizing compiler based on the Polyhedral model. We implemented a static analysis in PPCG which validates the semantics of the texturized transformed program. Depending on the results of the analysis, our algorithm chooses to use texture and/or surface memory, and alters the Abstract Syntax Tree accordingly. We also modified the code-generation phase of PPCG to take care of various subtleties. We evaluated the texturization algorithm on the PolyBench (4.2.1 beta) benchmark and observed up to 1.6x speedup with a geometric mean of 1.103X. The title and at many places, the paper uses term Texture memory. But, the optimizations are for Texture and Surface memory

    Optimization and parallelization of tensor and ODE/PDE computations on GPU

    Get PDF
    We propose a multi-level GPU-based parallelization algorithm to solve the multi-compartment Hodgkin Huxley (HH) model equation that requires solving the Hines matrix. We use a ‘parallel-in-time’ algorithm (like the Parareal strategy) for obtaining outer level parallelism, and an Exact Domain Decomposition (EDD) algorithm with fine-decomposition for inner-level parallelism. We show that our technique can also be applied to any differential equation like the heat equations which induce tridiagonal systems. Typically, a solution to the HH equation runs for hundreds to tens of thousands of time-steps while solving a Hines matrix at each time step. Previous solutions by Michael Mascagni et al. (1991) and Hines et al. (2008) to this problem have tackled only solving the Hines matrix in parallel. Our approach uses the dynamic parallelism of CUDA to achieve multi-level parallelism on GPUs. Our solution outperforms the sequential time method on standard neuron morphologies upto 2.5x. We also show that iterative part of parareal method converges in 5-7 iterations on average with an accuracy of 10−6. We also propose a GPU optimization for the Higher Order Tensor Renormalization Group problem, where the tensor contraction operations inside HOTRG is optimized by a multi- GPU implementation using cuBLAS xt API

    Vectorization, Obfuscation and P4 LLVM Tool-chain

    Get PDF
    This thesis broadly focuses on three different areas: Loop Vectorization, Code Obfuscation, and P4LLVM compiler. The work in Loop vectorization starts with a comparison of Auto-vectorization of GCC, ICC and LLVM compilers and show their strengths and weakness. As an attempt to improve LLVM’s Auto-vectorization, we propose to improve Loop Distribution using exact dependences from Polly. Our work on Loop Distribution shows promising results. We developed an LLVM based Code Obfuscation engine with various obfuscation techniques as transformation passes, our techniques are novel and are different from existing works [1]. In hardware circuit obfuscation several methods were proposed at the hardware level to secure the IP. Our approach is to obfuscate the circuits at the software level, using code obfuscation techniques

    Optimizations In Compiler: Vectorization, Reordering, Register Allocation And Verification Of Explicitly Parallel Programs

    Get PDF
    Compiler Optimizations form a very important part of compiler development as they make a major difeerence between an average and a great compiler. There are various modules of a compiler-which opens opportunities for optimizations on various spheres. In this thesis, a comparative study of vectorization is done exposing the strengths and weaknesses of various contemporary compilers. Additionally, a study on the impact of vectorization on tiled code is performed. Different strategies for loop nest optimization is explored. An algorithm for statement reordering in loops to enhance performance has been developed. An Integer Linear Program formulation is done to improve loop parallelism, which makes use of loop unrolling and explicitly parallel directives. Finally, an attempt for optimal loop distribution is made. Following loop nest optimization chapter, an explanation of interprocedural register allocation(IPRA) for ARM32 and AArch64 is given. Additionally, a brief description of the problems for implementing IPRA for those architectures is presented. We conclude the chapter with the performance results with IPRA for those platforms. In the last chapter, a description of VoPiL, a static OpenMP verifier in LLVM, is presented. A brief description of the analysis and the results are included

    Polyhedral Compilation: Applications, Approximations and GPU-specific Optimizations

    Get PDF
    Polyhedral compilation has been successful in analyzing, optimizing, automatically parallelizing a�ne computations for modern heterogenous target architectures. Many of the tools have been developed to automate the process of program analysis and transformations for a�ne control parts of programs including widely used open-source and production compilers such as GCC, LLVM, IBM/XL. This thesis makes contribution to the polyhedral model in three orthogonal dimensions as follows: • Applications: Applies polyhedral loop transformations on Deep learning computation kernel to demonstrate the e�ectiveness of complex loop transformations on these kernels. • Approximations: Developes two efficient algorithms to over-approximate convex polyhedra into U-TVPI polyhedra having applications in polyhedral compilation as well as automated program verification. • GPU-Specific Optimizations: Builds end-to-end fully automatic compiler framework to generate cache optimized CUDA code begining from sequential C program by using polyhedral modelling techniques.

    LLOV: A Fast Static Data-Race Checker for OpenMP Programs

    Get PDF
    In the era of Exascale computing, writing efficient parallel programs is indispensable and at the same time, writing sound parallel programs is highly difficult. While parallel programming is easier with frameworks such as OpenMP, the possibility of data races in these programs still persists. In this paper, we propose a fast, lightweight, language agnostic, and static data race checker for OpenMP programs based on the LLVM compiler framework. We compare our tool with other state-of-the-art data race checkers on a variety of well-established benchmarks. We show that the precision, accuracy, and the F1 score of our tool is comparable to other checkers while being orders of magnitude faster. To the best of our knowledge, this work is the only tool among the state-of-the-art data race checkers that can verify a FORTRAN program to be data race free

    RL4ReAl: Reinforcement Learning for Register Allocation

    Full text link
    We propose a novel solution for the Register Allocation problem, leveraging multi-agent hierarchical Reinforcement Learning. We formalize the constraints that precisely define the problem for a given instruction-set architecture, while ensuring that the generated code preserves semantic correctness. We also develop a gRPC based framework providing a modular and efficient compiler interface for training and inference. Experimental results match or outperform the LLVM register allocators, targeting Intel x86 and ARM AArch64

    OpenMP aware MHP Analysis for Improved Static Data-Race Detection

    Get PDF
    Data races, a major source of bugs in concurrent programs, can result in loss of manpower and time as well as data loss due to system failures. OpenMP, the de facto shared memory parallelism framework used in the HPC community, also suffers from data races. To detect race conditions in OpenMP programs and improve turnaround time and/or developer productivity, we present a data flow analysis based, fast, static data race checker in the LLVM compiler framework. Our tool can detect races in the presence or absence of explicit barriers, with implicit or explicit synchronization. In addition, our tool effectively works for the OpenMP target offloading constructs and also supports the frequently used OpenMP constructs.We formalize and provide a data flow analysis framework to perform Phase Interval Analysis (PIA) of OpenMP programs. Phase intervals are then used to compute the MHP (and its complement NHP) sets for the programs, which, in turn, are used to detect data races statically.We evaluate our work using multiple OpenMP race detection benchmarks and real world applications. Our experiments show that the checker is comparable to the state-of-The-Art in various performance metrics with around 90% accuracy, almost perfect recall, and significantly lower runtime and memory footprint. © 2021 IEEE

    LLOV: A Fast Static Data-Race Checker for OpenMP Programs

    Full text link
    In the era of Exascale computing, writing efficient parallel programs is indispensable and at the same time, writing sound parallel programs is very difficult. Specifying parallelism with frameworks such as OpenMP is relatively easy, but data races in these programs are an important source of bugs. In this paper, we propose LLOV, a fast, lightweight, language agnostic, and static data race checker for OpenMP programs based on the LLVM compiler framework. We compare LLOV with other state-of-the-art data race checkers on a variety of well-established benchmarks. We show that the precision, accuracy, and the F1 score of LLOV is comparable to other checkers while being orders of magnitude faster. To the best of our knowledge, LLOV is the only tool among the state-of-the-art data race checkers that can verify a C/C++ or FORTRAN program to be data race free.Comment: Accepted in ACM TACO, August 202
    corecore