6 research outputs found

    Loopy: Programmable and Formally Verified Loop Transformations

    Get PDF
    Abstract. This paper presents a system, Loopy, for programming loop transformations. Manual loop transformation can be tedious and errorprone, while fully automated methods do not guarantee improvements. Loopy takes a middle path: a programmer specifies a loop transformation at a high level, which is then carried out automatically by Loopy, and formally verified to guard against specification and implementation mistakes. Loopy’s notation offers considerable flexibility with assembling transformations, while automation and checking prevent errors. Loopy is implemented for the LLVM framework, building on a polyhedral compilation library. Experiments show substantial improvements over fully automated loop transformations, using simple and direct specifications

    Hedging Bets in Markov Decision Processes

    Get PDF
    The classical model of Markov decision processes with costs or rewards, while widely used to formalize optimal decision making, cannot capture scenarios where there are multiple objectives for the agent during the system evolution, but only one of these objectives gets actualized upon termination. We introduce the model of Markov decision processes with alternative objectives (MDPAO) for formalizing optimization in such scenarios. To compute the strategy to optimize the expected cost/reward upon termination, we need to figure out how to balance the values of the alternative objectives. This requires analysis of the underlying infinite-state process that tracks the accumulated values of all the objectives. While the decidability of the problem of computing the exact optimal strategy for the general model remains open, we present the following results. First, for a Markov chain with alternative objectives, the optimal expected cost/reward can be computed in polynomial-time. Second, for a single-state process with two actions and multiple objectives we show how to compute the optimal decision strategy. Third, for a process with only two alternative objectives, we present a reduction to the minimum expected accumulated reward problem for one-counter MDPs, and this leads to decidability for this case under some technical restrictions. Finally, we show that optimal cost/reward can be approximated up to a constant additive factor for the general problem

    Static Analysis for GPU Program Performance

    Get PDF
    GPUs have become popular due to their high computational power. Data scientists rely on GPUs to process loads of data being generated by their systems. From a humble beginning as a graphics accelerator for arcade games, they have become essential compute units in many important applications. The programming infrastructure for GPU programs is still rudimentary and the GPU programmer needs to understand the intricacies of GPU architecture, tune various execution parameters and optimize parts of the program using low-level primitives. GPU compilers are still far from the automation provided by CPU compilers where the programmer is often oblivious of the details of the underlying architecture. In this work, we present light-weight formal approaches to improve performance of general GPU programs. This enables our tools to be fast, correct and accessible to everyone. We present three works. First, we present a compile-time analysis to identify uncoalesced accesses in GPU programs. Uncoalesced accesses are a well-documented memory access pattern that leads to poor performance. Second, we present an analysis to verify block-size independence of GPU programs. Block-size is an execution parameter that must be tuned to optimally utilize GPU resources. We present a static analysis to verify block-size independence for synchronization-free GPU programs and ensure that modifying block-size does not break program functionality. Finally, we present a compile-time optimization to leverage cache reuse in GPU to improve performance of GPU programs. GPUs often abandon cache reuse-based performance improvement in favor of thread-level parallelism, where a large number of threads are executed to hide latency of memory and compute operations. We define a compile-time analysis to identify programs with significant intra-thread locality and little inter-thread locality, where cache resue is useful, and a transformation to modify block-size which indirectly influences the hardware thread-scheduler to improve cache utilization. We have implemented the above approaches in LLVM and evaluate them on various benchmarks. The uncoalesced access analysis identifies 111 accesses, the block-size independence analysis verifies 35 block-size independent kernels and the cache reuse optimization improves performance by an average 1.3x on two Nvidia GPUs. The approaches are fast and finish within few seconds for most programs

    Static Analysis For Gpu Program Performance

    No full text
    GPUs have become popular due to their high computational power. Data scientists rely on GPUs to process loads of data being generated by their systems. From a humble beginning as a graphics accelerator for arcade games, they have become essential compute units in many important applications. The programming infrastructure for GPU programs is still rudimentary and the GPU programmer needs to understand the intricacies of GPU architecture, tune various execution parameters and optimize parts of the program using low-level primitives. GPU compilers are still far from the automation provided by CPU compilers where the programmer is often oblivious of the details of the underlying architecture. In this work, we present light-weight formal approaches to improve performance of general GPU programs. This enables our tools to be fast, correct and accessible to everyone. We present three works. First, we present a compile-time analysis to identify uncoalesced accesses in GPU programs. Uncoalesced accesses are a well-documented memory access pattern that leads to poor performance. Second, we present an analysis to verify block-size independence of GPU programs. Block-size is an execution parameter that must be tuned to optimally utilize GPU resources. We present a static analysis to verify block-size independence for synchronization-free GPU programs and ensure that modifying block-size does not break program functionality. Finally, we present a compile-time optimization to leverage cache reuse in GPU to improve performance of GPU programs. GPUs often abandon cache reuse-based performance improvement in favor of thread-level parallelism, where a large number of threads are executed to hide latency of memory and compute operations. We define a compile-time analysis to identify programs with significant intra-thread locality and little inter-thread locality, where cache resue is useful, and a transformation to modify block-size which indirectly influences the hardware thread-scheduler to improve cache utilization. We have implemented the above approaches in LLVM and evaluate them on various benchmarks. The uncoalesced access analysis identifies 111 accesses, the block-size independence analysis verifies 35 block-size independent kernels and the cache reuse optimization improves performance by an average 1.3x on two Nvidia GPUs. The approaches are fast and finish within few seconds for most programs

    Static Analysis For Gpu Program Performance

    No full text
    GPUs have become popular due to their high computational power. Data scientists rely on GPUs to process loads of data being generated by their systems. From a humble beginning as a graphics accelerator for arcade games, they have become essential compute units in many important applications. The programming infrastructure for GPU programs is still rudimentary and the GPU programmer needs to understand the intricacies of GPU architecture, tune various execution parameters and optimize parts of the program using low-level primitives. GPU compilers are still far from the automation provided by CPU compilers where the programmer is often oblivious of the details of the underlying architecture. In this work, we present light-weight formal approaches to improve performance of general GPU programs. This enables our tools to be fast, correct and accessible to everyone. We present three works. First, we present a compile-time analysis to identify uncoalesced accesses in GPU programs. Uncoalesced accesses are a well-documented memory access pattern that leads to poor performance. Second, we present an analysis to verify block-size independence of GPU programs. Block-size is an execution parameter that must be tuned to optimally utilize GPU resources. We present a static analysis to verify block-size independence for synchronization-free GPU programs and ensure that modifying block-size does not break program functionality. Finally, we present a compile-time optimization to leverage cache reuse in GPU to improve performance of GPU programs. GPUs often abandon cache reuse-based performance improvement in favor of thread-level parallelism, where a large number of threads are executed to hide latency of memory and compute operations. We define a compile-time analysis to identify programs with significant intra-thread locality and little inter-thread locality, where cache resue is useful, and a transformation to modify block-size which indirectly influences the hardware thread-scheduler to improve cache utilization. We have implemented the above approaches in LLVM and evaluate them on various benchmarks. The uncoalesced access analysis identifies 111 accesses, the block-size independence analysis verifies 35 block-size independent kernels and the cache reuse optimization improves performance by an average 1.3x on two Nvidia GPUs. The approaches are fast and finish within few seconds for most programs

    Static Analysis for GPU Program Performance

    No full text
    GPUs have become popular due to their high computational power. Data scientists rely on GPUs to process loads of data being generated by their systems. From a humble beginning as a graphics accelerator for arcade games, they have become essential compute units in many important applications. The programming infrastructure for GPU programs is still rudimentary and the GPU programmer needs to understand the intricacies of GPU architecture, tune various execution parameters and optimize parts of the program using low-level primitives. GPU compilers are still far from the automation provided by CPU compilers where the programmer is often oblivious of the details of the underlying architecture. In this work, we present light-weight formal approaches to improve performance of general GPU programs. This enables our tools to be fast, correct and accessible to everyone. We present three works. First, we present a compile-time analysis to identify uncoalesced accesses in GPU programs. Uncoalesced accesses are a well-documented memory access pattern that leads to poor performance. Second, we present an analysis to verify block-size independence of GPU programs. Block-size is an execution parameter that must be tuned to optimally utilize GPU resources. We present a static analysis to verify block-size independence for synchronization-free GPU programs and ensure that modifying block-size does not break program functionality. Finally, we present a compile-time optimization to leverage cache reuse in GPU to improve performance of GPU programs. GPUs often abandon cache reuse-based performance improvement in favor of thread-level parallelism, where a large number of threads are executed to hide latency of memory and compute operations. We define a compile-time analysis to identify programs with significant intra-thread locality and little inter-thread locality, where cache resue is useful, and a transformation to modify block-size which indirectly influences the hardware thread-scheduler to improve cache utilization. We have implemented the above approaches in LLVM and evaluate them on various benchmarks. The uncoalesced access analysis identifies 111 accesses, the block-size independence analysis verifies 35 block-size independent kernels and the cache reuse optimization improves performance by an average 1.3x on two Nvidia GPUs. The approaches are fast and finish within few seconds for most programs
    corecore