103 research outputs found

    Automatic Parallelization With Statistical Accuracy Bounds

    Get PDF
    Traditional parallelizing compilers are designed to generate parallel programs that produce identical outputs as the original sequential program. The difficulty of performing the program analysis required to satisfy this goal and the restricted space of possible target parallel programs have both posed significant obstacles to the development of effective parallelizing compilers. The QuickStep compiler is instead designed to generate parallel programs that satisfy statistical accuracy guarantees. The freedom to generate parallel programs whose output may differ (within statistical accuracy bounds) from the output of the sequential program enables a dramatic simplification of the compiler and a significant expansion in the range of parallel programs that it can legally generate. QuickStep exploits this flexibility to take a fundamentally different approach from traditional parallelizing compilers. It applies a collection of transformations (loop parallelization, loop scheduling, synchronization introduction, and replication introduction) to generate a search space of parallel versions of the original sequential program. It then searches this space (prioritizing the parallelization of the most time-consuming loops in the application) to find a final parallelization that exhibits good parallel performance and satisfies the statistical accuracy guarantee. At each step in the search it performs a sequence of trial runs on representative inputs to examine the performance, accuracy, and memory accessing characteristics of the current generated parallel program. An analysis of these characteristics guides the steps the compiler takes as it explores the search space of parallel programs. Results from our benchmark set of applications show that QuickStep can automatically generate parallel programs with good performance and statistically accurate outputs. For two of the applications, the parallelization introduces noise into the output, but the noise remains within acceptable statistical bounds. The simplicity of the compilation strategy and the performance and statistical acceptability of the generated parallel programs demonstrate the advantages of the QuickStep approach

    Compiler Optimization Techniques for Scheduling and Reducing Overhead

    Get PDF
    Exploiting parallelism in loops in programs is an important factor in realizing the potential performance of processors today. This dissertation develops and evaluates several compiler optimizations aimed at improving the performance of loops on processors. An important feature of a class of scientific computing problems is the regularity exhibited by their access patterns. Chapter 2 presents an approach of optimizing the address generation of these problems that results in the following: (i) elimination of redundant arithmetic computation by recognizing and exploiting the presence of common sub-expressions across different iterations in stencil codes; and (ii) conversion of as many array references to scalar accesses as possible, which leads to reduced execution time, decrease in address arithmetic overhead, access to data in registers as opposed to caches, etc. With the advent of VLIW processors, the exploitation of fine-grain instruction-level parallelism has become a major challenge to optimizing compilers. Fine-grain scheduling of inner loops has received a lot of attention, little work has been done in the area of applying it to nested loops. Chapter 3 presents an approach to fine-grain scheduling of nested loops by formulating the problem of finding theminimum iteration initiation interval as one of finding a rational affine schedule for each statement in the body of a perfectly nested loop which is then solved using linear programming. Frequent synchronization on multiprocessors is expensive due to its high cost. Chapter 4 presents a method for eliminating redundant synchronization for nested loops. In nested loops, a dependence may be redundant in only a portion of the iteration space. A characterization of the non-uniformity of the redundancy of a dependence is developed in terms of the relation between the dependences and the shape and size of the iteration space. Exploiting locality is critical for achieving high level of performance on a parallel machine. Chapter 5 presents an approach using the concept of affinity regions to find transformations such that a suitable iteration-to-processor mapping can be found for a sequence of loop nests accessing shared arrays. This not only improves the data locality but significantly reduces communication overhead

    Beyond shared memory loop parallelism in the polyhedral model

    Get PDF
    2013 Spring.Includes bibliographical references.With the introduction of multi-core processors, motivated by power and energy concerns, parallel processing has become main-stream. Parallel programming is much more difficult due to its non-deterministic nature, and because of parallel programming bugs that arise from non-determinacy. One solution is automatic parallelization, where it is entirely up to the compiler to efficiently parallelize sequential programs. However, automatic parallelization is very difficult, and only a handful of successful techniques are available, even after decades of research. Automatic parallelization for distributed memory architectures is even more problematic in that it requires explicit handling of data partitioning and communication. Since data must be partitioned among multiple nodes that do not share memory, the original memory allocation of sequential programs cannot be directly used. One of the main contributions of this dissertation is the development of techniques for generating distributed memory parallel code with parametric tiling. Our approach builds on important contributions to the polyhedral model, a mathematical framework for reasoning about program transformations. We show that many affine control programs can be uniformized only with simple techniques. Being able to assume uniform dependences significantly simplifies distributed memory code generation, and also enables parametric tiling. Our approach implemented in the AlphaZ system, a system for prototyping analyses, transformations, and code generators in the polyhedral model. The key features of AlphaZ are memory re-allocation, and explicit representation of reductions. We evaluate our approach on a collection of polyhedral kernels from the PolyBench suite, and show that our approach scales as well as PLuTo, a state-of-the-art shared memory automatic parallelizer using the polyhedral model. Automatic parallelization is only one approach to dealing with the non-deterministic nature of parallel programming that leaves the difficulty entirely to the compiler. Another approach is to develop novel parallel programming languages. These languages, such as X10, aim to provide highly productive parallel programming environment by including parallelism into the language design. However, even in these languages, parallel bugs remain to be an important issue that hinders programmer productivity. Another contribution of this dissertation is to extend the array dataflow analysis to handle a subset of X10 programs. We apply the result of dataflow analysis to statically guarantee determinism. Providing static guarantees can significantly increase programmer productivity by catching questionable implementations at compile-time, or even while programming

    Parallelizing Sequential Programs With Statistical Accuracy Tests

    Get PDF
    We present QuickStep, a novel system for parallelizing sequential programs. QuickStep deploys a set of parallelization transformations that together induce a search space of candidate parallel programs. Given a sequential program, representative inputs, and an accuracy requirement, QuickStep uses performance measurements, profiling information, and statistical accuracy tests on the outputs of candidate parallel programs to guide its search for a parallelizationthat maximizes performance while preserving acceptable accuracy. When the search completes, QuickStep produces an interactive report that summarizes the applied parallelization transformations, performance, and accuracy results for the automatically generated candidate parallel programs. In our envisioned usage scenarios, the developer examines this report to evaluate the acceptability of the final parallelization and to obtain insight into how the original sequential program responds to different parallelization strategies. Itis also possible for the developer (or even a user of the program who has no software development expertise whatsoever) to simply use the best parallelization out of the box without examining the report or further investigating the parallelization. Results from our benchmark set of applications show that QuickStep can automatically generate accurate and efficient parallel programs---the automatically generated parallel versions of five of our six benchmark applications run between 5.0 and 7.7 times faster on 8 cores than the original sequential versions. Moreover, a comparison with the Intel icc compiler highlights how QuickStep can effectively parallelize applications with features (such as the use of modern object-oriented programming constructs or desirable parallelizations with infrequent but acceptable data races) that place them inherently beyond the reach of standard approaches

    Run-time optimization of adaptive irregular applications

    Get PDF
    Compared to traditional compile-time optimization, run-time optimization could offer significant performance improvements when parallelizing and optimizing adaptive irregular applications, because it performs program analysis and adaptive optimizations during program execution. Run-time techniques can succeed where static techniques fail because they exploit the characteristics of input data, programs' dynamic behaviors, and the underneath execution environment. When optimizing adaptive irregular applications for parallel execution, a common observation is that the effectiveness of the optimizing transformations depends on programs' input data and their dynamic phases. This dissertation presents a set of run-time optimization techniques that match the characteristics of programs' dynamic memory access patterns and the appropriate optimization (parallelization) transformations. First, we present a general adaptive algorithm selection framework to automatically and adaptively select at run-time the best performing, functionally equivalent algorithm for each of its execution instances. The selection process is based on off-line automatically generated prediction models and characteristics (collected and analyzed dynamically) of the algorithm's input data, In this dissertation, we specialize this framework for automatic selection of reduction algorithms. In this research, we have identified a small set of machine independent high-level characterization parameters and then we deployed an off-line, systematic experiment process to generate prediction models. These models, in turn, match the parameters to the best optimization transformations for a given machine. The technique has been evaluated thoroughly in terms of applications, platforms, and programs' dynamic behaviors. Specifically, for the reduction algorithm selection, the selected performance is within 2% of optimal performance and on average is 60% better than "Replicated Buffer," the default parallel reduction algorithm specified by OpenMP standard. To reduce the overhead of speculative run-time parallelization, we have developed an adaptive run-time parallelization technique that dynamically chooses effcient shadow structures to record a program's dynamic memory access patterns for parallelization. This technique complements the original speculative run-time parallelization technique, the LRPD test, in parallelizing loops with sparse memory accesses. The techniques presented in this dissertation have been implemented in an optimizing research compiler and can be viewed as effective building blocks for comprehensive run-time optimization systems, e.g., feedback-directed optimization systems and dynamic compilation systems

    Enhanced applicability of loop transformations

    Get PDF

    Search-based Model-driven Loop Optimizations for Tensor Contractions

    Get PDF
    Complex tensor contraction expressions arise in accurate electronic structure models in quantum chemistry, such as the coupled cluster method. The Tensor Contraction Engine (TCE) is a high-level program synthesis system that facilitates the generation of high-performance parallel programs from tensor contraction equations. We are developing a new software infrastructure for the TCE that is designed to allow experimentation with optimization algorithms for modern computing platforms, including for heterogeneous architectures employing general-purpose graphics processing units (GPGPUs). In this dissertation, we present improvements and extensions to the loop fusion optimization algorithm, which can be used with cost models, e.g., for minimizing memory usage or for minimizing data movement costs under a memory constraint. We show that our data structure and pruning improvements to the loop fusion algorithm result in significant performance improvements that enable complex cost models being use for large input equations. We also present an algorithm for optimizing the fused loop structure of handwritten code. It determines the regions in handwritten code that are safe to be optimized and then runs the loop fusion algorithm on the dependency graph of the code. Finally, we develop an optimization framework for generating GPGPU code consisting of loop fusion optimization with a novel cost model, tiling optimization, and layout optimization. Depending on the memory available on the GPGPU and the sizes of the tensors, our framework decides which processor (CPU or GPGPU) should perform an operation and where the result should be moved. We present extensive measurements for tuning the loop fusion algorithm, for validating our optimization framework, and for measuring the performance characteristics of GPGPUs. Our measurements demonstrate that our optimization framework outperforms existing general-purpose optimization approaches both on multi-core CPUs and on GPGPUs

    Empirically Tuning HPC Kernels with iFKO

    Get PDF
    iFKO (iterative Floating point Kernel Optimizer) is an open-source iterative empirical compilation framework which can be used to tune high performance computing (HPC) kernels. The goal of our research is to advance iterative empirical compilation to the degree that the performance it can achieve is comparable to that delivered by painstaking hand tuning in assembly. This will allow many HPC researchers to spend precious development time on higher level aspects of tuning such as parallelization, as well as enabling computational scientists to develop new algorithms that demand new high performance kernels. At present, algorithms that cannot use hand-tuned performance libraries tend to lose to even inferior algorithms that can. We discuss our new autovectorization technique (speculative vectorization) which can autovectorize loops past dependent branches by speculating along frequently taken paths, even when other paths cannot be effectively vectorized. We implemented this technique in iFKO and demonstrated significant speedup for kernels that prior vectorization techniques could not optimize. We have developed an optimization for two dimensional array indexing that is critical for allowing us to heavily unroll and jam loops without restriction from integer register pressure. We then extended the state of the art single basic block vectorization method, SLP, to vectorize nested loops. We have also introduced optimized reductions that can retain full SIMD parallelization for the entire reduction, as well as doing loop specialization and unswitching as needed to address vector alignment issues and paths inside the loops which inhibit autovectorization. We have also implemented a critical transformation for optimal vectorization of mixed-type data. Combining all these techniques we can now fully vectorize the loopnests for our most complicated kernels, allowing us to achieve performance very close to that of hand-tuned assembly

    Performance analysis and tuning in multicore environments

    Get PDF
    Performance analysis is the task of monitor the behavior of a program execution. The main goal is to find out the possible adjustments that might be done in order improve the performance. To be able to get that improvement it is necessary to find the different causes of overhead. Nowadays we are already in the multicore era, but there is a gap between the level of development of the two main divisions of multicore technology (hardware and software). When we talk about multicore we are also speaking of shared memory systems, on this master thesis we talk about the issues involved on the performance analysis and tuning of applications running specifically in a shared Memory system. We move one step ahead to take the performance analysis to another level by analyzing the applications structure and patterns. We also present some tools specifically addressed to the performance analysis of OpenMP multithread application. At the end we present the results of some experiments performed with a set of OpenMP scientific application.Análisis de rendimiento es el área de estudio encargada de monitorizar el comportamiento de la ejecución de programas informáticos. El principal objetivo es encontrar los posibles ajustes que serán necesarios para mejorar el rendimiento. Para poder obtener esa mejora es necesario encontrar las principales causas de overhead. Actualmente estamos sumergidos en la era multicore, pero existe una brecha entre el nivel de desarrollo de sus dos principales divisiones (hardware y software). Cuando hablamos de multicore también estamos hablando de sistemas de memoria compartida. Nosotros damos un paso más al abordar el análisis de rendimiento a otro nivel por medio del estudio de la estructura de las aplicaciones y sus patrones. También presentamos herramientas de análisis de aplicaciones que son específicas para el análisis de rendimiento de aplicaciones paralelas desarrolladas con OpenMP. Al final presentamos los resultados de algunos experimentos realizados con un grupo de aplicaciones científicas desarrolladas bajo este modelo de programación.L'Anàlisi de rendiment és l'àrea d'estudi encarregada de monitorar el comportament de l'execució de programes informàtics. El principal objectiu és trobar els possibles ajustaments que seran necessaris per a millorar el rendiment. Per a poder obtenir aquesta millora és necessari trobar les principals causes de l'overhead (excessos de computació no productiva). Actualment estem immersos en l'era multicore, però existeix una rasa entre el nivell de desenvolupament de les seves dues principals divisions (maquinari i programari). Quan parlam de multicore, també estem parlant de sistemes de memòria compartida. Nosaltres donem un pas més per a abordar l'anàlisi de rendiment en un altre nivell per mitjà de l'estudi de l'estructura de les aplicacions i els seus patrons. També presentem eines d'anàlisis d'aplicacions que són específiques per a l'anàlisi de rendiment d'aplicacions paral·leles desenvolupades amb OpenMP. Al final, presentem els resultats d'alguns experiments realitzats amb un grup d'aplicacions científiques desenvolupades sota aquest model de programació
    corecore