742 research outputs found

    PENCIL: Towards a Platform-Neutral Compute Intermediate Language for DSLs

    Full text link
    We motivate the design and implementation of a platform-neutral compute intermediate language (PENCIL) for productive and performance-portable accelerator programming

    Parallel machine architecture and compiler design facilities

    Get PDF
    The objective is to provide an integrated simulation environment for studying and evaluating various issues in designing parallel systems, including machine architectures, parallelizing compiler techniques, and parallel algorithms. The status of Delta project (which objective is to provide a facility to allow rapid prototyping of parallelized compilers that can target toward different machine architectures) is summarized. Included are the surveys of the program manipulation tools developed, the environmental software supporting Delta, and the compiler research projects in which Delta has played a role

    Automatic Parallelization of Affine Loops using Dependence and Cache analysis in a Binary Rewriter

    Get PDF
    Today, nearly all general-purpose computers are parallel, but nearly all software running on them is serial. Bridging this disconnect by manually rewriting source code in parallel is prohibitively expensive. Automatic parallelization technology is therefore an attractive alternative. We present a method to perform automatic parallelization in a binary rewriter. The input to the binary rewriter is the serial binary executable program and the output is a parallel binary executable. The advantages of parallelization in a binary rewriter versus a compiler include (i) compatibility with all compilers and languages; (ii) high economic feasibility from avoiding repeated compiler implementation; (iii) applicability to legacy binaries; and (iv) applicability to assembly-language programs. Adapting existing parallelizing compiler methods that work on source code to work on binary programs instead is a significant challenge. This is primarily because symbolic and array index information used in existing compiler parallelizers is not available in a binary. We show how to adapt existing parallelization methods to achieve equivalent parallelization from a binary without such information. We have also designed a affine cache reuse model that works inside a binary rewriter building on the parallelization techniques. It quantifies cache reuse in terms of the number of cache lines that will be required when a loop dimension is considered for the innermost position in a loop nest. This cache metric can be used to reason about affine code that results when affine code is transformed using affine transformations. Hence, it can be used to evaluate candidate transformation sequences to improve run-time directly from a binary. Results using our x86 binary rewriter called SecondWrite on a suite of dense- matrix regular programs from Polybench suite of benchmarks shows an geomean speedup of 6.81X from binary and 8.9X from source with 8 threads compared to the input serial binary on a x86 Xeon E5530 machine; and 8.31X from binary and 9.86X from source with 24 threads compared to the input serial binary on a x86 E7450 machine. Such regular loops are an important component of scientific and multi- media workloads, and are even present to a limited extent in otherwise non-regular programs. Further in this thesis we present a novel algorithm that enhances the past techniques significantly for loops with unknown loop bounds by guessing the loop bounds using only the memory expressions present in a loop. It then inserts run-time checks to see if these guesses were indeed correct and if correct executes the parallel version of the loop, else the serial version executes. These techniques are applied to the large affine benchmarks in SPEC2006 and OMP2001 and unlike previous methods the speedups from binary are as good as from source. We also present results on the number of loops parallelized directly from a binary with and without this algorithm. Among the 8 affine benchmarks among these suites, the best existing binary parallelization method achieves an geo-mean speedup of 1.33X, whereas our method achieves a speedup of 2.96X. This is close to the speedup from source code of 2.8X

    XARK: an extensible framework for automatic recognition of computational kernels

    Get PDF
    This is a post-peer-review, pre-copyedit version of an article published in ACM Transactions on Programming Languages and Systems. The final authenticated version is available online at: http://dx.doi.org/10.1145/1391956.1391959[Abstract] The recognition of program constructs that are frequently used by software developers is a powerful mechanism for optimizing and parallelizing compilers to improve the performance of the object code. The development of techniques for automatic recognition of computational kernels such as inductions, reductions and array recurrences has been an intensive research area in the scope of compiler technology during the 90's. This article presents a new compiler framework that, unlike previous techniques that focus on specific and isolated kernels, recognizes a comprehensive collection of computational kernels that appear frequently in full-scale real applications. The XARK compiler operates on top of the Gated Single Assignment (GSA) form of a high-level intermediate representation (IR) of the source code. Recognition is carried out through a demand-driven analysis of this high-level IR at two different levels. First, the dependences between the statements that compose the strongly connected components (SCCs) of the data-dependence graph of the GSA form are analyzed. As a result of this intra-SCC analysis, the computational kernels corresponding to the execution of the statements of the SCCs are recognized. Second, the dependences between statements of different SCCs are examined in order to recognize more complex kernels that result from combining simpler kernels in the same code. Overall, the XARK compiler builds a hierarchical representation of the source code as kernels and dependence relationships between those kernels. This article describes in detail the collection of computational kernels recognized by the XARK compiler. Besides, the internals of the recognition algorithms are presented. The design of the algorithms enables to extend the recognition capabilities of XARK to cope with new kernels, and provides an advanced symbolic analysis framework to run other compiler techniques on demand. Finally, extensive experiments showing the effectiveness of XARK for a collection of benchmarks from different application domains are presented. In particular, the SparsKit-II library for the manipulation of sparse matrices, the Perfect benchmarks, the SPEC CPU2000 collection and the PLTMG package for solving elliptic partial differential equations are analyzed in detail.Ministeiro de Educación y Ciencia; TIN2004-07797-C02Ministeiro de Educación y Ciencia; TIN2007-67537-C03Xunta de Galicia; PGIDIT05PXIC10504PNXunta de Galicia; PGIDIT06PXIB105228P

    Automatic parallelization of irregular and pointer-based computations: perspectives from logic and constraint programming

    Get PDF
    Irregular computations pose some of the most interesting and challenging problems in automatic parallelization. Irregularity appears in certain kinds of numerical problems and is pervasive in symbolic applications. Such computations often use dynamic data structures which make heavy use of pointers. This complicates all the steps of a parallelizing compiler, from independence detection to task partitioning and placement. In the past decade there has been significant progress in the development of parallelizing compilers for logic programming and, more recently, constraint programming. The typical applications of these paradigms frequently involve irregular computations, which arguably makes the techniques used in these compilers potentially interesting. In this paper we introduce in a tutorial way some of the problems faced by parallelizing compilers for logic and constraint programs. These include the need for inter-procedural pointer aliasing analysis for independence detection and having to manage speculative and irregular computations through task granularity control and dynamic task allocation. We also provide pointers to some of the progress made in these áreas. In the associated talk we demónstrate representatives of several generations of these parallelizing compilers

    Parallelizing irregular and pointer-based computations automatically: perspectives from logic and constraint programming

    Get PDF
    Irregular computations pose sorne of the most interesting and challenging problems in automatic parallelization. Irregularity appears in certain kinds of numerical problems and is pervasive in symbolic applications. Such computations often use dynamic data structures, which make heavy use of pointers. This complicates all the steps of a parallelizing compiler, from independence detection to task partitioning and placement. Starting in the mid 80s there has been significant progress in the development of parallelizing compilers for logic pro­gramming (and more recently, constraint programming) resulting in quite capable paralle­lizers. The typical applications of these paradigms frequently involve irregular computations, and make heavy use of dynamic data structures with pointers, since logical variables represent in practice a well-behaved form of pointers. This arguably makes the techniques used in these compilers potentially interesting. In this paper, we introduce in a tutoríal way, sorne of the problems faced by parallelizing compilers for logic and constraint programs and provide pointers to sorne of the significant progress made in the area. In particular, this work has resulted in a series of achievements in the areas of inter-procedural pointer aliasing analysis for independence detection, cost models and cost analysis, cactus-stack memory management, techniques for managing speculative and irregular computations through task granularity control and dynamic task allocation such as work-stealing schedulers), etc
    corecore