34 research outputs found

    XARK: an extensible framework for automatic recognition of computational kernels

    Get PDF
    This is a post-peer-review, pre-copyedit version of an article published in ACM Transactions on Programming Languages and Systems. The final authenticated version is available online at: http://dx.doi.org/10.1145/1391956.1391959[Abstract] The recognition of program constructs that are frequently used by software developers is a powerful mechanism for optimizing and parallelizing compilers to improve the performance of the object code. The development of techniques for automatic recognition of computational kernels such as inductions, reductions and array recurrences has been an intensive research area in the scope of compiler technology during the 90's. This article presents a new compiler framework that, unlike previous techniques that focus on specific and isolated kernels, recognizes a comprehensive collection of computational kernels that appear frequently in full-scale real applications. The XARK compiler operates on top of the Gated Single Assignment (GSA) form of a high-level intermediate representation (IR) of the source code. Recognition is carried out through a demand-driven analysis of this high-level IR at two different levels. First, the dependences between the statements that compose the strongly connected components (SCCs) of the data-dependence graph of the GSA form are analyzed. As a result of this intra-SCC analysis, the computational kernels corresponding to the execution of the statements of the SCCs are recognized. Second, the dependences between statements of different SCCs are examined in order to recognize more complex kernels that result from combining simpler kernels in the same code. Overall, the XARK compiler builds a hierarchical representation of the source code as kernels and dependence relationships between those kernels. This article describes in detail the collection of computational kernels recognized by the XARK compiler. Besides, the internals of the recognition algorithms are presented. The design of the algorithms enables to extend the recognition capabilities of XARK to cope with new kernels, and provides an advanced symbolic analysis framework to run other compiler techniques on demand. Finally, extensive experiments showing the effectiveness of XARK for a collection of benchmarks from different application domains are presented. In particular, the SparsKit-II library for the manipulation of sparse matrices, the Perfect benchmarks, the SPEC CPU2000 collection and the PLTMG package for solving elliptic partial differential equations are analyzed in detail.Ministeiro de Educaci贸n y Ciencia; TIN2004-07797-C02Ministeiro de Educaci贸n y Ciencia; TIN2007-67537-C03Xunta de Galicia; PGIDIT05PXIC10504PNXunta de Galicia; PGIDIT06PXIB105228P

    A Novel Compiler Support for Automatic Parallelization on Multicore Systems

    Get PDF
    [Abstract] The widespread use of multicore processors is not a consequence of significant advances in parallel programming. In contrast, multicore processors arise due to the complexity of building power-efficient, high-clock-rate, single-core chips. Automatic parallelization of sequential applications is the ideal solution for making parallel programming as easy as writing programs for sequential computers. However, automatic parallelization remains a grand challenge due to its need for complex program analysis and the existence of unknowns during compilation. This paper proposes a new method for converting a sequential application into a parallel counterpart that can be executed on current multicore processors. It hinges on an intermediate representation based on the concept of domain-independent kernel (e.g., assignment, reduction, recurrence). Such kernel-centric view hides the complexity of the implementation details, enabling the construction of the parallel version even when the source code of the sequential application contains different syntactic variations of the computations (e.g., pointers, arrays, complex control flows). Experiments that evaluate the effectiveness and performance of our approach with respect to state-of-the-art compilers are also presented. The benchmark suite consists of synthetic codes that represent common domain-independent kernels, dense/sparse linear algebra and image processing routines, and full-scale applications from SPEC CPU2000.[Resumen] El uso generalizado de procesadores multin煤cleo no es consecuencia de avances significativos en programaci贸n paralela. Por el contrario, los procesadores multin煤cleo surgen debido a la complejidad de construir chips monon煤cleo que sean eficiente energ茅ticamente y tengan altas velocidades de reloj. La paralelizaci贸n autom谩tica de aplicaciones secuenciales es la soluci贸n ideal para hacer la programaci贸n paralela tan f谩cil como escribir programas para ordenadores secuenciales. Sin embargo, la paralelizaci贸n autom谩tica continua a ser un gran reto debido a su necesidad de complejos an谩lisis del programa y la existencia de inc贸gnitas durante la compilaci贸n. Este art铆culo propone un nuevo m茅todo para convertir una aplicaci贸n secuencial en su contrapartida paralela que pueda ser ejecutada en los procesadores multin煤cleo actuales. Este m茅todo depende de una representaci贸n intermedia basada en el concepto de n煤cleos independientes del dominio (p. ej., asignaci贸n, reducci贸n, recurrencia). Esta visi贸n centrada en n煤cleos oculta la complejidad de los detalles de implementaci贸n, permitiendo la construcci贸n de la versi贸n paralela incluso cuando el c贸digo fuente de la aplicaci贸n secuencial contiene diferentes variantes de las computaciones (p. ej., punteros, arrays, flujos de control complejos). Se presentan experimentos que eval煤an la efectividad y el rendimiento de nuestra aproximaci贸n con respecto al estado del arte. La serie programas de prueba consiste en c贸digos sint茅ticos que representan n煤cleos independientes del dominio comunes, rutinas de 谩lgebra lineal densa/dispersa y de procesamiento de imagen, y aplicaciones completas del SPEC CPU2000.[Resumo] O uso xeralizado de procesadores multin煤cleo non 茅 consecuencia de avances significativos en programaci贸n paralela. Pola contra, os procesadores multin煤cleo xurden debido 谩 complexidade de construir chips monon煤cleo que sexan eficientes enerx茅ticamente e te帽an altas velocidades de reloxo. A paralelizaci贸n autom谩tica de aplicaci贸ns secuenciais 茅 a soluci贸n ideal para facer a programaci贸n paralela tan sinxela como escribir programas para ordenadores secuenciais. Sen embargo, a paralelizaci贸n autom谩tica continua a ser un gran reto debido a s煤a necesidade de complexas an谩lises do programa e a existencia de inc贸gnitas durante a compilaci贸n. Este artigo prop贸n un novo m茅todo para convertir unha aplicaci贸n secuencias na s煤a contrapartida paralela que poida ser executada nos procesadores multin煤cleo actuais. Este m茅todo depende dunha representaci贸n intermedia baseada no concepto dos n煤cleos independentes do dominio (p. ex., asignaci贸n, reducci贸n, recurrencia). Esta visi贸n centrada en n煤cleos oculta a complexidade dos detalles de implementaci贸n, permitindo a construcci贸n da versi贸n paralela incluso cando o c贸digo fonte da aplicaci贸n secuencial cont茅n diferentes variantes das computaci贸ns (p. ex., punteiros, arrays, fluxos de control complejo). Pres茅ntanse experimentos que eval煤an a efectividade e o rendemento da nosa aproximaci贸n con respecto ao estado da arte. A serie de programas de proba consiste en c贸digos sint茅ticos que representan n煤cleos independentes do dominio comunes, rutinas de 谩lxebra lineal densa/dispersa e de procesamento de imaxe, e aplicaci贸ns completas do SPEC CPU2000.Ministerio de Econom铆a y Competitividad; TIN2010-16735Ministerio de Educaci贸n y Cultura; AP2008-0101

    Run-time optimization of adaptive irregular applications

    Get PDF
    Compared to traditional compile-time optimization, run-time optimization could offer significant performance improvements when parallelizing and optimizing adaptive irregular applications, because it performs program analysis and adaptive optimizations during program execution. Run-time techniques can succeed where static techniques fail because they exploit the characteristics of input data, programs' dynamic behaviors, and the underneath execution environment. When optimizing adaptive irregular applications for parallel execution, a common observation is that the effectiveness of the optimizing transformations depends on programs' input data and their dynamic phases. This dissertation presents a set of run-time optimization techniques that match the characteristics of programs' dynamic memory access patterns and the appropriate optimization (parallelization) transformations. First, we present a general adaptive algorithm selection framework to automatically and adaptively select at run-time the best performing, functionally equivalent algorithm for each of its execution instances. The selection process is based on off-line automatically generated prediction models and characteristics (collected and analyzed dynamically) of the algorithm's input data, In this dissertation, we specialize this framework for automatic selection of reduction algorithms. In this research, we have identified a small set of machine independent high-level characterization parameters and then we deployed an off-line, systematic experiment process to generate prediction models. These models, in turn, match the parameters to the best optimization transformations for a given machine. The technique has been evaluated thoroughly in terms of applications, platforms, and programs' dynamic behaviors. Specifically, for the reduction algorithm selection, the selected performance is within 2% of optimal performance and on average is 60% better than "Replicated Buffer," the default parallel reduction algorithm specified by OpenMP standard. To reduce the overhead of speculative run-time parallelization, we have developed an adaptive run-time parallelization technique that dynamically chooses effcient shadow structures to record a program's dynamic memory access patterns for parallelization. This technique complements the original speculative run-time parallelization technique, the LRPD test, in parallelizing loops with sparse memory accesses. The techniques presented in this dissertation have been implemented in an optimizing research compiler and can be viewed as effective building blocks for comprehensive run-time optimization systems, e.g., feedback-directed optimization systems and dynamic compilation systems

    Hybrid analysis of memory references and its application to automatic parallelization

    Get PDF
    Executing sequential code in parallel on a multithreaded machine has been an elusive goal of the academic and industrial research communities for many years. It has recently become more important due to the widespread introduction of multicores in PCs. Automatic multithreading has not been achieved because classic, static compiler analysis was not powerful enough and program behavior was found to be, in many cases, input dependent. Speculative thread level parallelization was a welcome avenue for advancing parallelization coverage but its performance was not always optimal due to the sometimes unnecessary overhead of checking every dynamic memory reference. In this dissertation we introduce a novel analysis technique, Hybrid Analysis, which unifies static and dynamic memory reference techniques into a seamless compiler framework which extracts almost maximum available parallelism from scientific codes and incurs close to the minimum necessary run time overhead. We present how to extract maximum information from the quantities that could not be sufficiently analyzed through static compiler methods, and how to generate sufficient conditions which, when evaluated dynamically, can validate optimizations. Our techniques have been fully implemented in the Polaris compiler and resulted in whole program speedups on a large number of industry standard benchmark applications

    Run-time optimization of adaptive irregular applications

    Get PDF
    Compared to traditional compile-time optimization, run-time optimization could offer significant performance improvements when parallelizing and optimizing adaptive irregular applications, because it performs program analysis and adaptive optimizations during program execution. Run-time techniques can succeed where static techniques fail because they exploit the characteristics of input data, programs' dynamic behaviors, and the underneath execution environment. When optimizing adaptive irregular applications for parallel execution, a common observation is that the effectiveness of the optimizing transformations depends on programs' input data and their dynamic phases. This dissertation presents a set of run-time optimization techniques that match the characteristics of programs' dynamic memory access patterns and the appropriate optimization (parallelization) transformations. First, we present a general adaptive algorithm selection framework to automatically and adaptively select at run-time the best performing, functionally equivalent algorithm for each of its execution instances. The selection process is based on off-line automatically generated prediction models and characteristics (collected and analyzed dynamically) of the algorithm's input data, In this dissertation, we specialize this framework for automatic selection of reduction algorithms. In this research, we have identified a small set of machine independent high-level characterization parameters and then we deployed an off-line, systematic experiment process to generate prediction models. These models, in turn, match the parameters to the best optimization transformations for a given machine. The technique has been evaluated thoroughly in terms of applications, platforms, and programs' dynamic behaviors. Specifically, for the reduction algorithm selection, the selected performance is within 2% of optimal performance and on average is 60% better than "Replicated Buffer," the default parallel reduction algorithm specified by OpenMP standard. To reduce the overhead of speculative run-time parallelization, we have developed an adaptive run-time parallelization technique that dynamically chooses effcient shadow structures to record a program's dynamic memory access patterns for parallelization. This technique complements the original speculative run-time parallelization technique, the LRPD test, in parallelizing loops with sparse memory accesses. The techniques presented in this dissertation have been implemented in an optimizing research compiler and can be viewed as effective building blocks for comprehensive run-time optimization systems, e.g., feedback-directed optimization systems and dynamic compilation systems

    Compilation techniques for automatic extraction of parallelism and locality in heterogeneous architectures

    Get PDF
    [Abstract] High performance computing has become a key enabler for innovation in science and industry. This fact has unleashed a continuous demand of more computing power that the silicon industry has satisfied with parallel and heterogeneous architectures, and complex memory hierarchies. As a consequence, software developers have been challenged to write new codes and rewrite the old ones to be efficient in these new systems. Unfortunately, success cases are scarce and require huge investments in human workforce. Current compilers generate peak-peformance binary code in monocore architectures. Following this victory, this thesis explores new ideas in compiler design to overcome this challenge with the automatic extraction of parallelism and locality. First, we present a new compiler intermediate representation based on diKernels named KIR, which is insensitive to syntactic variations in the source code and exposes multiple levels of parallelism. On top of the KIR, we build a source-to-source approach that generates parallel code annotated with compiler directives: OpenMP for multicores and OpenHMPP for GPUs. Finally, we model program behavior from the point of view of the memory accesses through the reconstruction of affine loops for sequential and parallel codes. The experimental evaluations throughout the thesis corroborate the effectiveness and efficiency of the proposed solutions.[Resumen]La computaci贸n de altas prestaciones se ha convertido en un habilitador clave para la innovaci贸n en la ciencia y la industria. Este hecho ha propiciado una demanda continua de m谩s poder computacional que la industria del silicio ha satisfecho con arquitecturas paralelas y heterog茅neas, y jerarqu铆as de memoria complejas. Como consecuencia, los desarrolladores de software han sido desafiados a escribir c贸digos nuevos y reescribir los antiguos para que sean eficientes en estos nuevos sistemas. Desafortunadamente, los casos de 茅xito son escasos y requieren inversiones enormes en fuerza de trabajo. Los compiladores actuales generan c贸digo binario con rendimiento m谩ximo en las arquitecturas monon煤cleo. Siguiendo esta victoria, esta tesis explora nuevas ideas en el dise帽o de compiladores para superar este reto con la extracci贸n autom谩tica de paralelismo y localidad. En primer lugar, presentamos una nueva representaci贸n intermedia de compilador basada en diKernels denominada KIR, la cual es insensible a variaciones sint谩cticas en el c贸digo de fuente y expone m煤ltiples niveles de paralelismo. Sobre la KIR, construimos una aproximaci贸n fuente-a-fuente que genera c贸digo paralelo anotado con directivas: OpenMP para multin煤cleos y OpenHMPP para GPUs. Finalmente, modelamos el comportamiento del programa desde el punto de vista de los accesos de memoria a trav茅s de la reconstrucci贸n de bucles afines para c贸digos secuenciales y paralelos. Las evaluaciones experimentales a lo largo de la tesis corroboran la efectividad y eficacia de las soluciones propuestas.[Resumo]A computaci贸n de altas prestaci贸ns converteuse nun habilitador clave para a innovaci贸n na ciencia e na industria. Este feito propiciou unha demanda continua de m谩is poder computacional que a industria do silicio satisfixo con arquitecturas paralelas e heterox茅neas, e xerarqu铆as de memoria complexas. Como consecuencia, os desenvolvedores de software foron desafiados a escribir c贸digos novos e reescribir os antigos para que sexan eficientes nestes novos sistemas. Desafortunadamente, os casos de 茅xito son escasos e requiren investimentos enormes en forza de traballo. Os compiladores actuais xeran c贸digo binario con rendemento m谩ximo nas arquitecturas monon煤cleo. Seguindo esta vitoria, esta tese explora novas ideas no dese帽o de compiladores para superar este reto coa extracci贸n autom谩tica de paralelismo e localidade. En primeiro lugar, presentamos unha nova representaci贸n intermedia de compilador baseada en diKernels denominada KIR, a cal 茅 insensible a variaci贸ns sint谩cticas no c贸digo fonte e exp贸n m煤ltiples niveis de paralelismo. Sobre a KIR, constru铆mos unha aproximaci贸n fonte-a-fonte que xera c贸digo paralelo anotado con directivas: OpenMP para multin煤cleos e OpenHMPP para GPUs. Finalmente, modelamos o comportamento do programa desde o punto de vista dos accesos de memoria a trav茅s da reconstruci贸n de bucles af铆ns para c贸digos secuenciais e paralelos. As avaliaci贸ns experimentais ao longo da tese corroboran a efectividade e eficacia das soluci贸ns propostas

    Analytical cost metrics: days of future past

    Get PDF
    2019 Summer.Includes bibliographical references.Future exascale high-performance computing (HPC) systems are expected to be increasingly heterogeneous, consisting of several multi-core CPUs and a large number of accelerators, special-purpose hardware that will increase the computing power of the system in a very energy-efficient way. Specialized, energy-efficient accelerators are also an important component in many diverse systems beyond HPC: gaming machines, general purpose workstations, tablets, phones and other media devices. With Moore's law driving the evolution of hardware platforms towards exascale, the dominant performance metric (time efficiency) has now expanded to also incorporate power/energy efficiency. This work builds analytical cost models for cost metrics such as time, energy, memory access, and silicon area. These models are used to predict the performance of applications, for performance tuning, and chip design. The idea is to work with domain specific accelerators where analytical cost models can be accurately used for performance optimization. The performance optimization problems are formulated as mathematical optimization problems. This work explores the analytical cost modeling and mathematical optimization approach in a few ways. For stencil applications and GPU architectures, the analytical cost models are developed for execution time as well as energy. The models are used for performance tuning over existing architectures, and are coupled with silicon area models of GPU architectures to generate highly efficient architecture configurations. For matrix chain products, analytical closed form solutions for off-chip data movement are built and used to minimize the total data movement cost of a minimum op count tree

    HPCCP/CAS Workshop Proceedings 1998

    Get PDF
    This publication is a collection of extended abstracts of presentations given at the HPCCP/CAS (High Performance Computing and Communications Program/Computational Aerosciences Project) Workshop held on August 24-26, 1998, at NASA Ames Research Center, Moffett Field, California. The objective of the Workshop was to bring together the aerospace high performance computing community, consisting of airframe and propulsion companies, independent software vendors, university researchers, and government scientists and engineers. The Workshop was sponsored by the HPCCP Office at NASA Ames Research Center. The Workshop consisted of over 40 presentations, including an overview of NASA's High Performance Computing and Communications Program and the Computational Aerosciences Project; ten sessions of papers representative of the high performance computing research conducted within the Program by the aerospace industry, academia, NASA, and other government laboratories; two panel sessions; and a special presentation by Mr. James Bailey
    corecore