1,435 research outputs found

    A Novel Compiler Support for Automatic Parallelization on Multicore Systems

    Get PDF
    [Abstract] The widespread use of multicore processors is not a consequence of significant advances in parallel programming. In contrast, multicore processors arise due to the complexity of building power-efficient, high-clock-rate, single-core chips. Automatic parallelization of sequential applications is the ideal solution for making parallel programming as easy as writing programs for sequential computers. However, automatic parallelization remains a grand challenge due to its need for complex program analysis and the existence of unknowns during compilation. This paper proposes a new method for converting a sequential application into a parallel counterpart that can be executed on current multicore processors. It hinges on an intermediate representation based on the concept of domain-independent kernel (e.g., assignment, reduction, recurrence). Such kernel-centric view hides the complexity of the implementation details, enabling the construction of the parallel version even when the source code of the sequential application contains different syntactic variations of the computations (e.g., pointers, arrays, complex control flows). Experiments that evaluate the effectiveness and performance of our approach with respect to state-of-the-art compilers are also presented. The benchmark suite consists of synthetic codes that represent common domain-independent kernels, dense/sparse linear algebra and image processing routines, and full-scale applications from SPEC CPU2000.[Resumen] El uso generalizado de procesadores multinúcleo no es consecuencia de avances significativos en programación paralela. Por el contrario, los procesadores multinúcleo surgen debido a la complejidad de construir chips mononúcleo que sean eficiente energéticamente y tengan altas velocidades de reloj. La paralelización automática de aplicaciones secuenciales es la solución ideal para hacer la programación paralela tan fácil como escribir programas para ordenadores secuenciales. Sin embargo, la paralelización automática continua a ser un gran reto debido a su necesidad de complejos análisis del programa y la existencia de incógnitas durante la compilación. Este artículo propone un nuevo método para convertir una aplicación secuencial en su contrapartida paralela que pueda ser ejecutada en los procesadores multinúcleo actuales. Este método depende de una representación intermedia basada en el concepto de núcleos independientes del dominio (p. ej., asignación, reducción, recurrencia). Esta visión centrada en núcleos oculta la complejidad de los detalles de implementación, permitiendo la construcción de la versión paralela incluso cuando el código fuente de la aplicación secuencial contiene diferentes variantes de las computaciones (p. ej., punteros, arrays, flujos de control complejos). Se presentan experimentos que evalúan la efectividad y el rendimiento de nuestra aproximación con respecto al estado del arte. La serie programas de prueba consiste en códigos sintéticos que representan núcleos independientes del dominio comunes, rutinas de álgebra lineal densa/dispersa y de procesamiento de imagen, y aplicaciones completas del SPEC CPU2000.[Resumo] O uso xeralizado de procesadores multinúcleo non é consecuencia de avances significativos en programación paralela. Pola contra, os procesadores multinúcleo xurden debido á complexidade de construir chips mononúcleo que sexan eficientes enerxéticamente e teñan altas velocidades de reloxo. A paralelización automática de aplicacións secuenciais é a solución ideal para facer a programación paralela tan sinxela como escribir programas para ordenadores secuenciais. Sen embargo, a paralelización automática continua a ser un gran reto debido a súa necesidade de complexas análises do programa e a existencia de incógnitas durante a compilación. Este artigo propón un novo método para convertir unha aplicación secuencias na súa contrapartida paralela que poida ser executada nos procesadores multinúcleo actuais. Este método depende dunha representación intermedia baseada no concepto dos núcleos independentes do dominio (p. ex., asignación, reducción, recurrencia). Esta visión centrada en núcleos oculta a complexidade dos detalles de implementación, permitindo a construcción da versión paralela incluso cando o código fonte da aplicación secuencial contén diferentes variantes das computacións (p. ex., punteiros, arrays, fluxos de control complejo). Preséntanse experimentos que evalúan a efectividade e o rendemento da nosa aproximación con respecto ao estado da arte. A serie de programas de proba consiste en códigos sintéticos que representan núcleos independentes do dominio comunes, rutinas de álxebra lineal densa/dispersa e de procesamento de imaxe, e aplicacións completas do SPEC CPU2000.Ministerio de Economía y Competitividad; TIN2010-16735Ministerio de Educación y Cultura; AP2008-0101

    Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code

    Full text link
    This paper introduces Tiramisu, a polyhedral framework designed to generate high performance code for multiple platforms including multicores, GPUs, and distributed machines. Tiramisu introduces a scheduling language with novel extensions to explicitly manage the complexities that arise when targeting these systems. The framework is designed for the areas of image processing, stencils, linear algebra and deep learning. Tiramisu has two main features: it relies on a flexible representation based on the polyhedral model and it has a rich scheduling language allowing fine-grained control of optimizations. Tiramisu uses a four-level intermediate representation that allows full separation between the algorithms, loop transformations, data layouts, and communication. This separation simplifies targeting multiple hardware architectures with the same algorithm. We evaluate Tiramisu by writing a set of image processing, deep learning, and linear algebra benchmarks and compare them with state-of-the-art compilers and hand-tuned libraries. We show that Tiramisu matches or outperforms existing compilers and libraries on different hardware architectures, including multicore CPUs, GPUs, and distributed machines.Comment: arXiv admin note: substantial text overlap with arXiv:1803.0041

    GHOST: Building blocks for high performance sparse linear algebra on heterogeneous systems

    Get PDF
    While many of the architectural details of future exascale-class high performance computer systems are still a matter of intense research, there appears to be a general consensus that they will be strongly heterogeneous, featuring "standard" as well as "accelerated" resources. Today, such resources are available as multicore processors, graphics processing units (GPUs), and other accelerators such as the Intel Xeon Phi. Any software infrastructure that claims usefulness for such environments must be able to meet their inherent challenges: massive multi-level parallelism, topology, asynchronicity, and abstraction. The "General, Hybrid, and Optimized Sparse Toolkit" (GHOST) is a collection of building blocks that targets algorithms dealing with sparse matrix representations on current and future large-scale systems. It implements the "MPI+X" paradigm, has a pure C interface, and provides hybrid-parallel numerical kernels, intelligent resource management, and truly heterogeneous parallelism for multicore CPUs, Nvidia GPUs, and the Intel Xeon Phi. We describe the details of its design with respect to the challenges posed by modern heterogeneous supercomputers and recent algorithmic developments. Implementation details which are indispensable for achieving high efficiency are pointed out and their necessity is justified by performance measurements or predictions based on performance models. The library code and several applications are available as open source. We also provide instructions on how to make use of GHOST in existing software packages, together with a case study which demonstrates the applicability and performance of GHOST as a component within a larger software stack.Comment: 32 pages, 11 figure

    Software Engineering for Multicore Systems - An Experience Report

    Get PDF

    Skalabilna implementacija dekodera po normi MPEG korištenjem tokovnog programskog jezika

    Get PDF
    In this paper, we describe a scalable and portable parallelized implementation of a MPEG decoder using a streaming computation paradigm, tailored to new generations of multi--core systems. A novel, hybrid approach towards parallelization of both new and legacy applications is described, where only data--intensive and performance--critical parts are implemented in the streaming domain. An architecture--independent \u27StreamIt\u27 language is used for design, optimization and implementation of parallelized segments, while the developed \u27StreamGate\u27 interface provides a communication mechanism between the implementation domains. The proposed hybrid approach was employed in re--factoring of a reference MPEG video decoder implementation; identifying the most performance--critical segments and re-implementing them in \u27StreamIt\u27 language, with \u27StreamGate\u27 interface as a communication mechanism between the host and streaming kernel. We evaluated the scalability of the decoder with respect to the number of cores, video frame formats, sizes and decomposition. Decoder performance was examined in the presence of different processor load configurations and with respect to the number of simultaneously processed frames.U ovom radu opisujemo skalabilnu i prenosivu implementaciju dekodera po normi MPEG ostvarenu korištenjem paradigme tokovnog računarstva, prilagođenu novim generacijama višejezgrenih računala. Opisan je novi, hibridni pristup paralelizaciji novih ili postojećih aplikacija, gdje se samo podatkovno intenzivni i računski zahtjevni dijelovi implementiraju u tokovnoj domeni. Arhitekturno neovisni jezik StreamIt koristi se za oblikovanje, optimiranje i izvedbu paraleliziranih segmenata aplikacije, dok razvijeno sučelje \u27StreamGate\u27 omogućava komunikaciju između domena implementacije. Predloženi hibridni pristup razvoju paraleliziranih aplikacija iskorišten je u preoblikovanju referentnog dekodera video zapisa po normi MPEG; identificirani su računski zahtjevni segmenti aplikacije i ponovno implementirani u jeziku StreamIt, sa sučeljem \u27StreamGate\u27 kao poveznicom između slijedne i tokovne domene. Ispitivana su svojstva skalabilnosti s obzirom na ciljani broj jezgri, format video zapisa i veličinu okvira te dekompoziciju ulaznih podataka. Svojstva dekodera  su praćena u prisustvu različitih opterećenja ispitnog računala, i s obzirom na broj istovremeno obrađivanih okvira
    corecore