24 research outputs found

    Restructuring Fortran legacy applications for parallel computing in multiprocessors

    Get PDF
    As it is widely known, multi-core computers are broadly used these days, and automatic parallelization of sequential programs is still a challenge. In this context, we propose a set of code transformations to be applied automatically by a tool in order to transform sequential legacy systems into their parallel version. We implement these transformations by applying a lightweight source code analysis based on rewritable AST (Abstract Syntax Tree). Since it is not always possible to automatically parallelize the code, we also implemented some specific analyses in order to report possible changes that would allow specific parallelization. Additionally, we present some examples in which these transformations were conducted and the corresponding performance experiments.Instituto de Investigación en Informátic

    The virtual time machine

    Get PDF
    Journal ArticleExisting multiprocessors and multicomputers require the programmer or compiler to perform data dependence analysis at compile time. We propose a parallel computer that performs this task at runtime. In particular, the Virtual Time Machine (VTM) detects violations of data dependence constraints as they occur, and automatically recovers from them. A sophisticated memory system that is addressed using both a spatial and a temporal coordinate is used to efficiently implement this mechanism. Initially targeted for discrete event simulation applications, many of the ideas used in the machine architecture have direct application in the more general realm of parallel computation. The long term goal of this work is to develop a general purpose parallel computer that will support a wide range of parallel programming paradigms. This paper outlines the motivations behind the V TM architecture, the underlying computation model, a proposed implementation, and initial performance results. A recurring theme that pervades the entire paper is our contention that existing shared memory and message-base machines do not pay adequate attention to the dimension of time. We argue that this architectural deficiency is the underlying reason behind many difficult problems in parallel computation today

    Extending Static Synchronization Beyond SIMD and VLIW

    Get PDF
    A key advantage of SIMD (Single Instruction stream, Multiple Data stream) architectures is that synchronization is effected statically at compile-time, hence the execution-time cost of synchronization between “processes” is essentially zero. VLIW (Very Long Instruction Word) machines are successful in large part because they preserve this property while providing more flexibility in terms of what kinds of operations can be parallelized. In this paper, we propose a new kind of architecture —- the “static barrier MIMD” or SBM — which can be viewed as a further generalization of the parallel execution abilities of static synchronization machines. Barrier MIMDs are asynchronous Multiple Instruction stream Multiple Data stream architectures capable of parallel execution of loops, subprogram calls, and variable execution- time instructions; however, little or no run-time synchronization is needed. When a group of processors within a barrier MIMD has just encountered a barrier, any conceptual synchronizations between the processors are statically accomplished with zero cost — as in a SIMD or VLIW and using similar compiler technology. Unlike these machines, however, as execution continues the relative timing of processors may become less precisely knowable as a static, compile-time, quantity. Where this imprecision becomes too large, the compiler simply inserts a synchronization barrier to insure that timing imprecision at that point is zero, and again employs purely static, implicit, synchronization. Both the architecture and the supporting compiler technology are discussed in detail

    Loop Parallelization using Dynamic Commutativity Analysis

    Get PDF

    Programming parallel supercomputers

    Get PDF

    Automatic Parallelization With Statistical Accuracy Bounds

    Get PDF
    Traditional parallelizing compilers are designed to generate parallel programs that produce identical outputs as the original sequential program. The difficulty of performing the program analysis required to satisfy this goal and the restricted space of possible target parallel programs have both posed significant obstacles to the development of effective parallelizing compilers. The QuickStep compiler is instead designed to generate parallel programs that satisfy statistical accuracy guarantees. The freedom to generate parallel programs whose output may differ (within statistical accuracy bounds) from the output of the sequential program enables a dramatic simplification of the compiler and a significant expansion in the range of parallel programs that it can legally generate. QuickStep exploits this flexibility to take a fundamentally different approach from traditional parallelizing compilers. It applies a collection of transformations (loop parallelization, loop scheduling, synchronization introduction, and replication introduction) to generate a search space of parallel versions of the original sequential program. It then searches this space (prioritizing the parallelization of the most time-consuming loops in the application) to find a final parallelization that exhibits good parallel performance and satisfies the statistical accuracy guarantee. At each step in the search it performs a sequence of trial runs on representative inputs to examine the performance, accuracy, and memory accessing characteristics of the current generated parallel program. An analysis of these characteristics guides the steps the compiler takes as it explores the search space of parallel programs. Results from our benchmark set of applications show that QuickStep can automatically generate parallel programs with good performance and statistically accurate outputs. For two of the applications, the parallelization introduces noise into the output, but the noise remains within acceptable statistical bounds. The simplicity of the compilation strategy and the performance and statistical acceptability of the generated parallel programs demonstrate the advantages of the QuickStep approach

    Evaluation of strategies for the development of efficient code for Raspberry Pi devices

    Get PDF
    La Internet de las cosas (IO) se enfrenta a desafíos que requieren soluciones ecológicas y paradigmas de eficiencia energética. Las arquitecturas (como el ARM) han evolucionado significativamente en los últimos años, con mejoras en la eficiencia de los procesadores, esenciales para los dispositivos de conexión permanente, como punto focal. Sin embargo, en lo que respecta al software, pocos enfoques analizan las ventajas de escribir un código eficiente al programar dispositivos de IO. Por consiguiente, esta propuesta tiene por objeto mejorar la optimización del código fuente para lograr mejores tiempos de ejecución. Además, se analiza la importancia de diversas técnicas para escribir código eficiente para los dispositivos Pi de Frambuesa, con el objetivo de aumentar la velocidad de ejecución. Se ha desarrollado un conjunto completo de pruebas exclusivamente para analizar y medir las mejoras logradas al aplicar cada una de estas técnicas. De esta manera se toma conciencia del importante impacto que pueden tener las técnicas recomendadas.The Internet of Things (IoT) is faced with challenges that require green solutions and energy-efficient paradigms. Architectures (such as ARM) have evolved significantly in recent years, with improvements to processor efficiency, essential for always-on devices, as a focal point. However, as far as software is concerned, few approaches analyse the advantages of writing efficient code when programming IoT devices. Therefore, this proposal aims to improve source code optimization to achieve better execution times. In addition, the importance of various techniques for writing efficient code for Raspberry Pi devices is analysed, with the objective of increasing execution speed. A complete set of tests have been developed exclusively for analysing and measuring the improvements achieved when applying each of these techniques. This will raise awareness of the significant impact the recommended techniques can have.• Ministerio de Economía y Competitividad y Fondos FEDER. Proyecto TIN2015-69957-R (I+D+i) • Unión Europea. Programa de Desarrollo Regional Europeo y Programa del Fondo Europeo de Desarrollo (FEDER): Programa Operativo Extremadura 2014-2020. Ref. 2018.14.02.332A.444.00peerReviewe

    A Novel Compiler Support for Automatic Parallelization on Multicore Systems

    Get PDF
    [Abstract] The widespread use of multicore processors is not a consequence of significant advances in parallel programming. In contrast, multicore processors arise due to the complexity of building power-efficient, high-clock-rate, single-core chips. Automatic parallelization of sequential applications is the ideal solution for making parallel programming as easy as writing programs for sequential computers. However, automatic parallelization remains a grand challenge due to its need for complex program analysis and the existence of unknowns during compilation. This paper proposes a new method for converting a sequential application into a parallel counterpart that can be executed on current multicore processors. It hinges on an intermediate representation based on the concept of domain-independent kernel (e.g., assignment, reduction, recurrence). Such kernel-centric view hides the complexity of the implementation details, enabling the construction of the parallel version even when the source code of the sequential application contains different syntactic variations of the computations (e.g., pointers, arrays, complex control flows). Experiments that evaluate the effectiveness and performance of our approach with respect to state-of-the-art compilers are also presented. The benchmark suite consists of synthetic codes that represent common domain-independent kernels, dense/sparse linear algebra and image processing routines, and full-scale applications from SPEC CPU2000.[Resumen] El uso generalizado de procesadores multinúcleo no es consecuencia de avances significativos en programación paralela. Por el contrario, los procesadores multinúcleo surgen debido a la complejidad de construir chips mononúcleo que sean eficiente energéticamente y tengan altas velocidades de reloj. La paralelización automática de aplicaciones secuenciales es la solución ideal para hacer la programación paralela tan fácil como escribir programas para ordenadores secuenciales. Sin embargo, la paralelización automática continua a ser un gran reto debido a su necesidad de complejos análisis del programa y la existencia de incógnitas durante la compilación. Este artículo propone un nuevo método para convertir una aplicación secuencial en su contrapartida paralela que pueda ser ejecutada en los procesadores multinúcleo actuales. Este método depende de una representación intermedia basada en el concepto de núcleos independientes del dominio (p. ej., asignación, reducción, recurrencia). Esta visión centrada en núcleos oculta la complejidad de los detalles de implementación, permitiendo la construcción de la versión paralela incluso cuando el código fuente de la aplicación secuencial contiene diferentes variantes de las computaciones (p. ej., punteros, arrays, flujos de control complejos). Se presentan experimentos que evalúan la efectividad y el rendimiento de nuestra aproximación con respecto al estado del arte. La serie programas de prueba consiste en códigos sintéticos que representan núcleos independientes del dominio comunes, rutinas de álgebra lineal densa/dispersa y de procesamiento de imagen, y aplicaciones completas del SPEC CPU2000.[Resumo] O uso xeralizado de procesadores multinúcleo non é consecuencia de avances significativos en programación paralela. Pola contra, os procesadores multinúcleo xurden debido á complexidade de construir chips mononúcleo que sexan eficientes enerxéticamente e teñan altas velocidades de reloxo. A paralelización automática de aplicacións secuenciais é a solución ideal para facer a programación paralela tan sinxela como escribir programas para ordenadores secuenciais. Sen embargo, a paralelización automática continua a ser un gran reto debido a súa necesidade de complexas análises do programa e a existencia de incógnitas durante a compilación. Este artigo propón un novo método para convertir unha aplicación secuencias na súa contrapartida paralela que poida ser executada nos procesadores multinúcleo actuais. Este método depende dunha representación intermedia baseada no concepto dos núcleos independentes do dominio (p. ex., asignación, reducción, recurrencia). Esta visión centrada en núcleos oculta a complexidade dos detalles de implementación, permitindo a construcción da versión paralela incluso cando o código fonte da aplicación secuencial contén diferentes variantes das computacións (p. ex., punteiros, arrays, fluxos de control complejo). Preséntanse experimentos que evalúan a efectividade e o rendemento da nosa aproximación con respecto ao estado da arte. A serie de programas de proba consiste en códigos sintéticos que representan núcleos independentes do dominio comunes, rutinas de álxebra lineal densa/dispersa e de procesamento de imaxe, e aplicacións completas do SPEC CPU2000.Ministerio de Economía y Competitividad; TIN2010-16735Ministerio de Educación y Cultura; AP2008-0101
    corecore