1,127 research outputs found

    Feedback Driven Annotation and Refactoring of Parallel Programs

    Get PDF

    Automatic Parallelization With Statistical Accuracy Bounds

    Get PDF
    Traditional parallelizing compilers are designed to generate parallel programs that produce identical outputs as the original sequential program. The difficulty of performing the program analysis required to satisfy this goal and the restricted space of possible target parallel programs have both posed significant obstacles to the development of effective parallelizing compilers. The QuickStep compiler is instead designed to generate parallel programs that satisfy statistical accuracy guarantees. The freedom to generate parallel programs whose output may differ (within statistical accuracy bounds) from the output of the sequential program enables a dramatic simplification of the compiler and a significant expansion in the range of parallel programs that it can legally generate. QuickStep exploits this flexibility to take a fundamentally different approach from traditional parallelizing compilers. It applies a collection of transformations (loop parallelization, loop scheduling, synchronization introduction, and replication introduction) to generate a search space of parallel versions of the original sequential program. It then searches this space (prioritizing the parallelization of the most time-consuming loops in the application) to find a final parallelization that exhibits good parallel performance and satisfies the statistical accuracy guarantee. At each step in the search it performs a sequence of trial runs on representative inputs to examine the performance, accuracy, and memory accessing characteristics of the current generated parallel program. An analysis of these characteristics guides the steps the compiler takes as it explores the search space of parallel programs. Results from our benchmark set of applications show that QuickStep can automatically generate parallel programs with good performance and statistically accurate outputs. For two of the applications, the parallelization introduces noise into the output, but the noise remains within acceptable statistical bounds. The simplicity of the compilation strategy and the performance and statistical acceptability of the generated parallel programs demonstrate the advantages of the QuickStep approach

    A Novel Compiler Support for Automatic Parallelization on Multicore Systems

    Get PDF
    [Abstract] The widespread use of multicore processors is not a consequence of significant advances in parallel programming. In contrast, multicore processors arise due to the complexity of building power-efficient, high-clock-rate, single-core chips. Automatic parallelization of sequential applications is the ideal solution for making parallel programming as easy as writing programs for sequential computers. However, automatic parallelization remains a grand challenge due to its need for complex program analysis and the existence of unknowns during compilation. This paper proposes a new method for converting a sequential application into a parallel counterpart that can be executed on current multicore processors. It hinges on an intermediate representation based on the concept of domain-independent kernel (e.g., assignment, reduction, recurrence). Such kernel-centric view hides the complexity of the implementation details, enabling the construction of the parallel version even when the source code of the sequential application contains different syntactic variations of the computations (e.g., pointers, arrays, complex control flows). Experiments that evaluate the effectiveness and performance of our approach with respect to state-of-the-art compilers are also presented. The benchmark suite consists of synthetic codes that represent common domain-independent kernels, dense/sparse linear algebra and image processing routines, and full-scale applications from SPEC CPU2000.[Resumen] El uso generalizado de procesadores multinúcleo no es consecuencia de avances significativos en programación paralela. Por el contrario, los procesadores multinúcleo surgen debido a la complejidad de construir chips mononúcleo que sean eficiente energéticamente y tengan altas velocidades de reloj. La paralelización automática de aplicaciones secuenciales es la solución ideal para hacer la programación paralela tan fácil como escribir programas para ordenadores secuenciales. Sin embargo, la paralelización automática continua a ser un gran reto debido a su necesidad de complejos análisis del programa y la existencia de incógnitas durante la compilación. Este artículo propone un nuevo método para convertir una aplicación secuencial en su contrapartida paralela que pueda ser ejecutada en los procesadores multinúcleo actuales. Este método depende de una representación intermedia basada en el concepto de núcleos independientes del dominio (p. ej., asignación, reducción, recurrencia). Esta visión centrada en núcleos oculta la complejidad de los detalles de implementación, permitiendo la construcción de la versión paralela incluso cuando el código fuente de la aplicación secuencial contiene diferentes variantes de las computaciones (p. ej., punteros, arrays, flujos de control complejos). Se presentan experimentos que evalúan la efectividad y el rendimiento de nuestra aproximación con respecto al estado del arte. La serie programas de prueba consiste en códigos sintéticos que representan núcleos independientes del dominio comunes, rutinas de álgebra lineal densa/dispersa y de procesamiento de imagen, y aplicaciones completas del SPEC CPU2000.[Resumo] O uso xeralizado de procesadores multinúcleo non é consecuencia de avances significativos en programación paralela. Pola contra, os procesadores multinúcleo xurden debido á complexidade de construir chips mononúcleo que sexan eficientes enerxéticamente e teñan altas velocidades de reloxo. A paralelización automática de aplicacións secuenciais é a solución ideal para facer a programación paralela tan sinxela como escribir programas para ordenadores secuenciais. Sen embargo, a paralelización automática continua a ser un gran reto debido a súa necesidade de complexas análises do programa e a existencia de incógnitas durante a compilación. Este artigo propón un novo método para convertir unha aplicación secuencias na súa contrapartida paralela que poida ser executada nos procesadores multinúcleo actuais. Este método depende dunha representación intermedia baseada no concepto dos núcleos independentes do dominio (p. ex., asignación, reducción, recurrencia). Esta visión centrada en núcleos oculta a complexidade dos detalles de implementación, permitindo a construcción da versión paralela incluso cando o código fonte da aplicación secuencial contén diferentes variantes das computacións (p. ex., punteiros, arrays, fluxos de control complejo). Preséntanse experimentos que evalúan a efectividade e o rendemento da nosa aproximación con respecto ao estado da arte. A serie de programas de proba consiste en códigos sintéticos que representan núcleos independentes do dominio comunes, rutinas de álxebra lineal densa/dispersa e de procesamento de imaxe, e aplicacións completas do SPEC CPU2000.Ministerio de Economía y Competitividad; TIN2010-16735Ministerio de Educación y Cultura; AP2008-0101

    High Lundquist Number Simulations of Parker\u27s Model of Coronal Heating: Scaling and Current Sheet Statistics Using Heterogeneous Computing Architectures

    Get PDF
    Parker\u27s model [Parker, Astrophys. J., 174, 499 (1972)] is one of the most discussed mechanisms for coronal heating and has generated much debate. We have recently obtained new scaling results for a 2D version of this problem suggesting that the heating rate becomes independent of resistivity in a statistical steady state [Ng and Bhattacharjee, Astrophys. J., 675, 899 (2008)]. Our numerical work has now been extended to 3D using high resolution MHD numerical simulations. Random photospheric footpoint motion is applied for a time much longer than the correlation time of the motion to obtain converged average coronal heating rates. Simulations are done for different values of the Lundquist number to determine scaling. In the high-Lundquist number limit (S \u3e 1000), the coronal heating rate obtained is consistent with a trend that is independent of the Lundquist number, as predicted by previous analysis and 2D simulations. We will present scaling analysis showing that when the dissipation time is comparable or larger than the correlation time of the random footpoint motion, the heating rate tends to become independent of Lundquist number, and that the magnetic energy production is also reduced significantly. We also present a comprehensive reprogramming of our simulation code to run on NVidia graphics processing units using the Compute Unified Device Architecture (CUDA) and report code performance on several large scale heterogenous machines

    Development of an oceanographic application in HPC

    Get PDF
    High Performance Computing (HPC) is used for running advanced application programs efficiently, reliably, and quickly. In earlier decades, performance analysis of HPC applications was evaluated based on speed, scalability of threads, memory hierarchy. Now, it is essential to consider the energy or the power consumed by the system while executing an application. In fact, the High Power Consumption (HPC) is one of biggest problems for the High Performance Computing (HPC) community and one of the major obstacles for exascale systems design. The new generations of HPC systems intend to achieve exaflop performances and will demand even more energy to processing and cooling. Nowadays, the growth of HPC systems is limited by energy issues Recently, many research centers have focused the attention on doing an automatic tuning of HPC applications which require a wide study of HPC applications in terms of power efficiency. In this context, this paper aims to propose the study of an oceanographic application, named OceanVar, that implements Domain Decomposition based 4D Variational model (DD-4DVar), one of the most commonly used HPC applications, going to evaluate not only the classic aspects of performance but also aspects related to power efficiency in different case of studies. These work were realized at Bsc (Barcelona Supercomputing Center), Spain within the Mont-Blanc project, performing the test first on HCA server with Intel technology and then on a mini-cluster Thunder with ARM technology. In this work of thesis it was initially explained the concept of assimilation date, the context in which it is developed, and a brief description of the mathematical model 4DVAR. After this problem’s close examination, it was performed a porting from Matlab description of the problem of data-assimilation to its sequential version in C language. Secondly, after identifying the most onerous computational kernels in order of time, it has been developed a parallel version of the application with a parallel multiprocessor programming style, using the MPI (Message Passing Interface) protocol. The experiments results, in terms of performance, have shown that, in the case of running on HCA server, an Intel architecture, values of efficiency of the two most onerous functions obtained, growing the number of process, are approximately equal to 80%. In the case of running on ARM architecture, specifically on Thunder mini-cluster, instead, the trend obtained is labeled as "SuperLinear Speedup" and, in our case, it can be explained by a more efficient use of resources (cache memory access) compared with the sequential case. In the second part of this paper was presented an analysis of the some issues of this application that has impact in the energy efficiency. After a brief discussion about the energy consumption characteristics of the Thunder chip in technological landscape, through the use of a power consumption detector, the Yokogawa Power Meter, values of energy consumption of mini-cluster Thunder were evaluated in order to determine an overview on the power-to-solution of this application to use as the basic standard for successive analysis with other parallel styles. Finally, a comprehensive performance evaluation, targeted to estimate the goodness of MPI parallelization, is conducted using a suitable performance tool named Paraver, developed by BSC. Paraver is such a performance analysis and visualisation tool which can be used to analyse MPI, threaded or mixed mode programmes and represents the key to perform a parallel profiling and to optimise the code for High Performance Computing. A set of graphical representation of these statistics make it easy for a developer to identify performance problems. Some of the problems that can be easily identified are load imbalanced decompositions, excessive communication overheads and poor average floating operations per second achieved. Paraver can also report statistics based on hardware counters, which are provided by the underlying hardware. This project aimed to use Paraver configuration files to allow certain metrics to be analysed for this application. To explain in some way the performance trend obtained in the case of analysis on the mini-cluster Thunder, the tracks were extracted from various case of studies and the results achieved is what expected, that is a drastic drop of cache misses by the case ppn (process per node) = 1 to case ppn = 16. This in some way explains a more efficient use of cluster resources with an increase of the number of processes

    Coz: Finding Code that Counts with Causal Profiling

    Full text link
    Improving performance is a central concern for software developers. To locate optimization opportunities, developers rely on software profilers. However, these profilers only report where programs spent their time: optimizing that code may have no impact on performance. Past profilers thus both waste developer time and make it difficult for them to uncover significant optimization opportunities. This paper introduces causal profiling. Unlike past profiling approaches, causal profiling indicates exactly where programmers should focus their optimization efforts, and quantifies their potential impact. Causal profiling works by running performance experiments during program execution. Each experiment calculates the impact of any potential optimization by virtually speeding up code: inserting pauses that slow down all other code running concurrently. The key insight is that this slowdown has the same relative effect as running that line faster, thus "virtually" speeding it up. We present Coz, a causal profiler, which we evaluate on a range of highly-tuned applications: Memcached, SQLite, and the PARSEC benchmark suite. Coz identifies previously unknown optimization opportunities that are both significant and targeted. Guided by Coz, we improve the performance of Memcached by 9%, SQLite by 25%, and accelerate six PARSEC applications by as much as 68%; in most cases, these optimizations involve modifying under 10 lines of code.Comment: Published at SOSP 2015 (Best Paper Award
    corecore