29 research outputs found

    Programming models for parallel computing

    Get PDF
    Mit dem Auftauchen von Multicore Prozessoren beginnt parallele Programmierung den Massenmarkt zu erobern. Derzeit ist der Parallelismus noch relativ eingeschränkt, da aktuelle Prozessoren nur über eine geringe Anzahl an Kernen verfügen, doch schon bald wird der Schritt zu Prozessoren mit Hunderten an Kernen vollzogen sein. Während sich die Hardware unaufhaltsam in Richtung Parallelismus weiterentwickelt, ist es für Softwareentwickler schwierig, mit diesen Entwicklungen Schritt zu halten. Parallele Programmierung erfordert neue Ansätze gegenüber den bisher verwendeten sequentiellen Programmiermodellen. In der Vergangenheit war es ausreichend, die nächste Prozessorgeneration abzuwarten, um Computerprogramme zu beschleunigen. Heute jedoch kann ein sequentielles Programm mit einem neuen Prozessor sogar langsamer werden, da die Geschwindigkeit eines einzelnen Prozessorkerns nun oft zugunsten einer größeren Gesamtzahl an Kernen in einem Prozessor reduziert wird. Angesichts dieser Tatsache wird es in der Softwareentwicklung in Zukunft notwendig sein, Parallelismus explizit auszunutzen, um weiterhin performante Programme zu entwickeln, die auch auf zukünftigen Prozessorgenerationen skalieren. Die Problematik liegt dabei darin, dass aktuelle Programmiermodelle weiterhin auf dem sogenannten "Assembler der parallelen Programmierung", d.h. auf Multithreading für Shared-Memory- sowie auf Message Passing für Distributed-Memory Architekturen basieren, was zu einer geringen Produktivität und einer hohen Fehleranfälligkeit führt. Um dies zu ändern, wird an neuen Programmiermodellen, -sprachen und -werkzeugen, die Parallelismus auf einer höheren Abstraktionsebene als bisherige Programmiermodelle zu behandeln versprechen, geforscht. Auch wenn bereits einige Teilerfolge erzielt wurden und es gute, performante Lösungen für bestimmte Bereiche gibt, konnte bis jetzt noch kein allgemeingültiges paralleles Programmiermodell entwickelt werden - viele bezweifeln, dass das überhaupt möglich ist. Das Ziel dieser Arbeit ist es, einen Überblick über aktuelle Entwicklungen bei parallelen Programmiermodellen zu geben. Da homogenen Multi- und Manycore Prozessoren in nächster Zukunft die meiste Bedeutung zukommen wird, wird das Hauptaugenmerk darauf gelegt, inwieweit die behandelten Programmiermodelle für diese Plattformen nützlich sind. Durch den Vergleich unterschiedlicher, auch experimenteller Ansätze soll erkennbar werden, wohin die Entwicklung geht und welche Werkzeuge aktuell verwendet werden können.With the emergence of multi-core processors in the consumer market, parallel computing is moving to the mainstream. Currently parallelism is still very restricted as modern consumer computers only contain a small number of cores. Nonetheless, the number is constantly increasing, and the time will come when we move to hundreds of cores. For software developers it is becoming more difficult to keep up with these new developments. Parallel programming requires a new way of thinking. No longer will a new processor generation accelerate every existing program. On the contrary, some programs might even get slower because good single-thread performance of a processor is traded in for a higher level of parallelism. For that reason, it becomes necessary to exploit parallelism explicitly and to make sure that the program scales well. Unfortunately, parallelism in current programming models is mostly based on the "assembler of parallel programming", namely low level threading for shared multiprocessors and message passing for distributed multiprocessors. This leads to low programmer productivity and erroneous programs. Because of this, a lot of effort is put into developing new high level programming models, languages and tools that should help parallel programming to keep up with hardware development. Although there have been successes in different areas, no good all-round solution has emerged until now, and there are doubts that there ever will be one. The aim of this work is to give an overview of current developments in the area of parallel programming models. The focus is put onto programming models for multi- and many-core architectures as this is the area most relevant for the near future. Through the comparison of different approaches, including experimental ones, the reader will be able to see which existing programming models can be used for which tasks and to anticipate future developments

    LLOV: A Fast Static Data-Race Checker for OpenMP Programs

    Get PDF
    In the era of Exascale computing, writing efficient parallel programs is indispensable and at the same time, writing sound parallel programs is highly difficult. While parallel programming is easier with frameworks such as OpenMP, the possibility of data races in these programs still persists. In this paper, we propose a fast, lightweight, language agnostic, and static data race checker for OpenMP programs based on the LLVM compiler framework. We compare our tool with other state-of-the-art data race checkers on a variety of well-established benchmarks. We show that the precision, accuracy, and the F1 score of our tool is comparable to other checkers while being orders of magnitude faster. To the best of our knowledge, this work is the only tool among the state-of-the-art data race checkers that can verify a FORTRAN program to be data race free

    LLOV: A Fast Static Data-Race Checker for OpenMP Programs

    Full text link
    In the era of Exascale computing, writing efficient parallel programs is indispensable and at the same time, writing sound parallel programs is very difficult. Specifying parallelism with frameworks such as OpenMP is relatively easy, but data races in these programs are an important source of bugs. In this paper, we propose LLOV, a fast, lightweight, language agnostic, and static data race checker for OpenMP programs based on the LLVM compiler framework. We compare LLOV with other state-of-the-art data race checkers on a variety of well-established benchmarks. We show that the precision, accuracy, and the F1 score of LLOV is comparable to other checkers while being orders of magnitude faster. To the best of our knowledge, LLOV is the only tool among the state-of-the-art data race checkers that can verify a C/C++ or FORTRAN program to be data race free.Comment: Accepted in ACM TACO, August 202

    HPCCP/CAS Workshop Proceedings 1998

    Get PDF
    This publication is a collection of extended abstracts of presentations given at the HPCCP/CAS (High Performance Computing and Communications Program/Computational Aerosciences Project) Workshop held on August 24-26, 1998, at NASA Ames Research Center, Moffett Field, California. The objective of the Workshop was to bring together the aerospace high performance computing community, consisting of airframe and propulsion companies, independent software vendors, university researchers, and government scientists and engineers. The Workshop was sponsored by the HPCCP Office at NASA Ames Research Center. The Workshop consisted of over 40 presentations, including an overview of NASA's High Performance Computing and Communications Program and the Computational Aerosciences Project; ten sessions of papers representative of the high performance computing research conducted within the Program by the aerospace industry, academia, NASA, and other government laboratories; two panel sessions; and a special presentation by Mr. James Bailey

    Array optimizations for high productivity programming languages

    Get PDF
    While the HPCS languages (Chapel, Fortress and X10) have introduced improvements in programmer productivity, several challenges still remain in delivering high performance. In the absence of optimization, the high-level language constructs that improve productivity can result in order-of-magnitude runtime performance degradations. This dissertation addresses the problem of efficient code generation for high-level array accesses in the X10 language. The X10 language supports rank-independent specification of loop and array computations using regions and points. Three aspects of high-level array accesses in X10 are important for productivity but also pose significant performance challenges: high-level accesses are performed through Point objects rather than integer indices, variables containing references to arrays are rank-independent, and array subscripts are verified as legal array indices during runtime program execution. Our solution to the first challenge is to introduce new analyses and transformations that enable automatic inlining and scalar replacement of Point objects. Our solution to the second challenge is a hybrid approach. We use an interprocedural rank analysis algorithm to automatically infer ranks of arrays in X10. We use rank analysis information to enable storage transformations on arrays. If rank-independent array references still remain after compiler analysis, the programmer can use X10's dependent type system to safely annotate array variable declarations with additional information for the rank and region of the variable, and to enable the compiler to generate efficient code in cases where the dependent type information is available. Our solution to the third challenge is to use a new interprocedural array bounds analysis approach using regions to automatically determine when runtime bounds checks are not needed. Our performance results show that our optimizations deliver performance that rivals the performance of hand-tuned code with explicit rank-specific loops and lower-level array accesses, and is up to two orders of magnitude faster than unoptimized, high-level X10 programs. These optimizations also result in scalability improvements of X10 programs as we increase the number of CPUs. While we perform the optimizations primarily in X10, these techniques are applicable to other high-productivity languages such as Chapel and Fortress

    Type Oriented Parallel Programming

    Get PDF
    Context: Parallel computing is an important field within the sciences. With the emergence of multi, and soon many, core CPUs this is moving more and more into the domain of general computing. HPC programmers want performance, but at the moment this comes at a cost; parallel languages are either efficient or conceptually simple, but not both. Aim: To develop and evaluate a novel programming paradigm which will address the problem of parallel programming and allow for languages which are both conceptually simple and efficient. Method: A type-based approach, which allows the programmer to control all aspects of parallelism by the use and combination of types has been developed. As a vehicle to present and analyze this new paradigm a parallel language, Mesham, and associated compilation tools have also been created. By using types to express parallelism the programmer can exercise efficient, flexible control in a high level abstract model yet with a sufficiently rich amount of information in the source code upon which the compiler can perform static analysis and optimization. Results: A number of case studies have been implemented in Mesham. Official benchmarks have been performed which demonstrate the paradigm allows one to write code which is comparable, in terms of performance, with existing high performance solutions. Sections of the parallel simulation package, Gadget-2, have been ported into Mesham, where substantial code simplifications have been made. Conclusions: The results obtained indicate that the type-based approach does satisfy the aim of the research described in this thesis. By using this new paradigm the programmer has been able to write parallel code which is both simple and efficient

    Une approche basée sur les modèles pour le développement d'applications de simulation numérique haute-performance

    Get PDF
    Le développement et la maintenance d'applications de simulation numérique haute-performance sont des activités complexes. Cette complexité découle notamment du couplage fort existant entre logiciel et matériel ainsi que du manque d'accessibilité des solutions de programmation actuelles et du mélange des préoccupations qu'elles induisent. Dans cette thèse nous proposons une approche pour le développement d'applications de simulation numérique haute-performance qui repose sur l'ingénierie des modèles. Afin à la fois de réduire les couts et les délais de portage sur de nouvelles architectures matérielles mais également de concentrer les efforts sur des interventions à plus haute valeur ajoutée, cette approche nommée MDE4HPC, définit un langage de modélisation dédié. Ce dernier permet aux numériciens de décrire la résolution de leurs modèles théoriques dans un langage qui d'une part leur est familier et d'autre part est indépendant d'une quelconque architecture matérielle. Les différentes préoccupations logicielles sont séparées grâce à l'utilisation de plusieurs modèles et de plusieurs points de vue sur ces modèles. En fonction des plateformes d'exécution disponibles, ces modèles abstraits sont alors traduits en implémentations exécutables grâce à des transformations de modèles mutualisables entre les divers projets de développement. Afin de valider notre approche nous avons développé un prototype nommé ArchiMDE. Grâce à cet outil nous avons développé plusieurs applications de simulation numérique pour valider les choix de conception réalisés pour le langage de modélisation.The development and maintenance of high-performance scientific computing software is a complex task. This complexity results from the fact that software and hardware are tightly coupled. Furthermore current parallel programming approaches lack of accessibility and lead to a mixing of concerns within the source code. In this thesis we define an approach for the development of high-performance scientific computing software which relies on model-driven engineering. In order to reduce both duration and cost of migration phases toward new hardware architectures and also to focus on tasks with higher added value this approach called MDE4HPC defines a domain-specific modeling language. This language enables applied mathematicians to describe their numerical model in a both user-friendly and hardware independent way. The different concerns are separated thanks to the use of several models as well as several modeling viewpoints on these models. Depending on the targeted execution platforms, these abstract models are translated into executable implementations with model transformations that can be shared among several software developments. To evaluate the effectiveness of this approach we developed a tool called ArchiMDE. Using this tool we developed different numerical simulation software to validate the design choices made regarding the modeling language

    Optimization techniques for fine-grained communication in PGAS environments

    Get PDF
    Partitioned Global Address Space (PGAS) languages promise to deliver improved programmer productivity and good performance in large-scale parallel machines. However, adequate performance for applications that rely on fine-grained communication without compromising their programmability is difficult to achieve. Manual or compiler assistance code optimization is required to avoid fine-grained accesses. The downside of manually applying code transformations is the increased program complexity and hindering of the programmer productivity. On the other hand, compiler optimizations of fine-grained accesses require knowledge of physical data mapping and the use of parallel loop constructs. This thesis presents optimizations for solving the three main challenges of the fine-grain communication: (i) low network communication efficiency; (ii) large number of runtime calls; and (iii) network hotspot creation for the non-uniform distribution of network communication, To solve this problems, the dissertation presents three approaches. First, it presents an improved inspector-executor transformation to improve the network efficiency through runtime aggregation. Second, it presents incremental optimizations to the inspector-executor loop transformation to automatically remove the runtime calls. Finally, the thesis presents a loop scheduling loop transformation for avoiding network hotspots and the oversubscription of nodes. In contrast to previous work that use static coalescing, prefetching, limited privatization, and caching, the solutions presented in this thesis focus cover all the aspect of fine-grained communication, including reducing the number of calls generated by the compiler and minimizing the overhead of the inspector-executor optimization. A performance evaluation with various microbenchmarks and benchmarks, aiming at predicting scaling and absolute performance numbers of a Power 775 machine, indicates that applications with regular accesses can achieve up to 180% of the performance of hand-optimized versions, while in applications with irregular accesses the transformations are expected to yield from 1.12X up to 6.3X speedup. The loop scheduling shows performance gains from 3-25% for NAS FT and bucket-sort benchmarks, and up to 3.4X speedup for the microbenchmarks
    corecore