19 research outputs found

    Static Validation of Barriers and Worksharing Constructs in OpenMP Applications

    Get PDF
    International audienceThe OpenMP specification requires that all threads in a team execute the same sequence of worksharing and barrier regions. An improper use of such directive may lead to deadlocks. In this paper we propose a static analysis to ensure this property is verified. The well-defined semantic of OpenMP programs makes compiler analysis more effective. We propose a new compile-time method to identify in OpenMP codes the potential improper uses of barriers and work-sharing constructs, and the execution paths that are responsible for these issues. We implemented our method in a GCC compiler plugin and show the small im-pact of our analysis on performance for NAS-OMP benchmarks and a test case for a production industrial code

    Analyse statique/dynamique pour la validation et l'amélioration des applications parallèles multi-modèles

    Get PDF
    Supercomputing plays an important role in several innovative fields, speeding up prototyping or validating scientific theories. However, supercomputers are evolving rapidly with now millions of processing units, posing the questions of their programmability. Despite the emergence of more widespread and functional parallel programming models, developing correct and effective parallel applications still remains a complex task. Although debugging solutions have emerged to address this issue, they often come with restrictions. However programming model evolutions stress the requirement for a convenient validation tool able to handle hybrid applications. Indeed as current scientific applications mainly rely on the Message Passing Interface (MPI) parallel programming model, new hardwares designed for Exascale with higher node-level parallelism clearly advocate for an MPI+X solutions with X a thread-based model such as OpenMP. But integrating two different programming models inside the same application can be error-prone leading to complex bugs - mostly detected unfortunately at runtime. In an MPI+X program not only the correctness of MPI should be ensured but also its interactions with the multi-threaded model, for example identical MPI collective operations cannot be performed by multiple nonsynchronized threads. This thesis aims at developing a combination of static and dynamic analysis to enable an early verification of hybrid HPC applications. The first pass statically verifies the thread level required by an MPI+OpenMP application and outlines execution paths leading to potential deadlocks. Thanks to this analysis, the code is selectively instrumented, displaying an error and synchronously interrupting all processes if the actual scheduling leads to a deadlock situation.L’utilisation du parallélisme des architectures actuelles dans le domaine du calcul hautes performances, oblige à recourir à différents langages parallèles. Ainsi, l’utilisation conjointe de MPI pour le parallélisme gros grain, à mémoire distribuée et OpenMP pour du parallélisme de thread, fait partie des pratiques de développement d’applications pour supercalculateurs. Des erreurs, liées à l’utilisation conjointe de ces langages de parallélisme, sont actuellement difficiles à détecter et cela limite l’écriture de codes, permettant des interactions plus poussées entre ces niveaux de parallélisme. Des outils ont été proposés afin de palier ce problème. Cependant, ces outils sont généralement focalisés sur un type de modèle et permettent une vérification dite statique (à la compilation) ou dynamique (à l’exécution). Pourtant une combinaison statique/- dynamique donnerait des informations plus pertinentes. En effet, le compilateur est en mesure de donner des informations relatives au comportement général du code, indépendamment du jeu d’entrée. C’est par exemple le cas des problèmes liés aux communications collectives du modèle MPI. Cette thèse a pour objectif de développer des analyses statiques/dynamiques permettant la vérification d’une application parallèle mélangeant plusieurs modèles de programmation, afin de diriger les développeurs vers un code parallèle multi-modèles correct et performant. La vérification se fait en deux étapes. Premièrement, de potentielles erreurs sont détectées lors de la phase de compilation. Ensuite, un test au runtime est ajouté pour savoir si le problème va réellement se produire. Grâce à ces analyses combinées, nous renvoyons des messages précis aux utilisateurs et évitons les situations de blocage

    On the benefits of tasking with OpenMP

    Get PDF
    Tasking promises a model to program parallel applications that provides intuitive semantics. In the case of tasks with dependences, it also promises better load balancing by removing global synchronizations (barriers), and potential for improved locality. Still, the adoption of tasking in production HPC codes has been slow. Despite OpenMP supporting tasks, most codes rely on worksharing-loop constructs alongside MPI primitives. This paper provides insights on the benefits of tasking over the worksharing-loop model by reporting on the experience of taskifying an adaptive mesh refinement proxy application: miniAMR. The performance evaluation shows the taskified implementation being 15–30% faster than the loop-parallel one for certain thread counts across four systems, three architectures and four compilers thanks to better load balancing and system utilization. Dynamic scheduling of loops narrows the gap but still falls short of tasking due to serial sections between loops. Locality improvements are incidental due to the lack of locality-aware scheduling. Overall, the introduction of asynchrony with tasking lives up to its promises, provided that programmers parallelize beyond individual loops and across application phases.Peer ReviewedPostprint (author's final draft

    A Functional Safety OpenMP∗ for Critical Real-Time Embedded Systems

    Get PDF
    OpenMP* has recently gained attention in the embedded domain by virtue of the augmentations implemented in the last specification. Yet, the language has a minimal impact in the embedded real-time domain mostly due to the lack of reliability and resiliency mechanisms. As a result, functional safety properties cannot be guaranteed. This paper analyses in detail the latest specification to determine whether and how the compliant OpenMP implementations can guarantee functional safety. Given the conclusions drawn from the analysis, the paper describes a set of modifications to the specification, and a set of requirements for compiler and runtime systems to qualify for safety critical environments. Through the proposed solution, OpenMP can be used in critical real-time embedded systems without compromising functional safety.This work was funded by the EU project P-SOCRATES (FP7-ICT-2013- 10) and the Spanish Ministry of Science and Innovation under contract TIN2015- 65316-P.Peer ReviewedPostprint (author's final draft

    Enhancing the Speed and Automation of Assisted Parallelization

    Get PDF
    Parallelization is a technique that boosts the performance of a program beyond optimizations of the sequential algorithm. Utilizing the technique requires deep program knowledge and is usually complex and time-consuming. Software tools have been proposed to discover parallelism opportunities. Tools relying on static analysis follow a conservative path and tend to miss many opportunities, whereas dynamic analysis suffers from a vast runtime overhead, often resulting in a slowdown of 100x. In this dissertation, we present two methods that help programmers parallelize programs. We abandon the idea of fully automated parallelization and instead pinpoint programmers to potential parallelism opportunities in the source code. Our first method detects data dependences using a hybrid approach, mitigating the limitations of both static and dynamic analyses. Our second method exploits the identified dependences to provide practical hints for parallelizing a sequential program. Data dependence analysis can be performed statically or dynamically. Static analysis is fast, but it overestimates the number of data dependences. Dynamic analysis records dependences that actually occur during program execution but suffers from high runtime overhead and is also input sensitive. We have proposed a hybrid approach that considerably reduces the overhead and does not overestimate data dependences. Our approach first detects memory-access instructions that create statically-identifiable data dependences. Then, it excludes these instructions from the instrumentation, avoiding their associated overhead at runtime. We have implemented our approach in DiscoPoP, a parallelism discovery tool, and evaluated it with 49 benchmarks from three benchmark suites (i.e., Polybench, NPB, and BOTS) and two simulation applications (namely, EOS-MBPT and LULESH). The median reduction of the profiling time across all programs was 76%. Additionally, we proposed a method that uses the identified dependences to make recommendations on how to parallelize a program with OpenMP. OpenMP allows programmers to annotate code sections in a program with parallelization constructs. Programming with OpenMP is not easy. Programmers need to determine which construct to insert where in the source code to maximize performance and preserve correctness. Another task is classifying variables inside the constructs according to their data-sharing semantics. Performing these tasks manually is complex and error-prone. We have proposed an approach that automates these tasks. Our approach receives as input parallel design patterns derived from the extracted data dependences and maps them to appropriate OpenMP constructs and source-code lines. Further, it classifies the variables within those constructs. After integrating our parallelization approach into DiscoPoP, we used it to parallelize our test programs. We compared their execution times with their sequential versions. We observed a speedup of up to 1.35x for EOS-MBPT and 8x for LULESH. For the benchmarks, we further compared our parallelizations with those generated by three state-of-the-art parallelization tools: We produced faster codes in most cases with an average speedup relative to any of the three ranging from 1.8 to 2.7. Also, we automatically reclassified variables of OpenMP programs parallelized manually or with the help of these tools, reducing their execution time by up to 29%. Moreover, we found that the inherent input sensitivity of the dynamic dependence analysis, if running the target program with a range of representative inputs, does not make the resulting parallel programs harder to validate than those parallelized manually. Finally, our approach has been extended to suggest offloading suitable kernels onto the GPU using OpenMP

    Portable Checkpointing for Parallel Applications

    Full text link
    High Performance Computing (HPC) systems represent the peak of modern computational capability. As ever-increasing demands for computational power have fuelled the demand for ever-larger computing systems, modern HPC systems have grown to incorporate hundreds, thousands or as many as 130,000 processors. At these scales, the huge number of individual components in a single system makes the probability that a single component will fail quite high, with today's large HPC systems featuring mean times between failures on the order of hours or a few days. As many modern computational tasks require days or months to complete, fault tolerance becomes critical to HPC system design. The past three decades have seen significant amounts of research on parallel system fault tolerance. However, as most of it has been either theoretical or has focused on low-level solutions that are embedded into a particular operating system or type of hardware, this work has had little impact on real HPC systems. This thesis attempts to address this lack of impact by describing a high-level approach for implementing checkpoint/restart functionality that decouples the fault tolerance solution from the details of the operating system, system libraries and the hardware and instead connects it to the APIs implemented by the above components. The resulting solution enables applications that use these APIs to become self-checkpointing and self-restarting regardless of the the software/hardware platform that may implement the APIs. The particular focus of this thesis is on the problem of checkpoint/restart of parallel applications. It presents two theoretical checkpointing protocols, one for the message passing communication model and one for the shared memory model. The former is the first protocol to be compatible with application-level checkpointing of individual processes, while the latter is the first protocol that is compatible with arbitrary shared memory models, APIs, implementations and consistency protocols. These checkpointing protocols are used to implement checkpointing systems for applications that use the MPI and OpenMP parallel APIs, respectively, and are first in providing checkpoint/restart to arbitrary implementations of these popular APIs. Both checkpointing systems are extensively evaluated on multiple software/hardware platforms and are shown to feature low overheads

    The hArtes Tool Chain

    Get PDF
    This chapter describes the different design steps needed to go from legacy code to a transformed application that can be efficiently mapped on the hArtes platform

    Doctor of Philosophy

    Get PDF
    dissertationHigh Performance Computing (HPC) on-node parallelism is of extreme importance to guarantee and maintain scalability across large clusters of hundreds of thousands of multicore nodes. HPC programming is dominated by the hybrid model "MPI + X", with MPI to exploit the parallelism across the nodes, and "X" as some shared memory parallel programming model to accomplish multicore parallelism across CPUs or GPUs. OpenMP has become the "X" standard de-facto in HPC to exploit the multicore architectures of modern CPUs. Data races are one of the most common and insidious of concurrent errors in shared memory programming models and OpenMP programs are not immune to them. The OpenMP-provided ease of use to parallelizing programs can often make it error-prone to data races which become hard to find in large applications with thousands lines of code. Unfortunately, prior tools are unable to impact practice owing to their poor coverage or poor scalability. In this work, we develop several new approaches for low overhead data race detection. Our approaches aim to guarantee high precision and accuracy of race checking while maintaining a low runtime and memory overhead. We present two race checkers for C/C++ OpenMP programs that target two different classes of programs. The first, ARCHER, is fast but requires large amount of memory, so it ideally targets applications that require only a small portion of the available on-node memory. On the other hand, SWORD strikes a balance between fast zero memory overhead data collection followed by offline analysis that can take a long time, but it often report most races quickly. Given that race checking was impossible for large OpenMP applications, our contributions are the best available advances in what is known to be a difficult NP-complete problem. We performed an extensive evaluation of the tools on existing OpenMP programs and HPC benchmarks. Results show that both tools guarantee to identify all the races of a program in a given run without reporting any false alarms. The tools are user-friendly, hence serve as an important instrument for the daily work of programmers to help them identify data races early during development and production testing. Furthermore, our demonstrated success on real-world applications puts these tools on the top list of debugging tools for scientists at large

    SIMD@OpenMP : a programming model approach to leverage SIMD features

    Get PDF
    SIMD instruction sets are a key feature in current general purpose and high performance architectures. SIMD instructions apply in parallel the same operation to a group of data, commonly known as vector. A single SIMD/vector instruction can, thus, replace a sequence of scalar instructions. Consequently, the number of instructions can be greatly reduced leading to improved execution times. However, SIMD instructions are not widely exploited by the vast majority of programmers. In many cases, taking advantage of these instructions relies on the compiler. Nevertheless, compilers struggle with the automatic vectorization of codes. Advanced programmers are then compelled to exploit SIMD units by hand, using low-level hardware-specific intrinsics. This approach is cumbersome, error prone and not portable across SIMD architectures. This thesis targets OpenMP to tackle the underuse of SIMD instructions from three main areas of the programming model: language constructions, compiler code optimizations and runtime algorithms. We choose the Intel Xeon Phi coprocessor (Knights Corner) and its 512-bit SIMD instruction set for our evaluation process. We make four contributions aimed at improving the exploitation of SIMD instructions in this scope. Our first contribution describes a compiler vectorization infrastructure suitable for OpenMP. This infrastructure targets for-loops and whole functions. We define a set of attributes for expressions that determine how the code is vectorized. Our vectorization infrastructure also implements support for several advanced vector features. This infrastructure is proven to be effective in the vectorization of complex codes and it is the basis upon which we build the following two contributions. The second contribution introduces a proposal to extend OpenMP 3.1 with SIMD parallelism. Essential parts of this work have become key features of the SIMD proposal included in OpenMP 4.0. We define the "simd" and "simd for" directives that allow programmers to describe SIMD parallelism of loops and whole functions. Furthermore, we propose a set of optional clauses that leads the compiler to generate a more efficient vector code. These SIMD extensions improve the programming efficiency when exploiting SIMD resources. Our evaluation on the Intel Xeon Phi coprocessor shows that our SIMD proposal allows the compiler to efficiently vectorize codes poorly or not vectorized automatically with the Intel C/C++ compiler. In the third contribution, we propose a vector code optimization that enhances overlapped vector loads. These vector loads redundantly read from memory scalar elements already loaded by other vector loads. Our vector code optimization improves the memory usage of these accesses by means of building a vector register cache and exploiting register-to-register instructions. Our proposal also includes a new clause (overlap) in the context of the SIMD extensions for OpenMP of our first contribution. This new clause allows enabling, disabling and tuning this optimization on demand. The last contribution tackles the exploitation of SIMD instructions in the OpenMP barrier and reduction primitives. We propose a new combined barrier and reduction tree scheme specifically designed to make the most of SIMD instructions. Our barrier algorithm takes advantage of simultaneous multi-threading technology (SMT) and it utilizes SIMD memory instructions in the synchronization process. The four contributions of this thesis are an important step in the direction of a more common and generalized use of SIMD instructions. Our work is having an outstanding impact on the whole OpenMP community, ranging from users of the programming model to compiler and runtime implementations. Our proposals in the context of OpenMP improves the programmability of the programming model, the overhead of runtime services and the execution time of applications by means of a better use of SIMD.Los juegos de instrucciones SIMD son un componente clave en las arquitecturas de propósito general y de alto rendimiento actuales. Estas instrucciones aplican en paralelo la misma operación a un conjunto de datos, conocido como vector. Una instrucción SIMD/vectorial puede sustituir una secuencia de instrucciones escalares. Así, el número de instrucciones puede ser reducido considerablemente, dando lugar a mejores tiempos de ejecución. No obstante, las instrucciones SIMD no son explotadas ampliamente por la mayoría de programadores. En general, beneficiarse de estas instrucciones depende del compilador. Sin embargo, los compiladores tienen dificultades con la vectorización automática de códigos por lo que los programadores avanzados se ven obligados a explotar las unidades SIMD manualmente, empleando intrínsecas de bajo nivel específicas del hardware. Esta aproximación es costosa, propensa a errores y no portable entre arquitecturas. Esta tesis se centra en el modelo de programación OpenMP para abordar el poco uso de las instrucciones SIMD desde tres áreas: construcciones del lenguaje, optimizaciones de código del compilador y algoritmos del runtime. Hemos escogido el coprocesador Intel Xeon Phi (Knights Corner) y su juego de instrucciones SIMD de 512 bits para nuestra evaluación. Realizamos cuatro contribuciones para mejorar la explotación de las instrucciones SIMD en este ámbito. Nuestra primera contribución describe una infraestructura de vectorización de compilador adecuada para OpenMP. Esta infraestructura tiene como objetivo la vectorización de bucles y funciones. Para ello definimos un conjunto de atributos que determina como se vectoriza el código. Nuestra evaluación demuestra la efectividad de esta infraestructura en la vectorización de códigos complejos. Esta infraestructura es la base de las dos propuestas siguientes. En la segunda contribución proponemos una extensión SIMD para de OpenMP 3.1. Partes esenciales de este trabajo se han convertido en características clave de la propuesta sobre SIMD incluida en OpenMP 4.0. Definimos las directivas ‘simd’ y ‘simd for’ que permiten a los programadores describir paralelismo SIMD de bucles y funciones. Además, proponemos un conjunto de cláusulas opcionales que permiten que el compilador genere código vectorial más eficiente. Nuestra evaluación muestra que nuestra propuesta SIMD permite al compilador vectorizar eficientemente códigos pobremente o no vectorizados automáticamente con el compilador Intel C/C++

    Scalable Applications on Heterogeneous System Architectures: A Systematic Performance Analysis Framework

    Get PDF
    The efficient parallel execution of scientific applications is a key challenge in high-performance computing (HPC). With growing parallelism and heterogeneity of compute resources as well as increasingly complex software, performance analysis has become an indispensable tool in the development and optimization of parallel programs. This thesis presents a framework for systematic performance analysis of scalable, heterogeneous applications. Based on event traces, it automatically detects the critical path and inefficiencies that result in waiting or idle time, e.g. due to load imbalances between parallel execution streams. As a prerequisite for the analysis of heterogeneous programs, this thesis specifies inefficiency patterns for computation offloading. Furthermore, an essential contribution was made to the development of tool interfaces for OpenACC and OpenMP, which enable a portable data acquisition and a subsequent analysis for programs with offload directives. At present, these interfaces are already part of the latest OpenACC and OpenMP API specification. The aforementioned work, existing preliminary work, and established analysis methods are combined into a generic analysis process, which can be applied across programming models. Based on the detection of wait or idle states, which can propagate over several levels of parallelism, the analysis identifies wasted computing resources and their root cause as well as the critical-path share for each program region. Thus, it determines the influence of program regions on the load balancing between execution streams and the program runtime. The analysis results include a summary of the detected inefficiency patterns and a program trace, enhanced with information about wait states, their cause, and the critical path. In addition, a ranking, based on the amount of waiting time a program region caused on the critical path, highlights program regions that are relevant for program optimization. The scalability of the proposed performance analysis and its implementation is demonstrated using High-Performance Linpack (HPL), while the analysis results are validated with synthetic programs. A scientific application that uses MPI, OpenMP, and CUDA simultaneously is investigated in order to show the applicability of the analysis
    corecore