262 research outputs found

    An LLVM Instrumentation Plug-in for Score-P

    Full text link
    Reducing application runtime, scaling parallel applications to higher numbers of processes/threads, and porting applications to new hardware architectures are tasks necessary in the software development process. Therefore, developers have to investigate and understand application runtime behavior. Tools such as monitoring infrastructures that capture performance relevant data during application execution assist in this task. The measured data forms the basis for identifying bottlenecks and optimizing the code. Monitoring infrastructures need mechanisms to record application activities in order to conduct measurements. Automatic instrumentation of the source code is the preferred method in most application scenarios. We introduce a plug-in for the LLVM infrastructure that enables automatic source code instrumentation at compile-time. In contrast to available instrumentation mechanisms in LLVM/Clang, our plug-in can selectively include/exclude individual application functions. This enables developers to fine-tune the measurement to the required level of detail while avoiding large runtime overheads due to excessive instrumentation.Comment: 8 page

    Towards A Quasi High Level Compiler Comparative and Attributive Model for OpenMP Programs

    Get PDF
    In order to understand the behavior of OpenMP programs, special tools and adaptive techniques are needed for performance analysis. However, these tools provide low level profile information at the assembly and functions boundaries via instrumentation at the binary or code level, which are very hard to interpret. Hence, this thesis proposes a new model for OpenMP enabled compilers that assesses the performance differences in well defined formulations by dividing OpenMP program conditions into four distinct states which account for all the possible cases that an OpenMP program can take. An improved version of the standard performance metrics is proposed: speedup, overhead and efficiency based on the model categorization that is state\u27s aware. Moreover, an algorithmic approach to find patterns between OpenMP compilers is proposed which is verified along with the model formulations experimentally. Finally, the thesis reveals the mathematical model behind the optimum performance for any OpenMP program

    Scalable Applications on Heterogeneous System Architectures: A Systematic Performance Analysis Framework

    Get PDF
    The efficient parallel execution of scientific applications is a key challenge in high-performance computing (HPC). With growing parallelism and heterogeneity of compute resources as well as increasingly complex software, performance analysis has become an indispensable tool in the development and optimization of parallel programs. This thesis presents a framework for systematic performance analysis of scalable, heterogeneous applications. Based on event traces, it automatically detects the critical path and inefficiencies that result in waiting or idle time, e.g. due to load imbalances between parallel execution streams. As a prerequisite for the analysis of heterogeneous programs, this thesis specifies inefficiency patterns for computation offloading. Furthermore, an essential contribution was made to the development of tool interfaces for OpenACC and OpenMP, which enable a portable data acquisition and a subsequent analysis for programs with offload directives. At present, these interfaces are already part of the latest OpenACC and OpenMP API specification. The aforementioned work, existing preliminary work, and established analysis methods are combined into a generic analysis process, which can be applied across programming models. Based on the detection of wait or idle states, which can propagate over several levels of parallelism, the analysis identifies wasted computing resources and their root cause as well as the critical-path share for each program region. Thus, it determines the influence of program regions on the load balancing between execution streams and the program runtime. The analysis results include a summary of the detected inefficiency patterns and a program trace, enhanced with information about wait states, their cause, and the critical path. In addition, a ranking, based on the amount of waiting time a program region caused on the critical path, highlights program regions that are relevant for program optimization. The scalability of the proposed performance analysis and its implementation is demonstrated using High-Performance Linpack (HPL), while the analysis results are validated with synthetic programs. A scientific application that uses MPI, OpenMP, and CUDA simultaneously is investigated in order to show the applicability of the analysis

    Application Acceleration on FPGAs with OmpSs@FPGA

    Get PDF
    © 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.OmpSs@FPGA is the flavor of OmpSs that allows offloading application functionality to FPGAs. Similarly to OpenMP, it is based on compiler directives. While the OpenMP specification also includes support for heterogeneous execution, we use OmpSs and OmpSs@FPGA as prototype implementation to develop new ideas for OpenMP. OmpSs@FPGA implements the tasking model with runtime support to automatically exploit all SMP and FPGA resources available in the execution platform. In this paper, we present the OmpSs@FPGA ecosystem, based on the Mercurium compiler and the Nanos++ runtime system. We show how the applications are transformed to run on the SMP cores and the FPGA. The application kernels defined as tasks to be accelerated, using the OmpSs directives are: 1) transformed by the compiler into kernels connected with the proper synchronization and communication ports, 2) extracted to intermediate files, 3) compiled through the FPGA vendor HLS tool, and 4) used to configure the FPGA. Our Nanos++ runtime system schedules the application tasks on the platform, being able to use the SMP cores and the FPGA accelerators at the same time. We present the evaluation of the OmpSs@FPGA environment with the Matrix Multiplication, Cholesky and N-Body benchmarks, showing the internal details of the execution, and the performance obtained on a Zynq Ultrascale+ MPSoC (up to 128x). The source code uses OmpSs@FPGA annotations and different Vivado HLS optimization directives are applied for acceleration.This work is partially supported by the European Union H2020 program through the EuroEXA project (grant 754337), and HiPEAC (GA 687698), by the Spanish Government through Programa Severo Ochoa (SEV-2015- 0493), by the Spanish Ministry of Science and Technology (TIN2015-65316-P) and the Departament d’Innovació Universitats i Empresa de la Generalitat de Catalunya, under project MPEXPAR: Models de Programació i Entorns d’Execució Paral·lels (2014-SGR-1051).Peer ReviewedPostprint (author's final draft

    Experiences on the characterization of parallel applications in embedded systems with Extrae/Paraver

    Get PDF
    Cutting-edge functionalities in embedded systems require the use of parallel architectures to meet their performance requirements. This imposes the introduction of a new layer in the software stacks of embedded systems: the parallel programming model. Unfortunately, the tools used to analyze embedded systems fall short to characterize the performance of parallel applications at a parallel programming model level, and correlate this with information about non-functional requirements such as real-time, energy, memory usage, etc. HPC tools, like Extrae, are designed with that level of abstraction in mind, but their main focus is on performance evaluation. Overall, providing insightful information about the performance of parallel embedded applications at the parallel programming model level, and relate it to the non-functional requirements, is of paramount importance to fully exploit the performance capabilities of parallel embedded architectures. This paper contributes to the state-of-the-art of analysis tools for embedded systems by: (1) analyzing the particular constraints of embedded systems compared to HPC systems (e.g., static setting, restricted memory, limited drivers) to support HPC analysis tools; (2) porting Extrae, a powerful tracing tool from the HPC domain, to the GR740 platform, a SoC used in the space domain; and (3) augmenting Extrae with new features needed to correlate the parallel execution with the following non-functional requirements: energy, temperature and memory usage. Finally, the paper presents the usefulness of Extrae to characterize OpenMP applications and its non-functional requirements, evaluating different aspects of the applications running in the GR740.This work has been partially funded from the HP4S (High Performance Parallel Payload Processing for Space) project under the ESA-ESTEC ITI contract № 4000124124/18/NL/CRS.Peer ReviewedPostprint (author's final draft

    Performance Optimization Strategies for Transactional Memory Applications

    Get PDF
    This thesis presents tools for Transactional Memory (TM) applications that cover multiple TM systems (Software, Hardware, and hybrid TM) and use information of all different layers of the TM software stack. Therefore, this thesis addresses a number of challenges to extract static information, information about the run time behavior, and expert-level knowledge to develop these new methods and strategies for the optimization of TM applications

    LLOV: A Fast Static Data-Race Checker for OpenMP Programs

    Get PDF
    In the era of Exascale computing, writing efficient parallel programs is indispensable and at the same time, writing sound parallel programs is highly difficult. While parallel programming is easier with frameworks such as OpenMP, the possibility of data races in these programs still persists. In this paper, we propose a fast, lightweight, language agnostic, and static data race checker for OpenMP programs based on the LLVM compiler framework. We compare our tool with other state-of-the-art data race checkers on a variety of well-established benchmarks. We show that the precision, accuracy, and the F1 score of our tool is comparable to other checkers while being orders of magnitude faster. To the best of our knowledge, this work is the only tool among the state-of-the-art data race checkers that can verify a FORTRAN program to be data race free

    LLOV: A Fast Static Data-Race Checker for OpenMP Programs

    Full text link
    In the era of Exascale computing, writing efficient parallel programs is indispensable and at the same time, writing sound parallel programs is very difficult. Specifying parallelism with frameworks such as OpenMP is relatively easy, but data races in these programs are an important source of bugs. In this paper, we propose LLOV, a fast, lightweight, language agnostic, and static data race checker for OpenMP programs based on the LLVM compiler framework. We compare LLOV with other state-of-the-art data race checkers on a variety of well-established benchmarks. We show that the precision, accuracy, and the F1 score of LLOV is comparable to other checkers while being orders of magnitude faster. To the best of our knowledge, LLOV is the only tool among the state-of-the-art data race checkers that can verify a C/C++ or FORTRAN program to be data race free.Comment: Accepted in ACM TACO, August 202

    Dynamic Determinacy Race Detection for Task-Parallel Programs with Promises

    Get PDF
    Much of the past work on dynamic data-race and determinacy-race detection algorithms for task parallelism has focused on structured parallelism with fork-join constructs and, more recently, with future constructs. This paper addresses the problem of dynamic detection of data-races and determinacy-races in task-parallel programs with promises, which are more general than fork-join constructs and futures. The motivation for our work is twofold. First, promises have now become a mainstream synchronization construct, with their inclusion in multiple languages, including C++, JavaScript, and Java. Second, past work on dynamic data-race and determinacy-race detection for task-parallel programs does not apply to programs with promises, thereby identifying a vital need for this work. This paper makes multiple contributions. First, we introduce a featherweight programming language that captures the semantics of task-parallel programs with promises and provides a basis for formally defining determinacy using our semantics. This definition subsumes functional determinacy (same output for same input) and structural determinacy (same computation graph for same input). The main theoretical result shows that the absence of data races is sufficient to guarantee determinacy with both properties. We are unaware of any prior work that established this result for task-parallel programs with promises. Next, we introduce a new Dynamic Race Detector for Promises that we call DRDP. DRDP is the first known race detection algorithm that executes a task-parallel program sequentially without requiring the serial-projection property; this is a critical requirement since programs with promises do not satisfy the serial-projection property in general. Finally, the paper includes experimental results obtained from an implementation of DRDP. The results show that, with some important optimizations introduced in our work, the space and time overheads of DRDP are comparable to those of more restrictive race detection algorithms from past work. To the best of our knowledge, DRDP is the first determinacy race detector for task-parallel programs with promises
    corecore