33 research outputs found

    Profiling a parallel domain specific language using off-the-shelf tools

    Get PDF
    Profiling tools are essential for understanding and tuning the performance of both parallel programs and parallel language implementations. Assessing the performance of a program in a language with high-level parallel coordination is often complicated by the layers of abstraction present in the language and its implementation. This thesis investigates whether it is possible to profile parallel Domain Specific Languages (DSLs) using existing host language profiling tools. The key challenge is that the host language tools report the performance of the DSL runtime system (RTS) executing the application rather than the performance of the DSL application. The key questions are whether a correct, effective and efficient profiler can be constructed using host language profiling tools; is it possible to effectively profile the DSL implementation, and what capabilities are required of the host language profiling tools? The main contribution of this thesis is the development of an execution profiler for the parallel DSL, Haskell Distributed Parallel Haskell (HdpH) using the host language profiling tools. We show that it is possible to construct a profiler (HdpHProf) to support performance analysis of both the DSL applications and the DSL implementation. The implementation uses several new GHC features, including the GHC-Events Library and ThreadScope, develops two new performance analysis tools for DSL HdpH internals, i.e. Spark Pool Contention Analysis, and Registry Contention Analysis. We present a critical comparative evaluation of the host language profiling tools that we used (GHC-PPS and ThreadScope) with another recent functional profilers, EdenTV, alongside four important imperative profilers. This is the first report on the performance of functional profilers in comparison with well established industrial standard imperative profiling technologies. We systematically compare the profilers for usability and data presentation. We found that the GHC-PPS performs well in terms of overheads and usability so using it to profile the DSL is feasible and would not have significant impact on the DSL performance. We validate HdpHProf for functional correctness and measure its performance using six benchmarks. HdpHProf works correctly and can scale to profile HdpH programs running on up to 192 cores of a 32 nodes Beowulf cluster. We characterise the performance of HdpHProf in terms of profiling data size and profiling execution runtime overhead. It shows that HdpHProf does not alter the behaviour of the GHC-PPS and retains low tracing overheads close to the studied functional profilers; 18% on average. Also, it shows a low ratio of HdpH trace events in GHC-PPS eventlog, less than 3% on average. We show that HdpHProf is effective and efficient to use for performance analysis and tuning of the DSL applications. We use HdpHProf to identify performance issues and to tune the thread granularity of six HdpH benchmarks with different parallel paradigms, e.g. divide and conquer, flat data parallel, and nested data parallel. This include identifying problems such as, too small/large thread granularity, problem size too small for the parallel architecture, and synchronisation bottlenecks. We show that HdpHProf is effective and efficient for tuning the parallel DSL implementation. We use the Spark Pool Contention Analysis tool to examine how the spark pool implementation performs when accessed concurrently. We found that appropriate thread granularity can significantly reduce both conflict ratios, and conflict durations, by more than 90%. We use the Registry Contention Analysis tool to evaluate three alternatives of the registry implementations. We found that the tools can give a better understanding of how different implementations of the HdpH RTS perform

    Trace-based Performance Analysis for Hardware Accelerators

    Get PDF
    This thesis presents how performance data from hardware accelerators can be included in event logs. It extends the capabilities of trace-based performance analysis to also monitor and record data from this novel parallelization layer. The increasing awareness to power consumption of computing devices has led to an interest in hybrid computing architectures as well. High-end computers, workstations, and mobile devices start to employ hardware accelerators to offload computationally intense and parallel tasks, while at the same time retaining a highly efficient scalar compute unit for non-parallel tasks. This execution pattern is typically asynchronous so that the scalar unit can resume other work while the hardware accelerator is busy. Performance analysis tools provided by the hardware accelerator vendors cover the situation of one host using one device very well. Yet, they do not address the needs of the high performance computing community. This thesis investigates ways to extend existing methods for recording events from highly parallel applications to also cover scenarios in which hardware accelerators aid these applications. After introducing a generic approach that is suitable for any API based acceleration paradigm, the thesis derives a suggestion for a generic performance API for hardware accelerators and its implementation with NVIDIA CUPTI. In a next step the visualization of event logs containing data from execution streams on different levels of parallelism is discussed. In order to overcome the limitations of classic performance profiles and timeline displays, a graph-based visualization using Parallel Performance Flow Graphs (PPFGs) is introduced. This novel technical approach is using program states in order to display similarities and differences between the potentially very large number of event streams and, thus, enables a fast way to spot load imbalances. The thesis concludes with the in-depth analysis of a case-study of PIConGPU---a highly parallel, multi-hybrid plasma physics simulation---that benefited greatly from the developed performance analysis methods.Diese Dissertation zeigt, wie der Ablauf von Anwendungsteilen, die auf Hardwarebeschleuniger ausgelagert wurden, als Programmspur mit aufgezeichnet werden kann. Damit wird die bekannte Technik der Leistungsanalyse von Anwendungen mittels Programmspuren so erweitert, dass auch diese neue Parallelitätsebene mit erfasst wird. Die Beschränkungen von Computersystemen bezüglich der elektrischen Leistungsaufnahme hat zu einer steigenden Anzahl von hybriden Computerarchitekturen geführt. Sowohl Hochleistungsrechner, aber auch Arbeitsplatzcomputer und mobile Endgeräte nutzen heute Hardwarebeschleuniger um rechenintensive, parallele Programmteile auszulagern und so den skalaren Hauptprozessor zu entlasten und nur für nicht parallele Programmteile zu verwenden. Dieses Ausführungsschema ist typischerweise asynchron: der Skalarprozessor kann, während der Hardwarebeschleuniger rechnet, selbst weiterarbeiten. Die Leistungsanalyse-Werkzeuge der Hersteller von Hardwarebeschleunigern decken den Standardfall (ein Host-System mit einem Hardwarebeschleuniger) sehr gut ab, scheitern aber an einer Unterstützung von hochparallelen Rechnersystemen. Die vorliegende Dissertation untersucht, in wie weit auch multi-hybride Anwendungen die Aktivität von Hardwarebeschleunigern aufzeichnen können. Dazu wird die vorhandene Methode zur Erzeugung von Programmspuren für hochparallele Anwendungen entsprechend erweitert. In dieser Untersuchung wird zuerst eine allgemeine Methodik entwickelt, mit der sich für jede API-gestützte Hardwarebeschleunigung eine Programmspur erstellen lässt. Darauf aufbauend wird eine eigene Programmierschnittstelle entwickelt, die es ermöglicht weitere leistungsrelevante Daten aufzuzeichnen. Die Umsetzung dieser Schnittstelle wird am Beispiel von NVIDIA CUPTI darstellt. Ein weiterer Teil der Arbeit beschäftigt sich mit der Darstellung von Programmspuren, welche Aufzeichnungen von den unterschiedlichen Parallelitätsebenen enthalten. Um die Einschränkungen klassischer Leistungsprofile oder Zeitachsendarstellungen zu überwinden, wird mit den parallelen Programmablaufgraphen (PPFGs) eine neue graphenbasisierte Darstellungsform eingeführt. Dieser neuartige Ansatz zeigt eine Programmspur als eine Folge von Programmzuständen mit gemeinsamen und unterchiedlichen Abläufen. So können divergierendes Programmverhalten und Lastimbalancen deutlich einfacher lokalisiert werden. Die Arbeit schließt mit der detaillierten Analyse von PIConGPU -- einer multi-hybriden Simulation aus der Plasmaphysik --, die in großem Maße von den in dieser Arbeit entwickelten Analysemöglichkeiten profiert hat

    Performance analysis of a hybrid MPI/OpenMP application on multi-core clusters

    Get PDF
    The mixing of shared memory and message passing programming models within a single application has often been suggested as a method for improving scientific application performance on clusters of shared memory or multi-core systems. DL_POLY, a large scale molecular dynamics application programmed using message passing programming has been modified to add a layer of shared memory threading and the performance analysed on two multi-core clusters. At lower processor numbers, the extra overheads from shared memory threading in the hybrid code outweigh performance benefits gained over the pure MPI code. On larger core counts the hybrid model performs better than pure MPI, with reduced communication time decreasing the overall runtime

    Performance Optimization Strategies for Transactional Memory Applications

    Get PDF
    This thesis presents tools for Transactional Memory (TM) applications that cover multiple TM systems (Software, Hardware, and hybrid TM) and use information of all different layers of the TM software stack. Therefore, this thesis addresses a number of challenges to extract static information, information about the run time behavior, and expert-level knowledge to develop these new methods and strategies for the optimization of TM applications

    Dataflow Programming Paradigms for Computational Chemistry Methods

    Get PDF
    The transition to multicore and heterogeneous architectures has shaped the High Performance Computing (HPC) landscape over the past decades. With the increase in scale, complexity, and heterogeneity of modern HPC platforms, one of the grim challenges for traditional programming models is to sustain the expected performance at scale. By contrast, dataflow programming models have been growing in popularity as a means to deliver a good balance between performance and portability in the post-petascale era. This work introduces dataflow programming models for computational chemistry methods, and compares different dataflow executions in terms of programmability, resource utilization, and scalability. This effort is driven by computational chemistry applications, considering that they comprise one of the driving forces of HPC. In particular, many-body methods, such as Coupled Cluster methods (CC), which are the gold standard to compute energies in quantum chemistry, are of particular interest for the applied chemistry community. On that account, the latest development for CC methods is used as the primary vehicle for this research, but our effort is not limited to CC and can be applied across other application domains. Two programming paradigms for expressing CC methods into a dataflow form, in order to make them capable of utilizing task scheduling systems, are presented. Explicit dataflow, is the programming model where the dataflow is explicitly specified by the developer, is contrasted with implicit dataflow, where a task scheduling runtime derives the dataflow. An abstract model is derived to explore the limits of the different dataflow programming paradigms

    HPC-enabling technologies for high-fidelity combustion simulations

    Get PDF
    With the increase in computational power in the last decade and the forthcoming Exascale supercomputers, a new horizon in computational modelling and simulation is envisioned in combustion science. Considering the multiscale and multiphysics characteristics of turbulent reacting flows, combustion simulations are considered as one of the most computationally demanding applications running on cutting-edge supercomputers. Exascale computing opens new frontiers for the simulation of combustion systems as more realistic conditions can be achieved with high-fidelity methods. However, an efficient use of these computing architectures requires methodologies that can exploit all levels of parallelism. The efficient utilization of the next generation of supercomputers needs to be considered from a global perspective, that is, involving physical modelling and numerical methods with methodologies based on High-Performance Computing (HPC) and hardware architectures. This review introduces recent developments in numerical methods for large-eddy simulations (LES) and direct-numerical simulations (DNS) to simulate combustion systems, with focus on the computational performance and algorithmic capabilities. Due to the broad scope, a first section is devoted to describe the fundamentals of turbulent combustion, which is followed by a general description of state-of-the-art computational strategies for solving these problems. These applications require advanced HPC approaches to exploit modern supercomputers, which is addressed in the third section. The increasing complexity of new computing architectures, with tightly coupled CPUs and GPUs, as well as high levels of parallelism, requires new parallel models and algorithms exposing the required level of concurrency. Advances in terms of dynamic load balancing, vectorization, GPU acceleration and mesh adaptation have permitted to achieve highly-efficient combustion simulations with data-driven methods in HPC environments. Therefore, dedicated sections covering the use of high-order methods for reacting flows, integration of detailed chemistry and two-phase flows are addressed. Final remarks and directions of future work are given at the end. }The research leading to these results has received funding from the European Union’s Horizon 2020 Programme under the CoEC project, grant agreement No. 952181 and the CoE RAISE project grant agreement no. 951733.Peer ReviewedPostprint (published version

    A Unified Infrastructure for Monitoring and Tuning the Energy Efficiency of HPC Applications

    Get PDF
    High Performance Computing (HPC) has become an indispensable tool for the scientific community to perform simulations on models whose complexity would exceed the limits of a standard computer. An unfortunate trend concerning HPC systems is that their power consumption under high-demanding workloads increases. To counter this trend, hardware vendors have implemented power saving mechanisms in recent years, which has increased the variability in power demands of single nodes. These capabilities provide an opportunity to increase the energy efficiency of HPC applications. To utilize these hardware power saving mechanisms efficiently, their overhead must be analyzed. Furthermore, applications have to be examined for performance and energy efficiency issues, which can give hints for optimizations. This requires an infrastructure that is able to capture both, performance and power consumption information concurrently. The mechanisms that such an infrastructure would inherently support could further be used to implement a tool that is able to do both, measuring and tuning of energy efficiency. This thesis targets all steps in this process by making the following contributions: First, I provide a broad overview on different related fields. I list common performance measurement tools, power measurement infrastructures, hardware power saving capabilities, and tuning tools. Second, I lay out a model that can be used to define and describe energy efficiency tuning on program region scale. This model includes hardware and software dependent parameters. Hardware parameters include the runtime overhead and delay for switching power saving mechanisms as well as a contemplation of their scopes and the possible influence on application performance. Thus, in a third step, I present methods to evaluate common power saving mechanisms and list findings for different x86 processors. Software parameters include their performance and power consumption characteristics as well as the influence of power-saving mechanisms on these. To capture software parameters, an infrastructure for measuring performance and power consumption is necessary. With minor additions, the same infrastructure can later be used to tune software and hardware parameters. Thus, I lay out the structure for such an infrastructure and describe common components that are required for measuring and tuning. Based on that, I implement adequate interfaces that extend the functionality of contemporary performance measurement tools. Furthermore, I use these interfaces to conflate performance and power measurements and further process the gathered information for tuning. I conclude this work by demonstrating that the infrastructure can be used to manipulate power-saving mechanisms of contemporary x86 processors and increase the energy efficiency of HPC applications

    Evaluating techniques for parallelization tuning in MPI, OmpSs and MPI/OmpSs

    Get PDF
    Parallel programming is used to partition a computational problem among multiple processing units and to define how they interact (communicate and synchronize) in order to guarantee the correct result. The performance that is achieved when executing the parallel program on a parallel architecture is usually far from the optimal: computation unbalance and excessive interaction among processing units often cause lost cycles, reducing the efficiency of parallel computation. In this thesis we propose techniques oriented to better exploit parallelism in parallel applications, with emphasis in techniques that increase asynchronism. Theoretically, this type of parallelization tuning promises multiple benefits. First, it should mitigate communication and synchronization delays, thus increasing the overall performance. Furthermore, parallelization tuning should expose additional parallelism and therefore increase the scalability of execution. Finally, increased asynchronism would provide higher tolerance to slower networks and external noise. In the first part of this thesis, we study the potential for tuning MPI parallelism. More specifically, we explore automatic techniques to overlap communication and computation. We propose a speculative messaging technique that increases the overlap and requires no changes of the original MPI application. Our technique automatically identifies the application’s MPI activity and reinterprets that activity using optimally placed non-blocking MPI requests. We demonstrate that this overlapping technique increases the asynchronism of MPI messages, maximizing the overlap, and consequently leading to execution speedup and higher tolerance to bandwidth reduction. However, in the case of realistic scientific workloads, we show that the overlapping potential is significantly limited by the pattern by which each MPI process locally operates on MPI messages. In the second part of this thesis, we study the potential for tuning hybrid MPI/OmpSs parallelism. We try to gain a better understanding of the parallelism of hybrid MPI/OmpSs applications in order to evaluate how these applications would execute on future machines and to predict the execution bottlenecks that are likely to emerge. We explore how MPI/OmpSs applications could scale on the parallel machine with hundreds of cores per node. Furthermore, we investigate how this high parallelism within each node would reflect on the network constraints. We especially focus on identifying critical code sections in MPI/OmpSs. We devised a technique that quickly evaluates, for a given MPI/OmpSs application and the selected target machine, which code section should be optimized in order to gain the highest performance benefits. Also, this thesis studies techniques to quickly explore the potential OmpSs parallelism inherent in applications. We provide mechanisms to easily evaluate potential parallelism of any task decomposition. Furthermore, we describe an iterative trialand-error approach to search for a task decomposition that will expose sufficient parallelism for a given target machine. Finally, we explore potential of automating the iterative approach by capturing the programmers’ experience into an expert system that can autonomously lead the search process. Also, throughout the work on this thesis, we designed development tools that can be useful to other researchers in the field. The most advanced of these tools is Tareador – a tool to help porting MPI applications to MPI/OmpSs programming model. Tareador provides a simple interface to propose some decomposition of a code into OmpSs tasks. Tareador dynamically calculates data dependencies among the annotated tasks, and automatically estimates the potential OmpSs parallelization. Furthermore, Tareador gives additional hints on how to complete the process of porting the application to OmpSs. Tareador already proved itself useful, by being included in the academic classes on parallel programming at UPC.La programación paralela consiste en dividir un problema de computación entre múltiples unidades de procesamiento y definir como interactúan (comunicación y sincronización) para garantizar un resultado correcto. El rendimiento de un programa paralelo normalmente está muy lejos de ser óptimo: el desequilibrio de la carga computacional y la excesiva interacción entre las unidades de procesamiento a menudo causa ciclos perdidos, reduciendo la eficiencia de la computación paralela. En esta tesis proponemos técnicas orientadas a explotar mejor el paralelismo en aplicaciones paralelas, poniendo énfasis en técnicas que incrementan el asincronismo. En teoría, estas técnicas prometen múltiples beneficios. Primero, tendrían que mitigar el retraso de la comunicación y la sincronización, y por lo tanto incrementar el rendimiento global. Además, la calibración de la paralelización tendría que exponer un paralelismo adicional, incrementando la escalabilidad de la ejecución. Finalmente, un incremente en el asincronismo proveería una tolerancia mayor a redes de comunicación lentas y ruido externo. En la primera parte de la tesis, estudiamos el potencial para la calibración del paralelismo a través de MPI. En concreto, exploramos técnicas automáticas para solapar la comunicación con la computación. Proponemos una técnica de mensajería especulativa que incrementa el solapamiento y no requiere cambios en la aplicación MPI original. Nuestra técnica identifica automáticamente la actividad MPI de la aplicación y la reinterpreta usando solicitudes MPI no bloqueantes situadas óptimamente. Demostramos que esta técnica maximiza el solapamiento y, en consecuencia, acelera la ejecución y permite una mayor tolerancia a las reducciones de ancho de banda. Aún así, en el caso de cargas de trabajo científico realistas, mostramos que el potencial de solapamiento está significativamente limitado por el patrón según el cual cada proceso MPI opera localmente en el paso de mensajes. En la segunda parte de esta tesis, exploramos el potencial para calibrar el paralelismo híbrido MPI/OmpSs. Intentamos obtener una comprensión mejor del paralelismo de aplicaciones híbridas MPI/OmpSs para evaluar de qué manera se ejecutarían en futuras máquinas. Exploramos como las aplicaciones MPI/OmpSs pueden escalar en una máquina paralela con centenares de núcleos por nodo. Además, investigamos cómo este paralelismo de cada nodo se reflejaría en las restricciones de la red de comunicación. En especia, nos concentramos en identificar secciones críticas de código en MPI/OmpSs. Hemos concebido una técnica que rápidamente evalúa, para una aplicación MPI/OmpSs dada y la máquina objetivo seleccionada, qué sección de código tendría que ser optimizada para obtener la mayor ganancia de rendimiento. También estudiamos técnicas para explorar rápidamente el paralelismo potencial de OmpSs inherente en las aplicaciones. Proporcionamos mecanismos para evaluar fácilmente el paralelismo potencial de cualquier descomposición en tareas. Además, describimos una aproximación iterativa para buscar una descomposición en tareas que mostrará el suficiente paralelismo en la máquina objetivo dada. Para finalizar, exploramos el potencial para automatizar la aproximación iterativa. En el trabajo expuesto en esta tesis hemos diseñado herramientas que pueden ser útiles para otros investigadores de este campo. La más avanzada es Tareador, una herramienta para ayudar a migrar aplicaciones al modelo de programación MPI/OmpSs. Tareador proporciona una interfaz simple para proponer una descomposición del código en tareas OmpSs. Tareador también calcula dinámicamente las dependencias de datos entre las tareas anotadas, y automáticamente estima el potencial de paralelización OmpSs. Por último, Tareador da indicaciones adicionales sobre como completar el proceso de migración a OmpSs. Tareador ya se ha mostrado útil al ser incluido en las clases de programación de la UPC

    Analyse statique/dynamique pour la validation et l'amélioration des applications parallèles multi-modèles

    Get PDF
    Supercomputing plays an important role in several innovative fields, speeding up prototyping or validating scientific theories. However, supercomputers are evolving rapidly with now millions of processing units, posing the questions of their programmability. Despite the emergence of more widespread and functional parallel programming models, developing correct and effective parallel applications still remains a complex task. Although debugging solutions have emerged to address this issue, they often come with restrictions. However programming model evolutions stress the requirement for a convenient validation tool able to handle hybrid applications. Indeed as current scientific applications mainly rely on the Message Passing Interface (MPI) parallel programming model, new hardwares designed for Exascale with higher node-level parallelism clearly advocate for an MPI+X solutions with X a thread-based model such as OpenMP. But integrating two different programming models inside the same application can be error-prone leading to complex bugs - mostly detected unfortunately at runtime. In an MPI+X program not only the correctness of MPI should be ensured but also its interactions with the multi-threaded model, for example identical MPI collective operations cannot be performed by multiple nonsynchronized threads. This thesis aims at developing a combination of static and dynamic analysis to enable an early verification of hybrid HPC applications. The first pass statically verifies the thread level required by an MPI+OpenMP application and outlines execution paths leading to potential deadlocks. Thanks to this analysis, the code is selectively instrumented, displaying an error and synchronously interrupting all processes if the actual scheduling leads to a deadlock situation.L’utilisation du parallélisme des architectures actuelles dans le domaine du calcul hautes performances, oblige à recourir à différents langages parallèles. Ainsi, l’utilisation conjointe de MPI pour le parallélisme gros grain, à mémoire distribuée et OpenMP pour du parallélisme de thread, fait partie des pratiques de développement d’applications pour supercalculateurs. Des erreurs, liées à l’utilisation conjointe de ces langages de parallélisme, sont actuellement difficiles à détecter et cela limite l’écriture de codes, permettant des interactions plus poussées entre ces niveaux de parallélisme. Des outils ont été proposés afin de palier ce problème. Cependant, ces outils sont généralement focalisés sur un type de modèle et permettent une vérification dite statique (à la compilation) ou dynamique (à l’exécution). Pourtant une combinaison statique/- dynamique donnerait des informations plus pertinentes. En effet, le compilateur est en mesure de donner des informations relatives au comportement général du code, indépendamment du jeu d’entrée. C’est par exemple le cas des problèmes liés aux communications collectives du modèle MPI. Cette thèse a pour objectif de développer des analyses statiques/dynamiques permettant la vérification d’une application parallèle mélangeant plusieurs modèles de programmation, afin de diriger les développeurs vers un code parallèle multi-modèles correct et performant. La vérification se fait en deux étapes. Premièrement, de potentielles erreurs sont détectées lors de la phase de compilation. Ensuite, un test au runtime est ajouté pour savoir si le problème va réellement se produire. Grâce à ces analyses combinées, nous renvoyons des messages précis aux utilisateurs et évitons les situations de blocage
    corecore