11 research outputs found

    Heterogeneous parallel virtual machine: A portable program representation and compiler for performance and energy optimizations on heterogeneous parallel systems

    Get PDF
    Programming heterogeneous parallel systems, such as the SoCs (System-on-Chip) on mobile and edge devices is extremely difficult; the diverse parallel hardware they contain exposes vastly different hardware instruction sets, parallelism models and memory systems. Moreover, a wide range of diverse hardware and software approximation techniques are available for applications targeting heterogeneous SoCs, further exacerbating the programmability challenges. In this thesis, we alleviate the programmability challenges of such systems using flexible compiler intermediate representation solutions, in order to benefit from the performance and superior energy efficiency of heterogeneous systems. First, we develop Heterogeneous Parallel Virtual Machine (HPVM), a parallel program representation for heterogeneous systems, designed to enable functional and performance portability across popular parallel hardware. HPVM is based on a hierarchical dataflow graph with side effects. HPVM successfully supports three important capabilities for programming heterogeneous systems: a compiler intermediate representation (IR), a virtual instruction set (ISA), and a basis for runtime scheduling. We use the HPVM representation to implement an HPVM prototype, defining the HPVM IR as an extension of the Low Level Virtual Machine (LLVM) IR. Our results show comparable performance with optimized OpenCL kernels for the target hardware from a single HPVM representation using translators from HPVM virtual ISA to native code, IR optimizations operating directly on the HPVM representation, and the capability for supporting flexible runtime scheduling schemes from a single HPVM representation. We extend HPVM to ApproxHPVM, introducing hardware-independent approximation metrics in the IR to enable maintaining accuracy information at the IR level and mapping of application-level end-to-end quality metrics to system level "knobs". The approximation metrics quantify the acceptable accuracy loss for individual computations. Application programmers only need to specify high-level, and end-to-end, quality metrics, instead of detailed parameters for individual approximation methods. The ApproxHPVM system then automatically tunes the accuracy requirements of individual computations and maps them to approximate hardware when possible. ApproxHPVM results show significant performance and energy improvements for popular deep learning benchmarks. Finally, we extend to ApproxHPVM to ApproxTuner, a compiler and runtime system for approximation. ApproxTuner extends ApproxHPVM with a wide range of hardware and software approximation techniques. It uses a three step approximation tuning strategy, a combination of development-time, install-time, and dynamic tuning. Our strategy ensures software portability, even though approximations have highly hardware-dependent performance, and enables efficient dynamic approximation tuning despite the expensive offline steps. ApproxTuner results show significant performance and energy improvements across 7 Deep Neural Networks and 3 image processing benchmarks, and ensures that high-level end-to-end quality specifications are satisfied during adaptive approximation tuning

    Contributions to the efficient use of general purpose coprocessors: kernel density estimation as case study

    Get PDF
    142 p.The high performance computing landscape is shifting from assemblies of homogeneous nodes towards heterogeneous systems, in which nodes consist of a combination of traditional out-of-order execution cores and accelerator devices. Accelerators provide greater theoretical performance compared to traditional multi-core CPUs, but exploiting their computing power remains as a challenging task.This dissertation discusses the issues that arise when trying to efficiently use general purpose accelerators. As a contribution to aid in this task, we present a thorough survey of performance modeling techniques and tools for general purpose coprocessors. Then we use as case study the statistical technique Kernel Density Estimation (KDE). KDE is a memory bound application that poses several challenges for its adaptation to the accelerator-based model. We present a novel algorithm for the computation of KDE that reduces considerably its computational complexity, called S-KDE. Furthermore, we have carried out two parallel implementations of S-KDE, one for multi and many-core processors, and another one for accelerators. The latter has been implemented in OpenCL in order to make it portable across a wide range of devices. We have evaluated the performance of each implementation of S-KDE in a variety of architectures, trying to highlight the bottlenecks and the limits that the code reaches in each device. Finally, we present an application of our S-KDE algorithm in the field of climatology: a novel methodology for the evaluation of environmental models

    Traçage et profilage de systèmes hétérogènes

    Get PDF
    RÉSUMÉ : Les systèmes hétérogènes sont de plus en plus présents dans tous les ordinateurs. En effet, de nombreuses tâches nécessitent l’utilisation de coprocesseurs spécialisés. Ces coprocesseurs ont permis des gains de performance très importants qui ont mené à des découvertes scientifiques, notamment l’apprentissage profond qui n’est réapparu qu’avec l’arrivée de la programmation multiusage des processeurs graphiques. Ces coprocesseurs sont de plus en plus complexes. La collaboration et la cohabitation dans un même système de ces puces mènent à des comportements qui ne peuvent pas être prédits avec l’utilisation d’analyse statique. De plus, l’utilisation de systèmes parallèles qui possèdent des milliers de fils d’exécution, et de modèles de programmation spécialisés, rend la compréhension de tels systèmes très difficile. Ces problèmes de compréhension rendent non seulement la programmation plus lente, plus couteuse, mais empêchent aussi le diagnostic de problèmes de performance.----------ABSTRACT : Heterogeneous systems are becoming increasingly relevant and important with the emergence of powerful specialized coprocessors. Because of the nature of certain problems, like graphics display, deep learning and physics simulation, these devices have become a necessity. The power derived from their highly parallel or very specialized architecture is essential to meet the demands of these problems. Because these use cases are common on everyday devices like cellphones and computers, highly parallel coprocessors are added to these devices and collaborate with standard CPUs. The cooperation between these different coprocessors makes the system very difficult to analyze and understand. The highly parallel workload and specialized programming models make programming applications very difficult. Troubleshooting performance issues is even more complex. Since these systems communicate through many layers, the abstractions hide many performance defects

    Enabling high performance dynamic language programming for micro-core architectures

    Get PDF
    Micro-core architectures are intended to deliver high performance at a low overall power consumption by combining many simple central processing unit (CPU) cores, with an associated small amount of memory, onto a single chip. This technology is not only of great interest for embedded, Edge and IoT applications but also for High-Performance Computing (HPC) accelerators. However, micro-core architectures are difficult to program and exploit, not only because each technology is different, with its own idiosyncrasies, but also because they each present a different low-level interface to the programmer. Furthermore, micro-cores have very constrained amounts of on-chip, scratchpad memory (often around 32KB), further hampering programmer productivity by requiring the programmer to manually manage the regular loading and unloading of data from the host to the device during program execution. To help address these issues, dynamic languages such as Python have been ported to several micro-core architectures but these are often delivered as interpreters with the associated performance penalty over natively compiled languages, such as C. The research questions for this thesis target four areas of concern for dynamic programming languages on micro-core architectures: (RQ1) how to manage the limited on-chip memory for data, (RQ2) how to manage the limited on-chip memory for code, (RQ3) how to address the low runtime performance of virtual machines and (RQ4) how to manage the idiosyncratic architectures of micro-core architectures. The focus of this work is to address these concerns whilst maintaining the programmer productivity benefits of dynamic programming languages, using ePython as the research vehicle. Therefore, key areas of design (such as abstractions for offload) and implementation (novel compiler and runtime techniques for these technologies) are considered, resulting in a number of approaches that are not only applicable to the compilation of Python codes but also more generally to other dynamic languages on micro-cores architectures. RQ1 was addressed by providing support for kernels with arbitrary data size through high-level programming abstractions that enable access to the memory hierarchies of micro-core devices, allowing the deployment of real-world applications, such as a machine learning code to detect cancer cells in full-sized scan images. A new abstract machine, Olympus, addressed RQ2 by supporting the compilation of dynamic languages, such as Python, to micro-core native code. Olympus enables ePython to close the kernel runtime performance gap with native C, matching C for the LINPACK and an iterative Fibonacci benchmark, and to provide, on average, around 75\% of native C runtime performance across four benchmarks running on a set of eight CPU architectures. Olympus also addresses RQ3 by providing dynamic function loading, supporting kernel codes larger than the on-chip memory, whilst still retaining the runtime performance benefits of native code generation. Finally, RQ4 was addressed by the Eithne benchmarking framework which not only enabled a single benchmarking code to be deployed, unchanged, across different CPU architectures, but also provided the underlying communications framework for Olympus. The portability of end-user ePython codes and the underlying Olympus abstract machine were validated by running a set of four benchmarks on eight different CPU architectures, from a single codebase

    A GPU performance estimation model based on micro-benchmarks and black-box kernel profiling

    Get PDF
    Κατά την τελευταία δεκαετία, οι επεξεργαστές γραφικών (GPUs) έχουν εδραιωθεί στον τομέα των υπολογιστικών συστημάτων υψηλής απόδοσης ως επιταχυντές υπολογισμών. Τα βασικά χαρακτηριστικά που δικαιολογούν αυτή τη σύγχρονη τάση είναι η εξαιρετικά υψηλή υπολογιστική απόδοση τους και η αξιοσημείωτη ενεργειακή αποδοτικότητα τους. Ωστόσο, η απόδοση τους είναι πολύ ευαίσθητη σε πολλούς παράγοντες, όπως π.χ. τον τύπο των μοτίβων πρόσβασης στη μνήμη (memory access patterns), την απόκλιση διακλαδώσεων (branch divergence), τον βαθμό παραλληλισμού και τις δυνητικές καθυστερήσεις (latencies). Συνεπώς, ο χρόνος εκτέλεσης ενός πυρήνα (kernel) σε ένα επεξεργαστή γραφικών είναι ένα δύσκολα προβλέψιμο μέγεθος. Στην περίπτωση που η απόδοση του πυρήνα δεν περιορίζεται από καθυστερήσεις, μπορεί να παρασχεθεί μια χονδρική εκτίμηση του χρόνου εκτέλεσης σε ένα συγκεκριμένο επεξεργαστή εφαρμόζοντας το μοντέλο γραμμής-οροφής (roofline), το οποίο χρησιμοποιείται για να αντιστοιχίσει την ένταση υπολογισμών του προγράμματος στην μέγιστη αναμενόμενη απόδοση για ένα συγκεκριμένο επεξεργαστή. Αν και αυτή η προσέγγιση είναι απλή, δεν μπορεί να παρέχει ακριβή αποτελέσματα πρόβλεψης. Σε αυτή τη διατριβή, μετά την επαλήθευση της αρχής του μοντέλου γραμμής-οροφής σε επεξεργαστές γραφικών με τη χρήση ενός μικρο-μετροπρογράμματος, προτείνεται ένα αναλυτικό μοντέλο απόδοσης. Συγκεκριμένα, βελτιώνεται το μοντέλο γραμμής-οροφής ακολουθώντας μια ποσοτική προσέγγιση και παρουσιάζεται μία πλήρως αυτοματοποιημένη μέθοδος πρόβλεψης απόδοσης σε επεξεργαστή γραφικών. Από αυτή την άποψη, το προτεινόμενο μοντέλο χρησιμοποιεί την αξιολόγηση μέσω μικρο-μετροπρογραμμάτων και την καταγραφή μετρικών με μέθοδο «μαύρου κουτιού», καθώς δεν απαιτείται διερεύνηση του πηγαίου/δυαδικού κώδικα. Το προτεινόμενο μοντέλο συνδυάζει τις παραμέτρους του επεξεργαστή γραφικών και του πυρήνα για να χαρακτηρίσει τον παράγοντα περιορισμού της απόδοσης και να προβλέψει το χρόνο εκτέλεσης στο στοχευόμενο υλικό, λαμβάνοντας υπόψη την αποδοτικότητα των ωφελίμων υπολογιστικών εντολών. Επιπλέον, προτείνεται η οπτική αναπαράσταση «διαμοιρασμού-τεταρτημορίου» (“quadrant-split”), η οποία αποδίδει τα χαρακτηριστικά πολλών επεξεργαστών σε σχέση με έναν συγκεκριμένο πυρήνα. Η πειραματική αξιολόγηση συνδυάζει δοκιμαστικές εκτελέσεις σε υπολογισμούς μορίων (κόκκινο/μαύρο SOR, LMSOR), πολλαπλασιασμό πινάκων (SGEMM) και ένα σύνολο 28 πυρήνων της σουίτας μετροπρογραμμάτων Rodinia, όλα εφαρμοσμένα σε έξι επεξεργαστές γραφικών CUDA. Το παρατηρηθέν απόλυτο σφάλμα στις προβλέψεις ήταν 27,66% στη μέση περίπτωση. Διερευνήθηκαν και αιτιολογήθηκαν ιδιαίτερες περιπτώσεις εσφαλμένων προβλέψεων. Επιπλέον, το προαναφερθέν μικρο-μετροπρόγραμμα χρησιμοποιήθηκε ως αντικείμενο για την πρόβλεψη απόδοσης και τα αποτελέσματα ήταν πολύ ακριβή. Προσθέτως, το μοντέλο απόδοσης εξετάστηκε σε σύνθετο περιβάλλον μεταξύ διαφορετικών κατασκευαστών, εφαρμόζοντας τη μέθοδο πρόβλεψης στους ίδιους πηγαίους κώδικες πυρήνων μέσω του περιβάλλοντος προγραμματισμού HIP που υποστηρίζεται από την πλατφόρμα AMD ROCm. Τα σφάλματα πρόβλεψης ήταν συγκρίσιμα αυτών των πειραμάτων του περιβάλλοντος CUDA, παρά τις σημαντικές διαφορές αρχιτεκτονικής που παρατηρούνται μεταξύ των διαφορετικών κατασκευαστών επεξεργαστών γραφικών.Over the last decade GPUs have been established in the High Performance Computing sector as compute accelerators. The primary characteristics that justify this modern trend are the exceptionally high compute throughput and the remarkable power efficiency of GPUs. However, GPU performance is highly sensitive to many factors, e.g. the type of memory access patterns, branch divergence, the degree of parallelism and potential latencies. Consequently, the execution time of a kernel on a GPU is a difficult to predict measure. Unless the kernel is latency bound, a rough estimate of the execution time on a particular GPU could be provided by applying the roofline model, which is used to map the program’s operation intensity to the peak expected performance on a particular processor. Though this approach is straightforward, it cannot not provide accurate prediction results. In this thesis, after validating the roofline principle on GPUs by employing a micro-benchmark, an analytical throughput oriented performance model is proposed. In particular, this improves on the roofline model following a quantitative approach and a completely automated GPU performance prediction technique is presented. In this respect, the proposed model utilizes micro-benchmarking and profiling in a “black-box” fashion as no inspection of source/binary code is required. The proposed model combines GPU and kernel parameters in order to characterize the performance limiting factor and to predict the execution time on target hardware, by taking into account the efficiency of beneficial computational instructions. In addition, the “quadrant-split” visual representation is proposed, which captures the characteristics of multiple processors in relation to a particular kernel. The experimental evaluation combines test executions on stencil computations (red/black SOR, LMSOR), matrix multiplication (SGEMM) and a total of 28 kernels of the Rodinia benchmark suite, all applied on six CUDA GPUs. The observed absolute error in predictions was 27.66% in the average case. Special cases of mispredicted results were investigated and justified. Moreover, the aforementioned micro-benchmark was used as a subject for performance prediction and the exhibited results were very accurate. Furthermore, the performance model was also examined in a cross vendor configuration by applying the prediction method on the same kernel source codes through the HIP programming environment supported on the AMD ROCm platform. Prediction errors were comparable to CUDA experiments despite the significant architectural differences evident between different vendor GPUs

    GPU Array Access Auto-Tuning

    Get PDF
    GPUs have been used for years in compute intensive applications. Their massive parallel processing capabilities can speedup calculations significantly. However, to leverage this speedup it is necessary to rethink and develop new algorithms that allow parallel processing. These algorithms are only one piece to achieve high performance. Nearly as important as suitable algorithms is the actual implementation and the usage of special hardware features such as intra-warp communication, shared memory, caches, and memory access patterns. Optimizing these factors is usually a time consuming task that requires deep understanding of the algorithms and the underlying hardware. Unlike CPUs, the internal structure of GPUs has changed significantly and will likely change even more over the years. Therefore it does not suffice to optimize the code once during the development, but it has to be optimized for each new GPU generation that is released. To efficiently (re-)optimize code towards the underlying hardware, auto-tuning tools have been developed that perform these optimizations automatically, taking this burden from the programmer. In particular, NVIDIA -- the leading manufacturer for GPUs today -- applied significant changes to the memory hierarchy over the last four hardware generations. This makes the memory hierarchy an attractive objective for an auto-tuner. In this thesis we introduce the MATOG auto-tuner that automatically optimizes array access for NVIDIA CUDA applications. In order to achieve these optimizations, MATOG has to analyze the application to determine optimal parameter values. The analysis relies on empirical profiling combined with a prediction method and a data post-processing step. This allows to find nearly optimal parameter values in a minimal amount of time. Further, MATOG is able to automatically detect varying application workloads and can apply different optimization parameter settings at runtime. To show MATOG's capabilities, we evaluated it on a variety of different applications, ranging from simple algorithms up to complex applications on the last four hardware generations, with a total of 14 GPUs. MATOG is able to achieve equal or even better performance than hand-optimized code. Further, it is able to provide performance portability across different GPU types (low-, mid-, high-end and HPC) and generations. In some cases it is able to exceed the performance of hand-crafted code that has been specifically optimized for the tested GPU by dynamically changing data layouts throughout the execution

    ApproxHPVM: A retargetable compiler framework for accuracy-aware optimizations

    Get PDF
    With the increasing need for machine learning and data processing near the edge, software stacks and compilers must provide optimizations for alleviating the computational burden on low-end edge devices. Approximate computing can help bridge the gap between increasing computational demands and limited compute power on such devices. We present ApproxHPVM, a portable optimizing compiler and runtime system that enables flexible, optimized use of multiple software and hardware approximations in a unified easy-to-use framework. ApproxHPVM uses a portable compiler IR and compiler analyses that are designed to enable accuracy-aware performance and energy tuning on heterogeneous systems with multiple compute units and approximation methods. ApproxHPVM automatically translates end-to-end application-level quality metrics into accuracy requirements for individual operations. ApproxHPVM uses a hardware-agnostic accuracy-tuning phase to do this translation that provides greater portability across heterogeneous hardware platforms. ApproxHPVM incorporates three main components: (a) a compiler IR with hardware-agnostic approximation metrics, (b) a hardware-agnostic accuracy-tuning phase to identify error-tolerant computations, and (c) an accuracy-aware hardware scheduler that maps error-tolerant computations to approximate hardware components. As ApproxHPVM does not incorporate any hardware-specific knowledge as part of the IR, it can serve as a portable virtual ISA that can be shipped to all kinds of hardware platforms. We evaluate ApproxHPVM on 9 benchmarks from the deep learning domain and 5 image-processing benchmarks. Our results show that our framework can offload chunks of approximable computations to special-purpose accelerators that provide significant gains in performance and energy, while staying within user-specified application-level quality metrics with high probability. Across the 14 benchmarks, we observe from 1-9x performance speedups and 1.1-11.3x energy reduction for very small reductions in accuracy. ApproxTuner extends ApproxHPVM with a flexible system for dynamic approximation tuning. The key contribution in ApproxTuner is a novel three-phase approach to approximation-tuning that consists of development-time, install-time, and run-time phases. Our approach decouples tuning hardware-independent and hardware-specific approximations, thus providing retargetability across devices. To enable efficient autotuning of approximation choices, we present a novel accuracy-aware tuning technique called predictive approximation-tuning. It can optimize the application during development-time and can also refine the optimization with (previously unknown) hardware-specific approximations at install time. We evaluate ApproxTuner across 11 benchmarks from deep learning and image processing domains. For the evaluated convolutional neural networks, we show that using only hardware-independent approximation choices provides a mean speedup of 2.2x (max 2.7x) on GPU, and 1.4x mean speedup (max 1.9x) on the CPU, while staying within 2 percentage points of inference accuracy loss. For two different accuracy-prediction models, our predictive tuning strategy speeds up tuning by 13.7x and 17.9x compared to conventional empirical tuning while achieving comparable benefits

    Generalized database index structures on massively parallel processor architectures

    Get PDF
    Height-balanced search trees are ubiquitous in database management systems as well as in other applications that require efficient access methods in order to identify entries in large data volumes. They can be configured with various strategies for structuring the search space for a given data set and for pruning it when different kinds of search queries are answered. In order to facilitate the development of application-specific tree variants, index frameworks, such as GiST, exist that provide a reusable library of commonly shared tree management functionality. By specializing internal data organization strategies, the framework can be customized to create an index that is efficient for an application's data access characteristics. Because the majority of the framework's code can be reused development and testing efforts are significantly lower, compared to an implementation from scratch. However, none of the existing frameworks supports the execution of index operations on massively parallel processor architectures, such as GPUs. Enabling the use of such processors for generalized index frameworks is the goal of this thesis. By compiling state-of-the-art techniques from a wide range of CPU- and GPU-optimized indexes, a GiST extension is developed that abstracts the physical execution aspect of generic, tree-based search queries. Tree traversals are broken-down into vectorized processing primitives that can be scheduled to one of the available (co-)processors for execution. Further, a CPU-based implementation is provided as well as a new GPU-based algorithm that, unlike prior art in this area, does not require that the index is fully stored inside a GPU's main memory buffer. The applicability of the extended framework is assessed for image rendering engines and, based on microbenchmarks, the parallelized algorithm performance is compared for different CPU and GPU generations. It will be shown that cases exist, where the GPU clearly outperforms the CPU and vice versa. In order to leverage the strengths of each processor type, an adaptive scheduler is presented that can be calibrated to schedule index operations to the best-fitting device in a hybrid system. With the help of a tree traversal simulation different scheduling strategies are evaluated and it will be shown that the adaptive scheduler can be used to make near-optimal decisions.Suchbäume sind allgegenwärtig in Datenbanksystemen und anderen Anwendungen, die eine effiziente Möglichkeit benötigen um in großen Datensätzen nach Einträgen zu suchen, die bestimmte Suchkriterien erfüllen. Sie können mit verschiedenen Strategien konfiguriert werden um den Suchraum zu strukturieren und die für ein Suchergebnis irrelevante Bereiche von der Bearbeitung auszuschließen. Die Entwicklung von anwendungsspezifischen Indexen wird durch Frameworks wie GiST unterstützt. Jedoch unterstützt keines der heute bereits existierenden Frameworks die Verwendung von hochgradig parallelen Prozessorarchitekturen wie GPUs. Solche Prozessoren für generische Index Frameworks nutzbar zu machen, ist Ziel dieser Arbeit. Dazu werden Techniken aus verschiedensten CPU- und GPU-optimierten Indexen analysiert und für die Entwicklung einer GiST-Erweiterung verwendet, welche die für eine Suche in Suchbäumen nötigen Berechnungen abstrahiert. Traversierungsoperationen werden dabei auf vektorisierte Primitive abgebildet, die auf parallelen Prozessoren implementiert werden können. Die Verwendung dieser Erweiterung wird beispielhaft an einem CPU Algorithmus demonstriert. Weiterhin wird ein neuer GPU-basierter Algorithmus vorgestellt, der im Vergleich zu bisherigen Verfahren, ein dynamisches Nachladen der Index Daten in den Hauptspeicher der GPU unterstützt. Die Praktikabilität des erweiterten Frameworks wird am Beispiel von Anwendungen aus der Computergrafik untersucht und die Performanz der verwendeten Algorithmen mit Hilfe eines Benchmarks auf verschiedenen CPU- und GPU-Modellen analysiert. Dabei wird gezeigt, unter welchen Bedingungen die parallele GPU-basierte Ausführung schneller ist als die CPU-basierte Variante - und umgekehrt. Um die Stärken beider Prozessortypen in einem hybriden System ausnutzen zu können, wird ein Scheduler entwickelt, der nach einer Kalibrierungsphase für eine gegebene Operation den geeignetsten Prozessor wählen kann. Mit Hilfe eines Simulators für Baumtraversierungen werden verschiedenste Scheduling Strategien verglichen. Dabei wird gezeigt, dass die Entscheidungen des Schedulers kaum vom Optimum abweichen und, abhängig von der simulierten Last, die erzielbaren Durchsätze für die parallele Ausführung mehrerer Suchoperationen durch hybrides Scheduling um eine Größenordnung und mehr erhöht werden können