13 research outputs found

    uiCA : Accurate Throughput Prediction of Basic Blocks on Recent Intel Microarchitectures

    Get PDF
    Performance models that statically predict the steady-state throughput of basic blocks on particular microarchitectures, such as IACA, Ithemal, llvm-mca, OSACA, or CQA, can guide optimizing compilers and aid manual software optimization. However, their utility heavily depends on the accuracy of their predictions. The average error of existing models compared to measurements on the actual hardware has been shown to lie between 9% and 36%. But how good is this? To answer this question, we propose an extremely simple analytical throughput model that may serve as a baseline. Surprisingly, this model is already competitive with the state of the art, indicating that there is significant potential for improvement. To explore this potential, we develop a simulation-based throughput predictor. To this end, we propose a detailed parametric pipeline model that supports all Intel Core microarchitecture generations released between 2011 and 2021. We evaluate our predictor on an improved version of the BHive benchmark suite and show that its predictions are usually within 1% of measurement results, improving upon prior models by roughly an order of magnitude. The experimental evaluation also demonstrates that several microarchitectural details considered to be rather insignificant in previous work, are in fact essential for accurate prediction. Our throughput predictor is available as open source

    Accurate Energy and Performance Prediction for Frequency-Scaled GPU Kernels

    Get PDF
    Energy optimization is an increasingly important aspect of today’s high-performance computing applications. In particular, dynamic voltage and frequency scaling (DVFS) has become a widely adopted solution to balance performance and energy consumption, and hardware vendors provide management libraries that allow the programmer to change both memory and core frequencies manually to minimize energy consumption while maximizing performance. This article focuses on modeling the energy consumption and speedup of GPU applications while using different frequency configurations. The task is not straightforward, because of the large set of possible and uniformly distributed configurations and because of the multi-objective nature of the problem, which minimizes energy consumption and maximizes performance. This article proposes a machine learning-based method to predict the best core and memory frequency configurations on GPUs for an input OpenCL kernel. The method is based on two models for speedup and normalized energy predictions over the default frequency configuration. Those are later combined into a multi-objective approach that predicts a Pareto-set of frequency configurations. Results show that our approach is very accurate at predicting extema and the Pareto set, and finds frequency configurations that dominate the default configuration in either energy or performance.DFG, 360291326, CELERITY: Innovative Modellierung fĂŒr Skalierbare Verteilte Laufzeitsystem

    Vectorization system for unstructured codes with a Data-parallel Compiler IR

    Get PDF
    With Dennard Scaling coming to an end, Single Instruction Multiple Data (SIMD) oïŹ€ers itself as a way to improve the compute throughput of CPUs. One fundamental technique in SIMD code generators is the vectorization of data-parallel code regions. This has applications in outer-loop vectorization, whole-function vectorization and vectorization of explicitly data-parallel languages. This thesis makes contributions to the reliable vectorization of data-parallel code regions with unstructured, reducible control ïŹ‚ow. Reducibility is the case in practice where all control-ïŹ‚ow loops have exactly one entry point. We present P-LLVM, a novel, full-featured, intermediate representation for vectorizers that provides a semantics for the code region at every stage of the vectorization pipeline. Partial control-ïŹ‚ow linearization is a novel partial if-conversion scheme, an essential technique to vectorize divergent control ïŹ‚ow. DiïŹ€erent to prior techniques, partial linearization has linear running time, does not insert additional branches or blocks and gives proved guarantees on the control ïŹ‚ow retained. Divergence of control induces value divergence at join points in the control-ïŹ‚ow graph (CFG). We present a novel control-divergence analysis for directed acyclic graphs with optimal running time and prove that it is correct and precise under common static assumptions. We extend this technique to obtain a quadratic-time, control-divergence analysis for arbitrary reducible CFGs. For this analysis, we show on a range of realistic examples how earlier approaches are either less precise or incorrect. We present a feature-complete divergence analysis for P-LLVM programs. The analysis is the ïŹrst to analyze stack-allocated objects in an unstructured control setting. Finally, we generalize single-dimensional vectorization of outer loops to multi-dimensional tensorization of loop nests. SIMD targets beneïŹt from tensorization through more opportunities for re-use of loaded values and more eïŹƒcient memory access behavior. The techniques were implemented in the Region Vectorizer (RV) for vectorization and TensorRV for loop-nest tensorization. Our evaluation validates that the general-purpose RV vectorization system matches the performance of more specialized approaches. RV performs on par with the ISPC compiler, which only supports its structured domain-speciïŹc language, on a range of tree traversal codes with complex control ïŹ‚ow. RV is able to outperform the loop vectorizers of state-of-the-art compilers, as we show for the SPEC2017 nab_s benchmark and the XSBench proxy application.Mit dem Ausreizen des Dennard Scalings erreichen die gewohnten ZuwĂ€chse in der skalaren Rechenleistung zusehends ihr Ende. Moderne Prozessoren setzen verstĂ€rkt auf parallele Berechnung, um den Rechendurchsatz zu erhöhen. Hierbei spielen SIMD Instruktionen (Single Instruction Multiple Data), die eine Operation gleichzeitig auf mehrere Eingaben anwenden, eine zentrale Rolle. Eine fundamentale Technik, um SIMD Programmcode zu erzeugen, ist der Einsatz datenparalleler Vektorisierung. Diese unterliegt populĂ€ren Verfahren, wie der Vektorisierung Ă€ußerer Schleifen, der Vektorisierung gesamter Funktionen bis hin zu explizit datenparallelen Programmiersprachen. Der Beitrag der vorliegenden Arbeit besteht darin, ein zuverlĂ€ssiges Vektorisierungssystem fĂŒr datenparallelen Code mit reduziblem SteuerïŹ‚uss zu entwickeln. Diese Anforderung ist fĂŒr alle SteuerïŹ‚ussgraphen erfĂŒllt, deren Schleifen nur einen Eingang haben, was in der Praxis der Fall ist. Wir prĂ€sentieren P-LLVM, eine ausdrucksstarke Zwischendarstellung fĂŒr Vektorisierer, welche dem Programm in jedem Stadium der Transformation von datenparallelem Code zu SIMD Code eine deïŹnierte Semantik verleiht. Partielle SteuerïŹ‚uss-Linearisierung ist ein neuer Algorithmus zur If-Conversion, welcher SprĂŒnge erhalten kann. Anders als existierende Verfahren hat Partielle Linearisierung eine lineare Laufzeit und fĂŒgt keine neuen SprĂŒnge oder Blöcke ein. Wir zeigen Kriterien, unter denen der Algorithmus SteuerïŹ‚uss erhĂ€lt, und beweisen diese. SteuerïŹ‚ussdivergenz induziert Divergenz an Punkten zusammenïŹ‚ießenden SteuerïŹ‚usses. Wir stellen eine neue SteuerïŹ‚ussdivergenzanalyse fĂŒr azyklische Graphen mit optimaler Laufzeit vor und beweisen deren Korrektheit und PrĂ€zision. Wir verallgemeinern die Technik zu einem Algorithmus mit quadratischer Laufzeit fĂŒr beliebiege, reduzible SteuerïŹ‚ussgraphen. Eine Studie auf realistischen Beispielgraphen zeigt, dass vergleichbare Techniken entweder weniger prĂ€size sind oder falsche Ergebnisse liefern. Ebenfalls prĂ€sentieren wir eine Divergenzanalyse fĂŒr P-LLVM Programme. Diese Analyse ist die erste Divergenzanalyse, welche Divergenz in stapelallokierten Objekten unter unstrukturiertem SteuerïŹ‚uss analysiert. Schließlich generalisieren wir die eindimensionale Vektorisierung von Ă€ußeren Schleifen zur multidimensionalen Tensorisierung von Schleifennestern. Tensorisierung erĂ¶ïŹ€net fĂŒr SIMD Prozessoren mehr Möglichkeiten, bereits geladene Werte wiederzuverwenden und das SpeicherzugriïŹ€sverhalten des Programms zu optimieren, als dies mit Vektorisierung der Fall ist. Die vorgestellten Techniken wurden in den Region Vectorizer (RV) fĂŒr Vektorisierung und TensorRV fĂŒr die Tensorisierung von Schleifennestern implementiert. Wir zeigen auf einer Reihe von steuerïŹ‚usslastigen Programmen fĂŒr die Traversierung von Baumdatenstrukturen, dass RV das gleiche Niveau erreicht wie der ISPC Compiler, welcher nur seine strukturierte Eingabesprache verarbeiten kann. RV kann schnellere SIMD-Programme erzeugen als die Schleifenvektorisierer in aktuellen Industriecompilern. Dies demonstrieren wir mit dem nab_s benchmark aus der SPEC2017 Benchmarksuite und der XSBench Proxy-Anwendung

    Optimizing SIMD execution in HW/SW co-designed processors

    Get PDF
    SIMD accelerators are ubiquitous in microprocessors from different computing domains. Their high compute power and hardware simplicity improve overall performance in an energy efficient manner. Moreover, their replicated functional units and simple control mechanism make them amenable to scaling to higher vector lengths. However, code generation for these accelerators has been a challenge from the days of their inception. Compilers generate vector code conservatively to ensure correctness. As a result they lose significant vectorization opportunities and fail to extract maximum benefits out of SIMD accelerators. This thesis proposes to vectorize the program binary at runtime in a speculative manner, in addition to the compile time static vectorization. There are different environments that support runtime profiling and optimization support required for dynamic vectorization, one of most prominent ones being: 1) Dynamic Binary Translators and Optimizers (DBTO) and 2) Hardware/Software (HW/SW) Co-designed Processors. HW/SW co-designed environment provides several advantages over DBTOs like transparent incorporations of new hardware features, binary compatibility, etc. Therefore, we use HW/SW co-designed environment to assess the potential of speculative dynamic vectorization. Furthermore, we analyze vector code generation for wider vector units and find out that even though SIMD accelerators are amenable to scaling from the hardware point of view, vector code generation at higher vector length is even more challenging. The two major factors impeding vectorization for wider SIMD units are: 1) Reduced dynamic instruction stream coverage for vectorization and 2) Large number of permutation instructions. To solve the first problem we propose Variable Length Vectorization that iteratively vectorizes for multiple vector lengths to improve dynamic instruction stream coverage. Secondly, to reduce the number of permutation instructions we propose Selective Writing that selectively writes to different parts of a vector register and avoids permutations. Finally, we tackle the problem of leakage energy in SIMD accelerators. Since SIMD accelerators consume significant amount of real estate on the chip, they become the principle source of leakage if not utilized judiciously. Power gating is one of the most widely used techniques to reduce leakage energy of functional units. However, power gating has its own energy and performance overhead associated with it. We propose to selectively devectorize the vector code when higher SIMD lanes are used intermittently. This selective devectorization keeps the higher SIMD lanes idle and power gated for maximum duration. Therefore, resulting in overall leakage energy reduction.Postprint (published version

    Ohjaamaton koneoppiminen tapahtumakategorisoinnissa liiketoimintatiedon hyödyntÀmisessÀ

    Get PDF
    The data and information available for business intelligence purposes is increasing rapidly in the world. Data quality and quantity are important for making the correct business decisions, but the amount of data is becoming difficult to process. Different machine learning methods are becoming an increasingly powerful tool to deal with the amount of data. One such machine learning approach is the automatic annotation and location of business intelligence relevant actions and events in news data. While studying the literature of this field, it however became clear, that there exists little standardization and objectivity regarding what types of categories these events and actions are sorted into. This was often done in subjective, arduous manners. The goal of this thesis is to provide information and recommendations on how to create more objective, less time consuming initial categorizations of actions and events by studying different common unsupervised learning methods for this task. The relevant literature and theory to understand the followed research and methodology is studied. The context and evolution of business intelligence to today is considered, and specially its relationship to the big data problem of today is studied. This again relates to the fields of machine learning, artificial intelligence, and especially natural language programming. The relevant methods of these fields are covered to understand the taken steps to achieve the goal of this thesis. All approaches aided in understanding the behaviour of unsupervised learning methods, and how it should taken into account in the categorization creation. Different natural language preprocessing steps are combined with different text vectorization methods. Specifically, three different text tokenization methods, plain, N-gram, and chunk tokenizations are tested with two popular vectorization methods: bag-of-words and term frequency inverse document frequency vectorizations. Two types of unsupervised methods are tested for these vectorizations: Clustering is a more traditional data subcategorization process, and topic modelling is a fuzzy, probability based method for the same task. Out of both learning methods, three different algorithms are studied by the interpretability and categorization value of their top cluster or topic representative terms. The top term representations are also compared to the true contents of these topics or clusters via content analysis. Out of the studied methods, plain and chunk tokenization methods yielded the most comprehensible results to a human reader. Vectorization made no major difference regarding top term interpretability or contents and top term congruence. Out of the methods studied, K-means clustering and Latent Dirichlet Allocation were deemed the most useful in event and action categorization creation. K-means clustering created a good basis for an initial categorization framework with congruent result top terms to the contents of the clusters, and Latent Dirichlet Allocation found latent topics in the text documents that provided serendipitous, fruitful insights for a category creator to take into account

    Combiner approches statique et dynamique pour modéliser la performance de boucles HPC

    Get PDF
    The complexity of CPUs has increased considerably since their beginnings, introducing mechanisms such as register renaming, out-of-order execution, vectorization,prefetchers and multi-core environments to keep performance rising with each product generation. However, so has the difficulty in making proper use of all these mechanisms, or even evaluating whether one’s program makes good use of a machine,whether users’ needs match a CPU’s design, or, for CPU architects, knowing how each feature really affects customers.This thesis focuses on increasing the observability of potential bottlenecks inHPC computational loops and how they relate to each other in modern microarchitectures.We will first introduce a framework combining CQA and DECAN (respectively static and dynamic analysis tools) to get detailed performance metrics on smallcodelets in various execution scenarios.We will then present PAMDA, a performance analysis methodology leveraging elements obtained from codelet analysis to detect potential performance problems in HPC applications and help resolve them. A work extending the Cape linear model to better cover Sandy Bridge and give it more flexibility for HW/SW codesign purposes will also be described. It will bedirectly used in VP3, a tool evaluating the performance gains vectorizing loops could provide.Finally, we will describe UFS, an approach combining static analysis and cycle accurate simulation to very quickly estimate a loop’s execution time while accounting for out-of-order limitations in modern CPUsLa complexitĂ© des CPUs s’est accrue considĂ©rablement depuis leurs dĂ©buts, introduisant des mĂ©canismes comme le renommage de registres, l’exĂ©cution dans le dĂ©sordre, la vectorisation, les prĂ©fetchers et les environnements multi-coeurs pour amĂ©liorer les performances avec chaque nouvelle gĂ©nĂ©ration de processeurs. Cependant, la difficultĂ© a suivi la mĂȘme tendance pour ce qui est a) d’utiliser ces mĂȘmes mĂ©canismes Ă  leur plein potentiel, b) d’évaluer si un programme utilise une machine correctement, ou c) de savoir si le design d’un processeur rĂ©pond bien aux besoins des utilisateurs.Cette thĂšse porte sur l’amĂ©lioration de l’observabilitĂ© des facteurs limitants dans les boucles de calcul intensif, ainsi que leurs interactions au sein de microarchitectures modernes.Nous introduirons d’abord un framework combinant CQA et DECAN (des outils d’analyse respectivement statique et dynamique) pour obtenir des mĂ©triques dĂ©taillĂ©es de performance sur des petits codelets et dans divers scĂ©narios d’exĂ©cution.Nous prĂ©senterons ensuite PAMDA, une mĂ©thodologie d’analyse de performance tirant partie de l’analyse de codelets pour dĂ©tecter d’éventuels problĂšmes de performance dans des applications de calcul Ă  haute performance et en guider la rĂ©solution.Un travail permettant au modĂšle linĂ©aire Cape de couvrir la microarchitecture Sandy Bridge de façon dĂ©taillĂ©e sera dĂ©crit, lui donnant plus de flexibilitĂ© pour effectuer du codesign matĂ©riel / logiciel. Il sera mis en pratique dans VP3, un outil Ă©valuant les gains de performance atteignables en vectorisant des boucles.Nous dĂ©crirons finalement UFS, une approche combinant analyse statique et simulation au cycle prĂšs pour permettre l’estimation rapide du temps d’exĂ©cution d’une boucle en prenant en compte certaines des limites de l’exĂ©cution en dĂ©sordre dans des microarchitectures moderne

    Analytical Query Processing Using Heterogeneous SIMD Instruction Sets

    Get PDF
    Numerous applications gather increasing amounts of data, which have to be managed and queried. Different hardware developments help to meet this challenge. The grow-ing capacity of main memory enables database systems to keep all their data in memory. Additionally, the hardware landscape is becoming more diverse. A plethora of homo-geneous and heterogeneous co-processors is available, where heterogeneity refers not only to a different computing power, but also to different instruction set architectures. For instance, modern IntelÂź CPUs offer different instruction sets supporting the Single Instruction Multiple Data (SIMD) paradigm, e.g. SSE, AVX, and AVX512. Database systems have started to exploit SIMD to increase performance. However, this is still a challenging task, because existing algorithms were mainly developed for scalar processing and because there is a huge variety of different instruction sets, which were never standardized and have no unified interface. This requires to completely rewrite the source code for porting a system to another hardware architecture, even if those archi-tectures are not fundamentally different and designed by the same company. Moreover, operations on large registers, which are the core principle of SIMD processing, behave counter-intuitively in several cases. This is especially true for analytical query process-ing, where different memory access patterns and data dependencies caused by the com-pression of data, challenge the limits of the SIMD principle. Finally, there are physical constraints to the use of such instructions affecting the CPU frequency scaling, which is further influenced by the use of multiple cores. This is because the supply power of a CPU is limited, such that not all transistors can be powered at the same time. Hence, there is a complex relationship between performance and power, and therefore also between performance and energy consumption. This thesis addresses the specific challenges, which are introduced by the application of SIMD in general, and the heterogeneity of SIMD ISAs in particular. Hence, the goal of this thesis is to exploit the potential of heterogeneous SIMD ISAs for increasing the performance as well as the energy-efficiency
    corecore