673 research outputs found

    Automatic Generation of Models of Microarchitectures

    Get PDF
    Detailed microarchitectural models are necessary to predict, explain, or optimize the performance of software running on modern microprocessors. Building such models often requires a significant manual effort, as the documentation provided by hardware manufacturers is typically not precise enough. The goal of this thesis is to develop techniques for generating microarchitectural models automatically. In the first part, we focus on recent x86 microarchitectures. We implement a tool to accurately evaluate small microbenchmarks using hardware performance counters. We then describe techniques to automatically generate microbenchmarks for measuring the performance of individual instructions and for characterizing cache architectures. We apply our implementations to more than a dozen different microarchitectures. In the second part of the thesis, we study more general techniques to obtain models of hardware components. In particular, we propose the concept of gray-box learning, and we develop a learning algorithm for Mealy machines that exploits prior knowledge about the system to be learned. Finally, we show how this algorithm can be adapted to minimize incompletely specified Mealy machinesโ€”a well-known NP-complete problem. Our implementation outperforms existing exact minimization techniques by several orders of magnitude on a number of hard benchmarks; it is even competitive with state-of-the-art heuristic approaches.Zur Vorhersage, Erklรคrung oder Optimierung der Leistung von Software auf modernen Mikroprozessoren werden detaillierte Modelle der verwendeten Mikroarchitekturen benรถtigt. Das Erstellen derartiger Modelle ist oft mit einem hohen Aufwand verbunden, da die erforderlichen Informationen von den Prozessorherstellern typischerweise nicht zur Verfรผgung gestellt werden. Das Ziel der vorliegenden Arbeit ist es, Techniken zu entwickeln, um derartige Modelle automatisch zu erzeugen. Im ersten Teil beschรคftigen wir uns mit aktuellen x86-Mikroarchitekturen. Wir entwickeln zuerst ein Tool, das kleine Microbenchmarks mithilfe von Performance Countern auswerten kann. Danach beschreiben wir Techniken, um automatisch Microbenchmarks zu erzeugen, mit denen die Leistung einzelner Instruktionen gemessen sowie die Cache-Architektur charakterisiert werden kann. Im zweiten Teil der Arbeit betrachten wir allgemeinere Techniken, um Hardwaremodelle zu erzeugen. Wir schlagen das Konzept des โ€œGray-Box Learningโ€ vor, und wir entwickeln einen Lernalgorithmus fรผr Mealy-Maschinen, der bekannte Informationen รผber das zu lernende System berรผcksichtigt. Zum Abschluss zeigen wir, wie dieser Algorithmus auf das Problem der Minimierung unvollstรคndig spezifizierter Mealy-Maschinen รผbertragen werden kann. Hierbei handelt es sich um ein bekanntes NP-vollstรคndiges Problem. Unsere Implementierung ist in mehreren Benchmarks um GrรถรŸenordnungen schneller als vorherige Ansรคtze

    Portable compiler optimisation across embedded programs and microarchitectures using machine learning

    Get PDF
    Building an optimising compiler is a difficult and time consuming task which must be repeated for each generation of a microprocessor. As the underlying microarchitecture changes from one generation to the next, the compiler must be retuned to optimise specifically for that new system. It may take several releases of the compiler to effectively exploit a processorโ€™s performance potential, by which time a new generation has appeared and the process starts again. We address this challenge by developing a portable optimising compiler. Our approach employs machine learning to automatically learn the best optimisations to apply for any new program on a new microarchitectural configuration. It achieves this by learning a model off-line which maps a microarchitecture description plus the hardware counters from a single run of the program to the best compiler optimisation passes. Our compiler gains 67 % of the maximum speedup obtainable by an iterative compiler search using 1000 evaluations. We obtain, on average, a 1.16x speedup over the highest default optimisation level across an entire microarchitecture configuration space, achieving a 4.3x speedup in the best case. We demonstrate the robustness of this technique by applying it to an extended microarchitectural space where we achieve comparable performance

    Automatic Throughput and Critical Path Analysis of x86 and ARM Assembly Kernels

    Full text link
    Useful models of loop kernel runtimes on out-of-order architectures require an analysis of the in-core performance behavior of instructions and their dependencies. While an instruction throughput prediction sets a lower bound to the kernel runtime, the critical path defines an upper bound. Such predictions are an essential part of analytic (i.e., white-box) performance models like the Roofline and Execution-Cache-Memory (ECM) models. They enable a better understanding of the performance-relevant interactions between hardware architecture and loop code. The Open Source Architecture Code Analyzer (OSACA) is a static analysis tool for predicting the execution time of sequential loops. It previously supported only x86 (Intel and AMD) architectures and simple, optimistic full-throughput execution. We have heavily extended OSACA to support ARM instructions and critical path prediction including the detection of loop-carried dependencies, which turns it into a versatile cross-architecture modeling tool. We show runtime predictions for code on Intel Cascade Lake, AMD Zen, and Marvell ThunderX2 micro-architectures based on machine models from available documentation and semi-automatic benchmarking. The predictions are compared with actual measurements.Comment: 6 pages, 3 figure

    Microarchitecture-Aware Code Generation for Deep Learning on Single-ISA Heterogeneous Multi-Core Mobile Processors

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์œตํ•ฉ๊ณผํ•™๊ธฐ์ˆ ๋Œ€ํ•™์› ์œตํ•ฉ๊ณผํ•™๋ถ€(์ง€๋Šฅํ˜•์œตํ•ฉ์‹œ์Šคํ…œ์ „๊ณต), 2020. 8. ์ „๋™์„.๋‹จ์ผ ISA ์ด๊ธฐ์ข… ๋ฉ€ํ‹ฐ ์ฝ”์–ด ํ”„๋กœ์„ธ์„œ๊ฐ€ ๋ชจ๋ฐ”์ผ ์ปดํ“จํŒ…์— ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋Š” ๋ฐ˜๋ฉด, ์ฝ”๋“œ ์ƒ์„ฑ์€ ๋‹จ์ผ ๋Œ€์ƒ ์ฝ”์–ด์— ๋Œ€ํ•œ ์ตœ์ ํ™”๋œ ์ฝ”๋“œ๋ฅผ ์ƒ์„ฑํ•˜๊ธฐ์— ํ”„๋กœ์„ธ์„œ์˜ ๋‹ค๋ฅธ ์ฝ”์–ด์—์„œ๋Š” ์ ํ•ฉํ•˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ๋งˆ์ดํฌ๋กœ ์•„ํ‚คํ…์ฒ˜๋ฅผ ์ธ์‹ํ•  ์ˆ˜ ์žˆ๋Š” ์ฝ”๋“œ ์ƒ์„ฑ ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•œ๋‹ค. ์šฐ์„  ๋งˆ์ดํฌ๋กœ ์•„ํ‚คํ…์ฒ˜์™€ ์ƒ๊ด€์—†์ด ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜ ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•˜๊ธฐ ์œ„ํ•ด ๋ชจ๋“  ์ฝ”์–ด๋ฅผ ์ตœ๋Œ€ํ•œ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก FMV (Function-Multi-Versioning)๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ๋˜ํ•œ Cortex-A55 / 75 ์ฝ”์–ด์˜ ์„ฑ๋Šฅ์„ ๋”์šฑ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด ์ปดํŒŒ์ผ๋Ÿฌ์— ๊ฐ„๋‹จํ•˜์ง€๋งŒ ๊ฐ•๋ ฅํ•œ ๋ฐฑ์—”๋“œ ์ตœ์ ํ™” ํŒจ์Šค๋ฅผ ์ถ”๊ฐ€ํ•  ๊ฒƒ์„ ์ œ์•ˆํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ๊ธฐ์ˆ ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ”„๋กœ๊ทธ๋žจ์„ ๋ถ„์„ํ•˜๊ณ  ์„œ๋กœ ๋‹ค๋ฅธ ๋งˆ์ดํฌ๋กœ ์•„ํ‚คํ…์ฒ˜์— ๋งž๊ฒŒ ์—ฌ๋Ÿฌ ๋ฒ„์ „์˜ Function์„ ์ƒ์„ฑํ•˜๋Š” Automated Flow๋ฅผ ๊ฐœ๋ฐœํ•˜์˜€๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ํ”„๋กœ๊ทธ๋žจ ์‹คํ–‰ ์‹œ, ์‹คํ–‰ ์ค‘์ธ ์ฝ”์–ด๋Š” ์—ฐ์‚ฐ ์„ฑ๋Šฅ์„ ๊ทน๋Œ€ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ์ตœ์  ๋ฒ„์ „์˜ Function์„ ์„ ํƒํ•œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•œ ๋ฐฉ๋ฒ•๋ก ์„ ํ†ตํ•ด, TensorFlow Lite๋ฅผ ์‹คํ–‰ํ•˜๋Š” ๋™์•ˆ ์‚ผ์„ฑ์˜ Exynos 9820 ํ”„๋กœ์„ธ์„œ์˜ Cortex-A55 ๋ฐ Cortex-A75 ์ฝ”์–ด์—์„œ ์„ฑ๋Šฅ์„ CNN ๋ชจ๋ธ์—์„œ 11.2 % ๋ฐ 17.9 %, NLP ๋ชจ๋ธ์—์„œ 10.9 %, 4.5 % ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•˜์˜€๋‹ค.While single-ISA heterogeneous multi-core processors are widely used in mobile computing, typical code generations optimize the code for a single target core, leaving it less suitable for the other cores in the processor. We present a microarchitecture-aware code generation methodology to mitigate this issue. We first suggest adopting Function-Multi-Versioning (FMV) to execute application codes utilizing a core at full capacity regardless of its microarchitecture. We also propose to add a simple but powerful backend optimization pass in the compiler to further boost the performance of Cortex-A55/75 cores. Based on these schemes, we developed an automated flow that analyzes the program and generates multiple versions of hot functions tailored to different microarchitectures. At runtime, the running core chooses an optimal version to maximize computation performance. Measurements confirm that the methodology improves the performance of Cortex-A55 and Cortex-A75 cores in Samsung's next-generation Exynos 9820 processor by 11.2% and 17.9% for CNN models, 10.9% and 4.5% for NLP models, respectively, while running TensorFlow Lite.Chapter 1 Introduction 1 1.1 Deep Learning on Single-ISA Heterogeneous Multi-Core Processors 1 1.2 Proposed approach 3 Chapter 2 Related works and Motivation 6 2.1 Function Multi Versioning (FMV) 6 2.2 General Matrix Multiplication (GEMM) 8 2.3 Function Multi Versioning (FMV) 10 Chapter 3 Microarchitecture-Aware Code Generation 12 3.1 FMV Scheme for Heterogeneous Multi-Core Processors 12 3.1.1 Runtime Selector 14 3.1.2 Multi-Versioned Functions 15 3.2 Load Split Optimization and LLVM-Based Unified Automatic Code Generation 16 3.2.1 LLVM Backend Passes 17 3.2.2 Load Split Optimization Pass 18 3.2.3 Unified Automatic Microarchitecture-aware Code Generation Flow 20 Chapter 4 Experimental Evaluation 22 4.1 Experimental Setup 22 4.2 Floating-point CNN Models 23 4.3 Quantized CNN Models 27 4.4 Question Answering Models 28 Chapter 5 Conclusion 31 Bibliography 32 Abstract in Korean 35Maste

    AutoAccel: Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture

    Full text link
    CPU-FPGA heterogeneous architectures are attracting ever-increasing attention in an attempt to advance computational capabilities and energy efficiency in today's datacenters. These architectures provide programmers with the ability to reprogram the FPGAs for flexible acceleration of many workloads. Nonetheless, this advantage is often overshadowed by the poor programmability of FPGAs whose programming is conventionally a RTL design practice. Although recent advances in high-level synthesis (HLS) significantly improve the FPGA programmability, it still leaves programmers facing the challenge of identifying the optimal design configuration in a tremendous design space. This paper aims to address this challenge and pave the path from software programs towards high-quality FPGA accelerators. Specifically, we first propose the composable, parallel and pipeline (CPP) microarchitecture as a template of accelerator designs. Such a well-defined template is able to support efficient accelerator designs for a broad class of computation kernels, and more importantly, drastically reduce the design space. Also, we introduce an analytical model to capture the performance and resource trade-offs among different design configurations of the CPP microarchitecture, which lays the foundation for fast design space exploration. On top of the CPP microarchitecture and its analytical model, we develop the AutoAccel framework to make the entire accelerator generation automated. AutoAccel accepts a software program as an input and performs a series of code transformations based on the result of the analytical-model-based design space exploration to construct the desired CPP microarchitecture. Our experiments show that the AutoAccel-generated accelerators outperform their corresponding software implementations by an average of 72x for a broad class of computation kernels

    Automated Instruction Stream Throughput Prediction for Intel and AMD Microarchitectures

    Full text link
    An accurate prediction of scheduling and execution of instruction streams is a necessary prerequisite for predicting the in-core performance behavior of throughput-bound loop kernels on out-of-order processor architectures. Such predictions are an indispensable component of analytical performance models, such as the Roofline and the Execution-Cache-Memory (ECM) model, and allow a deep understanding of the performance-relevant interactions between hardware architecture and loop code. We present the Open Source Architecture Code Analyzer (OSACA), a static analysis tool for predicting the execution time of sequential loops comprising x86 instructions under the assumption of an infinite first-level cache and perfect out-of-order scheduling. We show the process of building a machine model from available documentation and semi-automatic benchmarking, and carry it out for the latest Intel Skylake and AMD Zen micro-architectures. To validate the constructed models, we apply them to several assembly kernels and compare runtime predictions with actual measurements. Finally we give an outlook on how the method may be generalized to new architectures.Comment: 11 pages, 4 figures, 7 table

    Machine Learning for Microprocessor Performance Bug Localization

    Full text link
    The validation process for microprocessors is a very complex task that consumes substantial engineering time during the design process. Bugs that degrade overall system performance, without affecting its functional correctness, are particularly difficult to debug given the lack of a golden reference for bug-free performance. This work introduces two automated performance bug localization methodologies based on machine learning that aims to aid the debugging process. Our results show that, the evaluated microprocessor core performance bugs whose average IPC impact is greater than 1%, our best-performing technique is able to localize the exact microarchitectural unit of the bug โˆผ\sim77\% of the time, while achieving a top-3 unit accuracy (out of 11 possible locations) of over 90% for bugs with the same average IPC impact. The proposed system in our simulation setup requires only a few seconds to perform a bug location inference, which leads to a reduced debugging time.Comment: 12 pages, 6 figure

    Automatic Loop Kernel Analysis and Performance Modeling With Kerncraft

    Full text link
    Analytic performance models are essential for understanding the performance characteristics of loop kernels, which consume a major part of CPU cycles in computational science. Starting from a validated performance model one can infer the relevant hardware bottlenecks and promising optimization opportunities. Unfortunately, analytic performance modeling is often tedious even for experienced developers since it requires in-depth knowledge about the hardware and how it interacts with the software. We present the "Kerncraft" tool, which eases the construction of analytic performance models for streaming kernels and stencil loop nests. Starting from the loop source code, the problem size, and a description of the underlying hardware, Kerncraft can ideally predict the single-core performance and scaling behavior of loops on multicore processors using the Roofline or the Execution-Cache-Memory (ECM) model. We describe the operating principles of Kerncraft with its capabilities and limitations, and we show how it may be used to quickly gain insights by accelerated analytic modeling.Comment: 11 pages, 4 figures, 8 listing
    • โ€ฆ
    corecore