Search CORE

673 research outputs found

Automatic Generation of Models of Microarchitectures

Author: Abel Andreas
Publication venue: Saarländische Universitäts- und Landesbibliothek
Publication date: 01/01/2020
Field of study

Detailed microarchitectural models are necessary to predict, explain, or optimize the performance of software running on modern microprocessors. Building such models often requires a significant manual effort, as the documentation provided by hardware manufacturers is typically not precise enough. The goal of this thesis is to develop techniques for generating microarchitectural models automatically. In the first part, we focus on recent x86 microarchitectures. We implement a tool to accurately evaluate small microbenchmarks using hardware performance counters. We then describe techniques to automatically generate microbenchmarks for measuring the performance of individual instructions and for characterizing cache architectures. We apply our implementations to more than a dozen different microarchitectures. In the second part of the thesis, we study more general techniques to obtain models of hardware components. In particular, we propose the concept of gray-box learning, and we develop a learning algorithm for Mealy machines that exploits prior knowledge about the system to be learned. Finally, we show how this algorithm can be adapted to minimize incompletely specified Mealy machines—a well-known NP-complete problem. Our implementation outperforms existing exact minimization techniques by several orders of magnitude on a number of hard benchmarks; it is even competitive with state-of-the-art heuristic approaches.Zur Vorhersage, Erklärung oder Optimierung der Leistung von Software auf modernen Mikroprozessoren werden detaillierte Modelle der verwendeten Mikroarchitekturen benötigt. Das Erstellen derartiger Modelle ist oft mit einem hohen Aufwand verbunden, da die erforderlichen Informationen von den Prozessorherstellern typischerweise nicht zur Verfügung gestellt werden. Das Ziel der vorliegenden Arbeit ist es, Techniken zu entwickeln, um derartige Modelle automatisch zu erzeugen. Im ersten Teil beschäftigen wir uns mit aktuellen x86-Mikroarchitekturen. Wir entwickeln zuerst ein Tool, das kleine Microbenchmarks mithilfe von Performance Countern auswerten kann. Danach beschreiben wir Techniken, um automatisch Microbenchmarks zu erzeugen, mit denen die Leistung einzelner Instruktionen gemessen sowie die Cache-Architektur charakterisiert werden kann. Im zweiten Teil der Arbeit betrachten wir allgemeinere Techniken, um Hardwaremodelle zu erzeugen. Wir schlagen das Konzept des “Gray-Box Learning” vor, und wir entwickeln einen Lernalgorithmus für Mealy-Maschinen, der bekannte Informationen über das zu lernende System berücksichtigt. Zum Abschluss zeigen wir, wie dieser Algorithmus auf das Problem der Minimierung unvollständig spezifizierter Mealy-Maschinen übertragen werden kann. Hierbei handelt es sich um ein bekanntes NP-vollständiges Problem. Unsere Implementierung ist in mehreren Benchmarks um Größenordnungen schneller als vorherige Ansätze

Universaar

Acronym

Portable compiler optimisation across embedded programs and microarchitectures using machine learning

Author: Christophe Dubach
Edwin V. Bonilla
Grigori Fursin
Michael F. P. O’boyle
Timothy M. Jones
Publication venue
Publication date: 01/01/2009
Field of study

Building an optimising compiler is a difficult and time consuming task which must be repeated for each generation of a microprocessor. As the underlying microarchitecture changes from one generation to the next, the compiler must be retuned to optimise specifically for that new system. It may take several releases of the compiler to effectively exploit a processor’s performance potential, by which time a new generation has appeared and the process starts again. We address this challenge by developing a portable optimising compiler. Our approach employs machine learning to automatically learn the best optimisations to apply for any new program on a new microarchitectural configuration. It achieves this by learning a model off-line which maps a microarchitecture description plus the hardware counters from a single run of the program to the best compiler optimisation passes. Our compiler gains 67 % of the maximum speedup obtainable by an iterative compiler search using 1000 evaluations. We obtain, on average, a 1.16x speedup over the highest default optimisation level across an entire microarchitecture configuration space, achieving a 4.3x speedup in the best case. We demonstrate the robustness of this technique by applying it to an extended microarchitectural space where we achieve comparable performance

CiteSeerX

Crossref

Automatic Throughput and Critical Path Analysis of x86 and ARM Assembly Kernels

Author: Hager Georg
Hammer Julian
Laukemann Jan
Wellein Gerhard
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 21/10/2019
Field of study

Useful models of loop kernel runtimes on out-of-order architectures require an analysis of the in-core performance behavior of instructions and their dependencies. While an instruction throughput prediction sets a lower bound to the kernel runtime, the critical path defines an upper bound. Such predictions are an essential part of analytic (i.e., white-box) performance models like the Roofline and Execution-Cache-Memory (ECM) models. They enable a better understanding of the performance-relevant interactions between hardware architecture and loop code. The Open Source Architecture Code Analyzer (OSACA) is a static analysis tool for predicting the execution time of sequential loops. It previously supported only x86 (Intel and AMD) architectures and simple, optimistic full-throughput execution. We have heavily extended OSACA to support ARM instructions and critical path prediction including the detection of loop-carried dependencies, which turns it into a versatile cross-architecture modeling tool. We show runtime predictions for code on Intel Cascade Lake, AMD Zen, and Marvell ThunderX2 micro-architectures based on machine models from available documentation and semi-automatic benchmarking. The predictions are compared with actual measurements.Comment: 6 pages, 3 figure

arXiv.org e-Print Archive

Crossref

Microarchitecture-Aware Code Generation for Deep Learning on Single-ISA Heterogeneous Multi-Core Mobile Processors

Author: 박준모
Publication venue: 서울대학교 대학원
Publication date: 01/08/2020
Field of study

학위논문 (석사) -- 서울대학교 대학원 : 융합과학기술대학원 융합과학부(지능형융합시스템전공), 2020. 8. 전동석.단일 ISA 이기종 멀티 코어 프로세서가 모바일 컴퓨팅에 널리 사용되는 반면, 코드 생성은 단일 대상 코어에 대한 최적화된 코드를 생성하기에 프로세서의 다른 코어에서는 적합하지 않을 수 있다. 본 논문에서는 이 문제를 완화하기 위해 마이크로 아키텍처를 인식할 수 있는 코드 생성 방법을 제시한다. 우선 마이크로 아키텍처와 상관없이 애플리케이션 코드를 실행하기 위해 모든 코어를 최대한 활용할 수 있도록 FMV (Function-Multi-Versioning)를 제안한다. 또한 Cortex-A55 / 75 코어의 성능을 더욱 향상시키기 위해 컴파일러에 간단하지만 강력한 백엔드 최적화 패스를 추가할 것을 제안한다. 이러한 기술을 기반으로 프로그램을 분석하고 서로 다른 마이크로 아키텍처에 맞게 여러 버전의 Function을 생성하는 Automated Flow를 개발하였다. 이를 통해 프로그램 실행 시, 실행 중인 코어는 연산 성능을 극대화하기 위해 최적 버전의 Function을 선택한다. 본 논문에서 제시한 방법론을 통해, TensorFlow Lite를 실행하는 동안 삼성의 Exynos 9820 프로세서의 Cortex-A55 및 Cortex-A75 코어에서 성능을 CNN 모델에서 11.2 % 및 17.9 %, NLP 모델에서 10.9 %, 4.5 % 향상시키는 것을 확인하였다.While single-ISA heterogeneous multi-core processors are widely used in mobile computing, typical code generations optimize the code for a single target core, leaving it less suitable for the other cores in the processor. We present a microarchitecture-aware code generation methodology to mitigate this issue. We first suggest adopting Function-Multi-Versioning (FMV) to execute application codes utilizing a core at full capacity regardless of its microarchitecture. We also propose to add a simple but powerful backend optimization pass in the compiler to further boost the performance of Cortex-A55/75 cores. Based on these schemes, we developed an automated flow that analyzes the program and generates multiple versions of hot functions tailored to different microarchitectures. At runtime, the running core chooses an optimal version to maximize computation performance. Measurements confirm that the methodology improves the performance of Cortex-A55 and Cortex-A75 cores in Samsung's next-generation Exynos 9820 processor by 11.2% and 17.9% for CNN models, 10.9% and 4.5% for NLP models, respectively, while running TensorFlow Lite.Chapter 1 Introduction 1 1.1 Deep Learning on Single-ISA Heterogeneous Multi-Core Processors 1 1.2 Proposed approach 3 Chapter 2 Related works and Motivation 6 2.1 Function Multi Versioning (FMV) 6 2.2 General Matrix Multiplication (GEMM) 8 2.3 Function Multi Versioning (FMV) 10 Chapter 3 Microarchitecture-Aware Code Generation 12 3.1 FMV Scheme for Heterogeneous Multi-Core Processors 12 3.1.1 Runtime Selector 14 3.1.2 Multi-Versioned Functions 15 3.2 Load Split Optimization and LLVM-Based Unified Automatic Code Generation 16 3.2.1 LLVM Backend Passes 17 3.2.2 Load Split Optimization Pass 18 3.2.3 Unified Automatic Microarchitecture-aware Code Generation Flow 20 Chapter 4 Experimental Evaluation 22 4.1 Experimental Setup 22 4.2 Floating-point CNN Models 23 4.3 Quantized CNN Models 27 4.4 Question Answering Models 28 Chapter 5 Conclusion 31 Bibliography 32 Abstract in Korean 35Maste

SNU Open Repository and Archive

AutoAccel: Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture

Author: Cong Jason
Wei Peng
Yu Cody Hao
Zhang Peng
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 29/07/2018
Field of study

CPU-FPGA heterogeneous architectures are attracting ever-increasing attention in an attempt to advance computational capabilities and energy efficiency in today's datacenters. These architectures provide programmers with the ability to reprogram the FPGAs for flexible acceleration of many workloads. Nonetheless, this advantage is often overshadowed by the poor programmability of FPGAs whose programming is conventionally a RTL design practice. Although recent advances in high-level synthesis (HLS) significantly improve the FPGA programmability, it still leaves programmers facing the challenge of identifying the optimal design configuration in a tremendous design space. This paper aims to address this challenge and pave the path from software programs towards high-quality FPGA accelerators. Specifically, we first propose the composable, parallel and pipeline (CPP) microarchitecture as a template of accelerator designs. Such a well-defined template is able to support efficient accelerator designs for a broad class of computation kernels, and more importantly, drastically reduce the design space. Also, we introduce an analytical model to capture the performance and resource trade-offs among different design configurations of the CPP microarchitecture, which lays the foundation for fast design space exploration. On top of the CPP microarchitecture and its analytical model, we develop the AutoAccel framework to make the entire accelerator generation automated. AutoAccel accepts a software program as an input and performs a series of code transformations based on the result of the analytical-model-based design space exploration to construct the desired CPP microarchitecture. Our experiments show that the AutoAccel-generated accelerators outperform their corresponding software implementations by an average of 72x for a broad class of computation kernels

arXiv.org e-Print Archive

Crossref

Scipedia

Automated Instruction Stream Throughput Prediction for Intel and AMD Microarchitectures

Author: Hager Georg
Hammer Julian
Hofmann Johannes
Laukemann Jan
Wellein Gerhard
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 10/10/2018
Field of study

An accurate prediction of scheduling and execution of instruction streams is a necessary prerequisite for predicting the in-core performance behavior of throughput-bound loop kernels on out-of-order processor architectures. Such predictions are an indispensable component of analytical performance models, such as the Roofline and the Execution-Cache-Memory (ECM) model, and allow a deep understanding of the performance-relevant interactions between hardware architecture and loop code. We present the Open Source Architecture Code Analyzer (OSACA), a static analysis tool for predicting the execution time of sequential loops comprising x86 instructions under the assumption of an infinite first-level cache and perfect out-of-order scheduling. We show the process of building a machine model from available documentation and semi-automatic benchmarking, and carry it out for the latest Intel Skylake and AMD Zen micro-architectures. To validate the constructed models, we apply them to several assembly kernels and compare runtime predictions with actual measurements. Finally we give an outlook on how the method may be generalized to new architectures.Comment: 11 pages, 4 figures, 7 table

arXiv.org e-Print Archive

Crossref

Machine Learning for Microprocessor Performance Bug Localization

Author: Barboza Erick Carvajal
Gratz Paul
Hu Jiang
Ketkar Mahesh
Kishinevsky Michael
Publication venue
Publication date: 27/03/2023
Field of study

The validation process for microprocessors is a very complex task that consumes substantial engineering time during the design process. Bugs that degrade overall system performance, without affecting its functional correctness, are particularly difficult to debug given the lack of a golden reference for bug-free performance. This work introduces two automated performance bug localization methodologies based on machine learning that aims to aid the debugging process. Our results show that, the evaluated microprocessor core performance bugs whose average IPC impact is greater than 1%, our best-performing technique is able to localize the exact microarchitectural unit of the bug

\sim

77\% of the time, while achieving a top-3 unit accuracy (out of 11 possible locations) of over 90% for bugs with the same average IPC impact. The proposed system in our simulation setup requires only a few seconds to perform a bug location inference, which leads to a reduced debugging time.Comment: 12 pages, 6 figure

arXiv.org e-Print Archive

Automatic Loop Kernel Analysis and Performance Modeling With Kerncraft

Author: Eitzinger Jan
Hager Georg
Hammer Julian
Wellein Gerhard
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2015
Field of study

Analytic performance models are essential for understanding the performance characteristics of loop kernels, which consume a major part of CPU cycles in computational science. Starting from a validated performance model one can infer the relevant hardware bottlenecks and promising optimization opportunities. Unfortunately, analytic performance modeling is often tedious even for experienced developers since it requires in-depth knowledge about the hardware and how it interacts with the software. We present the "Kerncraft" tool, which eases the construction of analytic performance models for streaming kernels and stencil loop nests. Starting from the loop source code, the problem size, and a description of the underlying hardware, Kerncraft can ideally predict the single-core performance and scaling behavior of loops on multicore processors using the Roofline or the Execution-Cache-Memory (ECM) model. We describe the operating principles of Kerncraft with its capabilities and limitations, and we show how it may be used to quickly gain insights by accelerated analytic modeling.Comment: 11 pages, 4 figures, 8 listing

arXiv.org e-Print Archive

Crossref