8 research outputs found
Taming hardware event samples for FDO compilation
Feedback-directed optimization (FDO) is effective in improving application runtime performance, but has not been widely adopted due to the tedious dual-compilation model, the difficulties in generating representative training data sets, and the high runtime overhead of profile collection. The use of hardware-event sampling to generate estimated edge profiles overcomes these drawbacks. Yet, hardware event samples are typically not precise at the instruction or basic-block granularity. These inaccuracies lead to missed performance when compared to instrumentation-based FDO@. In this paper, we use multiple hardware event profiles and supervised learning techniques to generate heuristics for improved precision of basic-block-level sample profiles, and to further improve the smoothing algorithms used to construct edge profiles. We demonstrate that sampling-based FDO can achieve an average of 78% of the performance gains obtained using instrumentation-based exact edge profiles for SPEC2000 benchmarks, matching or beating instrumentation-based FDO in many cases. The overhead of collection is only 0.74% on average, while compiler based instrumentation incurs 6.8%-53.5% overhead (and 10x overhead on an industrial web search application), and dynamic instrumentation incurs 28.6%-1639.2% overhead. ? 2010 ACM.EI
Microarchitecture-Aware Code Generation for Deep Learning on Single-ISA Heterogeneous Multi-Core Mobile Processors
νμλ
Όλ¬Έ (μμ¬) -- μμΈλνκ΅ λνμ : μ΅ν©κ³ΌνκΈ°μ λνμ μ΅ν©κ³ΌνλΆ(μ§λ₯νμ΅ν©μμ€ν
μ 곡), 2020. 8. μ λμ.λ¨μΌ ISA μ΄κΈ°μ’
λ©ν° μ½μ΄ νλ‘μΈμκ° λͺ¨λ°μΌ μ»΄ν¨ν
μ λ리 μ¬μ©λλ λ°λ©΄, μ½λ μμ±μ λ¨μΌ λμ μ½μ΄μ λν μ΅μ νλ μ½λλ₯Ό μμ±νκΈ°μ νλ‘μΈμμ λ€λ₯Έ μ½μ΄μμλ μ ν©νμ§ μμ μ μλ€. λ³Έ λ
Όλ¬Έμμλ μ΄ λ¬Έμ λ₯Ό μννκΈ° μν΄ λ§μ΄ν¬λ‘ μν€ν
μ²λ₯Ό μΈμν μ μλ μ½λ μμ± λ°©λ²μ μ μνλ€. μ°μ λ§μ΄ν¬λ‘ μν€ν
μ²μ μκ΄μμ΄ μ ν리μΌμ΄μ
μ½λλ₯Ό μ€ννκΈ° μν΄ λͺ¨λ μ½μ΄λ₯Ό μ΅λν νμ©ν μ μλλ‘ FMV (Function-Multi-Versioning)λ₯Ό μ μνλ€.
λν Cortex-A55 / 75 μ½μ΄μ μ±λ₯μ λμ± ν₯μμν€κΈ° μν΄ μ»΄νμΌλ¬μ κ°λ¨νμ§λ§ κ°λ ₯ν λ°±μλ μ΅μ ν ν¨μ€λ₯Ό μΆκ°ν κ²μ μ μνλ€. μ΄λ¬ν κΈ°μ μ κΈ°λ°μΌλ‘ νλ‘κ·Έλ¨μ λΆμνκ³ μλ‘ λ€λ₯Έ λ§μ΄ν¬λ‘ μν€ν
μ²μ λ§κ² μ¬λ¬ λ²μ μ Functionμ μμ±νλ Automated Flowλ₯Ό κ°λ°νμλ€. μ΄λ₯Ό ν΅ν΄ νλ‘κ·Έλ¨ μ€ν μ, μ€ν μ€μΈ μ½μ΄λ μ°μ° μ±λ₯μ κ·ΉλννκΈ° μν΄ μ΅μ λ²μ μ Functionμ μ ννλ€.
λ³Έ λ
Όλ¬Έμμ μ μν λ°©λ²λ‘ μ ν΅ν΄, TensorFlow Liteλ₯Ό μ€ννλ λμ μΌμ±μ Exynos 9820 νλ‘μΈμμ Cortex-A55 λ° Cortex-A75 μ½μ΄μμ μ±λ₯μ CNN λͺ¨λΈμμ 11.2 % λ° 17.9 %, NLP λͺ¨λΈμμ 10.9 %, 4.5 % ν₯μμν€λ κ²μ νμΈνμλ€.While single-ISA heterogeneous multi-core processors are widely used in mobile computing, typical code generations optimize the code for a single target core, leaving it less suitable for the other cores in the processor. We present a microarchitecture-aware code generation methodology to mitigate this issue. We first suggest adopting Function-Multi-Versioning (FMV) to execute application codes utilizing a core at full capacity regardless of its microarchitecture. We also propose to add a simple but powerful backend optimization pass in the compiler to further boost the performance of Cortex-A55/75 cores. Based on these schemes, we developed an automated flow that analyzes the program and generates multiple versions of hot functions tailored to different microarchitectures. At runtime, the running core chooses an optimal version to maximize computation performance. Measurements confirm that the methodology improves the performance of Cortex-A55 and Cortex-A75 cores in Samsung's next-generation Exynos 9820 processor by 11.2% and 17.9% for CNN models, 10.9% and 4.5% for NLP models, respectively, while running TensorFlow Lite.Chapter 1 Introduction 1
1.1 Deep Learning on Single-ISA Heterogeneous Multi-Core Processors 1
1.2 Proposed approach 3
Chapter 2 Related works and Motivation 6
2.1 Function Multi Versioning (FMV) 6
2.2 General Matrix Multiplication (GEMM) 8
2.3 Function Multi Versioning (FMV) 10
Chapter 3 Microarchitecture-Aware Code Generation 12
3.1 FMV Scheme for Heterogeneous Multi-Core Processors 12
3.1.1 Runtime Selector 14
3.1.2 Multi-Versioned Functions 15
3.2 Load Split Optimization and LLVM-Based Unified Automatic Code Generation 16
3.2.1 LLVM Backend Passes 17
3.2.2 Load Split Optimization Pass 18
3.2.3 Unified Automatic Microarchitecture-aware Code Generation Flow 20
Chapter 4 Experimental Evaluation 22
4.1 Experimental Setup 22
4.2 Floating-point CNN Models 23
4.3 Quantized CNN Models 27
4.4 Question Answering Models 28
Chapter 5 Conclusion 31
Bibliography 32
Abstract in Korean 35Maste
Analysis of Application Delivery Platform for Software Defined Infrastructures
Application Service Providers (ASPs) obtaining resources from multiple clouds
have to contend with different management and control platforms employed by the
cloud service providers (CSPs) and network service providers (NSP).
Distributing applications on multiple clouds has a number of benefits but the
absence of a common multi-cloud management platform that would allow ASPs
dynamic and real-time control over resources across multiple clouds and
interconnecting networks makes this task arduous. OpenADN, being developed at
Washington University in Saint Louis, fills this gap. However, performance
issues of such a complex, distributed and multi-threaded platform, not tackled
appropriately, may neutralize some of the gains accruable to the ASPs. In this
paper, we establish the need for and methods of collecting precise and
fine-grained behavioral data of OpenADN like platforms that can be used to
optimize their behavior in order to control operational cost, performance
(e.g., latency) and energy consumption.Comment: E-preprin
Establishing a base of trust with performance counters for enterprise workloads
Understanding the performance of large, complex enterprise-class applications is an important, yet nontrivial task. Methods using hardware performance counters, such as profiling through event-based sampling, are often favored over instrumentation for analyzing such large codes, but rarely provide good accuracy at the instruction level. This work evaluates the accuracy ofmultiple eventbased sampling techniques and quantifies the impact of a range of improvements suggested in recent years. The evaluation is performed on instances of three modern CPU architectures, using designated kernels and full applications. We conclude that precisely distributed events considerably improve accuracy, with further improvements possible when using Last Branch Records. We also present practical recommendations for hardware architects, tool developers and performance engineers, aimed at improving the quality of results
SiblingRivalry: Online Autotuning Through Local Competitions
Modern high performance libraries, such as ATLAS and FFTW, and programming languages, such as PetaBricks, have shown that autotuning computer programs can lead to significant speedups. However, autotuning can be burdensome to the deployment of a program, since the tuning process can take a long time and should be re-run whenever the program, microarchitecture, execution environment, or tool chain changes. Failure to re-autotune programs often leads to widespread use of sub-optimal algorithms. With the growth of cloud computing, where computations can run in environments with unknown load and migrate between different (possibly unknown) microarchitectures, the need for online autotuning has become increasingly important.
We present SiblingRivalry, a new model for always-on online autotuning that allows parallel programs to continuously adapt and optimize themselves to their environment. In our system, requests are processed by dividing the available cores in half, and processing two identical requests in parallel on each half. Half of the cores are devoted to a known safe program configuration, while the other half are used for an experimental program configuration chosen by our self-adapting evolutionary algorithm. When the faster configuration completes, its results are returned, and the slower configuration is terminated. Over time, this constant experimentation allows programs to adapt to changing dynamic environments and often outperform the original algorithm that uses the entire system.United States. Dept. of Energy (DOE Award DE-SC0005288