8 research outputs found

    Taming hardware event samples for FDO compilation

    Full text link
    Feedback-directed optimization (FDO) is effective in improving application runtime performance, but has not been widely adopted due to the tedious dual-compilation model, the difficulties in generating representative training data sets, and the high runtime overhead of profile collection. The use of hardware-event sampling to generate estimated edge profiles overcomes these drawbacks. Yet, hardware event samples are typically not precise at the instruction or basic-block granularity. These inaccuracies lead to missed performance when compared to instrumentation-based FDO@. In this paper, we use multiple hardware event profiles and supervised learning techniques to generate heuristics for improved precision of basic-block-level sample profiles, and to further improve the smoothing algorithms used to construct edge profiles. We demonstrate that sampling-based FDO can achieve an average of 78% of the performance gains obtained using instrumentation-based exact edge profiles for SPEC2000 benchmarks, matching or beating instrumentation-based FDO in many cases. The overhead of collection is only 0.74% on average, while compiler based instrumentation incurs 6.8%-53.5% overhead (and 10x overhead on an industrial web search application), and dynamic instrumentation incurs 28.6%-1639.2% overhead. ? 2010 ACM.EI

    Microarchitecture-Aware Code Generation for Deep Learning on Single-ISA Heterogeneous Multi-Core Mobile Processors

    Get PDF
    ν•™μœ„λ…Όλ¬Έ (석사) -- μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› : μœ΅ν•©κ³Όν•™κΈ°μˆ λŒ€ν•™μ› μœ΅ν•©κ³Όν•™λΆ€(지λŠ₯ν˜•μœ΅ν•©μ‹œμŠ€ν…œμ „κ³΅), 2020. 8. 전동석.단일 ISA 이기쒅 λ©€ν‹° μ½”μ–΄ ν”„λ‘œμ„Έμ„œκ°€ λͺ¨λ°”일 μ»΄ν“¨νŒ…μ— 널리 μ‚¬μš©λ˜λŠ” 반면, μ½”λ“œ 생성은 단일 λŒ€μƒ 코어에 λŒ€ν•œ μ΅œμ ν™”λœ μ½”λ“œλ₯Ό μƒμ„±ν•˜κΈ°μ— ν”„λ‘œμ„Έμ„œμ˜ λ‹€λ₯Έ μ½”μ–΄μ—μ„œλŠ” μ ν•©ν•˜μ§€ μ•Šμ„ 수 μžˆλ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” 이 문제λ₯Ό μ™„ν™”ν•˜κΈ° μœ„ν•΄ 마이크둜 μ•„ν‚€ν…μ²˜λ₯Ό 인식할 수 μžˆλŠ” μ½”λ“œ 생성 방법을 μ œμ‹œν•œλ‹€. μš°μ„  마이크둜 μ•„ν‚€ν…μ²˜μ™€ 상관없이 μ• ν”Œλ¦¬μΌ€μ΄μ…˜ μ½”λ“œλ₯Ό μ‹€ν–‰ν•˜κΈ° μœ„ν•΄ λͺ¨λ“  μ½”μ–΄λ₯Ό μ΅œλŒ€ν•œ ν™œμš©ν•  수 μžˆλ„λ‘ FMV (Function-Multi-Versioning)λ₯Ό μ œμ•ˆν•œλ‹€. λ˜ν•œ Cortex-A55 / 75 μ½”μ–΄μ˜ μ„±λŠ₯을 λ”μš± ν–₯μƒμ‹œν‚€κΈ° μœ„ν•΄ μ»΄νŒŒμΌλŸ¬μ— κ°„λ‹¨ν•˜μ§€λ§Œ κ°•λ ₯ν•œ λ°±μ—”λ“œ μ΅œμ ν™” 패슀λ₯Ό μΆ”κ°€ν•  것을 μ œμ•ˆν•œλ‹€. μ΄λŸ¬ν•œ κΈ°μˆ μ„ 기반으둜 ν”„λ‘œκ·Έλž¨μ„ λΆ„μ„ν•˜κ³  μ„œλ‘œ λ‹€λ₯Έ 마이크둜 μ•„ν‚€ν…μ²˜μ— 맞게 μ—¬λŸ¬ λ²„μ „μ˜ Function을 μƒμ„±ν•˜λŠ” Automated Flowλ₯Ό κ°œλ°œν•˜μ˜€λ‹€. 이λ₯Ό 톡해 ν”„λ‘œκ·Έλž¨ μ‹€ν–‰ μ‹œ, μ‹€ν–‰ 쀑인 μ½”μ–΄λŠ” μ—°μ‚° μ„±λŠ₯을 κ·ΉλŒ€ν™”ν•˜κΈ° μœ„ν•΄ 졜적 λ²„μ „μ˜ Function을 μ„ νƒν•œλ‹€. λ³Έ λ…Όλ¬Έμ—μ„œ μ œμ‹œν•œ 방법둠을 톡해, TensorFlow Liteλ₯Ό μ‹€ν–‰ν•˜λŠ” λ™μ•ˆ μ‚Όμ„±μ˜ Exynos 9820 ν”„λ‘œμ„Έμ„œμ˜ Cortex-A55 및 Cortex-A75 μ½”μ–΄μ—μ„œ μ„±λŠ₯을 CNN λͺ¨λΈμ—μ„œ 11.2 % 및 17.9 %, NLP λͺ¨λΈμ—μ„œ 10.9 %, 4.5 % ν–₯μƒμ‹œν‚€λŠ” 것을 ν™•μΈν•˜μ˜€λ‹€.While single-ISA heterogeneous multi-core processors are widely used in mobile computing, typical code generations optimize the code for a single target core, leaving it less suitable for the other cores in the processor. We present a microarchitecture-aware code generation methodology to mitigate this issue. We first suggest adopting Function-Multi-Versioning (FMV) to execute application codes utilizing a core at full capacity regardless of its microarchitecture. We also propose to add a simple but powerful backend optimization pass in the compiler to further boost the performance of Cortex-A55/75 cores. Based on these schemes, we developed an automated flow that analyzes the program and generates multiple versions of hot functions tailored to different microarchitectures. At runtime, the running core chooses an optimal version to maximize computation performance. Measurements confirm that the methodology improves the performance of Cortex-A55 and Cortex-A75 cores in Samsung's next-generation Exynos 9820 processor by 11.2% and 17.9% for CNN models, 10.9% and 4.5% for NLP models, respectively, while running TensorFlow Lite.Chapter 1 Introduction 1 1.1 Deep Learning on Single-ISA Heterogeneous Multi-Core Processors 1 1.2 Proposed approach 3 Chapter 2 Related works and Motivation 6 2.1 Function Multi Versioning (FMV) 6 2.2 General Matrix Multiplication (GEMM) 8 2.3 Function Multi Versioning (FMV) 10 Chapter 3 Microarchitecture-Aware Code Generation 12 3.1 FMV Scheme for Heterogeneous Multi-Core Processors 12 3.1.1 Runtime Selector 14 3.1.2 Multi-Versioned Functions 15 3.2 Load Split Optimization and LLVM-Based Unified Automatic Code Generation 16 3.2.1 LLVM Backend Passes 17 3.2.2 Load Split Optimization Pass 18 3.2.3 Unified Automatic Microarchitecture-aware Code Generation Flow 20 Chapter 4 Experimental Evaluation 22 4.1 Experimental Setup 22 4.2 Floating-point CNN Models 23 4.3 Quantized CNN Models 27 4.4 Question Answering Models 28 Chapter 5 Conclusion 31 Bibliography 32 Abstract in Korean 35Maste

    Analysis of Application Delivery Platform for Software Defined Infrastructures

    Get PDF
    Application Service Providers (ASPs) obtaining resources from multiple clouds have to contend with different management and control platforms employed by the cloud service providers (CSPs) and network service providers (NSP). Distributing applications on multiple clouds has a number of benefits but the absence of a common multi-cloud management platform that would allow ASPs dynamic and real-time control over resources across multiple clouds and interconnecting networks makes this task arduous. OpenADN, being developed at Washington University in Saint Louis, fills this gap. However, performance issues of such a complex, distributed and multi-threaded platform, not tackled appropriately, may neutralize some of the gains accruable to the ASPs. In this paper, we establish the need for and methods of collecting precise and fine-grained behavioral data of OpenADN like platforms that can be used to optimize their behavior in order to control operational cost, performance (e.g., latency) and energy consumption.Comment: E-preprin

    Establishing a base of trust with performance counters for enterprise workloads

    Get PDF
    Understanding the performance of large, complex enterprise-class applications is an important, yet nontrivial task. Methods using hardware performance counters, such as profiling through event-based sampling, are often favored over instrumentation for analyzing such large codes, but rarely provide good accuracy at the instruction level. This work evaluates the accuracy ofmultiple eventbased sampling techniques and quantifies the impact of a range of improvements suggested in recent years. The evaluation is performed on instances of three modern CPU architectures, using designated kernels and full applications. We conclude that precisely distributed events considerably improve accuracy, with further improvements possible when using Last Branch Records. We also present practical recommendations for hardware architects, tool developers and performance engineers, aimed at improving the quality of results

    SiblingRivalry: Online Autotuning Through Local Competitions

    Get PDF
    Modern high performance libraries, such as ATLAS and FFTW, and programming languages, such as PetaBricks, have shown that autotuning computer programs can lead to significant speedups. However, autotuning can be burdensome to the deployment of a program, since the tuning process can take a long time and should be re-run whenever the program, microarchitecture, execution environment, or tool chain changes. Failure to re-autotune programs often leads to widespread use of sub-optimal algorithms. With the growth of cloud computing, where computations can run in environments with unknown load and migrate between different (possibly unknown) microarchitectures, the need for online autotuning has become increasingly important. We present SiblingRivalry, a new model for always-on online autotuning that allows parallel programs to continuously adapt and optimize themselves to their environment. In our system, requests are processed by dividing the available cores in half, and processing two identical requests in parallel on each half. Half of the cores are devoted to a known safe program configuration, while the other half are used for an experimental program configuration chosen by our self-adapting evolutionary algorithm. When the faster configuration completes, its results are returned, and the slower configuration is terminated. Over time, this constant experimentation allows programs to adapt to changing dynamic environments and often outperform the original algorithm that uses the entire system.United States. Dept. of Energy (DOE Award DE-SC0005288
    corecore