Search CORE

8 research outputs found

Taming hardware event samples for FDO compilation

Author: Chen Dehao
Chen Wenguang
Hundt Robert
Liao Shih-Wei
Ramasamy Vinodha
Vachharajani Neil
Yuan Paul
Zheng Weimin
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2010
Field of study

Feedback-directed optimization (FDO) is effective in improving application runtime performance, but has not been widely adopted due to the tedious dual-compilation model, the difficulties in generating representative training data sets, and the high runtime overhead of profile collection. The use of hardware-event sampling to generate estimated edge profiles overcomes these drawbacks. Yet, hardware event samples are typically not precise at the instruction or basic-block granularity. These inaccuracies lead to missed performance when compared to instrumentation-based FDO@. In this paper, we use multiple hardware event profiles and supervised learning techniques to generate heuristics for improved precision of basic-block-level sample profiles, and to further improve the smoothing algorithms used to construct edge profiles. We demonstrate that sampling-based FDO can achieve an average of 78% of the performance gains obtained using instrumentation-based exact edge profiles for SPEC2000 benchmarks, matching or beating instrumentation-based FDO in many cases. The overhead of collection is only 0.74% on average, while compiler based instrumentation incurs 6.8%-53.5% overhead (and 10x overhead on an industrial web search application), and dynamic instrumentation incurs 28.6%-1639.2% overhead. ? 2010 ACM.EI

Crossref

Microarchitecture-Aware Code Generation for Deep Learning on Single-ISA Heterogeneous Multi-Core Mobile Processors

Author: 박준모
Publication venue: 서울대학교 대학원
Publication date: 01/08/2020
Field of study

학위논문 (석사) -- 서울대학교 대학원 : 융합과학기술대학원 융합과학부(지능형융합시스템전공), 2020. 8. 전동석.단일 ISA 이기종 멀티 코어 프로세서가 모바일 컴퓨팅에 널리 사용되는 반면, 코드 생성은 단일 대상 코어에 대한 최적화된 코드를 생성하기에 프로세서의 다른 코어에서는 적합하지 않을 수 있다. 본 논문에서는 이 문제를 완화하기 위해 마이크로 아키텍처를 인식할 수 있는 코드 생성 방법을 제시한다. 우선 마이크로 아키텍처와 상관없이 애플리케이션 코드를 실행하기 위해 모든 코어를 최대한 활용할 수 있도록 FMV (Function-Multi-Versioning)를 제안한다. 또한 Cortex-A55 / 75 코어의 성능을 더욱 향상시키기 위해 컴파일러에 간단하지만 강력한 백엔드 최적화 패스를 추가할 것을 제안한다. 이러한 기술을 기반으로 프로그램을 분석하고 서로 다른 마이크로 아키텍처에 맞게 여러 버전의 Function을 생성하는 Automated Flow를 개발하였다. 이를 통해 프로그램 실행 시, 실행 중인 코어는 연산 성능을 극대화하기 위해 최적 버전의 Function을 선택한다. 본 논문에서 제시한 방법론을 통해, TensorFlow Lite를 실행하는 동안 삼성의 Exynos 9820 프로세서의 Cortex-A55 및 Cortex-A75 코어에서 성능을 CNN 모델에서 11.2 % 및 17.9 %, NLP 모델에서 10.9 %, 4.5 % 향상시키는 것을 확인하였다.While single-ISA heterogeneous multi-core processors are widely used in mobile computing, typical code generations optimize the code for a single target core, leaving it less suitable for the other cores in the processor. We present a microarchitecture-aware code generation methodology to mitigate this issue. We first suggest adopting Function-Multi-Versioning (FMV) to execute application codes utilizing a core at full capacity regardless of its microarchitecture. We also propose to add a simple but powerful backend optimization pass in the compiler to further boost the performance of Cortex-A55/75 cores. Based on these schemes, we developed an automated flow that analyzes the program and generates multiple versions of hot functions tailored to different microarchitectures. At runtime, the running core chooses an optimal version to maximize computation performance. Measurements confirm that the methodology improves the performance of Cortex-A55 and Cortex-A75 cores in Samsung's next-generation Exynos 9820 processor by 11.2% and 17.9% for CNN models, 10.9% and 4.5% for NLP models, respectively, while running TensorFlow Lite.Chapter 1 Introduction 1 1.1 Deep Learning on Single-ISA Heterogeneous Multi-Core Processors 1 1.2 Proposed approach 3 Chapter 2 Related works and Motivation 6 2.1 Function Multi Versioning (FMV) 6 2.2 General Matrix Multiplication (GEMM) 8 2.3 Function Multi Versioning (FMV) 10 Chapter 3 Microarchitecture-Aware Code Generation 12 3.1 FMV Scheme for Heterogeneous Multi-Core Processors 12 3.1.1 Runtime Selector 14 3.1.2 Multi-Versioned Functions 15 3.2 Load Split Optimization and LLVM-Based Unified Automatic Code Generation 16 3.2.1 LLVM Backend Passes 17 3.2.2 Load Split Optimization Pass 18 3.2.3 Unified Automatic Microarchitecture-aware Code Generation Flow 20 Chapter 4 Experimental Evaluation 22 4.1 Experimental Setup 22 4.2 Floating-point CNN Models 23 4.3 Quantized CNN Models 27 4.4 Question Answering Models 28 Chapter 5 Conclusion 31 Bibliography 32 Abstract in Korean 35Maste

SNU Open Repository and Archive

Analysis of Application Delivery Platform for Software Defined Infrastructures

Author: Gupta Lav
Jain Raj
Samaka Mohammed
Publication venue
Publication date: 11/01/1938
Field of study

Application Service Providers (ASPs) obtaining resources from multiple clouds have to contend with different management and control platforms employed by the cloud service providers (CSPs) and network service providers (NSP). Distributing applications on multiple clouds has a number of benefits but the absence of a common multi-cloud management platform that would allow ASPs dynamic and real-time control over resources across multiple clouds and interconnecting networks makes this task arduous. OpenADN, being developed at Washington University in Saint Louis, fills this gap. However, performance issues of such a complex, distributed and multi-threaded platform, not tackled appropriately, may neutralize some of the gains accruable to the ASPs. In this paper, we establish the need for and methods of collecting precise and fine-grained behavioral data of OpenADN like platforms that can be used to optimize their behavior in order to control operational cost, performance (e.g., latency) and energy consumption.Comment: E-preprin

arXiv.org e-Print Archive

Trinity College

Establishing a base of trust with performance counters for enterprise workloads

Author: Mendelson Avi
Nowak Andrzej
Yasin Ahmad
Zwaenepoel Willy
Publication venue
Publication date: 27/06/2015
Field of study

Understanding the performance of large, complex enterprise-class applications is an important, yet nontrivial task. Methods using hardware performance counters, such as profiling through event-based sampling, are often favored over instrumentation for analyzing such large codes, but rarely provide good accuracy at the instruction level. This work evaluates the accuracy ofmultiple eventbased sampling techniques and quantifies the impact of a range of improvements suggested in recent years. The evaluation is performed on instances of three modern CPU architectures, using designated kernels and full applications. We conclude that precisely distributed events considerably improve accuracy, with further improvements possible when using Last Branch Records. We also present practical recommendations for hardware architects, tool developers and performance engineers, aimed at improving the quality of results

Infoscience - École polytechnique fédérale de Lausanne

SiblingRivalry: Online Autotuning Through Local Competitions

Author: Amarasinghe Saman P.
Ansel Jason Andrew
Chan Cy
O'Reilly Una-May
Olszewski Marek Krystyn
Pacula Maciej
Wong Yee Lok
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/10/2012
Field of study

Modern high performance libraries, such as ATLAS and FFTW, and programming languages, such as PetaBricks, have shown that autotuning computer programs can lead to significant speedups. However, autotuning can be burdensome to the deployment of a program, since the tuning process can take a long time and should be re-run whenever the program, microarchitecture, execution environment, or tool chain changes. Failure to re-autotune programs often leads to widespread use of sub-optimal algorithms. With the growth of cloud computing, where computations can run in environments with unknown load and migrate between different (possibly unknown) microarchitectures, the need for online autotuning has become increasingly important. We present SiblingRivalry, a new model for always-on online autotuning that allows parallel programs to continuously adapt and optimize themselves to their environment. In our system, requests are processed by dividing the available cores in half, and processing two identical requests in parallel on each half. Half of the cores are devoted to a known safe program configuration, while the other half are used for an experimental program configuration chosen by our self-adapting evolutionary algorithm. When the faster configuration completes, its results are returned, and the slower configuration is terminated. Over time, this constant experimentation allows programs to adapt to changing dynamic environments and often outperform the original algorithm that uses the entire system.United States. Dept. of Energy (DOE Award DE-SC0005288

DSpace@MIT

Crossref

eScholarship - University of California