Search CORE

365 research outputs found

Microarchitecture-Aware Code Generation for Deep Learning on Single-ISA Heterogeneous Multi-Core Mobile Processors

Author: 박준모
Publication venue: 서울대학교 대학원
Publication date: 01/08/2020
Field of study

학위논문 (석사) -- 서울대학교 대학원 : 융합과학기술대학원 융합과학부(지능형융합시스템전공), 2020. 8. 전동석.단일 ISA 이기종 멀티 코어 프로세서가 모바일 컴퓨팅에 널리 사용되는 반면, 코드 생성은 단일 대상 코어에 대한 최적화된 코드를 생성하기에 프로세서의 다른 코어에서는 적합하지 않을 수 있다. 본 논문에서는 이 문제를 완화하기 위해 마이크로 아키텍처를 인식할 수 있는 코드 생성 방법을 제시한다. 우선 마이크로 아키텍처와 상관없이 애플리케이션 코드를 실행하기 위해 모든 코어를 최대한 활용할 수 있도록 FMV (Function-Multi-Versioning)를 제안한다. 또한 Cortex-A55 / 75 코어의 성능을 더욱 향상시키기 위해 컴파일러에 간단하지만 강력한 백엔드 최적화 패스를 추가할 것을 제안한다. 이러한 기술을 기반으로 프로그램을 분석하고 서로 다른 마이크로 아키텍처에 맞게 여러 버전의 Function을 생성하는 Automated Flow를 개발하였다. 이를 통해 프로그램 실행 시, 실행 중인 코어는 연산 성능을 극대화하기 위해 최적 버전의 Function을 선택한다. 본 논문에서 제시한 방법론을 통해, TensorFlow Lite를 실행하는 동안 삼성의 Exynos 9820 프로세서의 Cortex-A55 및 Cortex-A75 코어에서 성능을 CNN 모델에서 11.2 % 및 17.9 %, NLP 모델에서 10.9 %, 4.5 % 향상시키는 것을 확인하였다.While single-ISA heterogeneous multi-core processors are widely used in mobile computing, typical code generations optimize the code for a single target core, leaving it less suitable for the other cores in the processor. We present a microarchitecture-aware code generation methodology to mitigate this issue. We first suggest adopting Function-Multi-Versioning (FMV) to execute application codes utilizing a core at full capacity regardless of its microarchitecture. We also propose to add a simple but powerful backend optimization pass in the compiler to further boost the performance of Cortex-A55/75 cores. Based on these schemes, we developed an automated flow that analyzes the program and generates multiple versions of hot functions tailored to different microarchitectures. At runtime, the running core chooses an optimal version to maximize computation performance. Measurements confirm that the methodology improves the performance of Cortex-A55 and Cortex-A75 cores in Samsung's next-generation Exynos 9820 processor by 11.2% and 17.9% for CNN models, 10.9% and 4.5% for NLP models, respectively, while running TensorFlow Lite.Chapter 1 Introduction 1 1.1 Deep Learning on Single-ISA Heterogeneous Multi-Core Processors 1 1.2 Proposed approach 3 Chapter 2 Related works and Motivation 6 2.1 Function Multi Versioning (FMV) 6 2.2 General Matrix Multiplication (GEMM) 8 2.3 Function Multi Versioning (FMV) 10 Chapter 3 Microarchitecture-Aware Code Generation 12 3.1 FMV Scheme for Heterogeneous Multi-Core Processors 12 3.1.1 Runtime Selector 14 3.1.2 Multi-Versioned Functions 15 3.2 Load Split Optimization and LLVM-Based Unified Automatic Code Generation 16 3.2.1 LLVM Backend Passes 17 3.2.2 Load Split Optimization Pass 18 3.2.3 Unified Automatic Microarchitecture-aware Code Generation Flow 20 Chapter 4 Experimental Evaluation 22 4.1 Experimental Setup 22 4.2 Floating-point CNN Models 23 4.3 Quantized CNN Models 27 4.4 Question Answering Models 28 Chapter 5 Conclusion 31 Bibliography 32 Abstract in Korean 35Maste

SNU Open Repository and Archive

Energy challenges for ICT

Author: Fagas Giorgos
Gallagher John P.
Luca Gammaitoni
Paul Douglas J.
Publication venue: Intech
Publication date: 01/01/2017
Field of study

The energy consumption from the expanding use of information and communications technology (ICT) is unsustainable with present drivers, and it will impact heavily on the future climate change. However, ICT devices have the potential to contribute signi - cantly to the reduction of CO2 emission and enhance resource e ciency in other sectors, e.g., transportation (through intelligent transportation and advanced driver assistance systems and self-driving vehicles), heating (through smart building control), and manu- facturing (through digital automation based on smart autonomous sensors). To address the energy sustainability of ICT and capture the full potential of ICT in resource e - ciency, a multidisciplinary ICT-energy community needs to be brought together cover- ing devices, microarchitectures, ultra large-scale integration (ULSI), high-performance computing (HPC), energy harvesting, energy storage, system design, embedded sys- tems, e cient electronics, static analysis, and computation. In this chapter, we introduce challenges and opportunities in this emerging eld and a common framework to strive towards energy-sustainable ICT

A Survey of Phase Classification Techniques for Characterizing Variable Application Behavior

Author: Adegbija Tosiron
Criswell Keeley
Publication venue
Publication date: 16/07/2019
Field of study

Adaptable computing is an increasingly important paradigm that specializes system resources to variable application requirements, environmental conditions, or user requirements. Adapting computing resources to variable application requirements (or application phases) is otherwise known as phase-based optimization. Phase-based optimization takes advantage of application phases, or execution intervals of an application, that behave similarly, to enable effective and beneficial adaptability. In order for phase-based optimization to be effective, the phases must first be classified to determine when application phases begin and end, and ensure that system resources are accurately specialized. In this paper, we present a survey of phase classification techniques that have been proposed to exploit the advantages of adaptable computing through phase-based optimization. We focus on recent techniques and classify these techniques with respect to several factors in order to highlight their similarities and differences. We divide the techniques by their major defining characteristics---online/offline and serial/parallel. In addition, we discuss other characteristics such as prediction and detection techniques, the characteristics used for prediction, interval type, etc. We also identify gaps in the state-of-the-art and discuss future research directions to enable and fully exploit the benefits of adaptable computing.Comment: To appear in IEEE Transactions on Parallel and Distributed Systems (TPDS

arXiv.org e-Print Archive

Multithreading Aware Hardware Prefetching for Chip Multiprocessors

Author: Albarakat Laith Mohammad
Publication venue
Publication date: 05/02/2018
Field of study

To take advantage of the processing power in the Chip Multiprocessors design, applications must be divided into semi-independent processes that can run concur- rently on multiple cores within a system. Therefore, programmers must insert thread synchronization semantics (i.e. locks, barriers, and condition variables) to synchro- nize data access between processes. Indeed, threads spend long time waiting to acquire the lock of a critical section. In addition, a processor has to stall execution to wait for load data accesses to complete. Furthermore, there are often independent instructions which include load instructions beyond synchronization semantics that could be executed in parallel while a thread waits on the synchronization semantics. The conveniences of the cache memories come with some extra cost in Chip Multiprocessors. Cache Coherence mechanisms address the Memory Consistency problem. However, Cache Coherence adds considerable overhead to memory accesses. Having aggressive prefetcher on different cores of a Chip Multiprocessor can definitely lead to significant system performance degradation when running multi-threaded applications. This result of prefetch-demand interference when a prefetcher in one core ends up pulling shared data from a producing core before it has been written, the cache block will end up transitioning back and forth between the cores and result in useless prefetch, saturating the memory bandwidth and substantially increase the latency to critical shared data. We present a hardware prefetcher that enables large performance improvements from prefetching in Chip Multiprocessors by significantly reducing prefetch-demand interference. Furthermore, it will utilize the time that a thread spends waiting on syn- chronization semantics to run ahead of the critical section to speculate and prefetch independent load instruction data beyond the synchronization semantics

Texas A&M Repository

Improving heterogeneous system efficiency : architecture, scheduling, and machine learning

Author: Nemirovsky Daniel A.
Publication venue: Universitat Politècnica de Catalunya
Publication date: 30/10/2017
Field of study

Computer architects are beginning to embrace heterogeneous systems as an effective method to utilize increases in transistor densities for executing a diverse range of workloads under varying performance and energy constraints. As heterogeneous systems become more ubiquitous, architects will need to develop novel CPU scheduling techniques capable of exploiting the diversity of computational resources. In recognizing hardware diversity, state-of-the-art heterogeneous schedulers are able to produce significant performance improvements over their predecessors and enable more flexible system designs. Nearly all of these, however, are unable to efficiently identify the mapping schemes which will result in the highest system performance. Accurately estimating the performance of applications on different heterogeneous resources can provide a significant advantage to heterogeneous schedulers for identifying a performance maximizing mapping scheme to improve system performance. Recent advances in machine learning techniques including artificial neural networks have led to the development of powerful and practical prediction models for a variety of fields. As of yet, however, no significant leaps have been taken towards employing machine learning for heterogeneous scheduling in order to maximize system throughput. The core issue we approach is how to understand and utilize the rise of heterogeneous architectures, benefits of heterogeneous scheduling, and the promise of machine learning techniques with respect to maximizing system performance. We present studies that promote a future computing model capable of supporting massive hardware diversity, discuss the constraints faced by heterogeneous designers, explore the advantages and shortcomings of conventional heterogeneous schedulers, and pioneer applying machine learning to optimize mapping and system throughput. The goal of this thesis is to highlight the importance of efficiently exploiting heterogeneity and to validate the opportunities that machine learning can offer for various areas in computer architecture.Arquitectos de computadores estan empesando a diseñar systemas heterogeneos como una manera efficiente de usar los incrementos en densidades de transistors para ejecutar una gran diversidad de programas corriendo debajo de differentes condiciones y requisitos de energia y rendimiento (performance). En cuanto los sistemas heterogeneos van ganando popularidad de uso, arquitectos van a necesitar a diseñar nuevas formas de hacer el scheduling de las applicaciones en los cores distintos de los CPUs. Schedulers nuevos que tienen en cuenta la heterogeniedad de los recursos en el hardware logran importantes beneficios en terminos de rendimiento en comparacion con schedulers hecho para sistemas homogenios. Pero, casi todos de estos schedulers heterogeneos no son capaz de poder identificar la esquema de mapping que produce el rendimiento maximo dado el estado de los cores y las applicaciones. Estimando con precision el rendimiento de los programas ejecutando sobre diferentes cores de un CPU es un a gran ventaja para poder identificar el mapping para lograr el mejor rendimiento posible para el proximo scheduling quantum. Desarollos nuevos en la area de machine learning, como redes neurales, han producido predictores muy potentes y con gran precision in disciplinas numerosas. Pero en estos momentos, la aplicacion de metodos de machine learning no se han casi explorados para poder mejorar la eficiencia de los CPUs y menos para mejorar los schedulers para sistemas heterogeneos. El tema de enfoque en esta tesis es como poder entender y utilizar los sistemas heterogeneos, los beneficios de scheduling para estos sistemas, y como aprovechar las promesas de los metodos de machine learning con respeto a maximizer el redimiento de el Sistema. Presentamos estudios que dan una esquema para un modelo de computacion para el futuro capaz de dar suporte a recursos heterogeneos en gran escala, discutimos las restricciones enfrentados por diseñadores de sistemas heterogeneos, exploramos las ventajas y desventajas de las ultimas schedulers heterogeneos, y abrimos el camino de usar metodos de machine learning para optimizer el mapping y rendimiento de un sistema heterogeneo. El objetivo de esta tesis es destacar la imporancia de explotando eficientemente la heterogenidad de los recursos y tambien validar las oportunidades para mejorar la eficiencia en diferente areas de arquitectura de computadoras que pueden ser realizadas gracias a machine learning.Postprint (published version

UPCommons. Portal del coneixement obert de la UPC

Recommended from our members

Achieving Accurate Predictions of Future Events Under Hardware Heterogeneity

Author: Prodromou Andreas
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

Heterogeneous hardware is becoming increasingly available in modern hardware, while research breakthroughs enforce the expectation that heterogeneity will keep increasing in the future. Significant gains can be achieved via appropriate utilization of heterogeneity, in terms of performance and power consumption, however, poor utilization can have a detrimental effect. Intelligent scheduling and resource management is a crucial challenge we need to overcome in order to harvest the full potential of heterogeneous hardware. As systems become larger and include greater levels of hardware diversity, the importance of intelligent scheduling and resource management is further accentuated.This dissertation presents techniques that aid the process of scheduling and resource management in the presence of heterogeneous hardware, via accurately predicting upcoming runtime events. With a proactive and accurate view of the near future, schedulers can utilize the underlying hardware more efficiently, and fully take advantage of the available benefits.By adapting a majority element heuristic, this dissertation significantly improves the accuracy of predicting memory addresses about to be accessed, while reducing prediction-related costs by a factor of ten thousand compared to previously proposed predictive approaches. Coupled with novel microarchitectural modifications, accurate address predictions are shown to improve the performance of heterogeneous memory architectures.Machine learning-based performance predictors are further presented, capable of predicting a program's performance when executed on a given general-purpose core. Trained to model the subtleties of the interaction between hardware and software, these predictors are capable of generating highly accurate predictions even for cores with varied Instruction Set Architectures. Utilizing these performance predictions for job scheduling, is shown to improve overall system performance. The trained predictors are further examined and interpreted in order to visualize the correlations between features picked up and amplified during training.Finally, this dissertation demonstrates that scheduling algorithms cannot guarantee deriving an optimal schedule during realistic execution scenarios due to the underlying hardware heterogeneity, the wide range of runtime requirements of software, as well as prediction error from performance predictors. In response, deep neural networks are trained to select one scheduling approach from a list of options with varied overheads and correctness guarantees. The scheduling approach chosen, is the one which will most likely return the highest-performance schedule with the lowest overhead, given a particular instance of the job-to-core assignment problem

eScholarship - University of California