365 research outputs found

    Microarchitecture-Aware Code Generation for Deep Learning on Single-ISA Heterogeneous Multi-Core Mobile Processors

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์œตํ•ฉ๊ณผํ•™๊ธฐ์ˆ ๋Œ€ํ•™์› ์œตํ•ฉ๊ณผํ•™๋ถ€(์ง€๋Šฅํ˜•์œตํ•ฉ์‹œ์Šคํ…œ์ „๊ณต), 2020. 8. ์ „๋™์„.๋‹จ์ผ ISA ์ด๊ธฐ์ข… ๋ฉ€ํ‹ฐ ์ฝ”์–ด ํ”„๋กœ์„ธ์„œ๊ฐ€ ๋ชจ๋ฐ”์ผ ์ปดํ“จํŒ…์— ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋Š” ๋ฐ˜๋ฉด, ์ฝ”๋“œ ์ƒ์„ฑ์€ ๋‹จ์ผ ๋Œ€์ƒ ์ฝ”์–ด์— ๋Œ€ํ•œ ์ตœ์ ํ™”๋œ ์ฝ”๋“œ๋ฅผ ์ƒ์„ฑํ•˜๊ธฐ์— ํ”„๋กœ์„ธ์„œ์˜ ๋‹ค๋ฅธ ์ฝ”์–ด์—์„œ๋Š” ์ ํ•ฉํ•˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ๋งˆ์ดํฌ๋กœ ์•„ํ‚คํ…์ฒ˜๋ฅผ ์ธ์‹ํ•  ์ˆ˜ ์žˆ๋Š” ์ฝ”๋“œ ์ƒ์„ฑ ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•œ๋‹ค. ์šฐ์„  ๋งˆ์ดํฌ๋กœ ์•„ํ‚คํ…์ฒ˜์™€ ์ƒ๊ด€์—†์ด ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜ ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•˜๊ธฐ ์œ„ํ•ด ๋ชจ๋“  ์ฝ”์–ด๋ฅผ ์ตœ๋Œ€ํ•œ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก FMV (Function-Multi-Versioning)๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ๋˜ํ•œ Cortex-A55 / 75 ์ฝ”์–ด์˜ ์„ฑ๋Šฅ์„ ๋”์šฑ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด ์ปดํŒŒ์ผ๋Ÿฌ์— ๊ฐ„๋‹จํ•˜์ง€๋งŒ ๊ฐ•๋ ฅํ•œ ๋ฐฑ์—”๋“œ ์ตœ์ ํ™” ํŒจ์Šค๋ฅผ ์ถ”๊ฐ€ํ•  ๊ฒƒ์„ ์ œ์•ˆํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ๊ธฐ์ˆ ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ”„๋กœ๊ทธ๋žจ์„ ๋ถ„์„ํ•˜๊ณ  ์„œ๋กœ ๋‹ค๋ฅธ ๋งˆ์ดํฌ๋กœ ์•„ํ‚คํ…์ฒ˜์— ๋งž๊ฒŒ ์—ฌ๋Ÿฌ ๋ฒ„์ „์˜ Function์„ ์ƒ์„ฑํ•˜๋Š” Automated Flow๋ฅผ ๊ฐœ๋ฐœํ•˜์˜€๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ํ”„๋กœ๊ทธ๋žจ ์‹คํ–‰ ์‹œ, ์‹คํ–‰ ์ค‘์ธ ์ฝ”์–ด๋Š” ์—ฐ์‚ฐ ์„ฑ๋Šฅ์„ ๊ทน๋Œ€ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ์ตœ์  ๋ฒ„์ „์˜ Function์„ ์„ ํƒํ•œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•œ ๋ฐฉ๋ฒ•๋ก ์„ ํ†ตํ•ด, TensorFlow Lite๋ฅผ ์‹คํ–‰ํ•˜๋Š” ๋™์•ˆ ์‚ผ์„ฑ์˜ Exynos 9820 ํ”„๋กœ์„ธ์„œ์˜ Cortex-A55 ๋ฐ Cortex-A75 ์ฝ”์–ด์—์„œ ์„ฑ๋Šฅ์„ CNN ๋ชจ๋ธ์—์„œ 11.2 % ๋ฐ 17.9 %, NLP ๋ชจ๋ธ์—์„œ 10.9 %, 4.5 % ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•˜์˜€๋‹ค.While single-ISA heterogeneous multi-core processors are widely used in mobile computing, typical code generations optimize the code for a single target core, leaving it less suitable for the other cores in the processor. We present a microarchitecture-aware code generation methodology to mitigate this issue. We first suggest adopting Function-Multi-Versioning (FMV) to execute application codes utilizing a core at full capacity regardless of its microarchitecture. We also propose to add a simple but powerful backend optimization pass in the compiler to further boost the performance of Cortex-A55/75 cores. Based on these schemes, we developed an automated flow that analyzes the program and generates multiple versions of hot functions tailored to different microarchitectures. At runtime, the running core chooses an optimal version to maximize computation performance. Measurements confirm that the methodology improves the performance of Cortex-A55 and Cortex-A75 cores in Samsung's next-generation Exynos 9820 processor by 11.2% and 17.9% for CNN models, 10.9% and 4.5% for NLP models, respectively, while running TensorFlow Lite.Chapter 1 Introduction 1 1.1 Deep Learning on Single-ISA Heterogeneous Multi-Core Processors 1 1.2 Proposed approach 3 Chapter 2 Related works and Motivation 6 2.1 Function Multi Versioning (FMV) 6 2.2 General Matrix Multiplication (GEMM) 8 2.3 Function Multi Versioning (FMV) 10 Chapter 3 Microarchitecture-Aware Code Generation 12 3.1 FMV Scheme for Heterogeneous Multi-Core Processors 12 3.1.1 Runtime Selector 14 3.1.2 Multi-Versioned Functions 15 3.2 Load Split Optimization and LLVM-Based Unified Automatic Code Generation 16 3.2.1 LLVM Backend Passes 17 3.2.2 Load Split Optimization Pass 18 3.2.3 Unified Automatic Microarchitecture-aware Code Generation Flow 20 Chapter 4 Experimental Evaluation 22 4.1 Experimental Setup 22 4.2 Floating-point CNN Models 23 4.3 Quantized CNN Models 27 4.4 Question Answering Models 28 Chapter 5 Conclusion 31 Bibliography 32 Abstract in Korean 35Maste

    Energy challenges for ICT

    Get PDF
    The energy consumption from the expanding use of information and communications technology (ICT) is unsustainable with present drivers, and it will impact heavily on the future climate change. However, ICT devices have the potential to contribute signi - cantly to the reduction of CO2 emission and enhance resource e ciency in other sectors, e.g., transportation (through intelligent transportation and advanced driver assistance systems and self-driving vehicles), heating (through smart building control), and manu- facturing (through digital automation based on smart autonomous sensors). To address the energy sustainability of ICT and capture the full potential of ICT in resource e - ciency, a multidisciplinary ICT-energy community needs to be brought together cover- ing devices, microarchitectures, ultra large-scale integration (ULSI), high-performance computing (HPC), energy harvesting, energy storage, system design, embedded sys- tems, e cient electronics, static analysis, and computation. In this chapter, we introduce challenges and opportunities in this emerging eld and a common framework to strive towards energy-sustainable ICT

    A Survey of Phase Classification Techniques for Characterizing Variable Application Behavior

    Full text link
    Adaptable computing is an increasingly important paradigm that specializes system resources to variable application requirements, environmental conditions, or user requirements. Adapting computing resources to variable application requirements (or application phases) is otherwise known as phase-based optimization. Phase-based optimization takes advantage of application phases, or execution intervals of an application, that behave similarly, to enable effective and beneficial adaptability. In order for phase-based optimization to be effective, the phases must first be classified to determine when application phases begin and end, and ensure that system resources are accurately specialized. In this paper, we present a survey of phase classification techniques that have been proposed to exploit the advantages of adaptable computing through phase-based optimization. We focus on recent techniques and classify these techniques with respect to several factors in order to highlight their similarities and differences. We divide the techniques by their major defining characteristics---online/offline and serial/parallel. In addition, we discuss other characteristics such as prediction and detection techniques, the characteristics used for prediction, interval type, etc. We also identify gaps in the state-of-the-art and discuss future research directions to enable and fully exploit the benefits of adaptable computing.Comment: To appear in IEEE Transactions on Parallel and Distributed Systems (TPDS

    Multithreading Aware Hardware Prefetching for Chip Multiprocessors

    Get PDF
    To take advantage of the processing power in the Chip Multiprocessors design, applications must be divided into semi-independent processes that can run concur- rently on multiple cores within a system. Therefore, programmers must insert thread synchronization semantics (i.e. locks, barriers, and condition variables) to synchro- nize data access between processes. Indeed, threads spend long time waiting to acquire the lock of a critical section. In addition, a processor has to stall execution to wait for load data accesses to complete. Furthermore, there are often independent instructions which include load instructions beyond synchronization semantics that could be executed in parallel while a thread waits on the synchronization semantics. The conveniences of the cache memories come with some extra cost in Chip Multiprocessors. Cache Coherence mechanisms address the Memory Consistency problem. However, Cache Coherence adds considerable overhead to memory accesses. Having aggressive prefetcher on different cores of a Chip Multiprocessor can definitely lead to significant system performance degradation when running multi-threaded applications. This result of prefetch-demand interference when a prefetcher in one core ends up pulling shared data from a producing core before it has been written, the cache block will end up transitioning back and forth between the cores and result in useless prefetch, saturating the memory bandwidth and substantially increase the latency to critical shared data. We present a hardware prefetcher that enables large performance improvements from prefetching in Chip Multiprocessors by significantly reducing prefetch-demand interference. Furthermore, it will utilize the time that a thread spends waiting on syn- chronization semantics to run ahead of the critical section to speculate and prefetch independent load instruction data beyond the synchronization semantics

    Improving heterogeneous system efficiency : architecture, scheduling, and machine learning

    Get PDF
    Computer architects are beginning to embrace heterogeneous systems as an effective method to utilize increases in transistor densities for executing a diverse range of workloads under varying performance and energy constraints. As heterogeneous systems become more ubiquitous, architects will need to develop novel CPU scheduling techniques capable of exploiting the diversity of computational resources. In recognizing hardware diversity, state-of-the-art heterogeneous schedulers are able to produce significant performance improvements over their predecessors and enable more flexible system designs. Nearly all of these, however, are unable to efficiently identify the mapping schemes which will result in the highest system performance. Accurately estimating the performance of applications on different heterogeneous resources can provide a significant advantage to heterogeneous schedulers for identifying a performance maximizing mapping scheme to improve system performance. Recent advances in machine learning techniques including artificial neural networks have led to the development of powerful and practical prediction models for a variety of fields. As of yet, however, no significant leaps have been taken towards employing machine learning for heterogeneous scheduling in order to maximize system throughput. The core issue we approach is how to understand and utilize the rise of heterogeneous architectures, benefits of heterogeneous scheduling, and the promise of machine learning techniques with respect to maximizing system performance. We present studies that promote a future computing model capable of supporting massive hardware diversity, discuss the constraints faced by heterogeneous designers, explore the advantages and shortcomings of conventional heterogeneous schedulers, and pioneer applying machine learning to optimize mapping and system throughput. The goal of this thesis is to highlight the importance of efficiently exploiting heterogeneity and to validate the opportunities that machine learning can offer for various areas in computer architecture.Arquitectos de computadores estan empesando a diseรฑar systemas heterogeneos como una manera efficiente de usar los incrementos en densidades de transistors para ejecutar una gran diversidad de programas corriendo debajo de differentes condiciones y requisitos de energia y rendimiento (performance). En cuanto los sistemas heterogeneos van ganando popularidad de uso, arquitectos van a necesitar a diseรฑar nuevas formas de hacer el scheduling de las applicaciones en los cores distintos de los CPUs. Schedulers nuevos que tienen en cuenta la heterogeniedad de los recursos en el hardware logran importantes beneficios en terminos de rendimiento en comparacion con schedulers hecho para sistemas homogenios. Pero, casi todos de estos schedulers heterogeneos no son capaz de poder identificar la esquema de mapping que produce el rendimiento maximo dado el estado de los cores y las applicaciones. Estimando con precision el rendimiento de los programas ejecutando sobre diferentes cores de un CPU es un a gran ventaja para poder identificar el mapping para lograr el mejor rendimiento posible para el proximo scheduling quantum. Desarollos nuevos en la area de machine learning, como redes neurales, han producido predictores muy potentes y con gran precision in disciplinas numerosas. Pero en estos momentos, la aplicacion de metodos de machine learning no se han casi explorados para poder mejorar la eficiencia de los CPUs y menos para mejorar los schedulers para sistemas heterogeneos. El tema de enfoque en esta tesis es como poder entender y utilizar los sistemas heterogeneos, los beneficios de scheduling para estos sistemas, y como aprovechar las promesas de los metodos de machine learning con respeto a maximizer el redimiento de el Sistema. Presentamos estudios que dan una esquema para un modelo de computacion para el futuro capaz de dar suporte a recursos heterogeneos en gran escala, discutimos las restricciones enfrentados por diseรฑadores de sistemas heterogeneos, exploramos las ventajas y desventajas de las ultimas schedulers heterogeneos, y abrimos el camino de usar metodos de machine learning para optimizer el mapping y rendimiento de un sistema heterogeneo. El objetivo de esta tesis es destacar la imporancia de explotando eficientemente la heterogenidad de los recursos y tambien validar las oportunidades para mejorar la eficiencia en diferente areas de arquitectura de computadoras que pueden ser realizadas gracias a machine learning.Postprint (published version
    • โ€ฆ
    corecore