5,294 research outputs found

    A survey on scheduling and mapping techniques in 3D Network-on-chip

    Full text link
    Network-on-Chips (NoCs) have been widely employed in the design of multiprocessor system-on-chips (MPSoCs) as a scalable communication solution. NoCs enable communications between on-chip Intellectual Property (IP) cores and allow those cores to achieve higher performance by outsourcing their communication tasks. Mapping and Scheduling methodologies are key elements in assigning application tasks, allocating the tasks to the IPs, and organising communication among them to achieve some specified objectives. The goal of this paper is to present a detailed state-of-the-art of research in the field of mapping and scheduling of applications on 3D NoC, classifying the works based on several dimensions and giving some potential research directions

    Theory and design of portable parallel programs for heterogeneous computing systems and networks

    Get PDF
    A recurring problem with high-performance computing is that advanced architectures generally achieve only a small fraction of their peak performance on many portions of real applications sets. The Amdahl\u27s law corollary of this is that such architectures often spend most of their time on tasks (codes/algorithms and the data sets upon which they operate) for which they are unsuited. Heterogeneous Computing (HC) is needed in the mid 90\u27s and beyond due to ever increasing super-speed requirements and the number of projects with these requirements. HC is defined as a special form of parallel and distributed computing that performs computations using a single autonomous computer operating in both SIMD and MIMD modes, or using a number of connected autonomous computers. Physical implementation of a heterogeneous network or system is currently possible due to the existing technological advances in networking and supercomputing. Unfortunately, software solutions for heterogeneous computing are still in their infancy. Theoretical models, software tools, and intelligent resource-management schemes need to be developed to support heterogeneous computing efficiently. In this thesis, we present a heterogeneous model of computation which encapsulates all the essential parameters for designing efficient software and hardware for HC. We also study a portable parallel programming tool, called Cluster-M, which implements this model. Furthermore, we study and analyze the hardware and software requirements of HC and show that, Cluster-M satisfies the requirements of HC environments

    Mapping of portable parallel programs

    Get PDF
    An efficient parallel program designed for a parallel architecture includes a detailed outline of accurate assignments of concurrent computations onto processors, and data transfers onto communication links, such that the overall execution time is minimized. This process may be complex depending on the application task and the target multiprocessor architecture. Furthermore, this process is to be repeated for every different architecture even though the application task may be the same. Consequently, this has a major impact on the ever increasing cost of software development for multiprocessor systems. A remedy for this problem would be to design portable parallel programs which can be mapped efficiently onto any computer system. In this dissertation, we present a portable programming tool called Cluster-M. The three components of Cluster-M are the Specification Module, the Representation Module, and the Mapping Module. In the Specification Module, for a given problem, a machine-independent program is generated and represented in the form of a clustered task graph called Spec graph. Similarly, in the Representation Module, for a given architecture or heterogeneous suite of computers, a clustered system graph called Rep graph is generated. The Mapping Module is responsible for efficient mapping of Spec graphs onto Rep graphs. As part of this module, we present the first algorithm which produces a near-optimal mapping of an arbitrary non-uniform machine-independent task graph with M modules, onto an arbitrary non-uniform task-independent system graph having N processors, in 0(M P) time, where P = max(M, N). Our experimental results indicate that Cluster-M produces better or similar mapping results compared to other leading techniques which work only for restricted task or system graphs

    ์ด์ข… ๋ฉ€ํ‹ฐ ์ฝ”์–ด ํ”„๋กœ์„ธ์„œ์—์„œ SDF/L ๊ทธ๋ž˜ํ”„ ์Šค์ผ€์ค„๋ง ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2021.8. Ha Soonhoi.Although dataflow models are known to thrive at exploiting task-level parallelism of an application, it is difficult to exploit the parallelism of data. Data-level parallelism can be represented well with loop structures, but these structures are not explicitly specified in most existing dataflow models. SDF/L model was introduced to overcome this shortcoming by specifying the loop structures explicitly in a hierarchical fashion. To the best of our knowledge however, scheduling of SDF/L graph onto heterogeneous processors has not been considered in any previous work. In this dissertation, we introduce a scheduling technique of an application represented by the SDF/L model onto heterogeneous processors. In the proposed method, we explore the mapping of tasks using an evolutionary meta-heuristic and schedule hierarchically in a bottom-up fashion, creating parallel loop schedules at lower levels first and then re-using them when constructing the schedule at a higher level. To verify the efficiency of the proposed scheduling methodology, we apply it to benchmark examples and randomly generated SDF/L graphs.๋ฐ์ดํ„ฐํ”Œ๋กœ์šฐ ๋ชจ๋ธ์€ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ํƒœ์Šคํฌ๋ฅผ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌํ•  ๋•Œ ์ข‹์€ ๋ชจ๋ธ๋กœ ์•Œ๋ ค์ ธ ์žˆ์ง€๋งŒ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ‘๋ ฌ๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐ์— ํ™œ์šฉํ•˜๊ธฐ๋Š” ์–ด๋ ต๋‹ค. ๋ฐ์ดํ„ฐ ์ˆ˜์ค€ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๋Š” ๋ฃจํ”„ ๊ตฌ์กฐ๋ฅผ ํ†ตํ•ด ํ‘œํ˜„๋  ์ˆ˜ ์žˆ์œผ๋‚˜ ๊ธฐ์กด ๋ฐ์ดํ„ฐํ”Œ๋กœ์šฐ ๋ชจ๋ธ์—์„œ ๋ช…์‹œ์ ์œผ๋กœ ๋ฃจํ”„ ๊ตฌ์กฐ๋Š” ๋ช…์„ธํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ์—†์—ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋‹จ์ ์„ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด ๊ณ„์ธต์  ๊ตฌ์กฐ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋ฃจํ”„ ๊ตฌ์กฐ๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ๋ช…์„ธํ•  ์ˆ˜ ์žˆ๋Š” SDF/L ๋ชจ๋ธ์ด ์ œ์•ˆ๋˜์—ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด๊ธฐ์ข… ํ”„๋กœ์„ธ์„œ์— ๋Œ€ํ•œ SDF/L ๊ทธ๋ž˜ํ”„์˜ ์Šค์ผ€์ค„๋ง์€ ์ด์ „๊นŒ์ง€ ๊ณ ๋ ค๋˜์ง€ ์•Š์€ ๊ฒƒ์œผ๋กœ ํŒŒ์•…๋œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” SDF/L ๋ชจ๋ธ๋กœ ํ‘œํ˜„๋˜๋Š” ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ์ด๊ธฐ์ข… ํ”„๋กœ์„ธ์„œ์— ๋Œ€ํ•˜์—ฌ ์Šค์ผ€์ค„๋งํ•˜๋Š” ๊ธฐ๋ฒ•์„ ์†Œ๊ฐœํ•œ๋‹ค. ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์—์„œ๋Š” ๋จผ์ € ์ง„ํ™”์  ๋ฉ”ํƒ€ ํœด๋ฆฌ์Šคํ‹ฑ์„ ์‚ฌ์šฉํ•˜์—ฌ ํƒœ์Šคํฌ ๋งคํ•‘์„ ํƒ์ƒ‰ํ•œ๋‹ค. ์ดํ›„ ํ•˜์œ„ ์ˆ˜์ค€์—์„œ ๋ณ‘๋ ฌ ๋ฃจํ”„ ์Šค์ผ€์ค„์„ ๋งŒ๋“  ๋‹ค์Œ ์ƒ์œ„ ์ˆ˜์ค€์—์„œ ์Šค์ผ€์ค„ ๊ตฌ์„ฑํ•  ๋•Œ ์žฌ์‚ฌ์šฉํ•˜๋Š” ์ƒํ–ฅ์‹์˜ ๊ณ„์ธต์  ํƒœ์Šคํฌ ์Šค์ผ€์ค„๋ง์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. ์ œ์•ˆํ•˜๋Š” ์Šค์ผ€์ค„๋ง ๊ธฐ๋ฒ•์˜ ํšจ์œจ์„ฑ์„ ๊ฒ€์ฆํ•˜๊ธฐ ์œ„ํ•ด ๋ฒค์น˜๋งˆํฌ ์˜ˆ์ œ์™€ ๋ฌด์ž‘์œ„๋กœ ์ƒ์„ฑ๋œ SDF/L ๊ทธ๋ž˜ํ”„์— ๊ธฐ๋ฒ•์„ ์ ์šฉํ•˜์˜€๋‹ค.Chapter 1 Introduction 1 Chapter 2 Related Work 6 2.1 SDF Scheduling with Data-level Parallelism 8 2.2 Hierarchical Scheduling 9 Chapter 3 Problem and Challenges 11 3.1 Notations and Problem Description 11 3.2 Challenges 12 Chapter 4 Proposed methodology 15 4.1 Mapping Exploration 15 4.2 Priority Assignment and List Scheduling Heuristic 17 4.3 Hierarchical Scheduling 18 4.4 Complexity 23 Chapter 5 Experiments 24 5.1 Benchmarks 25 5.2 Randomly Generated Graphs 30 Chapter 6 Conclusions 35 Bibliography 37 ์š” ์•ฝ 41์„

    ์ž„๋ฒ ๋””๋“œ ์‹œ์Šคํ…œ์—์„œ ์—ฌ๋Ÿฌ ์ปจ๋ณผ๋ฃจ์…˜ ๋‰ด๋Ÿด ๋„คํŠธ์›Œํฌ๋ฅผ ์œ„ํ•œ ํ•˜๋“œ์›จ์–ด๋ฅผ ๊ณ ๋ คํ•˜๋Š” ์†Œํ”„ํŠธ์›จ์–ด ์ตœ์ ํ™” ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2021. 2. ํ•˜์ˆœํšŒ.์ž„๋ฒ ๋””๋“œ ๊ธฐ๊ธฐ๋Š” ๋Œ€๊ฐœ ๊ณ„์‚ฐ๋Ÿ‰, ๋ฉ”๋ชจ๋ฆฌ ํฌ๊ธฐ, ์—๋„ˆ์ง€ ์†Œ๋ชจ๋Ÿ‰ ๋“ฑ์˜ ์ œ์•ฝ ์‚ฌํ•ญ์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ๋”ฅ ๋Ÿฌ๋‹ ์‘์šฉ์„ ์ž„๋ฒ ๋””๋“œ ๊ธฐ๊ธฐ์—์„œ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์€ ์‰ฝ์ง€ ์•Š๋‹ค. ๋”ฅ ๋Ÿฌ๋‹ ์‘์šฉ์˜ ๊ณ„์‚ฐ๋Ÿ‰ ์ฆ๊ฐ€๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ ์—๋„ˆ์ง€ ํšจ์œจ์ ์ธ ๋ชจ๋ฐ”์ผ GPU, ๋””์ง€ํ„ธ ์‹ ํ˜ธ ์ฒ˜๋ฆฌ ํ”„๋กœ์„ธ์„œ์„ ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜, ๋˜๋Š” ์ƒˆ๋กœ์šด ๋‰ด๋Ÿด ํ”„๋กœ์„ธ์„œ ์นฉ์„ ๋งŒ๋“œ๋ ค๋Š” ํ•˜๋“œ์›จ์–ด ์˜์—ญ์˜ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•์ด ์žˆ๋‹ค. ๋ฐ˜๋ฉด์— ๋”ฅ ๋Ÿฌ๋‹ ์‘์šฉ ์˜์—ญ์—์„œ๋Š” ์ƒˆ๋กœ์šด ๋”ฅ ๋Ÿฌ๋‹ ์‘์šฉ์„ ๋งŒ๋“ค๊ฑฐ๋‚˜, ๋”ฅ ๋Ÿฌ๋‹์˜ ํ†ต๊ณ„์ ์ธ ํŠน์„ฑ์„ ์ด์šฉํ•œ ๊ทผ์‚ฌ ๊ณ„์‚ฐ ๋ฐฉ๋ฒ•์„ ์ด์šฉํ•˜์—ฌ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜๊ณ  ์žˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋˜ ๋‹ค๋ฅธ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•์œผ๋กœ๋Š” ๋จผ์ € ํ•˜๋“œ์›จ์–ด ํ”Œ๋žซํผ์˜ ์„ฑ๋Šฅ ๋ณ‘๋ชฉ ๋ถ€๋ถ„์„ ์ฐพ๊ณ , ์ผ์„ ๋™๋“ฑํ•˜๊ฒŒ ์—ฌ๋Ÿฌ ๊ณ„์‚ฐ ์ž์›์— ๋ถ„๋ฐฐํ•˜์—ฌ ์ตœ์ ํ™”ํ•˜๋Š” ํ•˜๋“œ์›จ์–ด๋ฅผ ๊ณ ๋ คํ•œ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•์ด ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ํ•˜๋“œ์›จ์–ด๋ฅผ ๊ณ ๋ คํ•œ ์†Œํ”„ํŠธ์›จ์–ด ์ตœ์ ํ™” ๋ฐฉ๋ฒ•๋“ค์„ ๊ณ ์•ˆํ•˜์˜€๋‹ค. ๋จผ์ €, LPIRC ๋Œ€ํšŒ์— ์ฐธ๊ฐ€ํ•œ ๊ฒฝํ—˜์„ ๋ฐ”ํƒ•์œผ๋กœ ์ž„๋ฒ ๋””๋“œ ๋”ฅ ๋Ÿฌ๋‹ ์‹œ์Šคํ…œ์„ ์ตœ์ ํ™”ํ•˜๋Š” ์ฒด๊ณ„์ ์ธ ๋ฐฉ๋ฒ•๋ก ์„ ๊ณ ์•ˆํ•˜๊ณ , ๊ทธ ๋ฐฉ๋ฒ•๋ก ์— ๋”ฐ๋ฅธ C-GOOD์ด๋ผ๋Š” ๋”ฅ ๋Ÿฌ๋‹ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ๊ตฌํ˜„ํ•˜์˜€๋‹ค. C-GOOD์€ ํ•˜๋“œ์›จ์–ด ํ”Œ๋žซํผ์— ๋…๋ฆฝ์ ์œผ๋กœ ์ž‘๋™ํ•˜๊ธฐ ์œ„ํ•ด ๋Œ€๋ถ€๋ถ„์˜ ์ž„๋ฒ ๋””๋“œ ๊ธฐ๊ธฐ์—์„œ ์ปดํŒŒ์ผ, ์ˆ˜ํ–‰์ด ๊ฐ€๋Šฅํ•œ C ์ฝ”๋“œ๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. ๋˜ํ•œ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ๋”ฅ ๋Ÿฌ๋‹ ์‘์šฉ ์˜์—ญ์˜ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•์„ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์˜ต์…˜๊ณผ ์‹œ์Šคํ…œ ์„ฑ๋Šฅ์„ ์ธก์ •ํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•˜์˜€๋‹ค. ์ด ๋ฐฉ๋ฒ•๋ก ์„ Jetson TX2, Odroid XU4, SRP ๋“ฑ์˜ ์„œ๋กœ ๋‹ค๋ฅธ 3๊ฐœ์˜ ๊ธฐ๊ธฐ์— ์ ์šฉํ•ด ๋ด„์œผ๋กœ์จ, ๊ณ ์•ˆ๋œ ๋ฐฉ๋ฒ•๋ก ์ด ํ•˜๋“œ์›จ์–ด ํ”Œ๋žซํผ์— ๋…๋ฆฝ์ ์ด๋ฉฐ C-GOOD์„ ํ†ตํ•ด ์‰ฝ๊ฒŒ ์—ฌ๋Ÿฌ ๋”ฅ ๋Ÿฌ๋‹ ์‘์šฉ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•์„ ์ ์šฉํ•  ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ•˜์˜€๋‹ค. ์ตœ๊ทผ ์ž„๋ฒ ๋””๋“œ ๊ธฐ๊ธฐ์— ์ด์ข… ํ”„๋กœ์„ธ์„œ๋“ค์ด ๋งŽ์ด ํƒ‘์žฌ๋˜๊ณ  ์žˆ๊ณ , ๋™์‹œ์— ์ž์œจ ์ฃผํ–‰ ์ž๋™์ฐจ์™€ ์Šค๋งˆํŠธํฐ ๋“ฑ์˜ ํ•˜๋‚˜์˜ ์ž„๋ฒ ๋””๋“œ ๊ธฐ๊ธฐ์—์„œ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋”ฅ ๋Ÿฌ๋‹ ์‘์šฉ์„ ๋™์‹œ์— ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์ด ํ•„์š”ํ•ด์ง€๊ณ  ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์—ฌ๋Ÿฌ ๋”ฅ ๋Ÿฌ๋‹ ์‘์šฉ์„ ์ด์ข… ํ”„๋กœ์„ธ์„œ๋“ค์„ ํƒ‘์žฌํ•œ ์ž„๋ฒ ๋””๋“œ ๊ธฐ๊ธฐ์— ์Šค์ผ€์ค„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๊ณ ์•ˆํ•˜๊ณ , ์Šค์ผ€์ค„๋ง ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ๊ตฌํ˜„ํ•˜์˜€๋‹ค. ์ด ๋ฐฉ๋ฒ•๋ก ์€ ์‹ค์ œ ๊ธฐ๊ธฐ์—์„œ์˜ ํ”„๋กœํŒŒ์ผ๋ง๋ถ€ํ„ฐ ์Šค์ผ€์ค„ ๊ฒฐ๊ณผ๋ฅผ ์‹ค์ œ ๊ธฐ๊ธฐ์—์„œ ํ™•์ธํ•˜๋Š” ๊ณผ์ •๊นŒ์ง€ ํฌํ•จํ•˜๋ฉฐ, ์‹ค์ œ ๊ธฐ๊ธฐ์—์„œ ๋ฐœ์ƒํ•˜๋Š” ์ด์Šˆ๋“ค์ธ DVFS, CPU Hot-plug ๋“ฑ์„ ๊ณ ๋ คํ•˜์˜€๋‹ค. ์ด์ข… ํ”„๋กœ์„ธ์„œ๋กœ์˜ ์Šค์ผ€์ค„๋ง ๊ธฐ๋ฒ•์œผ๋กœ๋Š” ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” ๋ฉ”ํƒ€ ํœด๋ฆฌ์Šคํ‹ฑ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์œ ์ „ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ํŠนํžˆ, ์„œ๋กœ ๋‹ค๋ฅธ ์ฃผ๊ธฐ์™€ ์ƒ๋Œ€ ์˜คํ”„์…‹์„ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ์—ฌ๋Ÿฌ ์‘์šฉ์„ ๋™์‹œ์— ์Šค์ผ€์ค„ํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋ชจ๋“  ํƒœ์Šคํฌ๋“ค์˜ ์Šค์ผ€์ค„ ๊ฐ€๋Šฅ์„ฑ์„ ๊ณ ๋ คํ•˜์—ฌ ์Šค์ผ€์ค„ํ•˜์˜€๋‹ค. ์Šค์ผ€์ค„ ๊ฒฐ๊ณผ๋ฅผ ๊ฒ€์ฆํ•˜๊ธฐ ์œ„ํ•ด์„œ, ACL์˜ ์ฝ”์–ด ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์ด์šฉํ•˜์—ฌ ๋”ฅ ๋Ÿฌ๋‹ ์ถ”๋ก  ์‘์šฉ์„ ๊ตฌํ˜„ํ•˜์˜€์œผ๋ฉฐ, ์Šค์ผ€์ค„ ๊ฒฐ๊ณผ์™€ ๊ฐ™์ด ๊ฐ ๋ ˆ์ด์–ด๋“ค์„ ์‹ค์ œ ํ•˜๋“œ์›จ์–ด์˜ ์„œ๋กœ ๋‹ค๋ฅธ ํ”„๋กœ์„ธ์„œ ๋งคํ•‘ํ•˜๋„๋ก ๊ตฌํ˜„ํ•˜์˜€๋‹ค. ๊ฐค๋Ÿญ์‹œ S9 ์Šค๋งˆํŠธํฐ๊ณผ Hikey 970 ๋ณด๋“œ์—์„œ ์„œ๋กœ ๋‹ค๋ฅธ ๋‘๊ฐœ์˜ ๋”ฅ ๋Ÿฌ๋‹ ๋„คํŠธ์›Œํฌ๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ณ , ์Šค์ผ€์ค„ ๊ฒฐ๊ณผ์™€ ๋น„๊ตํ•˜์—ฌ ๋ฐฉ๋ฒ•๋ก ์„ ๊ฒ€์ฆํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ์ด์ „ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•๋“ค์ด ๋”ฅ ๋Ÿฌ๋‹ ์‘์šฉ์˜ ๊ณ„์‚ฐ๋Ÿ‰๊ณผ ํ”„๋กœ์„ธ์„œ๋“ค์— ์ง‘์ค‘ํ•˜์˜€๋Š”๋ฐ, ๋”ฅ ๋Ÿฌ๋‹ ๊ฐ€์†๊ธฐ ๋˜๋Š” NPU์˜ ์„ฑ๋Šฅ ๋ณ‘๋ชฉ์ด ์ƒ๊ธฐ๋Š” ์›์ธ์€ ์˜คํ”„ ์นฉ ๋ฉ”๋ชจ๋ฆฌ์™€ ์˜จ ์นฉ ์‚ฌ์ด์˜ ํ†ต์‹ ์ด๋‹ค. ๋”์šฑ์ด ์˜คํ”„ ์นฉ ๋ฉ”๋ชจ๋ฆฌ DRAM ์ ‘๊ทผ์€ NPU์˜ ์ „๋ ฅ์†Œ๋ชจ์˜ ๋งŽ์€ ๋ถ€๋ถ„์„ ์ฐจ์ง€ํ•œ๋‹ค๊ณ  ์•Œ๋ ค์ ธ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ด์™€ ๊ฐ™์€ ์˜คํ”„ ์นฉ DRAM ์ ‘๊ทผ์œผ๋กœ ์ธํ•œ NPU์˜ ์„ฑ๋Šฅ๊ณผ ์—๋„ˆ์ง€ ์˜ํ–ฅ์„ ์ค„์ด๊ณ ์ž ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์˜จ ์นฉ ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ๋ฅผ ๊ด€๋ฆฌํ•˜๋Š” ์ปดํŒŒ์ผ๋Ÿฌ ๊ธฐ๋ฒ•์„ ๊ณ ์•ˆํ•˜์˜€๋‹ค. ์˜จ ์นฉ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋ฑ…ํฌ๋กœ ๊ตฌ์„ฑํ•˜๊ณ  ์—ฐ์‚ฐ ๋„์ค‘์— ์ธํ’‹ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฏธ๋ฆฌ ๋กœ๋“œํ•จ์œผ๋กœ์จ ์—ฐ์‚ฐ ์ง€์—ฐ ์‹œ๊ฐ„์„ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ๊ณผ ๋ ˆ์ด์–ด์˜ ์•„์›ƒํ’‹์„ ์˜จ ์นฉ ๋ฉ”๋ชจ๋ฆฌ์—์„œ ์žฌ์‚ฌ์šฉํ•˜์—ฌ ์˜คํ”„ ์นฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์„ ์ด์šฉํ•˜์—ฌ ์„œ๋กœ ๋‹ค๋ฅธ ๋‘ ๊ฐ€์ง€์˜ ๋ชฉ์  ํ•จ์ˆ˜๋ฅผ ๊ฐ€์ง„ ๋‘ ๊ฐ€์ง€ ๊ธฐ๋ฒ•์„ ๊ณ ์•ˆํ•˜์˜€๋‹ค. ๋ชฉ์  ํ•จ์ˆ˜๋Š” ๊ฐ๊ฐ ์˜คํ”„ ์นฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฒƒ๊ณผ ์˜คํ”„ ์นฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์œผ๋กœ ์ธํ•œ ํ”„๋กœ์„ธ์„œ๋“ค์˜ ์ฒ˜๋ฆฌ ์ง€์—ฐ์‹œ๊ฐ„์„ ์ค„์ด๋Š” ๊ฒƒ์ด๋‹ค. ์„œ๋กœ ๋‹ค๋ฅธ 5๊ฐœ์˜ ๋”ฅ ๋Ÿฌ๋‹ ๋„คํŠธ์›Œํฌ๋ฅผ ์‚ฌ์ดํด ๋ ˆ๋ฒจ NPU ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์—์„œ ์ˆ˜ํ–‰ํ•˜์—ฌ ๋‘ ๋ชฉ์  ํ•จ์ˆ˜์— ๋”ฐ๋ฅธ ์ ˆ์ถฉ (Trade-off) ๊ด€๊ณ„ ๋ฅผ ํ™•์ธํ•˜์˜€๋‹ค. ๋˜ํ•œ ์˜จ ์นฉ ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ ๊ด€๋ฆฌ ๊ธฐ๋ฒ•์„ ๋ ˆ์ด์–ด ๊ฐ„ ํ”ผ์ฒ˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ตœ๋Œ€ํ•œ ์žฌ์‚ฌ์šฉํ•˜๋Š” ๋ ˆ์ด์–ด ์œตํ•ฉ ๋ฐฉ๋ฒ•์œผ๋กœ ํ™•์žฅํ•˜์˜€๋‹ค. ๊ธฐ์กด์˜ ์ˆœ์ˆ˜ํ•œ ๋ ˆ์ด์–ด ์œตํ•ฉ ๋ฐฉ๋ฒ•์˜ ๊ฒฝ์šฐ์—๋Š” ์ค‘๋ณต ๊ณ„์‚ฐํ•˜๋Š” ์˜ค๋ฒ„ํ—ค๋“œ์™€ ์ถ”๊ฐ€์ ์ธ ํ•„ํ„ฐ ์›จ์ดํŠธ ๋กœ๋“œ๊ฐ€ ์ƒ๊ธด๋‹ค. ๋”ฐ๋ผ์„œ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๊ธฐ์กด์˜ ๋ ˆ์ด์–ด ๋ณ„๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ๋ฒ•๊ณผ ์ˆœ์ˆ˜ํ•œ ๋ ˆ์ด์–ด ์œตํ•ฉ ๋ฐฉ๋ฒ• ์‚ฌ์ด์˜ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๋ ˆ์ด์–ด ์œตํ•ฉ ๋ฐฉ๋ฒ•์„ ๊ณ ์•ˆํ•˜์˜€๋‹ค. ๋‘ ์˜จ ์นฉ ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ ๊ด€๋ฆฌ ๊ธฐ๋ฒ•์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๋ ˆ์ด์–ด ์œตํ•ฉ ๋ฐฉ๋ฒ•์ด ๊ธฐ์กด์˜ ๋ ˆ์ด์–ด ๋ณ„ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ธฐ๋ฒ•๊ณผ ์ˆœ์ˆ˜ํ•œ ๋ ˆ์ด์–ด ์œตํ•ฉ ๋ฐฉ๋ฒ•๋ณด๋‹ค ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ž„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.Executing deep learning algorithms on mobile embedded devices is challenging because embedded devices usually have tight constraints on the computational power, memory size, and energy consumption, while the resource requirements of deep learning algorithms achieving high accuracy continue to increase. To cope with increasing computation complexity, it is common to use an energy-efficient accelerator, such as a mobile GPU or digital signal processor (DSP) array, or to develop a customized neural processor chip called neural processing unit (NPU). In the application domain, many optimization techniques have been proposed to change the application algorithm in order to reduce the computational amount and memory usage by developing new deep learning networks or software optimization techniques that take advantage of the statistical nature of deep learning algorithms. Another approach is hardware-ware software optimization, which finds the performance bottleneck first and then distributes the workload evenly by scheduling the workloads. This dissertation covers hardware-aware software optimization, which is based on a hardware processor or platform. First, we devise a systematic optimization methodology through the experience of participating in the Low Power Image Recognition Challenge (LPIRC) and build a deep learning framework called C-GOOD (C-code Generation Framework for Optimized On-device Deep Learning) based on the devised methodology. For hardware independence, C-GOOD generates a C code that can be compiled for and run on any embedded device. Also, C-GOOD is facilitated with various options for application domain optimization that can be performed according to the devised methodology. By applying the devised methodology to three hardware platforms, NVIDIA Jetson TX2, Odroid XU4, and the Samsung Reconfigurable Processor (SRP), we demonstrate that the devised methodology is independent of the hardware platforms and application domain optimizations can be performed easily with C-GOOD. Recently, embedded devices are equipped with heterogeneous processing elements (PEs), and the need for running multiple deep learning applications concurrently in the embedded systems such as self-driving cars and smartphones is increasing at the same time. In those systems, we devise an end-to-end methodology to schedule deep learning applications onto heterogeneous PEs and implement a scheduling framework according to the methodology. It covers from profiling on real embedded devices to verifying the schedule results on the devices. In this methodology, we use a genetic algorithm (GA)-based scheduling technique for scheduling deep learning applications onto heterogeneous PEs and consider several practical issues in the profile step. Furthermore, we schedule multiple applications with different throughput constraints considering the schedulability of mapped tasks on each processor. After implementing a deep learning inference engine that can utilize heterogeneous PEs using a low-level library of the ARM compute library (ACL), we verify the devised methodology by running two widely used convolution neural networks (CNNs) on a Galaxy S9 smartphones and a Hikey970 board. While the previous optimization methods focus on the computation and processing elements, the performance bottleneck of deep learning accelerators is the communication between off-chip and on-chip memory. Moreover, the off-chip DRAM access volume has a significant effect on the energy consumption of an NPU. To reduce the impact of off-chip DRAM access on the performance and energy of an NPU, we devise compiler techniques for an NPU to manage multi-bank on-chip memory with two different objectives: one is to minimize the off-chip memory access volume, and the other is to minimize the processing delay caused by unhidden DRAM accesses. The main idea is that by organizing on-chip memory into multiple banks, we may hide the off-chip DRAM access delay by prefetching data into unused banks during computation and reduce the off-chip DRAM access volume by storing the output feature map data of each layer to on-chip memory. By running CNN benchmarks on a cycle-level NPU simulator, we demonstrate the trade-off relation between two objectives. The devised multi-bank on-chip memory management (MOMM) techniques are extended to consider layer fusion that aims to reuse feature maps between layers maximally. Since the pure layer fusion technique incurs extra computation overhead and increases DRAM access for filter weights, a hybrid fusion technique is presented between a per-layer processing technique and the pure layer fusion techniques, based on the devised MOMM techniques with two different objectives. Experiment results confirm the superiority of the hybrid fusion technique to the per-layer processing technique and the pure layer fusion technique.Abstract Contents List of Figures List of Tables List of Algorithms Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Contribution 7 1.3 Dissertation Organization 8 Chapter 2 Background 9 2.1 Target Hardware 9 2.1.1 Commodity Hardware Platform 9 2.1.2 Application-specific Hardware Accelerator 10 2.2 Convolutional Neural Network 11 2.2.1 Convolution 11 2.2.2 Optimization Methods for Convolutional Neural Network 11 Chapter 3 Optimization for a Commodity Hardware Platform 14 3.1 Joint Optimization Method of Multiple Objectives 15 3.1.1 Hardware Platform 16 3.1.2 Deep Neural Network and Software Framework 17 3.1.3 Software Optimization Techniques 19 3.2 C-code Generation Framework for Optimized On-device Deep Learning 29 3.2.1 C-GOOD Framework 29 3.2.2 Experiments 36 3.3 Scheduling Deep Learning Applications Onto Heterogeneous Processors 44 3.3.1 Search Space Size 45 3.3.2 Hardware Platform and System Model 45 3.3.3 Proposed Scheduling Framework and Profiling 48 3.3.4 Scheduling a Single Deep Learning Application 53 3.3.5 Scheduling Multiple Deep Learning Applications 61 3.3.6 Verification with Real Hardware Platforms 65 3.4 Related Work 69 3.4.1 Deep Learning Framework 69 3.4.2 Deep Learning Compiler 70 3.4.3 Scheduling Deep Learning Application 70 3.4.4 Scheduling Multiple Applications on Heterogeneous Processors 72 Chapter 4 Optimization for an Application-specific Hardware Accelerator 75 4.1 Multi-Bank On-chip Memory Management Problem 75 4.1.1 Main Idea 75 4.1.2 Assumed Dataflow 76 4.1.3 Multi-bank On-chip Memory Management Problem 79 4.2 Proposed Multi-bank On-chip Memory Management Techniques 83 4.2.1 DRAM-first Storing Policy 84 4.2.2 DRAM Access Minimization Policy (MIN policy) 85 4.2.3 DRAM Access Hiding Policy (HIDE policy) 89 4.2.4 Multiple Path Consideration 91 4.3 Layer Fusion Technique 92 4.3.1 Layer Fusion Technique 92 4.3.2 Hybrid Fusion Technique 94 4.4 Experiments 96 4.4.1 Setup 96 4.4.2 Performance Comparison of MOMM Techniques 98 4.4.3 Multiple Path 100 4.4.4 Design Space Exploration of NPU Architecture 101 4.4.5 Hybrid Fusion Technique 104 4.5 Related Work 106 Chapter 5 Conclusion 108 Bibliography 111 Appendix 120 A Proposed Multi-bank On-chip Memory Management Algorithm 120 A.1 Multi-bank On-chip Memory (MOM) Manager 120 A.2 MIN policy 122 A.3 HIDE policy 124 ์š” ์•ฝ 126Docto

    Mapping Framework for Heterogeneous Reconfigurable Architectures:Combining Temporal Partitioning and Multiprocessor Scheduling

    Get PDF
    • โ€ฆ
    corecore