468 research outputs found

    AMC: Advanced Multi-accelerator Controller

    Get PDF
    The rapid advancement, use of diverse architectural features and introduction of High Level Synthesis (HLS) tools in FPGA technology have enhanced the capacity of data-level parallelism on a chip. A generic FPGA based HLS multi-accelerator system requires a microprocessor (master core) that manages memory and schedules accelerators. In a real environment, such HLS multi-accelerator systems do not give a perfect performance due to memory bandwidth issues. Thus, a system demands a memory manager and a scheduler that improves performance by managing and scheduling the multi-acceleratorโ€™s memory access patterns efficiently. In this article, we propose the integration of an intelligent memory system and efficient scheduler in the HLS-based multi-accelerator environment called Advanced Multi-accelerator Controller (AMC). The AMC system is evaluated with memory intensive accelerators, High Performance Computing (HPC) applications and implemented and tested on a Xilinx Virtex-5 ML505 evaluation FPGA board. The performance of the system is compared against the microprocessor-based systems that have been integrated with the operating system. Results show that the AMC based HLS multi-accelerator system achieves 10.4x and 7x of speedup compared to the MicroBlaze and Intel Core based HLS multi-accelerator systems.Peer ReviewedPostprint (authorโ€™s final draft

    Scalability of broadcast performance in wireless network-on-chip

    Get PDF
    Networks-on-Chip (NoCs) are currently the paradigm of choice to interconnect the cores of a chip multiprocessor. However, conventional NoCs may not suffice to fulfill the on-chip communication requirements of processors with hundreds or thousands of cores. The main reason is that the performance of such networks drops as the number of cores grows, especially in the presence of multicast and broadcast traffic. This not only limits the scalability of current multiprocessor architectures, but also sets a performance wall that prevents the development of architectures that generate moderate-to-high levels of multicast. In this paper, a Wireless Network-on-Chip (WNoC) where all cores share a single broadband channel is presented. Such design is conceived to provide low latency and ordered delivery for multicast/broadcast traffic, in an attempt to complement a wireline NoC that will transport the rest of communication flows. To assess the feasibility of this approach, the network performance of WNoC is analyzed as a function of the system size and the channel capacity, and then compared to that of wireline NoCs with embedded multicast support. Based on this evaluation, preliminary results on the potential performance of the proposed hybrid scheme are provided, together with guidelines for the design of MAC protocols for WNoC.Peer ReviewedPostprint (published version

    A scalable multi-core architecture with heterogeneous memory structures for Dynamic Neuromorphic Asynchronous Processors (DYNAPs)

    Full text link
    Neuromorphic computing systems comprise networks of neurons that use asynchronous events for both computation and communication. This type of representation offers several advantages in terms of bandwidth and power consumption in neuromorphic electronic systems. However, managing the traffic of asynchronous events in large scale systems is a daunting task, both in terms of circuit complexity and memory requirements. Here we present a novel routing methodology that employs both hierarchical and mesh routing strategies and combines heterogeneous memory structures for minimizing both memory requirements and latency, while maximizing programming flexibility to support a wide range of event-based neural network architectures, through parameter configuration. We validated the proposed scheme in a prototype multi-core neuromorphic processor chip that employs hybrid analog/digital circuits for emulating synapse and neuron dynamics together with asynchronous digital circuits for managing the address-event traffic. We present a theoretical analysis of the proposed connectivity scheme, describe the methods and circuits used to implement such scheme, and characterize the prototype chip. Finally, we demonstrate the use of the neuromorphic processor with a convolutional neural network for the real-time classification of visual symbols being flashed to a dynamic vision sensor (DVS) at high speed.Comment: 17 pages, 14 figure

    ์–‘์žํ™”๋œ ํ•™์Šต์„ ํ†ตํ•œ ์ €์ „๋ ฅ ๋”ฅ๋Ÿฌ๋‹ ํ›ˆ๋ จ ๊ฐ€์†๊ธฐ ์„ค๊ณ„

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ์œตํ•ฉ๊ณผํ•™๊ธฐ์ˆ ๋Œ€ํ•™์› ์œตํ•ฉ๊ณผํ•™๋ถ€(์ง€๋Šฅํ˜•์œตํ•ฉ์‹œ์Šคํ…œ์ „๊ณต), 2022.2. ์ „๋™์„.๋”ฅ๋Ÿฌ๋‹์˜ ์‹œ๋Œ€๊ฐ€ ๋„๋ž˜ํ•จ์— ๋”ฐ๋ผ, ์‹ฌ์ธต ์ธ๊ณต ์‹ ๊ฒฝ๋ง (DNN)์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ์š”๊ตฌ๋˜๋Š” ํ•™์Šต ๋ฐ ์ถ”๋ก  ์—ฐ์‚ฐ๋Ÿ‰ ๋˜ํ•œ ๊ธฐํ•˜๊ธ‰์ˆ˜์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜์˜€๋‹ค. ๋”ฅ ๋Ÿฌ๋‹ ์‹œ๋Œ€์˜ ๋„๋ž˜์™€ ํ•จ๊ป˜ ๋‹ค์–‘ํ•œ ์ž‘์—…์— ๋Œ€ํ•œ ์‹ ๊ฒฝ๋ง ํ›ˆ๋ จ ๋ฐ ํŠน์ • ์šฉ๋„์— ๋Œ€ํ•ด ํ›ˆ๋ จ๋œ ์‹ ๊ฒฝ๋ง ์ถ”๋ก  ์ˆ˜ํ–‰ ์ธก๋ฉด์—์„œ ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง (DNN) ์ฒ˜๋ฆฌ์— ๋Œ€ํ•œ ์ปดํ“จํŒ… ์š”๊ตฌ๊ฐ€ ๊ทน์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜์˜€์œผ๋ฉฐ, ์ด๋Ÿฌํ•œ ์ถ”์„ธ๋Š” ์ธ๊ณต์ง€๋Šฅ์˜ ์‚ฌ์šฉ์ด ๋”์šฑ ๋ฒ”์šฉ์ ์œผ๋กœ ์ง„ํ™”ํ•จ์— ๋”ฐ๋ผ ๋”์šฑ ๊ฐ€์†ํ™” ๋  ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒ๋œ๋‹ค. ์ด๋Ÿฌํ•œ ์—ฐ์‚ฐ ์š”๊ตฌ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐ ์„ผํ„ฐ ๋‚ด๋ถ€์— ๋ฐฐ์น˜ํ•˜๊ธฐ ์œ„ํ•œ FPGA (Field-Programmable Gate Array) ๋˜๋Š” ASIC (Application-Specific Integrated Circuit) ๊ธฐ๋ฐ˜ ์‹œ์Šคํ…œ์—์„œ ์ €์ „๋ ฅ์„ ์œ„ํ•œ SoC (System-on-Chip)์˜ ๊ฐ€์† ๋ธ”๋ก์— ์ด๋ฅด๊ธฐ๊นŒ์ง€ ๋‹ค์–‘ํ•œ ๋งž์ถคํ˜• ํ•˜๋“œ์›จ์–ด๊ฐ€ ์‚ฐ์—… ๋ฐ ํ•™๊ณ„์—์„œ ์ œ์•ˆ๋˜์—ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š”, ์ธ๊ณต ์‹ ๊ฒฝ๋ง์˜ ์—๋„ˆ์ง€ ํšจ์œจ์ ์ธ ํ›ˆ๋ จ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ๋งž์ถคํ˜• ์ง‘์  ํšŒ๋กœ ํ•˜๋“œ์›จ์–ด๋ฅผ ๋ณด๋‹ค ์—๋„ˆ์ง€ ํšจ์œจ์ ์œผ๋กœ ์„ค๊ณ„ํ•  ์ˆ˜ ์žˆ๋Š” ๋‹ค์–‘ํ•œ ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์•ˆํ•˜๊ณ  ์‹ค์ œ ์ €์ „๋ ฅ ์ธ๊ณต ์‹ ๊ฒฝ๋ง ํ›ˆ๋ จ ์‹œ์Šคํ…œ์„ ์„ค๊ณ„ํ•˜๊ณ  ์ œ์ž‘ํ•˜์—ฌ, ๊ทธ ํšจ์œจ์„ ํ‰๊ฐ€ํ•˜๊ณ ์ž ํ•œ๋‹ค. ํŠนํžˆ, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ์ €์ „๋ ฅ ๊ณ ์„ฑ๋Šฅ ์„ค๊ณ„ ๋ฐฉ๋ฒ•๋ก ์„ ํฌ๊ฒŒ ์„ธ ๊ฐ€์ง€๋กœ ๋ถ„๋ฅ˜ํ•˜์—ฌ ๋ถ„์„์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค. ์ด๋Ÿฌํ•œ ๋ถ„๋ฅ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. (1) ํ›ˆ๋ จ ์•Œ๊ณ ๋ฆฌ์ฆ˜. ํ‘œ์ค€์ ์œผ๋กœ ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง ํ›ˆ๋ จ์€ ์—ญ์ „ํŒŒ (Back-Propagation) ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ์ˆ˜ํ–‰๋˜์ง€๋งŒ, ๋” ํšจ์œจ์ ์ธ ํ•˜๋“œ์›จ์–ด ๊ตฌํ˜„์„ ์œ„ํ•ด ์ŠคํŒŒ์ดํฌ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ†ต์‹ ํ•˜๋Š” ๋‰ด๋Ÿฐ์ด ์žˆ๋Š” ๋‰ด๋กœ๋ชจํ”ฝ ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋˜๋Š” ๋น„๋Œ€์นญ ํ”ผ๋“œ๋ฐฑ ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ์ƒ๋ฌผํ•™์  ๋ชจ์‚ฌ๋„๊ฐ€ ๋†’์€ (Bio-Plausible) ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ™œ์šฉํ•˜์—ฌ ๋” ํšจ์œจ์ ์ธ ํ›ˆ๋ จ ์‹œ์Šคํ…œ์„ ์„ค๊ณ„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์กฐ์‚ฌ ๋ฐ ์ œ์‹œํ•˜๊ณ , ๊ทธ ํ•˜๋“œ์›จ์–ด ํšจ์œจ์„ฑ์„ ๋ถ„์„ํ•˜์˜€๋‹ค. (2) ์ €์ •๋ฐ€๋„ ์ˆ˜ ์ฒด๊ณ„ ํ™œ์šฉ. ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” DNN ๊ฐ€์†๊ธฐ์—์„œ ํšจ์œจ์„ฑ์„ ๋†’์ด๋Š” ๊ฐ€์žฅ ๊ฐ•๋ ฅํ•œ ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜๋Š” ์ˆ˜์น˜ ์ •๋ฐ€๋„๋ฅผ ์กฐ์ •ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. DNN์˜ ์ถ”๋ก  ๋‹จ๊ณ„์— ๋‚ฎ์€ ์ •๋ฐ€๋„ ์ˆซ์ž๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ์ž˜ ์—ฐ๊ตฌ๋˜์—ˆ์ง€๋งŒ, ์„ฑ๋Šฅ ์ €ํ•˜ ์—†์ด DNN์„ ํ›ˆ๋ จํ•˜๋Š” ๊ฒƒ์€ ์ƒ๋Œ€์ ์œผ ๊ธฐ์ˆ ์  ์–ด๋ ค์›€์ด ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹ค์–‘ํ•œ ๋ชจ๋ธ๊ณผ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ DNN์„ ์„ฑ๋Šฅ ์ €ํ•˜ ์—†์ด ํ›ˆ๋ จํ•˜๊ธฐ ์œ„ํ•œ ์ƒˆ๋กœ์šด ์ˆ˜ ์ฒด๊ณ„๋ฅผ ์ œ์•ˆํ•˜์˜€๋‹ค. (3) ์‹œ์Šคํ…œ ๊ตฌํ˜„ ๊ธฐ๋ฒ•. ์ง‘์  ํšŒ๋กœ์—์„œ ๋งž์ถคํ˜• ํ›ˆ๋ จ ์‹œ์Šคํ…œ์„ ์‹ค์ œ๋กœ ์‹คํ˜„ํ•  ๋•Œ, ๊ฑฐ์˜ ๋ฌดํ•œํ•œ ์„ค๊ณ„ ๊ณต๊ฐ„์€ ์นฉ ๋‚ด๋ถ€์˜ ๋ฐ์ดํ„ฐ ํ๋ฆ„, ์‹œ์Šคํ…œ ๋ถ€ํ•˜ ๋ถ„์‚ฐ, ๊ฐ€์†/๊ฒŒ์ดํŒ… ๋ธ”๋ก ๋“ฑ ๋‹ค์–‘ํ•œ ์š”์†Œ์— ๋”ฐ๋ผ ๊ฒฐ๊ณผ์˜ ํ’ˆ์งˆ์ด ํฌ๊ฒŒ ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋” ๋‚˜์€ ์„ฑ๋Šฅ๊ณผ ํšจ์œจ์„ฑ์œผ๋กœ ์ด์–ด์ง€๋Š” ๋‹ค์–‘ํ•œ ์„ค๊ณ„ ๊ธฐ๋ฒ•์„ ์†Œ๊ฐœํ•˜๊ณ  ๋ถ„์„ํ•˜๊ณ ์ž ํ•œ๋‹ค. ์ฒซ์งธ๋กœ, ์†๊ธ€์”จ ๋ถ„๋ฅ˜ ํ•™์Šต์„ ์œ„ํ•œ ๋‰ด๋กœ๋ชจํ”ฝ ํ•™์Šต ์‹œ์Šคํ…œ์„ ์ œ์ž‘ํ•˜์—ฌ ํ‰๊ฐ€ํ•˜์˜€๋‹ค. ์ด ํ•™์Šต ์‹œ์Šคํ…œ์€ ์ „ํ†ต์ ์ธ ๊ธฐ๊ณ„ ํ•™์Šต์˜ ํ›ˆ๋ จ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ๋‚ฎ์€ ํ›ˆ๋ จ ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ œ๊ณตํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•˜์—ฌ ์„ค๊ณ„๋˜์—ˆ๋‹ค. ์ด ๋ชฉ์ ์„ ๋‹ฌ์„ฑํ•˜๊ธฐ ์œ„ํ•ด, ๋” ์ ์€ ์—ฐ์‚ฐ ์š”๊ตฌ๋Ÿ‰๊ณผ ๋ฒ„ํผ ๋ฉ”๋ชจ๋ฆฌ ํ•„์š”์น˜๋ฅผ ์œ„ํ•ด ๊ธฐ์กด์˜ ๋‰ด๋กœ๋ชจํ”ฝ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ˆ˜์ •ํ•˜์˜€์œผ๋ฉฐ, ์ด ๊ณผ์ •์—์„œ ํ›ˆ๋ จ ์„ฑ๋Šฅ ์†์‹ค ์—†์ด ๊ธฐ์กด ์—ญ์ „ํŒŒ ๊ธฐ๋ฐ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๊ทผ์ ‘ํ•œ ํ›ˆ๋ จ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค. ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ์—…๋ฐ์ดํŠธ๋ฅผ ๊ฑด๋„ˆ๋›ฐ๋Š” ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ๊ตฌํ˜„ํ•˜๊ณ  Lock-Free ๋งค๊ฐœ๋ณ€์ˆ˜ ์—…๋ฐ์ดํŠธ ๋ฐฉ์‹์„ ์ฑ„ํƒํ•˜์—ฌ ํ›ˆ๋ จ์— ์†Œ๋ชจ๋˜๋Š” ์—๋„ˆ์ง€๋ฅผ ํ›ˆ๋ จ์ด ์ง„ํ–‰๋จ์— ๋”ฐ๋ผ ๋™์ ์œผ๋กœ ๊ฐ์†Œ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ์‹œ์Šคํ…œ ๊ตฌํ˜„ ๊ธฐ๋ฒ• ๋˜ํ•œ ์†Œ๊ฐœํ•˜๊ณ  ๊ทธ ์„ฑ๋Šฅ์„ ๋ถ„์„ํ•˜์˜€๋‹ค. ์ด๋Ÿฐ ๊ธฐ๋ฒ•์„ ํ†ตํ•ด, ์ด ํ•™์Šต ์‹œ์Šคํ…œ์€ ๊ธฐ์กด์˜ ํ›ˆ๋ จ ์‹œ์Šคํ…œ ๋Œ€๋น„ ๋›ฐ์–ด๋‚œ ๋ถ„๋ฅ˜ ์„ฑ๋Šฅ-์—๋„ˆ์ง€ ์†Œ๋ชจ๋Ÿ‰ ๊ด€๊ณ„๋ฅผ ๋ณด์ด๋ฉด์„œ๋„ ๊ธฐ์กด์˜ ์—ญ์ „ํŒŒ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ธฐ๋ฐ˜์˜ ์ธ๊ณต ์‹ ๊ฒฝ๋ง์˜ ํ›ˆ๋ จ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜์˜€๋‹ค. ๋‘˜์งธ๋กœ, ํŠน์ˆ˜ ๋ช…๋ น์–ด ์ฒด๊ณ„ ๋ฐ ๋งž์ถคํ˜• ์ˆ˜ ์ฒด๊ณ„๋ฅผ ํ™œ์šฉํ•œ ํ”„๋กœ๊ทธ๋žจ ๊ฐ€๋Šฅํ•œ DNN ํ›ˆ๋ จ์šฉ ํ”„๋กœ์„ธ์„œ๊ฐ€ ์„ค๊ณ„๋˜๊ณ  ์ œ์ž‘๋˜์—ˆ๋‹ค. ๊ธฐ์กด DNN ์ถ”๋ก ์šฉ ๊ฐ€์†๊ธฐ๋Š” 8๋น„ํŠธ ์ •์ˆ˜ ๊ธฐ๋ฐ˜์œผ๋กœ ์ด๋ฃจ์–ด์ง„ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•˜์ง€๋งŒ, DNN ํ•™์Šต ์„ค๊ณ„์‹œ 8๋น„ํŠธ ์ˆ˜ ์ฒด๊ณ„๋ฅผ ์ด์šฉํ•˜๋ฉฐ ํ›ˆ๋ จ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ๋ณด์ด์ง€ ์•Š๋Š” ๊ฒƒ์€ ์ƒ๋‹นํ•œ ๊ธฐ์ˆ ์  ๋‚œ์ด๋„๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์—ˆ๋‹ค. ์ด๋Ÿฐ ๋ฌธ์ œ๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๊ณต์œ ํ˜• ๋ฉฑ์ง€์ˆ˜ ํŽธํ–ฅ๊ฐ’์„ ํ™œ์šฉํ•˜๋Š” 8๋น„ํŠธ ๋ถ€๋™ ์†Œ์ˆ˜์  ์ˆ˜ ์ฒด๊ณ„๋ฅผ ์ƒˆ๋กœ์ด ์ œ์•ˆํ•˜์˜€์œผ๋ฉฐ, ์ด ์ˆ˜ ์ฒด๊ณ„์˜ ํšจ์šฉ์„ฑ์„ ๋ณด์ด๊ธฐ ์œ„ํ•ด ์ด DNN ํ›ˆ๋ จ ํ”„๋กœ์„ธ์„œ๊ฐ€ ์„ค๊ณ„๋˜์—ˆ๋‹ค. ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ์ด ํ”„๋กœ์„ธ์„œ๋Š” ๋‹จ์ˆœํ•œ MAC ๊ธฐ๋ฐ˜ Matrix-Multiplication ๊ฐ€์†๊ธฐ๊ฐ€ ์•„๋‹Œ, Fused-Multiply-Add ํŠธ๋ฆฌ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ์—๋„ˆ์ง€ ํšจ์œจ์ ์ธ ๊ฐ€์†๊ธฐ ๊ตฌ์กฐ๋ฅผ ์ฑ„ํƒํ•˜๋ฉด์„œ๋„, ์นฉ ๋‚ด๋ถ€์—์„œ์˜ ๋ฐ์ดํ„ฐ ์ด๋™๋Ÿ‰ ์ตœ์ ํ™” ๋ฐ ์ปจ๋ณผ๋ฃจ์…˜์˜ ๊ณต๊ฐ„์„ฑ์„ ๊ทน๋Œ€ํ™”ํ•  ์ˆ˜ ์žˆ๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐ ์ „๋‹ฌ ์œ ๋‹›์„ ์ž…์ถœ๋ ฅ๋ถ€์— 2D๋กœ ์ œ์ž‘ํ•˜์—ฌ ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜์—์„œ์˜ ์ปจ๋ณผ๋ฃจ์…˜ ์ถ”๋ก  ๋ฐ ํ›ˆ๋ จ ๋‹จ๊ณ„์—์„œ์˜ ๊ณต๊ฐ„์„ฑ์„ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•˜์˜€๋‹ค. ๋ณธ DNN ํ›ˆ๋ จ ํ”„๋กœ์„ธ์„œ๋Š” ๋งž์ถคํ˜• ๋ฒกํ„ฐ ์—ฐ์‚ฐ๊ธฐ, ๊ฐ€์† ๋ช…๋ น์–ด ์ฒด๊ณ„, ์™ธ๋ถ€ DRAM์œผ๋กœ์˜ ์ง์ ‘์ ์ธ ์ ‘๊ทผ ์ œ์–ด ๋ฐฉ์‹ ๋“ฑ์„ ํ†ตํ•ด ํ•œ ํ”„๋กœ์„ธ์„œ ๋‚ด์—์„œ DNN ํ›ˆ๋ จ์˜ ๋ชจ๋“  ๋‹จ๊ณ„๋ฅผ ๋‹ค์–‘ํ•œ ๋ชจ๋ธ ๋ฐ ํ™˜๊ฒฝ์—์„œ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋„๋ก ์„ค๊ณ„๋˜์—ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋ณธ ํ”„๋กœ์„ธ์„œ๋Š” ๊ธฐ์กด์˜ ์—ฐ๊ตฌ์—์„œ ์ œ์‹œ๋˜์—ˆ๋˜ ๋‹ค๋ฅธ ํ”„๋กœ์„ธ์„œ์— ๋น„ํ•ด ๋™์ผ ๋ชจ๋ธ์„ ์ฒ˜๋ฆฌํ•˜๋ฉด์„œ 2.48๋ฐฐ ๊ฐ€๋Ÿ‰ ๋” ๋†’์€ ์—๋„ˆ์ง€ ํšจ์œจ์„ฑ, 43% ์ ์€ DRAM ์ ‘๊ทผ ์š”๊ตฌ๋Ÿ‰, 0.8%p ๋†’์€ ํ›ˆ๋ จ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค. ์ด๋ ‡๊ฒŒ ์†Œ๊ฐœ๋œ ๋‘ ๊ฐ€์ง€ ์„ค๊ณ„๋Š” ๋ชจ๋‘ ์‹ค์ œ ์นฉ์œผ๋กœ ์ œ์ž‘๋˜์–ด ๊ฒ€์ฆ๋˜์—ˆ๋‹ค. ์ธก์ • ๋ฐ์ดํ„ฐ ๋ฐ ์ „๋ ฅ ์†Œ๋ชจ๋Ÿ‰์„ ํ†ตํ•ด ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆ๋œ ์ €์ „๋ ฅ ๋”ฅ๋Ÿฌ๋‹ ํ›ˆ๋ จ ์‹œ์Šคํ…œ ์„ค๊ณ„ ๊ธฐ๋ฒ•์˜ ํšจ์œจ์„ ๊ฒ€์ฆํ•˜์˜€์œผ๋ฉฐ, ํŠนํžˆ ์ƒ๋ฌผํ•™์  ๋ชจ์‚ฌ๋„๊ฐ€ ๋†’์€ ํ›ˆ๋ จ ์•Œ๊ณ ๋ฆฌ์ฆ˜, ๋”ฅ๋Ÿฌ๋‹ ํ›ˆ๋ จ์— ์ตœ์ ํ™”๋œ ์ˆ˜ ์ฒด๊ณ„, ๊ทธ๋ฆฌ๊ณ  ํšจ์œจ์ ์ธ ์‹œ์Šคํ…œ ๊ตฌํ˜„ ๊ธฐ๋ฒ•์„ ํ™œ์šฉํ•˜์—ฌ ์‹œ์Šคํ…œ์˜ ์—๋„ˆ์ง€ ํšจ์œจ์„ฑ์„ ๊ฐœ์„ ํ•˜๋Š” ๋ชฉํ‘œ๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€๋Š”์ง€ ์ •๋Ÿ‰์ ์œผ๋กœ ๋ถ„์„ํ•˜์˜€๋‹ค.With the advent of the deep learning era, the computational need for processing deep neural networks (DNN) have increased dramatically, both in terms of performing training the neural networks on various tasks as well as in performing inference on the trained neural networks for specific use cases. To address those needs, many custom hardware ranging from systems based on field-programmable gate arrays (FPGA) or application-specific integrated circuits (ASIC) for deployment inside data centers to acceleration blocks in system-on-chip (SoC) for low-power processing in mobile devices were proposed. In this dissertation, custom integrated circuits hardware for energy efficient processing of training neural networks are designed, fabricated, and measured for evaluation of different methodologies that could be utilized for more energy efficient processing under same training performance constraints. In particular, these methodologies are categorized to three different categories for evaluation: (1) Training algorithm. While standard deep neural network training is performed with the back-propagation (BP) algorithm, we investigate various training algorithms, such as neuromorphic learning algorithms with spiking neurons or bio-plausible algorithms with asymmetric feedback for exploiting computational properties for more efficient hardware implementation. (2) Low-precision arithmetic. One of the most powerful methods for increased efficiency in DNN accelerators is through scaling numerical precision. While utilizing low precision numerics for inference phase of DNNs is well studied, training DNNs without performance degradation is relatively more challenging. A novel numerical scheme for training DNNs in various models and scenarios is proposed in this dissertation. (3) System implementation techniques. In actual realization of a custom training system in integrated circuits, nearly infinite design space leads to vastly different quality of results depending on dataflow inside the chip, system load balancing, acceleration and gating blocks, et cetera. Different design techniques which leads to better performance and efficiency are introduced in this dissertation. First, a neuromorphic learning system for classifying handwritten digits (MNIST) is introduced. This learning system aims to deliver low training overhead while maintaining the training performance of classical machine learning. In order to achieve this goal, a neuromorphic learning algorithm is modified for lower operation count and memory buffer requirement while maintaining or even obtaining higher machine learning performance. Moreover, implementation techniques such as update skipping mechanism and lock-free parameter updates allow even lower training overhead, dynamically reducing training energy overhead from 25.6% to 7.5%. With these proposed methodologies, this system greatly improves the accuracy-energy trade-off in on-chip learning system as well as showing close learning performance to classical DNN training through back propagation. Second, a programmable DNN training processor with a custom numerical format is introduced. While prior DNN inference accelerators have utilized 8-bit integers, implementing 8-bit numerics for a training accelerator remained to be a challenge due to higher precision requirements in the backward step of DNN training. To overcome this limitation, a custom 8-bit floating point format dubbed 8-bit floating point with shared exponent bias (FP8-SEB) is introduced in this dissertation. Moreover, a processing architecture of 24-way fused-multiply-adder (FMA) tree greatly increases processing energy efficiency per MAC, while complemented with a novel 2-dimensional routing data-path for making use of spatiality to increase data reuse in both forward, backward, and weight gradient step of convolutional neural networks. This DNN training processor is implemented with a custom vector processing unit, acceleration instructions, and DMA in external DRAMs for end-to-end DNN training in various models and datasets. Compared against prior low-precision training processor in ResNet-18 training, this work achieves 2.48ร— higher energy efficiency, 43% less DRAM accesses, and 0.8\p higher training accuracy. Both of the designs introduced are fabricated in real silicon and verified both in simulations and in physical measurements. Design methodologies are carefully evaluated using simulations of the fabricated chip and measurements with monitored data and power consumption under varying conditions that expose the design techniques in effect. The efficiency of various biologically plausible algorithms, novel numerical formats, and system implementation techniques are analyzed in discussed in this dissertations based on the obtained measurements.Abstract i Contents iv List of Tables vii List of Figures viii 1 Introduction 1 1.1 Study Background 1 1.2 Purpose of Research 6 1.3 Contents 8 2 Hardware-Friendly Learning Algorithms 9 2.1 Modified Learning Rule for Neuromorphic System 9 2.1.1 The Segregated Dendrites Algorithm 9 2.1.2 Modification of the Segregated Dendrites Algorithm 13 2.2 Non-BP Learning Rules on DNN Training Processor 18 2.2.1 Feedback Alignment and Direct Feedback Alignment 18 2.2.2 Reduced Memory Access in Non-BP Learning Rules 23 3 Optimal Numerical Format for DNN Training 27 3.1 Related Works 27 3.2 Proposed FP8 with Shared Exponent Bias 30 3.3 Training Results with FP8-SEB 33 3.4 Fused Multiply Adder Tree for FP8-SEB 37 4 System Implementations 41 4.1 Neuromorphic Learning System 41 4.1.1 Bio-Plausibility 41 4.1.2 Top Level Architecture 43 4.1.3 Lock-Free Weight Updates 47 4.1.4 Update Skipping Mechanism 48 4.2 Low-Precision DNN Training System 51 4.2.1 Top Level Architecture 52 4.2.2 Optimized Auxiliary Instructions in the Vector Processing Unit 55 4.2.3 Buffer Organization 57 4.2.4 Input-Output 2D Spatial Routing for FMA Trees 60 5 Measurement Results 70 5.1 Measurement Results on the Neuromorphic Learning System 70 5.1.1 Measurement Results and Test Setup . 70 5.1.2 Comparison against other works 73 5.1.3 Scalability of the Learning Algorithm 77 5.2 Measurements Results on the Low-Precision DNN Training Processor 79 5.2.1 Measurement Results in Benchmarked Tests 79 5.2.2 Comparison Against Other DNN Training Processors 89 6 Conclusion 93 6.1 Discussion for Future Works 93 6.1.1 Scaling to CNNs in the Neuromorphic System 93 6.1.2 Discussions for Improvements on DNN Training Processor 96 6.2 Conclusion 99 Abstract (In Korean) 108๋ฐ•

    An Energy-Efficient Reconfigurable Circuit Switched Network-on-Chip

    Get PDF
    Network-on-Chip (NoC) is an energy-efficient on-chip communication architecture for multi-tile System-on-Chip (SoC) architectures. The SoC architecture, including its run-time software, can replace inflexible ASICs for future ambient systems. These ambient systems have to be flexible as well as energy-efficient. To find an energy-efficient solution for the communication network we analyze three wireless applications. Based on their communication requirements we observe that revisiting of the circuit switching techniques is beneficial. In this paper we propose a new energy-efficient reconfigurable circuit-switched Network-on-Chip. By physically separating the concurrent data streams we reduce the overall energy consumption. The circuit-switched router has been synthesized and analyzed for its power consumption in 0.13 ยฟm technology. A 5-port circuit-switched router has an area of 0.05 mm2 and runs at 1075 MHz. The proposed architecture consumes 3.5 times less energy compared to its packet-switched equivalen

    Parallel and Distributed Computing

    Get PDF
    The 14 chapters presented in this book cover a wide variety of representative works ranging from hardware design to application development. Particularly, the topics that are addressed are programmable and reconfigurable devices and systems, dependability of GPUs (General Purpose Units), network topologies, cache coherence protocols, resource allocation, scheduling algorithms, peertopeer networks, largescale network simulation, and parallel routines and algorithms. In this way, the articles included in this book constitute an excellent reference for engineers and researchers who have particular interests in each of these topics in parallel and distributed computing

    Tinsel: a manythread overlay for FPGA clusters

    Get PDF
    Commodity FPGA boards with advanced networking facilities have great potential in the construction of high-performance compute clusters that scale. However, low-level design tools and long synthesis times are major barriers to productivity for application developers. In this paper, we explore the potential of a distributed soft-processor overlay, programmed in software at a high-level of abstraction, to deliver a useful level of performance for FPGA clusters. In particular, we demonstrate the use of hardware multhreading to achieve a fast, space-efficient, high-throughput overlay, and compare a 12-FPGA instance of it (12,288 RISC-V threads) against a conventional Xeon cluster on the problem of distributed graph processing.This work was supported by EPSRC grant EP/N031768/1 (POETS project)

    Tiled microprocessors

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.Includes bibliographical references (p. 251-258).Current-day microprocessors have reached the point of diminishing returns due to inherent scalability limitations. This thesis examines the tiled microprocessor, a class of microprocessor which is physically scalable but inherits many of the desirable properties of conventional microprocessors. Tiled microprocessors are composed of an array of replicated tiles connected by a special class of network, the Scalar Operand Network (SON), which is optimized for low-latency, low-occupancy communication between remote ALUs on different tiles. Tiled microprocessors can be constructed to scale to 100's or 1000's of functional units. This thesis identifies seven key criteria for achieving physical scalability in tiled microprocessors. It employs an archetypal tiled microprocessor to examine the challenges in achieving these criteria and to explore the properties of Scalar Operand Networks. The thesis develops the field of SONs in three major ways: it introduces the 5-tuple performance metric, it describes a complete, high-frequency SON implementation, and it proposes a taxonomy, called AsTrO, for categorizing them.(cont.) To develop these ideas, the thesis details the design, implementation and analysis of a tiled microprocessor prototype, the Raw Microprocessor, which was implemented at MIT in 180 nm technology. Overall, compared to Raw, recent commercial processors with half the transistors required 30x as many lines of code, occupied 100x as many designers, contained 50x as many pre-tapeout bugs, and resulted in 33x as many post-tapeout bugs. At the same time, the Raw microprocessor proves to be more versatile in exploiting ILP, stream, and server-farm workloads with modest to large amounts of parallelism.by Michael Bedford Taylor.Ph.D

    Doctor of Philosophy

    Get PDF
    dissertationIn recent years, a number of trends have started to emerge, both in microprocessor and application characteristics. As per Moore's law, the number of cores on chip will keep doubling every 18-24 months. International Technology Roadmap for Semiconductors (ITRS) reports that wires will continue to scale poorly, exacerbating the cost of on-chip communication. Cores will have to navigate an on-chip network to access data that may be scattered across many cache banks. The number of pins on the package, and hence available off-chip bandwidth, will at best increase at sublinear rate and at worst, stagnate. A number of disruptive memory technologies, e.g., phase change memory (PCM) have begun to emerge and will be integrated into the memory hierarchy sooner than later, leading to non-uniform memory access (NUMA) hierarchies. This will make the cost of accessing main memory even higher. In previous years, most of the focus has been on deciding the memory hierarchy level where data must be placed (L1 or L2 caches, main memory, disk, etc.). However, in modern and future generations, each level is getting bigger and its design is being subjected to a number of constraints (wire delays, power budget, etc.). It is becoming very important to make an intelligent decision about where data must be placed within a level. For example, in a large non-uniform access cache (NUCA), we must figure out the optimal bank. Similarly, in a multi-dual inline memory module (DIMM) non uniform memory access (NUMA) main memory, we must figure out the DIMM that is the optimal home for every data page. Studies have indicated that heterogeneous main memory hierarchies that incorporate multiple memory technologies are on the horizon. We must develop solutions for data management that take heterogeneity into account. For these memory organizations, we must again identify the appropriate home for data. In this dissertation, we attempt to verify the following thesis statement: "Can low-complexity hardware and OS mechanisms manage data placement within each memory hierarchy level to optimize metrics such as performance and/or throughput?" In this dissertation we argue for a hardware-software codesign approach to tackle the above mentioned problems at different levels of the memory hierarchy. The proposed methods utilize techniques like page coloring and shadow addresses and are able to handle a large number of problems ranging from managing wire-delays in large, shared NUCA caches to distributing shared capacity among different cores. We then examine data-placement issues in NUMA main memory for a many-core processor with a moderate number of on-chip memory controllers. Using codesign approaches, we achieve efficient data placement by modifying the operating system's (OS) page allocation algorithm for a wide variety of main memory architectures
    • โ€ฆ
    corecore