8 research outputs found

    Algorithm and Architecture Co-design for High-performance Digital Signal Processing.

    Full text link
    CMOS scaling has been the driving force behind the revolution of digital signal processing (DSP) systems, but scaling is slowing down and the CMOS device is approaching its fundamental scaling limit. At the same time, DSP algorithms are continuing to evolve, so there is a growing gap between the increasing complexities of the algorithms and what is practically implementable. The gap can be bridged by exploring the synergy between algorithm and hardware design, using the so-called co-design techniques. In this thesis, algorithm and architecture co-design techniques are applied to X-ray computed tomography (CT) image reconstruction. Analysis of fixed-point quantization and CT geometry identifies an optimal word length and a mismatch between the object and projection grids. A water-filling buffer is designed to resolve the grid mismatch, and is combined with parallel fixed-point arithmetic units to improve the throughput. The analysis eventually leads to an out-of-order scheduling architecture that reduces the off-chip memory access by three orders of magnitude. The co-design techniques are further applied to the design of neural networks for sparse coding. Analysis of the neuron spiking dynamics leads to the optimal tuning of network size, spiking rate, and update step size to keep the spiking sparse. The resulting sparsity enables a bus-ring architecture to achieve both high throughput and scalability. A 65nm CMOS chip implementing the architecture demonstrates feature extraction at a throughput of 1.24G pixel/s at 1.0V and 310MHz. The error tolerance of sparse coding can be exploited to enhance the energy efficiency. As a natural next step after the sparse coding chip, a neural-inspired inference module (IM) is designed for object recognition. The object recognition chip consists of an IM based on sparse coding and an event-driven classifier. A learning co-processor is integrated on chip to enable on-chip learning. The throughput and energy efficiency are further improved using architectural techniques including sub-dividing the IM and classifier into modules and optimal pipelining. The result is a 65nm CMOS chip that performs sparse coding at 10.16G pixel/s at 1.0V and 635MHz. The co-design techniques can be applied to the design of other advanced DSP algorithms for emerging applications.PhDElectrical Engineering: SystemsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113344/1/jungkook_1.pd

    A Survey of Spiking Neural Network Accelerator on FPGA

    Full text link
    Due to the ability to implement customized topology, FPGA is increasingly used to deploy SNNs in both embedded and high-performance applications. In this paper, we survey state-of-the-art SNN implementations and their applications on FPGA. We collect the recent widely-used spiking neuron models, network structures, and signal encoding formats, followed by the enumeration of related hardware design schemes for FPGA-based SNN implementations. Compared with the previous surveys, this manuscript enumerates the application instances that applied the above-mentioned technical schemes in recent research. Based on that, we discuss the actual acceleration potential of implementing SNN on FPGA. According to our above discussion, the upcoming trends are discussed in this paper and give a guideline for further advancement in related subjects

    Hardware Considerations for Signal Processing Systems: A Step Toward the Unconventional.

    Full text link
    As we progress into the future, signal processing algorithms are becoming more computationally intensive and power hungry while the desire for mobile products and low power devices is also increasing. An integrated ASIC solution is one of the primary ways chip developers can improve performance and add functionality while keeping the power budget low. This work discusses ASIC hardware for both conventional and unconventional signal processing systems, and how integration, error resilience, emerging devices, and new algorithms can be leveraged by signal processing systems to further improve performance and enable new applications. Specifically this work presents three case studies: 1) a conventional and highly parallel mix signal cross-correlator ASIC for a weather satellite performing real-time synthetic aperture imaging, 2) an unconventional native stochastic computing architecture enabled by memristors, and 3) two unconventional sparse neural network ASICs for feature extraction and object classification. As improvements from technology scaling alone slow down, and the demand for energy efficient mobile electronics increases, such optimization techniques at the device, circuit, and system level will become more critical to advance signal processing capabilities in the future.PhDElectrical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/116685/1/knagphil_1.pd

    ์–‘์žํ™”๋œ ํ•™์Šต์„ ํ†ตํ•œ ์ €์ „๋ ฅ ๋”ฅ๋Ÿฌ๋‹ ํ›ˆ๋ จ ๊ฐ€์†๊ธฐ ์„ค๊ณ„

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ์œตํ•ฉ๊ณผํ•™๊ธฐ์ˆ ๋Œ€ํ•™์› ์œตํ•ฉ๊ณผํ•™๋ถ€(์ง€๋Šฅํ˜•์œตํ•ฉ์‹œ์Šคํ…œ์ „๊ณต), 2022.2. ์ „๋™์„.๋”ฅ๋Ÿฌ๋‹์˜ ์‹œ๋Œ€๊ฐ€ ๋„๋ž˜ํ•จ์— ๋”ฐ๋ผ, ์‹ฌ์ธต ์ธ๊ณต ์‹ ๊ฒฝ๋ง (DNN)์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ์š”๊ตฌ๋˜๋Š” ํ•™์Šต ๋ฐ ์ถ”๋ก  ์—ฐ์‚ฐ๋Ÿ‰ ๋˜ํ•œ ๊ธฐํ•˜๊ธ‰์ˆ˜์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜์˜€๋‹ค. ๋”ฅ ๋Ÿฌ๋‹ ์‹œ๋Œ€์˜ ๋„๋ž˜์™€ ํ•จ๊ป˜ ๋‹ค์–‘ํ•œ ์ž‘์—…์— ๋Œ€ํ•œ ์‹ ๊ฒฝ๋ง ํ›ˆ๋ จ ๋ฐ ํŠน์ • ์šฉ๋„์— ๋Œ€ํ•ด ํ›ˆ๋ จ๋œ ์‹ ๊ฒฝ๋ง ์ถ”๋ก  ์ˆ˜ํ–‰ ์ธก๋ฉด์—์„œ ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง (DNN) ์ฒ˜๋ฆฌ์— ๋Œ€ํ•œ ์ปดํ“จํŒ… ์š”๊ตฌ๊ฐ€ ๊ทน์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜์˜€์œผ๋ฉฐ, ์ด๋Ÿฌํ•œ ์ถ”์„ธ๋Š” ์ธ๊ณต์ง€๋Šฅ์˜ ์‚ฌ์šฉ์ด ๋”์šฑ ๋ฒ”์šฉ์ ์œผ๋กœ ์ง„ํ™”ํ•จ์— ๋”ฐ๋ผ ๋”์šฑ ๊ฐ€์†ํ™” ๋  ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒ๋œ๋‹ค. ์ด๋Ÿฌํ•œ ์—ฐ์‚ฐ ์š”๊ตฌ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐ ์„ผํ„ฐ ๋‚ด๋ถ€์— ๋ฐฐ์น˜ํ•˜๊ธฐ ์œ„ํ•œ FPGA (Field-Programmable Gate Array) ๋˜๋Š” ASIC (Application-Specific Integrated Circuit) ๊ธฐ๋ฐ˜ ์‹œ์Šคํ…œ์—์„œ ์ €์ „๋ ฅ์„ ์œ„ํ•œ SoC (System-on-Chip)์˜ ๊ฐ€์† ๋ธ”๋ก์— ์ด๋ฅด๊ธฐ๊นŒ์ง€ ๋‹ค์–‘ํ•œ ๋งž์ถคํ˜• ํ•˜๋“œ์›จ์–ด๊ฐ€ ์‚ฐ์—… ๋ฐ ํ•™๊ณ„์—์„œ ์ œ์•ˆ๋˜์—ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š”, ์ธ๊ณต ์‹ ๊ฒฝ๋ง์˜ ์—๋„ˆ์ง€ ํšจ์œจ์ ์ธ ํ›ˆ๋ จ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ๋งž์ถคํ˜• ์ง‘์  ํšŒ๋กœ ํ•˜๋“œ์›จ์–ด๋ฅผ ๋ณด๋‹ค ์—๋„ˆ์ง€ ํšจ์œจ์ ์œผ๋กœ ์„ค๊ณ„ํ•  ์ˆ˜ ์žˆ๋Š” ๋‹ค์–‘ํ•œ ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์•ˆํ•˜๊ณ  ์‹ค์ œ ์ €์ „๋ ฅ ์ธ๊ณต ์‹ ๊ฒฝ๋ง ํ›ˆ๋ จ ์‹œ์Šคํ…œ์„ ์„ค๊ณ„ํ•˜๊ณ  ์ œ์ž‘ํ•˜์—ฌ, ๊ทธ ํšจ์œจ์„ ํ‰๊ฐ€ํ•˜๊ณ ์ž ํ•œ๋‹ค. ํŠนํžˆ, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ์ €์ „๋ ฅ ๊ณ ์„ฑ๋Šฅ ์„ค๊ณ„ ๋ฐฉ๋ฒ•๋ก ์„ ํฌ๊ฒŒ ์„ธ ๊ฐ€์ง€๋กœ ๋ถ„๋ฅ˜ํ•˜์—ฌ ๋ถ„์„์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค. ์ด๋Ÿฌํ•œ ๋ถ„๋ฅ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. (1) ํ›ˆ๋ จ ์•Œ๊ณ ๋ฆฌ์ฆ˜. ํ‘œ์ค€์ ์œผ๋กœ ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง ํ›ˆ๋ จ์€ ์—ญ์ „ํŒŒ (Back-Propagation) ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ์ˆ˜ํ–‰๋˜์ง€๋งŒ, ๋” ํšจ์œจ์ ์ธ ํ•˜๋“œ์›จ์–ด ๊ตฌํ˜„์„ ์œ„ํ•ด ์ŠคํŒŒ์ดํฌ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ†ต์‹ ํ•˜๋Š” ๋‰ด๋Ÿฐ์ด ์žˆ๋Š” ๋‰ด๋กœ๋ชจํ”ฝ ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋˜๋Š” ๋น„๋Œ€์นญ ํ”ผ๋“œ๋ฐฑ ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ์ƒ๋ฌผํ•™์  ๋ชจ์‚ฌ๋„๊ฐ€ ๋†’์€ (Bio-Plausible) ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ™œ์šฉํ•˜์—ฌ ๋” ํšจ์œจ์ ์ธ ํ›ˆ๋ จ ์‹œ์Šคํ…œ์„ ์„ค๊ณ„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์กฐ์‚ฌ ๋ฐ ์ œ์‹œํ•˜๊ณ , ๊ทธ ํ•˜๋“œ์›จ์–ด ํšจ์œจ์„ฑ์„ ๋ถ„์„ํ•˜์˜€๋‹ค. (2) ์ €์ •๋ฐ€๋„ ์ˆ˜ ์ฒด๊ณ„ ํ™œ์šฉ. ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” DNN ๊ฐ€์†๊ธฐ์—์„œ ํšจ์œจ์„ฑ์„ ๋†’์ด๋Š” ๊ฐ€์žฅ ๊ฐ•๋ ฅํ•œ ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜๋Š” ์ˆ˜์น˜ ์ •๋ฐ€๋„๋ฅผ ์กฐ์ •ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. DNN์˜ ์ถ”๋ก  ๋‹จ๊ณ„์— ๋‚ฎ์€ ์ •๋ฐ€๋„ ์ˆซ์ž๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ์ž˜ ์—ฐ๊ตฌ๋˜์—ˆ์ง€๋งŒ, ์„ฑ๋Šฅ ์ €ํ•˜ ์—†์ด DNN์„ ํ›ˆ๋ จํ•˜๋Š” ๊ฒƒ์€ ์ƒ๋Œ€์ ์œผ ๊ธฐ์ˆ ์  ์–ด๋ ค์›€์ด ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹ค์–‘ํ•œ ๋ชจ๋ธ๊ณผ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ DNN์„ ์„ฑ๋Šฅ ์ €ํ•˜ ์—†์ด ํ›ˆ๋ จํ•˜๊ธฐ ์œ„ํ•œ ์ƒˆ๋กœ์šด ์ˆ˜ ์ฒด๊ณ„๋ฅผ ์ œ์•ˆํ•˜์˜€๋‹ค. (3) ์‹œ์Šคํ…œ ๊ตฌํ˜„ ๊ธฐ๋ฒ•. ์ง‘์  ํšŒ๋กœ์—์„œ ๋งž์ถคํ˜• ํ›ˆ๋ จ ์‹œ์Šคํ…œ์„ ์‹ค์ œ๋กœ ์‹คํ˜„ํ•  ๋•Œ, ๊ฑฐ์˜ ๋ฌดํ•œํ•œ ์„ค๊ณ„ ๊ณต๊ฐ„์€ ์นฉ ๋‚ด๋ถ€์˜ ๋ฐ์ดํ„ฐ ํ๋ฆ„, ์‹œ์Šคํ…œ ๋ถ€ํ•˜ ๋ถ„์‚ฐ, ๊ฐ€์†/๊ฒŒ์ดํŒ… ๋ธ”๋ก ๋“ฑ ๋‹ค์–‘ํ•œ ์š”์†Œ์— ๋”ฐ๋ผ ๊ฒฐ๊ณผ์˜ ํ’ˆ์งˆ์ด ํฌ๊ฒŒ ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋” ๋‚˜์€ ์„ฑ๋Šฅ๊ณผ ํšจ์œจ์„ฑ์œผ๋กœ ์ด์–ด์ง€๋Š” ๋‹ค์–‘ํ•œ ์„ค๊ณ„ ๊ธฐ๋ฒ•์„ ์†Œ๊ฐœํ•˜๊ณ  ๋ถ„์„ํ•˜๊ณ ์ž ํ•œ๋‹ค. ์ฒซ์งธ๋กœ, ์†๊ธ€์”จ ๋ถ„๋ฅ˜ ํ•™์Šต์„ ์œ„ํ•œ ๋‰ด๋กœ๋ชจํ”ฝ ํ•™์Šต ์‹œ์Šคํ…œ์„ ์ œ์ž‘ํ•˜์—ฌ ํ‰๊ฐ€ํ•˜์˜€๋‹ค. ์ด ํ•™์Šต ์‹œ์Šคํ…œ์€ ์ „ํ†ต์ ์ธ ๊ธฐ๊ณ„ ํ•™์Šต์˜ ํ›ˆ๋ จ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ๋‚ฎ์€ ํ›ˆ๋ จ ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ œ๊ณตํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•˜์—ฌ ์„ค๊ณ„๋˜์—ˆ๋‹ค. ์ด ๋ชฉ์ ์„ ๋‹ฌ์„ฑํ•˜๊ธฐ ์œ„ํ•ด, ๋” ์ ์€ ์—ฐ์‚ฐ ์š”๊ตฌ๋Ÿ‰๊ณผ ๋ฒ„ํผ ๋ฉ”๋ชจ๋ฆฌ ํ•„์š”์น˜๋ฅผ ์œ„ํ•ด ๊ธฐ์กด์˜ ๋‰ด๋กœ๋ชจํ”ฝ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ˆ˜์ •ํ•˜์˜€์œผ๋ฉฐ, ์ด ๊ณผ์ •์—์„œ ํ›ˆ๋ จ ์„ฑ๋Šฅ ์†์‹ค ์—†์ด ๊ธฐ์กด ์—ญ์ „ํŒŒ ๊ธฐ๋ฐ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๊ทผ์ ‘ํ•œ ํ›ˆ๋ จ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค. ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ์—…๋ฐ์ดํŠธ๋ฅผ ๊ฑด๋„ˆ๋›ฐ๋Š” ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ๊ตฌํ˜„ํ•˜๊ณ  Lock-Free ๋งค๊ฐœ๋ณ€์ˆ˜ ์—…๋ฐ์ดํŠธ ๋ฐฉ์‹์„ ์ฑ„ํƒํ•˜์—ฌ ํ›ˆ๋ จ์— ์†Œ๋ชจ๋˜๋Š” ์—๋„ˆ์ง€๋ฅผ ํ›ˆ๋ จ์ด ์ง„ํ–‰๋จ์— ๋”ฐ๋ผ ๋™์ ์œผ๋กœ ๊ฐ์†Œ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ์‹œ์Šคํ…œ ๊ตฌํ˜„ ๊ธฐ๋ฒ• ๋˜ํ•œ ์†Œ๊ฐœํ•˜๊ณ  ๊ทธ ์„ฑ๋Šฅ์„ ๋ถ„์„ํ•˜์˜€๋‹ค. ์ด๋Ÿฐ ๊ธฐ๋ฒ•์„ ํ†ตํ•ด, ์ด ํ•™์Šต ์‹œ์Šคํ…œ์€ ๊ธฐ์กด์˜ ํ›ˆ๋ จ ์‹œ์Šคํ…œ ๋Œ€๋น„ ๋›ฐ์–ด๋‚œ ๋ถ„๋ฅ˜ ์„ฑ๋Šฅ-์—๋„ˆ์ง€ ์†Œ๋ชจ๋Ÿ‰ ๊ด€๊ณ„๋ฅผ ๋ณด์ด๋ฉด์„œ๋„ ๊ธฐ์กด์˜ ์—ญ์ „ํŒŒ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ธฐ๋ฐ˜์˜ ์ธ๊ณต ์‹ ๊ฒฝ๋ง์˜ ํ›ˆ๋ จ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜์˜€๋‹ค. ๋‘˜์งธ๋กœ, ํŠน์ˆ˜ ๋ช…๋ น์–ด ์ฒด๊ณ„ ๋ฐ ๋งž์ถคํ˜• ์ˆ˜ ์ฒด๊ณ„๋ฅผ ํ™œ์šฉํ•œ ํ”„๋กœ๊ทธ๋žจ ๊ฐ€๋Šฅํ•œ DNN ํ›ˆ๋ จ์šฉ ํ”„๋กœ์„ธ์„œ๊ฐ€ ์„ค๊ณ„๋˜๊ณ  ์ œ์ž‘๋˜์—ˆ๋‹ค. ๊ธฐ์กด DNN ์ถ”๋ก ์šฉ ๊ฐ€์†๊ธฐ๋Š” 8๋น„ํŠธ ์ •์ˆ˜ ๊ธฐ๋ฐ˜์œผ๋กœ ์ด๋ฃจ์–ด์ง„ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•˜์ง€๋งŒ, DNN ํ•™์Šต ์„ค๊ณ„์‹œ 8๋น„ํŠธ ์ˆ˜ ์ฒด๊ณ„๋ฅผ ์ด์šฉํ•˜๋ฉฐ ํ›ˆ๋ จ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ๋ณด์ด์ง€ ์•Š๋Š” ๊ฒƒ์€ ์ƒ๋‹นํ•œ ๊ธฐ์ˆ ์  ๋‚œ์ด๋„๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์—ˆ๋‹ค. ์ด๋Ÿฐ ๋ฌธ์ œ๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๊ณต์œ ํ˜• ๋ฉฑ์ง€์ˆ˜ ํŽธํ–ฅ๊ฐ’์„ ํ™œ์šฉํ•˜๋Š” 8๋น„ํŠธ ๋ถ€๋™ ์†Œ์ˆ˜์  ์ˆ˜ ์ฒด๊ณ„๋ฅผ ์ƒˆ๋กœ์ด ์ œ์•ˆํ•˜์˜€์œผ๋ฉฐ, ์ด ์ˆ˜ ์ฒด๊ณ„์˜ ํšจ์šฉ์„ฑ์„ ๋ณด์ด๊ธฐ ์œ„ํ•ด ์ด DNN ํ›ˆ๋ จ ํ”„๋กœ์„ธ์„œ๊ฐ€ ์„ค๊ณ„๋˜์—ˆ๋‹ค. ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ์ด ํ”„๋กœ์„ธ์„œ๋Š” ๋‹จ์ˆœํ•œ MAC ๊ธฐ๋ฐ˜ Matrix-Multiplication ๊ฐ€์†๊ธฐ๊ฐ€ ์•„๋‹Œ, Fused-Multiply-Add ํŠธ๋ฆฌ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ์—๋„ˆ์ง€ ํšจ์œจ์ ์ธ ๊ฐ€์†๊ธฐ ๊ตฌ์กฐ๋ฅผ ์ฑ„ํƒํ•˜๋ฉด์„œ๋„, ์นฉ ๋‚ด๋ถ€์—์„œ์˜ ๋ฐ์ดํ„ฐ ์ด๋™๋Ÿ‰ ์ตœ์ ํ™” ๋ฐ ์ปจ๋ณผ๋ฃจ์…˜์˜ ๊ณต๊ฐ„์„ฑ์„ ๊ทน๋Œ€ํ™”ํ•  ์ˆ˜ ์žˆ๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐ ์ „๋‹ฌ ์œ ๋‹›์„ ์ž…์ถœ๋ ฅ๋ถ€์— 2D๋กœ ์ œ์ž‘ํ•˜์—ฌ ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜์—์„œ์˜ ์ปจ๋ณผ๋ฃจ์…˜ ์ถ”๋ก  ๋ฐ ํ›ˆ๋ จ ๋‹จ๊ณ„์—์„œ์˜ ๊ณต๊ฐ„์„ฑ์„ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•˜์˜€๋‹ค. ๋ณธ DNN ํ›ˆ๋ จ ํ”„๋กœ์„ธ์„œ๋Š” ๋งž์ถคํ˜• ๋ฒกํ„ฐ ์—ฐ์‚ฐ๊ธฐ, ๊ฐ€์† ๋ช…๋ น์–ด ์ฒด๊ณ„, ์™ธ๋ถ€ DRAM์œผ๋กœ์˜ ์ง์ ‘์ ์ธ ์ ‘๊ทผ ์ œ์–ด ๋ฐฉ์‹ ๋“ฑ์„ ํ†ตํ•ด ํ•œ ํ”„๋กœ์„ธ์„œ ๋‚ด์—์„œ DNN ํ›ˆ๋ จ์˜ ๋ชจ๋“  ๋‹จ๊ณ„๋ฅผ ๋‹ค์–‘ํ•œ ๋ชจ๋ธ ๋ฐ ํ™˜๊ฒฝ์—์„œ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋„๋ก ์„ค๊ณ„๋˜์—ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋ณธ ํ”„๋กœ์„ธ์„œ๋Š” ๊ธฐ์กด์˜ ์—ฐ๊ตฌ์—์„œ ์ œ์‹œ๋˜์—ˆ๋˜ ๋‹ค๋ฅธ ํ”„๋กœ์„ธ์„œ์— ๋น„ํ•ด ๋™์ผ ๋ชจ๋ธ์„ ์ฒ˜๋ฆฌํ•˜๋ฉด์„œ 2.48๋ฐฐ ๊ฐ€๋Ÿ‰ ๋” ๋†’์€ ์—๋„ˆ์ง€ ํšจ์œจ์„ฑ, 43% ์ ์€ DRAM ์ ‘๊ทผ ์š”๊ตฌ๋Ÿ‰, 0.8%p ๋†’์€ ํ›ˆ๋ จ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค. ์ด๋ ‡๊ฒŒ ์†Œ๊ฐœ๋œ ๋‘ ๊ฐ€์ง€ ์„ค๊ณ„๋Š” ๋ชจ๋‘ ์‹ค์ œ ์นฉ์œผ๋กœ ์ œ์ž‘๋˜์–ด ๊ฒ€์ฆ๋˜์—ˆ๋‹ค. ์ธก์ • ๋ฐ์ดํ„ฐ ๋ฐ ์ „๋ ฅ ์†Œ๋ชจ๋Ÿ‰์„ ํ†ตํ•ด ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆ๋œ ์ €์ „๋ ฅ ๋”ฅ๋Ÿฌ๋‹ ํ›ˆ๋ จ ์‹œ์Šคํ…œ ์„ค๊ณ„ ๊ธฐ๋ฒ•์˜ ํšจ์œจ์„ ๊ฒ€์ฆํ•˜์˜€์œผ๋ฉฐ, ํŠนํžˆ ์ƒ๋ฌผํ•™์  ๋ชจ์‚ฌ๋„๊ฐ€ ๋†’์€ ํ›ˆ๋ จ ์•Œ๊ณ ๋ฆฌ์ฆ˜, ๋”ฅ๋Ÿฌ๋‹ ํ›ˆ๋ จ์— ์ตœ์ ํ™”๋œ ์ˆ˜ ์ฒด๊ณ„, ๊ทธ๋ฆฌ๊ณ  ํšจ์œจ์ ์ธ ์‹œ์Šคํ…œ ๊ตฌํ˜„ ๊ธฐ๋ฒ•์„ ํ™œ์šฉํ•˜์—ฌ ์‹œ์Šคํ…œ์˜ ์—๋„ˆ์ง€ ํšจ์œจ์„ฑ์„ ๊ฐœ์„ ํ•˜๋Š” ๋ชฉํ‘œ๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€๋Š”์ง€ ์ •๋Ÿ‰์ ์œผ๋กœ ๋ถ„์„ํ•˜์˜€๋‹ค.With the advent of the deep learning era, the computational need for processing deep neural networks (DNN) have increased dramatically, both in terms of performing training the neural networks on various tasks as well as in performing inference on the trained neural networks for specific use cases. To address those needs, many custom hardware ranging from systems based on field-programmable gate arrays (FPGA) or application-specific integrated circuits (ASIC) for deployment inside data centers to acceleration blocks in system-on-chip (SoC) for low-power processing in mobile devices were proposed. In this dissertation, custom integrated circuits hardware for energy efficient processing of training neural networks are designed, fabricated, and measured for evaluation of different methodologies that could be utilized for more energy efficient processing under same training performance constraints. In particular, these methodologies are categorized to three different categories for evaluation: (1) Training algorithm. While standard deep neural network training is performed with the back-propagation (BP) algorithm, we investigate various training algorithms, such as neuromorphic learning algorithms with spiking neurons or bio-plausible algorithms with asymmetric feedback for exploiting computational properties for more efficient hardware implementation. (2) Low-precision arithmetic. One of the most powerful methods for increased efficiency in DNN accelerators is through scaling numerical precision. While utilizing low precision numerics for inference phase of DNNs is well studied, training DNNs without performance degradation is relatively more challenging. A novel numerical scheme for training DNNs in various models and scenarios is proposed in this dissertation. (3) System implementation techniques. In actual realization of a custom training system in integrated circuits, nearly infinite design space leads to vastly different quality of results depending on dataflow inside the chip, system load balancing, acceleration and gating blocks, et cetera. Different design techniques which leads to better performance and efficiency are introduced in this dissertation. First, a neuromorphic learning system for classifying handwritten digits (MNIST) is introduced. This learning system aims to deliver low training overhead while maintaining the training performance of classical machine learning. In order to achieve this goal, a neuromorphic learning algorithm is modified for lower operation count and memory buffer requirement while maintaining or even obtaining higher machine learning performance. Moreover, implementation techniques such as update skipping mechanism and lock-free parameter updates allow even lower training overhead, dynamically reducing training energy overhead from 25.6% to 7.5%. With these proposed methodologies, this system greatly improves the accuracy-energy trade-off in on-chip learning system as well as showing close learning performance to classical DNN training through back propagation. Second, a programmable DNN training processor with a custom numerical format is introduced. While prior DNN inference accelerators have utilized 8-bit integers, implementing 8-bit numerics for a training accelerator remained to be a challenge due to higher precision requirements in the backward step of DNN training. To overcome this limitation, a custom 8-bit floating point format dubbed 8-bit floating point with shared exponent bias (FP8-SEB) is introduced in this dissertation. Moreover, a processing architecture of 24-way fused-multiply-adder (FMA) tree greatly increases processing energy efficiency per MAC, while complemented with a novel 2-dimensional routing data-path for making use of spatiality to increase data reuse in both forward, backward, and weight gradient step of convolutional neural networks. This DNN training processor is implemented with a custom vector processing unit, acceleration instructions, and DMA in external DRAMs for end-to-end DNN training in various models and datasets. Compared against prior low-precision training processor in ResNet-18 training, this work achieves 2.48ร— higher energy efficiency, 43% less DRAM accesses, and 0.8\p higher training accuracy. Both of the designs introduced are fabricated in real silicon and verified both in simulations and in physical measurements. Design methodologies are carefully evaluated using simulations of the fabricated chip and measurements with monitored data and power consumption under varying conditions that expose the design techniques in effect. The efficiency of various biologically plausible algorithms, novel numerical formats, and system implementation techniques are analyzed in discussed in this dissertations based on the obtained measurements.Abstract i Contents iv List of Tables vii List of Figures viii 1 Introduction 1 1.1 Study Background 1 1.2 Purpose of Research 6 1.3 Contents 8 2 Hardware-Friendly Learning Algorithms 9 2.1 Modified Learning Rule for Neuromorphic System 9 2.1.1 The Segregated Dendrites Algorithm 9 2.1.2 Modification of the Segregated Dendrites Algorithm 13 2.2 Non-BP Learning Rules on DNN Training Processor 18 2.2.1 Feedback Alignment and Direct Feedback Alignment 18 2.2.2 Reduced Memory Access in Non-BP Learning Rules 23 3 Optimal Numerical Format for DNN Training 27 3.1 Related Works 27 3.2 Proposed FP8 with Shared Exponent Bias 30 3.3 Training Results with FP8-SEB 33 3.4 Fused Multiply Adder Tree for FP8-SEB 37 4 System Implementations 41 4.1 Neuromorphic Learning System 41 4.1.1 Bio-Plausibility 41 4.1.2 Top Level Architecture 43 4.1.3 Lock-Free Weight Updates 47 4.1.4 Update Skipping Mechanism 48 4.2 Low-Precision DNN Training System 51 4.2.1 Top Level Architecture 52 4.2.2 Optimized Auxiliary Instructions in the Vector Processing Unit 55 4.2.3 Buffer Organization 57 4.2.4 Input-Output 2D Spatial Routing for FMA Trees 60 5 Measurement Results 70 5.1 Measurement Results on the Neuromorphic Learning System 70 5.1.1 Measurement Results and Test Setup . 70 5.1.2 Comparison against other works 73 5.1.3 Scalability of the Learning Algorithm 77 5.2 Measurements Results on the Low-Precision DNN Training Processor 79 5.2.1 Measurement Results in Benchmarked Tests 79 5.2.2 Comparison Against Other DNN Training Processors 89 6 Conclusion 93 6.1 Discussion for Future Works 93 6.1.1 Scaling to CNNs in the Neuromorphic System 93 6.1.2 Discussions for Improvements on DNN Training Processor 96 6.2 Conclusion 99 Abstract (In Korean) 108๋ฐ•

    Energy-Efficient Neural Network Hardware Design and Circuit Techniques to Enhance Hardware Security

    Get PDF
    University of Minnesota Ph.D. dissertation. May 2019. Major: Electrical Engineering. Advisor: Chris Kim. 1 computer file (PDF); ix, 108 pages.Artificial intelligence (AI) algorithms and hardware are being developed at a rapid pace for emerging applications such as self-driving cars, speech/image/video recognition, deep learning, etc. Todayโ€™s AI tasks are mostly performed at remote datacenters, while in the future, more AI workloads are expected to run on edge devices. To fulfill this goal, innovative design techniques are needed to improve energy-efficiency, form factor, and as well as the security of AI chips. In this dissertation, two topics are focused on to address these challenges: building energy-efficient AI chips based on various neural network architectures, and designing โ€œchip fingerprintโ€ circuits as well as counterfeit chip sensors to improve hardware security. First of all, in order to deploy AI tasks on edge devices, we come up with various energy and area efficient computing platforms. One is a novel time-domain computing scheme for fully connected multi-layer perceptron (MLP) neural network and the other is an efficient binarized architecture for long short-term memory (LSTM) neural network. Secondly, to enhance the hardware security and ensure secure data communication between edge devices, we need to make sure the authenticity of the chip. Physical Unclonable Function (PUF) is a circuit primitive that can serve as a chip โ€œfingerprintโ€ by generating a unique ID for each chip. Another source of security concerns comes from the counterfeit ICs, and recycled and remarked ICs account for more than 80% of the counterfeit electronics. To effectively detect those counterfeit chips that have been physically compromised, we came up with a passive IC tamper sensor. This proposed sensor is demonstrated to be able to efficiently and reliably detect suspicious activities such as high temperature cycling, ambient humidity rise, and increased dust particles in the chip cavity

    Thermal Management of Electronics and Optoelectronics: From Heat Source Characterization to Heat Mitigation at the Device and Package Levels

    Full text link
    Thermal management of electronic and optoelectronic devices has become increasingly challenging. For electronic devices, the challenge arises primarily from the drive for miniaturized, high-performance devices, leading to escalating power density. For optoelectronics, the recent widespread use of organic light emitting diode (OLED) displays in mobile platforms and flexible electronics presents new challenges for heat dissipation. Furthermore, the performance and reliability of increasingly high-power semiconductor lasers used for telecommunications and other applications hinge on proper thermal management. For example, small, concentrated hotspots may trigger thermal runaway and premature device destruction. Emerging challenges in thermal management of devices require innovative methods to characterize and mitigate heat generation and temperature rise at the device level as well as the package level. The first part of this dissertation discusses device-level thermal management. A thermal imaging microscope with high spatial resolution (~450nm) is created for hotspot detection in the context of diode lasers under back-irradiance (BI). Laser facet temperature maps reveal the existence of a critical BI spot location that increases the laserโ€™s active region temperature by nearly a factor of 3. An active solid-state cooling strategy that could scale down to the size of hotspots in modern devices is then explored, utilizing energy filtering at carbon nanotube (CNT) junctions as a means to provide thermionic cooling at nanometer spatial scales. The CNT cooler exhibits a large effective Seebeck coefficient of 386ฮผV/K and a relatively moderate thermal conductivity, together giving rise to a high cooling capacity (2.3 ร— 106 W/cm2). Thermal management at the package level is then considered. Heat transfer in polymers is first studied, owing to their prevalence in thermal interface materials as well as organic devices (e.g., OLEDs). Employing molecular design principles developed to engineer the thermal properties of polymers, molecular-scale electrostatic repulsive forces are utilized to modify chain morphologies in amorphous polymers, leading to spin-cast films that are free of ceramic or metallic fillers yet have thermal conductivities as high as 1.17 Wm-1K-1, which is approximately 6 times that of typical amorphous polymers. Electronics packaging designs incorporating phase change materials (PCMs) are then considered as a means to mitigate bursty heat sources; PCM incorporation in a packaged accelerator chip intended for large-scale object identification is found to suppress the peak die temperature by 17%.PHDMechanical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/150013/1/chenlium_1.pd
    corecore