75 research outputs found

    ARITHMETIC LOGIC UNIT ARCHITECTURES WITH DYNAMICALLY DEFINED PRECISION

    Get PDF
    Modern central processing units (CPUs) employ arithmetic logic units (ALUs) that support statically defined precisions, often adhering to industry standards. Although CPU manufacturers highly optimize their ALUs, industry standard precisions embody accuracy and performance compromises for general purpose deployment. Hence, optimizing ALU precision holds great potential for improving speed and energy efficiency. Previous research on multiple precision ALUs focused on predefined, static precisions. Little previous work addressed ALU architectures with customized, dynamically defined precision. This dissertation presents approaches for developing dynamic precision ALU architectures for both fixed-point and floating-point to enable better performance, energy efficiency, and numeric accuracy. These new architectures enable dynamically defined precision, including support for vectorization. The new architectures also prevent performance and energy loss due to applying unnecessarily high precision on computations, which often happens with statically defined standard precisions. The new ALU architectures support different precisions through the use of configurable sub-blocks, with this dissertation including demonstration implementations for floating point adder, multiply, and fused multiply-add (FMA) circuits with 4-bit sub-blocks. For these circuits, the dynamic precision ALU speed is nearly the same as traditional ALU approaches, although the dynamic precision ALU is nearly twice as large

    ๊ทผ์‚ฌ ์ปดํ“จํŒ…์„ ์ด์šฉํ•œ ํšŒ๋กœ ๋…ธํ™” ๋ณด์ƒ๊ณผ ์—๋„ˆ์ง€ ํšจ์œจ์ ์ธ ์‹ ๊ฒฝ๋ง ๊ตฌํ˜„

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2020. 8. ์ดํ˜์žฌ.Approximate computing reduces the cost (energy and/or latency) of computations by relaxing the correctness (i.e., precision) of computations up to the level, which is dependent on types of applications. Moreover, it can be realized in various hierarchies of computing system design from circuit level to application level. This dissertation presents the methodologies applying approximate computing across such hierarchies; compensating aging-induced delay in logic circuit by dynamic computation approximation (Chapter 1), designing energy-efficient neural network by combining low-power and low-latency approximate neuron models (Chapter 2), and co-designing in-memory gradient descent module with neural processing unit so as to address a memory bottleneck incurred by memory I/O for high-precision data (Chapter 3). The first chapter of this dissertation presents a novel design methodology to turn the timing violation caused by aging into computation approximation error without the reliability guardband or increasing the supply voltage. It can be realized by accurately monitoring the critical path delay at run-time. The proposal is evaluated at two levels: RTL component level and system level. The experimental results at the RTL component level show a significant improvement in terms of (normalized) mean squared error caused by the timing violation and, at the system level, show that the proposed approach successfully transforms the aging-induced timing violation errors into much less harmful computation approximation errors, therefore it recovers image quality up to perceptually acceptable levels. It reduces the dynamic and static power consumption by 21.45% and 10.78%, respectively, with 0.8% area overhead compared to the conventional approach. The second chapter of this dissertation presents an energy-efficient neural network consisting of alternative neuron models; Stochastic-Computing (SC) and Spiking (SP) neuron models. SC has been adopted in various fields to improve the power efficiency of systems by performing arithmetic computations stochastically, which approximates binary computation in conventional computing systems. Moreover, a recent work showed that deep neural network (DNN) can be implemented in the manner of stochastic computing and it greatly reduces power consumption. However, Stochastic DNN (SC-DNN) suffers from problem of high latency as it processes only a bit per cycle. To address such problem, it is proposed to adopt Spiking DNN (SP-DNN) as an input interface for SC-DNN since SP effectively processes more bits per cycle than SC-DNN. Moreover, this chapter resolves the encoding mismatch problem, between two different neuron models, without hardware cost by compensating the encoding mismatch with synapse weight calibration. A resultant hybrid DNN (SPSC-DNN) consists of SP-DNN as bottom layers and SC-DNN as top layers. Exploiting the reduced latency from SP-DNN and low-power consumption from SC-DNN, the proposed SPSC-DNN achieves improved energy-efficiency with lower error-rate compared to SC-DNN and SP-DNN in same network configuration. The third chapter of this dissertation proposes GradPim architecture, which accelerates the parameter updates by in-memory processing which is codesigned with 8-bit floating-point training in Neural Processing Unit (NPU) for deep neural networks. By keeping the high precision processing algorithms in memory, such as the parameter update incorporating high-precision weights in its computation, the GradPim architecture can achieve high computational efficiency using 8-bit floating point in NPU and also gain power efficiency by eliminating massive high-precision data transfers between NPU and off-chip memory. A simple extension of DDR4 SDRAM utilizing bank-group parallelism makes the operation designs in processing-in-memory (PIM) module efficient in terms of hardware cost and performance. The experimental results show that the proposed architecture can improve the performance of the parameter update phase in the training by up to 40% and greatly reduce the memory bandwidth requirement while posing only a minimal amount of overhead to the protocol and the DRAM area.๊ทผ์‚ฌ ์ปดํ“จํŒ…์€ ์—ฐ์‚ฐ์˜ ์ •ํ™•๋„์˜ ์†์‹ค์„ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜ ๋ณ„ ์ ์ ˆํ•œ ์ˆ˜์ค€๊นŒ์ง€ ํ—ˆ์šฉํ•จ์œผ๋กœ์จ ์—ฐ์‚ฐ์— ํ•„์š”ํ•œ ๋น„์šฉ (์—๋„ˆ์ง€๋‚˜ ์ง€์—ฐ์‹œ๊ฐ„)์„ ์ค„์ธ๋‹ค. ๊ฒŒ๋‹ค๊ฐ€, ๊ทผ์‚ฌ ์ปดํ“จํŒ…์€ ์ปดํ“จํŒ… ์‹œ์Šคํ…œ ์„ค๊ณ„์˜ ํšŒ๋กœ ๊ณ„์ธต๋ถ€ํ„ฐ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜ ๊ณ„์ธต๊นŒ์ง€ ๋‹ค์–‘ํ•œ ๊ณ„์ธต์— ์ ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๊ทผ์‚ฌ ์ปดํ“จํŒ… ๋ฐฉ๋ฒ•๋ก ์„ ๋‹ค์–‘ํ•œ ์‹œ์Šคํ…œ ์„ค๊ณ„์˜ ๊ณ„์ธต์— ์ ์šฉํ•˜์—ฌ ์ „๋ ฅ๊ณผ ์—๋„ˆ์ง€ ์ธก๋ฉด์—์„œ ์ด๋“์„ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•๋“ค์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ด๋Š”, ์—ฐ์‚ฐ ๊ทผ์‚ฌํ™” (computation Approximation)๋ฅผ ํ†ตํ•ด ํšŒ๋กœ์˜ ๋…ธํ™”๋กœ ์ธํ•ด ์ฆ๊ฐ€๋œ ์ง€์—ฐ์‹œ๊ฐ„์„ ์ถ”๊ฐ€์ ์ธ ์ „๋ ฅ์†Œ๋ชจ ์—†์ด ๋ณด์ƒํ•˜๋Š” ๋ฐฉ๋ฒ•๊ณผ (์ฑ•ํ„ฐ 1), ๊ทผ์‚ฌ ๋‰ด๋Ÿฐ๋ชจ๋ธ (approximate neuron model)์„ ์ด์šฉํ•ด ์—๋„ˆ์ง€ ํšจ์œจ์ด ๋†’์€ ์‹ ๊ฒฝ๋ง์„ ๊ตฌ์„ฑํ•˜๋Š” ๋ฐฉ๋ฒ• (์ฑ•ํ„ฐ 2), ๊ทธ๋ฆฌ๊ณ  ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์œผ๋กœ ์ธํ•œ ๋ณ‘๋ชฉํ˜„์ƒ ๋ฌธ์ œ๋ฅผ ๋†’์€ ์ •ํ™•๋„ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•œ ์—ฐ์‚ฐ์„ ๋ฉ”๋ชจ๋ฆฌ ๋‚ด์—์„œ ์ˆ˜ํ–‰ํ•จ์œผ๋กœ์จ ์™„ํ™”์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์„ (์ฑ•ํ„ฐ3) ์ œ์•ˆํ•˜์˜€๋‹ค. ์ฒซ ๋ฒˆ์งธ ์ฑ•ํ„ฐ๋Š” ํšŒ๋กœ์˜ ๋…ธํ™”๋กœ ์ธํ•œ ์ง€์—ฐ์‹œ๊ฐ„์œ„๋ฐ˜์„ (timing violation) ์„ค๊ณ„๋งˆ์ง„์ด๋‚˜ (reliability guardband) ๊ณต๊ธ‰์ „๋ ฅ์˜ ์ฆ๊ฐ€ ์—†์ด ์—ฐ์‚ฐ์˜ค์ฐจ (computation approximation error)๋ฅผ ํ†ตํ•ด ๋ณด์ƒํ•˜๋Š” ์„ค๊ณ„๋ฐฉ๋ฒ•๋ก  (design methodology)๋ฅผ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ์ฃผ์š”๊ฒฝ๋กœ์˜ (critical path) ์ง€์—ฐ์‹œ๊ฐ„์„ ๋™์ž‘์‹œ๊ฐ„์— ์ •ํ™•ํ•˜๊ฒŒ ์ธก์ •ํ•  ํ•„์š”๊ฐ€ ์žˆ๋‹ค. ์—ฌ๊ธฐ์„œ ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•๋ก ์€ RTL component์™€ system ๋‹จ๊ณ„์—์„œ ํ‰๊ฐ€๋˜์—ˆ๋‹ค. RTL component ๋‹จ๊ณ„์˜ ์‹คํ—˜๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•ด ์ œ์•ˆํ•œ ๋ฐฉ์‹์ด ํ‘œ์ค€ํ™”๋œ ํ‰๊ท ์ œ๊ณฑ์˜ค์ฐจ๋ฅผ (normalized mean squared error) ์ƒ๋‹นํžˆ ์ค„์˜€์Œ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  system ๋‹จ๊ณ„์—์„œ๋Š” ์ด๋ฏธ์ง€์ฒ˜๋ฆฌ ์‹œ์Šคํ…œ์—์„œ ์ด๋ฏธ์ง€์˜ ํ’ˆ์งˆ์ด ์ธ์ง€์ ์œผ๋กœ ์ถฉ๋ถ„ํžˆ ํšŒ๋ณต๋˜๋Š” ๊ฒƒ์„ ๋ณด์ž„์œผ๋กœ์จ ํšŒ๋กœ๋…ธํ™”๋กœ ์ธํ•ด ๋ฐœ์ƒํ•œ ์ง€์—ฐ์‹œ๊ฐ„์œ„๋ฐ˜ ์˜ค์ฐจ๊ฐ€ ์—๋Ÿฌ์˜ ํฌ๊ธฐ๊ฐ€ ์ž‘์€ ์—ฐ์‚ฐ์˜ค์ฐจ๋กœ ๋ณ€๊ฒฝ๋˜๋Š” ๊ฒƒ์„ ํ™•์ธ ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๊ฒฐ๋ก ์ ์œผ๋กœ, ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•๋ก ์„ ๋”ฐ๋ž์„ ๋•Œ 0.8%์˜ ๊ณต๊ฐ„์„ (area) ๋” ์‚ฌ์šฉํ•˜๋Š” ๋น„์šฉ์„ ์ง€๋ถˆํ•˜๊ณ  21.45%์˜ ๋™์ ์ „๋ ฅ์†Œ๋ชจ์™€ (dynamic power consumption) 10.78%์˜ ์ •์ ์ „๋ ฅ์†Œ๋ชจ์˜ (static power consumption) ๊ฐ์†Œ๋ฅผ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋‘ ๋ฒˆ์งธ ์ฑ•ํ„ฐ๋Š” ๊ทผ์‚ฌ ๋‰ด๋Ÿฐ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜๋Š” ๊ณ -์—๋„ˆ์ง€ํšจ์œจ์˜ ์‹ ๊ฒฝ๋ง์„ (neural network) ์ œ์•ˆํ•˜์˜€๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ ์‚ฌ์šฉํ•œ ๋‘ ๊ฐ€์ง€์˜ ๊ทผ์‚ฌ ๋‰ด๋Ÿฐ๋ชจ๋ธ์€ ํ™•๋ฅ ์ปดํ“จํŒ…๊ณผ (stochastic computing) ์ŠคํŒŒ์ดํ‚น๋‰ด๋Ÿฐ (spiking neuron) ์ด๋ก ๋“ค์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ชจ๋ธ๋ง๋˜์—ˆ๋‹ค. ํ™•๋ฅ ์ปดํ“จํŒ…์€ ์‚ฐ์ˆ ์—ฐ์‚ฐ๋“ค์„ ํ™•๋ฅ ์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•จ์œผ๋กœ์จ ์ด์ง„์—ฐ์‚ฐ์„ ๋‚ฎ์€ ์ „๋ ฅ์†Œ๋ชจ๋กœ ์ˆ˜ํ–‰ํ•œ๋‹ค. ์ตœ๊ทผ์— ํ™•๋ฅ ์ปดํ“จํŒ… ๋‰ด๋Ÿฐ๋ชจ๋ธ์„ ์ด์šฉํ•˜์—ฌ ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง (deep neural network)๋ฅผ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์—ฐ๊ตฌ๊ฐ€ ์ง„ํ–‰๋˜์—ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜, ํ™•๋ฅ ์ปดํ“จํŒ…์„ ๋‰ด๋Ÿฐ๋ชจ๋ธ๋ง์— ํ™œ์šฉํ•  ๊ฒฝ์šฐ ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์ด ๋งค ํด๋ฝ์‚ฌ์ดํด๋งˆ๋‹ค (clock cycle) ํ•˜๋‚˜์˜ ๋น„ํŠธ๋งŒ์„ (bit) ์ฒ˜๋ฆฌํ•˜๋ฏ€๋กœ, ์ง€์—ฐ์‹œ๊ฐ„ ์ธก๋ฉด์—์„œ ๋งค์šฐ ๋‚˜์  ์ˆ˜ ๋ฐ–์— ์—†๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ์ŠคํŒŒ์ดํ‚น ๋‰ด๋Ÿฐ๋ชจ๋ธ๋กœ ๊ตฌ์„ฑ๋œ ์ŠคํŒŒ์ดํ‚น ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์„ ํ™•๋ฅ ์ปดํ“จํŒ…์„ ํ™œ์šฉํ•œ ์‹ฌ์ธต์‹ ๊ฒฝ๋ง ๊ตฌ์กฐ์™€ ๊ฒฐํ•ฉํ•˜์˜€๋‹ค. ์ŠคํŒŒ์ดํ‚น ๋‰ด๋Ÿฐ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ ๋งค ํด๋ฝ์‚ฌ์ดํด๋งˆ๋‹ค ์—ฌ๋Ÿฌ ๋น„ํŠธ๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์˜ ์ž…๋ ฅ ์ธํ„ฐํŽ˜์ด์Šค๋กœ ์‚ฌ์šฉ๋  ๊ฒฝ์šฐ ์ง€์—ฐ์‹œ๊ฐ„์„ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค. ํ•˜์ง€๋งŒ, ํ™•๋ฅ ์ปดํ“จํŒ… ๋‰ด๋Ÿฐ๋ชจ๋ธ๊ณผ ์ŠคํŒŒ์ดํ‚น ๋‰ด๋Ÿฐ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ ๋ถ€ํ˜ธํ™” (encoding) ๋ฐฉ์‹์ด ๋‹ค๋ฅธ ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ํ•ด๋‹น ๋ถ€ํ˜ธํ™” ๋ถˆ์ผ์น˜ ๋ฌธ์ œ๋ฅผ ๋ชจ๋ธ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํ•™์Šตํ•  ๋•Œ ๊ณ ๋ คํ•จ์œผ๋กœ์จ, ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์˜ ๊ฐ’์ด ๋ถ€ํ˜ธํ™” ๋ถˆ์ผ์น˜๋ฅผ ๊ณ ๋ คํ•˜์—ฌ ์กฐ์ ˆ (calibration) ๋  ์ˆ˜ ์žˆ๋„๋ก ํ•˜์—ฌ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜์˜€๋‹ค. ์ด๋Ÿฌํ•œ ๋ถ„์„์˜ ๊ฒฐ๊ณผ๋กœ, ์•ž ์ชฝ์—๋Š” ์ŠคํŒŒ์ดํ‚น ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์„ ๋ฐฐ์น˜ํ•˜๊ณ  ๋’ท ์ชฝ์• ๋Š” ํ™•๋ฅ ์ปดํ“จํŒ… ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์„ ๋ฐฐ์น˜ํ•˜๋Š” ํ˜ผ์„ฑ์‹ ๊ฒฝ๋ง์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ํ˜ผ์„ฑ์‹ ๊ฒฝ๋ง์€ ์ŠคํŒŒ์ดํ‚น ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์„ ํ†ตํ•ด ๋งค ํด๋ฝ์‚ฌ์ดํด๋งˆ๋‹ค ์ฒ˜๋ฆฌ๋˜๋Š” ๋น„ํŠธ ์–‘์˜ ์ฆ๊ฐ€๋กœ ์ธํ•œ ์ง€์—ฐ์‹œ๊ฐ„ ๊ฐ์†Œ ํšจ๊ณผ์™€ ํ™•๋ฅ ์ปดํ“จํŒ… ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์˜ ์ €์ „๋ ฅ ์†Œ๋ชจ ํŠน์„ฑ์„ ๋ชจ๋‘ ํ™œ์šฉํ•จ์œผ๋กœ์จ ๊ฐ ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์„ ๋”ฐ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ๋Œ€๋น„ ์šฐ์ˆ˜ํ•œ ์—๋„ˆ์ง€ ํšจ์œจ์„ฑ์„ ๋น„์Šทํ•˜๊ฑฐ๋‚˜ ๋” ๋‚˜์€ ์ •ํ™•๋„ ๊ฒฐ๊ณผ๋ฅผ ๋‚ด๋ฉด์„œ ๋‹ฌ์„ฑํ•œ๋‹ค. ์„ธ ๋ฒˆ์งธ ์ฑ•ํ„ฐ๋Š” ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์„ 8๋น„ํŠธ ๋ถ€๋™์†Œ์ˆซ์  ์—ฐ์‚ฐ์œผ๋กœ ํ•™์Šตํ•˜๋Š” ์‹ ๊ฒฝ๋ง์ฒ˜๋ฆฌ์œ ๋‹›์˜ (neural processing unit) ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐฑ์‹ ์„ (parameter update) ๋ฉ”๋ชจ๋ฆฌ-๋‚ด-์—ฐ์‚ฐ์œผ๋กœ (in-memory processing) ๊ฐ€์†ํ•˜๋Š” GradPIM ์•„ํ‚คํ…์ณ๋ฅผ ์ œ์•ˆํ•˜์˜€๋‹ค. GradPIM์€ 8๋น„ํŠธ์˜ ๋‚ฎ์€ ์ •ํ™•๋„ ์—ฐ์‚ฐ์€ ์‹ ๊ฒฝ๋ง์ฒ˜๋ฆฌ์œ ๋‹›์— ๋‚จ๊ธฐ๊ณ , ๋†’์€ ์ •ํ™•๋„๋ฅผ ๊ฐ€์ง€๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜๋Š” ์—ฐ์‚ฐ์€ (ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐฑ์‹ ) ๋ฉ”๋ชจ๋ฆฌ ๋‚ด๋ถ€์— ๋‘ ์œผ๋กœ์จ ์‹ ๊ฒฝ๋ง์ฒ˜๋ฆฌ์œ ๋‹›๊ณผ ๋ฉ”๋ชจ๋ฆฌ๊ฐ„์˜ ๋ฐ์ดํ„ฐํ†ต์‹ ์˜ ์–‘์„ ์ค„์—ฌ, ๋†’์€ ์—ฐ์‚ฐํšจ์œจ๊ณผ ์ „๋ ฅํšจ์œจ์„ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค. ๋˜ํ•œ, GradPIM์€ bank-group ์ˆ˜์ค€์˜ ๋ณ‘๋ ฌํ™”๋ฅผ ์ด๋ฃจ์–ด ๋‚ด ๋†’์€ ๋‚ด๋ถ€ ๋Œ€์—ญํญ์„ ํ™œ์šฉํ•จ์œผ๋กœ์จ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์„ ํฌ๊ฒŒ ํ™•์žฅ์‹œํ‚ฌ ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ๋‹ค. ๋˜ํ•œ ์ด๋Ÿฌํ•œ ๋ฉ”๋ชจ๋ฆฌ ๊ตฌ์กฐ์˜ ๋ณ€๊ฒฝ์ด ์ตœ์†Œํ™”๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ถ”๊ฐ€์ ์ธ ํ•˜๋“œ์›จ์–ด ๋น„์šฉ๋„ ์ตœ์†Œํ™”๋˜์—ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•ด GradPIM์ด ์ตœ์†Œํ•œ์˜ DRAM ํ”„๋กœํ† ์ฝœ ๋ณ€ํ™”์™€ DRAM์นฉ ๋‚ด์˜ ๊ณต๊ฐ„์‚ฌ์šฉ์„ ํ†ตํ•ด ์‹ฌ์ธต์‹ ๊ฒฝ๋ง ํ•™์Šต๊ณผ์ • ์ค‘ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐฑ์‹ ์— ํ•„์š”ํ•œ ์‹œ๊ฐ„์„ 40%๋งŒํผ ํ–ฅ์ƒ์‹œ์ผฐ์Œ์„ ๋ณด์˜€๋‹ค.Chapter I: Dynamic Computation Approximation for Aging Compensation 1 1.1 Introduction 1 1.1.1 Chip Reliability 1 1.1.2 Reliability Guardband 2 1.1.3 Approximate Computing in Logic Circuits 2 1.1.4 Computation approximation for Aging Compensation 3 1.1.5 Motivational Case Study 4 1.2 Previous Work 5 1.2.1 Aging-induced Delay 5 1.2.2 Delay-Configurable Circuits 6 1.3 Proposed System 8 1.3.1 Overview of the Proposed System 8 1.3.2 Proposed Adder 9 1.3.3 Proposed Multiplier 11 1.3.4 Proposed Monitoring Circuit 16 1.3.5 Aging Compensation Scheme 19 1.4 Design Methodology 20 1.5 Evaluation 24 1.5.1 Experimental setup 24 1.5.2 RTL component level Adder/Multiplier 27 1.5.3 RTL component level Monitoring circuit 30 1.5.4 System level 31 1.6 Summary 38 Chapter II: Energy-Efficient Neural Network by Combining Approximate Neuron Models 40 2.1 Introduction 40 2.1.1 Deep Neural Network (DNN) 40 2.1.2 Low-power designs for DNN 41 2.1.3 Stochastic-Computing Deep Neural Network 41 2.1.4 Spiking Deep Neural Network 43 2.2 Hybrid of Stochastic and Spiking DNNs 44 2.2.1 Stochastic-Computing vs Spiking Deep Neural Network 44 2.2.2 Combining Spiking Layers and Stochastic Layers 46 2.2.3 Encoding Mismatch 47 2.3 Evaluation 49 2.3.1 Latency and Test Error 49 2.3.2 Energy Efficiency 51 2.4 Summary 54 Chapter III: GradPIM: In-memory Gradient Descent in Mixed-Precision DNN Training 55 3.1 Introduction 55 3.1.1 Neural Processing Unit 55 3.1.2 Mixed-precision Training 56 3.1.3 Mixed-precision Training with In-memory Gradient Descent 57 3.1.4 DNN Parameter Update Algorithms 59 3.1.5 Modern DRAM Architecture 61 3.1.6 Motivation 63 3.2 Previous Work 65 3.2.1 Processing-In-Memory 65 3.2.2 Co-design Neural Processing Unit and Processing-In-Memory 66 3.2.3 Low-precision Computation in NPU 67 3.3 GradPIM 68 3.3.1 GradPIM Architecture 68 3.3.2 GradPIM Operations 69 3.3.3 Timing Considerations 70 3.3.4 Update Phase Procedure 73 3.3.5 Commanding GradPIM 75 3.4 NPU Co-design with GradPIM 76 3.4.1 NPU Architecture 76 3.4.2 Data Placement 79 3.5 Evaluation 82 3.5.1 Evaluation Methodology 82 3.5.2 Experimental Results 83 3.5.3 Sensitivity Analysis 88 3.5.4 Layer Characterizations 90 3.5.5 Distributed Data Parallelism 90 3.6 Summary 92 3.6.1 Discussion 92 Bibliography 113 ์š”์•ฝ 114Docto

    EXPERIMENTAL STUDIES ON MULTI-OPERAND ADDERS

    Get PDF

    A Reconfigurable Digital Multiplier and 4:2 Compressor Cells Design

    Get PDF
    With the continually growing use of portable computing devices and increasingly complex software applications, there is a constant push for low power high speed circuitry to support this technology. Because of the high usage and large complex circuitry required to carry out arithmetic operations used in applications such as digital signal processing, there has been a great focus on increasing the efficiency of computer arithmetic circuitry. A key player in the realm of computer arithmetic is the digital multiplier and because of its size and power consumption, it has moved to the forefront of today\u27s research. A digital reconfigurable multiplier architecture will be introduced. Regulated by a 2-bit control signal, the multiplier is capable of double and single precision multiplication, as well as fault tolerant and dual throughput single precision execution. The architecture proposed in this thesis is centered on a recursive multiplication algorithm, where a large multiplication is carried out using recursions of simpler submultiplier modules. Within each sub-multiplier module, instead of carry save adder arrays, 4:2 compressor rows are utilized for partial product reduction, which present greater efficiency, thus result in lower delay and power consumption of the whole multiplier. In addition, a study of various digital logic circuit styles are initially presented, and then three different designs of 4:2 compressor in Domino Logic are presented and simulation results confirm the property of proposed design in terms of delay, power consumption and operation frequenc

    Power-Aware Design Methodologies for FPGA-Based Implementation of Video Processing Systems

    Get PDF
    The increasing capacity and capabilities of FPGA devices in recent years provide an attractive option for performance-hungry applications in the image and video processing domain. FPGA devices are often used as implementation platforms for image and video processing algorithms for real-time applications due to their programmable structure that can exploit inherent spatial and temporal parallelism. While performance and area remain as two main design criteria, power consumption has become an important design goal especially for mobile devices. Reduction in power consumption can be achieved by reducing the supply voltage, capacitances, clock frequency and switching activities in a circuit. Switching activities can be reduced by architectural optimization of the processing cores such as adders, multipliers, multiply and accumulators (MACS), etc. This dissertation research focuses on reducing the switching activities in digital circuits by considering data dependencies in bit level, word level and block level neighborhoods in a video frame. The bit level data neighborhood dependency consideration for power reduction is illustrated in the design of pipelined array, Booth and log-based multipliers. For an array multiplier, operands of the multipliers are partitioned into higher and lower parts so that the probability of the higher order parts being zero or one increases. The gating technique for the pipelined approach deactivates part(s) of the multiplier when the above special values are detected. For the Booth multiplier, the partitioning and gating technique is integrated into the Booth recoding scheme. In addition, a delay correction strategy is developed for the Booth multiplier to reduce the switching activities of the sign extension part in the partial products. A novel architecture design for the computation of log and inverse-log functions for the reduction of power consumption in arithmetic circuits is also presented. This also utilizes the proposed partitioning and gating technique for further dynamic power reduction by reducing the switching activities. The word level and block level data dependencies for reducing the dynamic power consumption are illustrated by presenting the design of a 2-D convolution architecture. Here the similarities of the neighboring pixels in window-based operations of image and video processing algorithms are considered for reduced switching activities. A partitioning and detection mechanism is developed to deactivate the parallel architecture for window-based operations if higher order parts of the pixel values are the same. A neighborhood dependent approach (NDA) is incorporated with different window buffering schemes. Consideration of the symmetry property in filter kernels is also applied with the NDA method for further reduction of switching activities. The proposed design methodologies are implemented and evaluated in a FPGA environment. It is observed that the dynamic power consumption in FPGA-based circuit implementations is significantly reduced in bit level, data level and block level architectures when compared to state-of-the-art design techniques. A specific application for the design of a real-time video processing system incorporating the proposed design methodologies for low power consumption is also presented. An image enhancement application is considered and the proposed partitioning and gating, and NDA methods are utilized in the design of the enhancement system. Experimental results show that the proposed multi-level power aware methodology achieves considerable power reduction. Research work is progressing In utilizing the data dependencies in subsequent frames in a video stream for the reduction of circuit switching activities and thereby the dynamic power consumption

    REAL-TIME ADAPTIVE PULSE COMPRESSION ON RECONFIGURABLE, SYSTEM-ON-CHIP (SOC) PLATFORMS

    Get PDF
    New radar applications need to perform complex algorithms and process a large quantity of data to generate useful information for the users. This situation has motivated the search for better processing solutions that include low-power high-performance processors, efficient algorithms, and high-speed interfaces. In this work, hardware implementation of adaptive pulse compression algorithms for real-time transceiver optimization is presented, and is based on a System-on-Chip architecture for reconfigurable hardware devices. This study also evaluates the performance of dedicated coprocessors as hardware accelerator units to speed up and improve the computation of computing-intensive tasks such matrix multiplication and matrix inversion, which are essential units to solve the covariance matrix. The tradeoffs between latency and hardware utilization are also presented. Moreover, the system architecture takes advantage of the embedded processor, which is interconnected with the logic resources through high-performance buses, to perform floating-point operations, control the processing blocks, and communicate with an external PC through a customized software interface. The overall system functionality is demonstrated and tested for real-time operations using a Ku-band testbed together with a low-cost channel emulator for different types of waveforms

    Flexible Multiple-Precision Fused Arithmetic Units for Efficient Deep Learning Computation

    Get PDF
    Deep Learning has achieved great success in recent years. In many fields of applications, such as computer vision, biomedical analysis, and natural language processing, deep learning can achieve a performance that is even better than human-level. However, behind this superior performance is the expensive hardware cost required to implement deep learning operations. Deep learning operations are both computation intensive and memory intensive. Many research works in the literature focused on improving the efficiency of deep learning operations. In this thesis, special focus is put on improving deep learning computation and several efficient arithmetic unit architectures are proposed and optimized for deep learning computation. The contents of this thesis can be divided into three parts: (1) the optimization of general-purpose arithmetic units for deep learning computation; (2) the design of deep learning specific arithmetic units; (3) the optimization of deep learning computation using 3D memory architecture. Deep learning models are usually trained on graphics processing unit (GPU) and the computations are done with single-precision floating-point numbers. However, recent works proved that deep learning computation can be accomplished with low precision numbers. The half-precision numbers are becoming more and more popular in deep learning computation due to their lower hardware cost compared to the single-precision numbers. In conventional floating-point arithmetic units, single-precision and beyond are well supported to achieve a better precision. However, for deep learning computation, since the computations are intensive, low precision computation is desired to achieve better throughput. As the popularity of half-precision raises, half-precision operations are also need to be supported. Moreover, the deep learning computation contains many dot-product operations and therefore, the support of mixed-precision dot-product operations can be explored in a multiple-precision architecture. In this thesis, a multiple-precision fused multiply-add (FMA) architecture is proposed. It supports half/single/double/quadruple-precision FMA operations. In addition, it also supports 2-term mixed-precision dot-product operations. Compared to the conventional multiple-precision FMA architecture, the newly added half-precision support and mixed-precision dot-product only bring minor resource overhead. The proposed FMA can be used as general-purpose arithmetic unit. Due to the support of parallel half-precision computations and mixed-precision dot-product computations, it is especially suitable for deep learning computation. For the design of deep learning specific computation unit, more optimizations can be performed. First, a fixed-point and floating-point merged multiply-accumulate (MAC) unit is proposed. As deep learning computation can be accomplished with low precision number formats, the support of high precision floating-point operations can be eliminated. In this design, the half-precision floating-point format is supported to provide a large dynamic range to handle small gradients for deep learning training. For deep learning inference, 8-bit fixed-point 2-term dot-product computation is supported. Second, a flexible multiple-precision MAC unit architecture is proposed. The proposed MAC unit supports both fixed-point operations and floating-point operations. For floating-point format, the proposed unit supports one 16-bit MAC operation or sum of two 8-bit multiplications plus a 16-bit addend. To make the proposed MAC unit more versatile, the bit-width of exponent and mantissa can be flexibly exchanged. By setting the bit-width of exponent to zero, the proposed MAC unit also supports fixed-point operations. For fixed-point format, the proposed unit supports one 16-bit MAC or sum of two 8-bit multiplications plus a 16-bit addend. Moreover, the proposed unit can be further divided to support sum of four 4-bit multiplications plus a 16-bit addend. At the lowest precision, the proposed MAC unit supports accumulating of eight 1-bit logic AND operations to enable the support of binary neural networks. Finally, a MAC architecture based on the posit format, a promising numerical format in deep learning computation, is proposed to facilitate the use of posit format in deep learning computation. In addition to the above mention arithmetic units, an improved hybrid memory cube (HMC) architecture is proposed for weight-sharing deep neural network processing. By modifying the HMC instruction set and HMC logic layer, the major part of the deep learning computation can be accomplished inside memory. The proposed design reduces the memory bandwidth requirements and thus reduces the energy consumed by memory data transfer

    Variable block size motion estimation hardware for video encoders.

    Get PDF
    Li, Man Ho.Thesis submitted in: November 2006.Thesis (M.Phil.)--Chinese University of Hong Kong, 2007.Includes bibliographical references (leaves 137-143).Abstracts in English and Chinese.Abstract --- p.iAcknowledgement --- p.ivChapter 1 --- Introduction --- p.1Chapter 1.1 --- Motivation --- p.3Chapter 1.2 --- The objectives of this thesis --- p.4Chapter 1.3 --- Contributions --- p.5Chapter 1.4 --- Thesis structure --- p.6Chapter 2 --- Digital video compression --- p.8Chapter 2.1 --- Introduction --- p.8Chapter 2.2 --- Fundamentals of lossy video compression --- p.9Chapter 2.2.1 --- Video compression and human visual systems --- p.10Chapter 2.2.2 --- Representation of color --- p.10Chapter 2.2.3 --- Sampling methods - frames and fields --- p.11Chapter 2.2.4 --- Compression methods --- p.11Chapter 2.2.5 --- Motion estimation --- p.12Chapter 2.2.6 --- Motion compensation --- p.13Chapter 2.2.7 --- Transform --- p.13Chapter 2.2.8 --- Quantization --- p.14Chapter 2.2.9 --- Entropy Encoding --- p.14Chapter 2.2.10 --- Intra-prediction unit --- p.14Chapter 2.2.11 --- Deblocking filter --- p.15Chapter 2.2.12 --- Complexity analysis of on different com- pression stages --- p.16Chapter 2.3 --- Motion estimation process --- p.16Chapter 2.3.1 --- Block-based matching method --- p.16Chapter 2.3.2 --- Motion estimation procedure --- p.18Chapter 2.3.3 --- Matching Criteria --- p.19Chapter 2.3.4 --- Motion vectors --- p.21Chapter 2.3.5 --- Quality judgment --- p.22Chapter 2.4 --- Block-based matching algorithms for motion estimation --- p.23Chapter 2.4.1 --- Full search (FS) --- p.23Chapter 2.4.2 --- Three-step search (TSS) --- p.24Chapter 2.4.3 --- Two-dimensional Logarithmic Search Algorithm (2D-log search) --- p.25Chapter 2.4.4 --- Diamond Search (DS) --- p.25Chapter 2.4.5 --- Fast full search (FFS) --- p.26Chapter 2.5 --- Complexity analysis of motion estimation --- p.27Chapter 2.5.1 --- Different searching algorithms --- p.28Chapter 2.5.2 --- Fixed-block size motion estimation --- p.28Chapter 2.5.3 --- Variable block size motion estimation --- p.29Chapter 2.5.4 --- Sub-pixel motion estimation --- p.30Chapter 2.5.5 --- Multi-reference frame motion estimation . --- p.30Chapter 2.6 --- Picture quality analysis --- p.31Chapter 2.7 --- Summary --- p.32Chapter 3 --- Arithmetic for video encoding --- p.33Chapter 3.1 --- Introduction --- p.33Chapter 3.2 --- Number systems --- p.34Chapter 3.2.1 --- Non-redundant Number System --- p.34Chapter 3.2.2 --- Redundant number system --- p.36Chapter 3.3 --- Addition/subtraction algorithm --- p.38Chapter 3.3.1 --- Non-redundant number addition --- p.39Chapter 3.3.2 --- Carry-save number addition --- p.39Chapter 3.3.3 --- Signed-digit number addition --- p.40Chapter 3.4 --- Bit-serial algorithms --- p.42Chapter 3.4.1 --- Least-significant-bit (LSB) first mode --- p.42Chapter 3.4.2 --- Most-significant-bit (MSB) first mode --- p.43Chapter 3.5 --- Absolute difference algorithm --- p.44Chapter 3.5.1 --- Non-redundant algorithm for absolute difference --- p.44Chapter 3.5.2 --- Redundant algorithm for absolute difference --- p.45Chapter 3.6 --- Multi-operand addition algorithm --- p.47Chapter 3.6.1 --- Bit-parallel non-redundant adder tree implementation --- p.47Chapter 3.6.2 --- Bit-parallel carry-save adder tree implementation --- p.49Chapter 3.6.3 --- Bit serial signed digit adder tree implementation --- p.49Chapter 3.7 --- Comparison algorithms --- p.50Chapter 3.7.1 --- Non-redundant comparison algorithm --- p.51Chapter 3.7.2 --- Signed-digit comparison algorithm --- p.52Chapter 3.8 --- Summary --- p.53Chapter 4 --- VLSI architectures for video encoding --- p.54Chapter 4.1 --- Introduction --- p.54Chapter 4.2 --- Implementation platform - (FPGA) --- p.55Chapter 4.2.1 --- Basic FPGA architecture --- p.55Chapter 4.2.2 --- DSP blocks in FPGA device --- p.56Chapter 4.2.3 --- Advantages employing FPGA --- p.57Chapter 4.2.4 --- Commercial FPGA Device --- p.58Chapter 4.3 --- Top level architecture of motion estimation processor --- p.59Chapter 4.4 --- Bit-parallel architectures for motion estimation --- p.60Chapter 4.4.1 --- Systolic arrays --- p.60Chapter 4.4.2 --- Mapping of a motion estimation algorithm onto systolic array --- p.61Chapter 4.4.3 --- 1-D systolic array architecture (LA-ID) --- p.63Chapter 4.4.4 --- 2-D systolic array architecture (LA-2D) --- p.64Chapter 4.4.5 --- 1-D Tree architecture (GA-1D) --- p.64Chapter 4.4.6 --- 2-D Tree architecture (GA-2D) --- p.65Chapter 4.4.7 --- Variable block size support in bit-parallel architectures --- p.66Chapter 4.5 --- Bit-serial motion estimation architecture --- p.68Chapter 4.5.1 --- Data Processing Direction --- p.68Chapter 4.5.2 --- Algorithm mapping and dataflow design . --- p.68Chapter 4.5.3 --- Early termination scheme --- p.69Chapter 4.5.4 --- Top-level architecture --- p.70Chapter 4.5.5 --- Non redundant positive number to signed digit conversion --- p.71Chapter 4.5.6 --- Signed-digit adder tree --- p.73Chapter 4.5.7 --- SAD merger --- p.74Chapter 4.5.8 --- Signed-digit comparator --- p.75Chapter 4.5.9 --- Early termination controller --- p.76Chapter 4.5.10 --- Data scheduling and timeline --- p.80Chapter 4.6 --- Decision metric in different architectural types . . --- p.80Chapter 4.6.1 --- Throughput --- p.81Chapter 4.6.2 --- Memory bandwidth --- p.83Chapter 4.6.3 --- Silicon area occupied and power consump- tion --- p.83Chapter 4.7 --- Architecture selection for different applications . . --- p.84Chapter 4.7.1 --- CIF and QCIF resolution --- p.84Chapter 4.7.2 --- SDTV resolution --- p.85Chapter 4.7.3 --- HDTV resolution --- p.85Chapter 4.8 --- Summary --- p.86Chapter 5 --- Results and comparison --- p.87Chapter 5.1 --- Introduction --- p.87Chapter 5.2 --- Implementation details --- p.87Chapter 5.2.1 --- Bit-parallel 1-D systolic array --- p.88Chapter 5.2.2 --- Bit-parallel 2-D systolic array --- p.89Chapter 5.2.3 --- Bit-parallel Tree architecture --- p.90Chapter 5.2.4 --- MSB-first bit-serial design --- p.91Chapter 5.3 --- Comparison between motion estimation architectures --- p.93Chapter 5.3.1 --- Throughput and latency --- p.93Chapter 5.3.2 --- Occupied resources --- p.94Chapter 5.3.3 --- Memory bandwidth --- p.95Chapter 5.3.4 --- Motion estimation algorithm --- p.95Chapter 5.3.5 --- Power consumption --- p.97Chapter 5.4 --- Comparison to ASIC and FPGA architectures in past literature --- p.99Chapter 5.5 --- Summary --- p.101Chapter 6 --- Conclusion --- p.102Chapter 6.1 --- Summary --- p.102Chapter 6.1.1 --- Algorithmic optimizations --- p.102Chapter 6.1.2 --- Architecture and arithmetic optimizations --- p.103Chapter 6.1.3 --- Implementation on a FPGA platform . . . --- p.104Chapter 6.2 --- Future work --- p.106Chapter A --- VHDL Sources --- p.108Chapter A.1 --- Online Full Adder --- p.108Chapter A.2 --- Online Signed Digit Full Adder --- p.109Chapter A.3 --- Online Pull Adder Tree --- p.110Chapter A.4 --- SAD merger --- p.112Chapter A.5 --- Signed digit adder tree stage (top) --- p.116Chapter A.6 --- Absolute element --- p.118Chapter A.7 --- Absolute stage (top) --- p.119Chapter A.8 --- Online comparator element --- p.120Chapter A.9 --- Comparator stage (top) --- p.122Chapter A.10 --- MSB-first motion estimation processor --- p.134Bibliography --- p.13

    Techniques for Efficient Implementation of FIR and Particle Filtering

    Full text link
    • โ€ฆ
    corecore