1,320 research outputs found

    Application-Driven Synthesis of Energy-Efficient Reconfigurable-Precision Operators

    Get PDF
    The increasing performance demands in emerging Internet of Things applications clash with the low energy budgets of end-nodes. Therefore, hardware operators able to reconfigure their computational precision at runtime are increasingly employed in these devices, to obtain good-enough results at minimal energy costs. Among the many methods proposed to implement such operators, Dynamic Voltage and Accuracy Scaling (DVAS) is particularly promising, due to its broad applicability and low overheads. However, a straight-forward application of DVAS conflicts with the optimizations performed by classic EDA algorithms, and does not yield the expected results. In this paper, we propose a novel synthesis algorithm for reconfigurable-precision circuits, that allows to integrate DVAS in a standard implementation flow. Moreover, we show how this algorithm can exploit information about the application, namely on the frequency of usage of each precision, to further reduce the total energy consumption. Applying our method to the popular LeNet neural network for digit recognition, we are able to reduce the energy due to Multiply-And-Accumulate (MAC) operations by 25%, compared to a straight-forward application of DVAS

    Design automation of approximate circuits with runtime reconfigurable accuracy

    Get PDF
    Leveraging the inherent error tolerance of a vast number of application domains that are rapidly growing, approximate computing arises as a design alternative to improve the efficiency of our computing systems by trading accuracy for energy savings. However, the requirement for computational accuracy is not fixed. Controlling the applied level of approximation dynamically at runtime is a key to effectively optimize energy, while still containing and bounding the induced errors at runtime. In this paper, we propose and implement an automatic and circuit independent design framework that generates approximate circuits with dynamically reconfigurable accuracy at runtime. The generated circuits feature varying accuracy levels, supporting also accurate execution. Extensive experimental evaluation, using industry strength flow and circuits, demonstrates that our generated approximate circuits improve the energy by up to 41% for 2% error bound and by 17.5% on average under a pessimistic scenario that assumes full accuracy requirement in the 33% of the runtime. To demonstrate further the efficiency of our framework, we considered two state-of-the-art technology libraries which are a 7nm conventional FinFET and an emerging technology that boosts performance at a high cost of increased dynamic power

    A Reconfigurable Digital Multiplier and 4:2 Compressor Cells Design

    Get PDF
    With the continually growing use of portable computing devices and increasingly complex software applications, there is a constant push for low power high speed circuitry to support this technology. Because of the high usage and large complex circuitry required to carry out arithmetic operations used in applications such as digital signal processing, there has been a great focus on increasing the efficiency of computer arithmetic circuitry. A key player in the realm of computer arithmetic is the digital multiplier and because of its size and power consumption, it has moved to the forefront of today\u27s research. A digital reconfigurable multiplier architecture will be introduced. Regulated by a 2-bit control signal, the multiplier is capable of double and single precision multiplication, as well as fault tolerant and dual throughput single precision execution. The architecture proposed in this thesis is centered on a recursive multiplication algorithm, where a large multiplication is carried out using recursions of simpler submultiplier modules. Within each sub-multiplier module, instead of carry save adder arrays, 4:2 compressor rows are utilized for partial product reduction, which present greater efficiency, thus result in lower delay and power consumption of the whole multiplier. In addition, a study of various digital logic circuit styles are initially presented, and then three different designs of 4:2 compressor in Domino Logic are presented and simulation results confirm the property of proposed design in terms of delay, power consumption and operation frequenc

    Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Networks

    Full text link
    Fully realizing the potential of acceleration for Deep Neural Networks (DNNs) requires understanding and leveraging algorithmic properties. This paper builds upon the algorithmic insight that bitwidth of operations in DNNs can be reduced without compromising their classification accuracy. However, to prevent accuracy loss, the bitwidth varies significantly across DNNs and it may even be adjusted for each layer. Thus, a fixed-bitwidth accelerator would either offer limited benefits to accommodate the worst-case bitwidth requirements, or lead to a degradation in final accuracy. To alleviate these deficiencies, this work introduces dynamic bit-level fusion/decomposition as a new dimension in the design of DNN accelerators. We explore this dimension by designing Bit Fusion, a bit-flexible accelerator, that constitutes an array of bit-level processing elements that dynamically fuse to match the bitwidth of individual DNN layers. This flexibility in the architecture enables minimizing the computation and the communication at the finest granularity possible with no loss in accuracy. We evaluate the benefits of BitFusion using eight real-world feed-forward and recurrent DNNs. The proposed microarchitecture is implemented in Verilog and synthesized in 45 nm technology. Using the synthesis results and cycle accurate simulation, we compare the benefits of Bit Fusion to two state-of-the-art DNN accelerators, Eyeriss and Stripes. In the same area, frequency, and process technology, BitFusion offers 3.9x speedup and 5.1x energy savings over Eyeriss. Compared to Stripes, BitFusion provides 2.6x speedup and 3.9x energy reduction at 45 nm node when BitFusion area and frequency are set to those of Stripes. Scaling to GPU technology node of 16 nm, BitFusion almost matches the performance of a 250-Watt Titan Xp, which uses 8-bit vector instructions, while BitFusion merely consumes 895 milliwatts of power

    Towards Energy-Efficient and Reliable Computing: From Highly-Scaled CMOS Devices to Resistive Memories

    Get PDF
    The continuous increase in transistor density based on Moore\u27s Law has led us to highly scaled Complementary Metal-Oxide Semiconductor (CMOS) technologies. These transistor-based process technologies offer improved density as well as a reduction in nominal supply voltage. An analysis regarding different aspects of 45nm and 15nm technologies, such as power consumption and cell area to compare these two technologies is proposed on an IEEE 754 Single Precision Floating-Point Unit implementation. Based on the results, using the 15nm technology offers 4-times less energy and 3-fold smaller footprint. New challenges also arise, such as relative proportion of leakage power in standby mode that can be addressed by post-CMOS technologies. Spin-Transfer Torque Random Access Memory (STT-MRAM) has been explored as a post-CMOS technology for embedded and data storage applications seeking non-volatility, near-zero standby energy, and high density. Towards attaining these objectives for practical implementations, various techniques to mitigate the specific reliability challenges associated with STT-MRAM elements are surveyed, classified, and assessed herein. Cost and suitability metrics assessed include the area of nanomagmetic and CMOS components per bit, access time and complexity, Sense Margin (SM), and energy or power consumption costs versus resiliency benefits. In an attempt to further improve the Process Variation (PV) immunity of the Sense Amplifiers (SAs), a new SA has been introduced called Adaptive Sense Amplifier (ASA). ASA can benefit from low Bit Error Rate (BER) and low Energy Delay Product (EDP) by combining the properties of two of the commonly used SAs, Pre-Charge Sense Amplifier (PCSA) and Separated Pre-Charge Sense Amplifier (SPCSA). ASA can operate in either PCSA or SPCSA mode based on the requirements of the circuit such as energy efficiency or reliability. Then, ASA is utilized to propose a novel approach to actually leverage the PV in Non-Volatile Memory (NVM) arrays using Self-Organized Sub-bank (SOS) design. SOS engages the preferred SA alternative based on the intrinsic as-built behavior of the resistive sensing timing margin to reduce the latency and power consumption while maintaining acceptable access time

    ๊ทผ์‚ฌ ์ปดํ“จํŒ…์„ ์ด์šฉํ•œ ํšŒ๋กœ ๋…ธํ™” ๋ณด์ƒ๊ณผ ์—๋„ˆ์ง€ ํšจ์œจ์ ์ธ ์‹ ๊ฒฝ๋ง ๊ตฌํ˜„

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2020. 8. ์ดํ˜์žฌ.Approximate computing reduces the cost (energy and/or latency) of computations by relaxing the correctness (i.e., precision) of computations up to the level, which is dependent on types of applications. Moreover, it can be realized in various hierarchies of computing system design from circuit level to application level. This dissertation presents the methodologies applying approximate computing across such hierarchies; compensating aging-induced delay in logic circuit by dynamic computation approximation (Chapter 1), designing energy-efficient neural network by combining low-power and low-latency approximate neuron models (Chapter 2), and co-designing in-memory gradient descent module with neural processing unit so as to address a memory bottleneck incurred by memory I/O for high-precision data (Chapter 3). The first chapter of this dissertation presents a novel design methodology to turn the timing violation caused by aging into computation approximation error without the reliability guardband or increasing the supply voltage. It can be realized by accurately monitoring the critical path delay at run-time. The proposal is evaluated at two levels: RTL component level and system level. The experimental results at the RTL component level show a significant improvement in terms of (normalized) mean squared error caused by the timing violation and, at the system level, show that the proposed approach successfully transforms the aging-induced timing violation errors into much less harmful computation approximation errors, therefore it recovers image quality up to perceptually acceptable levels. It reduces the dynamic and static power consumption by 21.45% and 10.78%, respectively, with 0.8% area overhead compared to the conventional approach. The second chapter of this dissertation presents an energy-efficient neural network consisting of alternative neuron models; Stochastic-Computing (SC) and Spiking (SP) neuron models. SC has been adopted in various fields to improve the power efficiency of systems by performing arithmetic computations stochastically, which approximates binary computation in conventional computing systems. Moreover, a recent work showed that deep neural network (DNN) can be implemented in the manner of stochastic computing and it greatly reduces power consumption. However, Stochastic DNN (SC-DNN) suffers from problem of high latency as it processes only a bit per cycle. To address such problem, it is proposed to adopt Spiking DNN (SP-DNN) as an input interface for SC-DNN since SP effectively processes more bits per cycle than SC-DNN. Moreover, this chapter resolves the encoding mismatch problem, between two different neuron models, without hardware cost by compensating the encoding mismatch with synapse weight calibration. A resultant hybrid DNN (SPSC-DNN) consists of SP-DNN as bottom layers and SC-DNN as top layers. Exploiting the reduced latency from SP-DNN and low-power consumption from SC-DNN, the proposed SPSC-DNN achieves improved energy-efficiency with lower error-rate compared to SC-DNN and SP-DNN in same network configuration. The third chapter of this dissertation proposes GradPim architecture, which accelerates the parameter updates by in-memory processing which is codesigned with 8-bit floating-point training in Neural Processing Unit (NPU) for deep neural networks. By keeping the high precision processing algorithms in memory, such as the parameter update incorporating high-precision weights in its computation, the GradPim architecture can achieve high computational efficiency using 8-bit floating point in NPU and also gain power efficiency by eliminating massive high-precision data transfers between NPU and off-chip memory. A simple extension of DDR4 SDRAM utilizing bank-group parallelism makes the operation designs in processing-in-memory (PIM) module efficient in terms of hardware cost and performance. The experimental results show that the proposed architecture can improve the performance of the parameter update phase in the training by up to 40% and greatly reduce the memory bandwidth requirement while posing only a minimal amount of overhead to the protocol and the DRAM area.๊ทผ์‚ฌ ์ปดํ“จํŒ…์€ ์—ฐ์‚ฐ์˜ ์ •ํ™•๋„์˜ ์†์‹ค์„ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜ ๋ณ„ ์ ์ ˆํ•œ ์ˆ˜์ค€๊นŒ์ง€ ํ—ˆ์šฉํ•จ์œผ๋กœ์จ ์—ฐ์‚ฐ์— ํ•„์š”ํ•œ ๋น„์šฉ (์—๋„ˆ์ง€๋‚˜ ์ง€์—ฐ์‹œ๊ฐ„)์„ ์ค„์ธ๋‹ค. ๊ฒŒ๋‹ค๊ฐ€, ๊ทผ์‚ฌ ์ปดํ“จํŒ…์€ ์ปดํ“จํŒ… ์‹œ์Šคํ…œ ์„ค๊ณ„์˜ ํšŒ๋กœ ๊ณ„์ธต๋ถ€ํ„ฐ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜ ๊ณ„์ธต๊นŒ์ง€ ๋‹ค์–‘ํ•œ ๊ณ„์ธต์— ์ ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๊ทผ์‚ฌ ์ปดํ“จํŒ… ๋ฐฉ๋ฒ•๋ก ์„ ๋‹ค์–‘ํ•œ ์‹œ์Šคํ…œ ์„ค๊ณ„์˜ ๊ณ„์ธต์— ์ ์šฉํ•˜์—ฌ ์ „๋ ฅ๊ณผ ์—๋„ˆ์ง€ ์ธก๋ฉด์—์„œ ์ด๋“์„ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•๋“ค์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ด๋Š”, ์—ฐ์‚ฐ ๊ทผ์‚ฌํ™” (computation Approximation)๋ฅผ ํ†ตํ•ด ํšŒ๋กœ์˜ ๋…ธํ™”๋กœ ์ธํ•ด ์ฆ๊ฐ€๋œ ์ง€์—ฐ์‹œ๊ฐ„์„ ์ถ”๊ฐ€์ ์ธ ์ „๋ ฅ์†Œ๋ชจ ์—†์ด ๋ณด์ƒํ•˜๋Š” ๋ฐฉ๋ฒ•๊ณผ (์ฑ•ํ„ฐ 1), ๊ทผ์‚ฌ ๋‰ด๋Ÿฐ๋ชจ๋ธ (approximate neuron model)์„ ์ด์šฉํ•ด ์—๋„ˆ์ง€ ํšจ์œจ์ด ๋†’์€ ์‹ ๊ฒฝ๋ง์„ ๊ตฌ์„ฑํ•˜๋Š” ๋ฐฉ๋ฒ• (์ฑ•ํ„ฐ 2), ๊ทธ๋ฆฌ๊ณ  ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์œผ๋กœ ์ธํ•œ ๋ณ‘๋ชฉํ˜„์ƒ ๋ฌธ์ œ๋ฅผ ๋†’์€ ์ •ํ™•๋„ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•œ ์—ฐ์‚ฐ์„ ๋ฉ”๋ชจ๋ฆฌ ๋‚ด์—์„œ ์ˆ˜ํ–‰ํ•จ์œผ๋กœ์จ ์™„ํ™”์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์„ (์ฑ•ํ„ฐ3) ์ œ์•ˆํ•˜์˜€๋‹ค. ์ฒซ ๋ฒˆ์งธ ์ฑ•ํ„ฐ๋Š” ํšŒ๋กœ์˜ ๋…ธํ™”๋กœ ์ธํ•œ ์ง€์—ฐ์‹œ๊ฐ„์œ„๋ฐ˜์„ (timing violation) ์„ค๊ณ„๋งˆ์ง„์ด๋‚˜ (reliability guardband) ๊ณต๊ธ‰์ „๋ ฅ์˜ ์ฆ๊ฐ€ ์—†์ด ์—ฐ์‚ฐ์˜ค์ฐจ (computation approximation error)๋ฅผ ํ†ตํ•ด ๋ณด์ƒํ•˜๋Š” ์„ค๊ณ„๋ฐฉ๋ฒ•๋ก  (design methodology)๋ฅผ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ์ฃผ์š”๊ฒฝ๋กœ์˜ (critical path) ์ง€์—ฐ์‹œ๊ฐ„์„ ๋™์ž‘์‹œ๊ฐ„์— ์ •ํ™•ํ•˜๊ฒŒ ์ธก์ •ํ•  ํ•„์š”๊ฐ€ ์žˆ๋‹ค. ์—ฌ๊ธฐ์„œ ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•๋ก ์€ RTL component์™€ system ๋‹จ๊ณ„์—์„œ ํ‰๊ฐ€๋˜์—ˆ๋‹ค. RTL component ๋‹จ๊ณ„์˜ ์‹คํ—˜๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•ด ์ œ์•ˆํ•œ ๋ฐฉ์‹์ด ํ‘œ์ค€ํ™”๋œ ํ‰๊ท ์ œ๊ณฑ์˜ค์ฐจ๋ฅผ (normalized mean squared error) ์ƒ๋‹นํžˆ ์ค„์˜€์Œ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  system ๋‹จ๊ณ„์—์„œ๋Š” ์ด๋ฏธ์ง€์ฒ˜๋ฆฌ ์‹œ์Šคํ…œ์—์„œ ์ด๋ฏธ์ง€์˜ ํ’ˆ์งˆ์ด ์ธ์ง€์ ์œผ๋กœ ์ถฉ๋ถ„ํžˆ ํšŒ๋ณต๋˜๋Š” ๊ฒƒ์„ ๋ณด์ž„์œผ๋กœ์จ ํšŒ๋กœ๋…ธํ™”๋กœ ์ธํ•ด ๋ฐœ์ƒํ•œ ์ง€์—ฐ์‹œ๊ฐ„์œ„๋ฐ˜ ์˜ค์ฐจ๊ฐ€ ์—๋Ÿฌ์˜ ํฌ๊ธฐ๊ฐ€ ์ž‘์€ ์—ฐ์‚ฐ์˜ค์ฐจ๋กœ ๋ณ€๊ฒฝ๋˜๋Š” ๊ฒƒ์„ ํ™•์ธ ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๊ฒฐ๋ก ์ ์œผ๋กœ, ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•๋ก ์„ ๋”ฐ๋ž์„ ๋•Œ 0.8%์˜ ๊ณต๊ฐ„์„ (area) ๋” ์‚ฌ์šฉํ•˜๋Š” ๋น„์šฉ์„ ์ง€๋ถˆํ•˜๊ณ  21.45%์˜ ๋™์ ์ „๋ ฅ์†Œ๋ชจ์™€ (dynamic power consumption) 10.78%์˜ ์ •์ ์ „๋ ฅ์†Œ๋ชจ์˜ (static power consumption) ๊ฐ์†Œ๋ฅผ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋‘ ๋ฒˆ์งธ ์ฑ•ํ„ฐ๋Š” ๊ทผ์‚ฌ ๋‰ด๋Ÿฐ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜๋Š” ๊ณ -์—๋„ˆ์ง€ํšจ์œจ์˜ ์‹ ๊ฒฝ๋ง์„ (neural network) ์ œ์•ˆํ•˜์˜€๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ ์‚ฌ์šฉํ•œ ๋‘ ๊ฐ€์ง€์˜ ๊ทผ์‚ฌ ๋‰ด๋Ÿฐ๋ชจ๋ธ์€ ํ™•๋ฅ ์ปดํ“จํŒ…๊ณผ (stochastic computing) ์ŠคํŒŒ์ดํ‚น๋‰ด๋Ÿฐ (spiking neuron) ์ด๋ก ๋“ค์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ชจ๋ธ๋ง๋˜์—ˆ๋‹ค. ํ™•๋ฅ ์ปดํ“จํŒ…์€ ์‚ฐ์ˆ ์—ฐ์‚ฐ๋“ค์„ ํ™•๋ฅ ์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•จ์œผ๋กœ์จ ์ด์ง„์—ฐ์‚ฐ์„ ๋‚ฎ์€ ์ „๋ ฅ์†Œ๋ชจ๋กœ ์ˆ˜ํ–‰ํ•œ๋‹ค. ์ตœ๊ทผ์— ํ™•๋ฅ ์ปดํ“จํŒ… ๋‰ด๋Ÿฐ๋ชจ๋ธ์„ ์ด์šฉํ•˜์—ฌ ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง (deep neural network)๋ฅผ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์—ฐ๊ตฌ๊ฐ€ ์ง„ํ–‰๋˜์—ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜, ํ™•๋ฅ ์ปดํ“จํŒ…์„ ๋‰ด๋Ÿฐ๋ชจ๋ธ๋ง์— ํ™œ์šฉํ•  ๊ฒฝ์šฐ ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์ด ๋งค ํด๋ฝ์‚ฌ์ดํด๋งˆ๋‹ค (clock cycle) ํ•˜๋‚˜์˜ ๋น„ํŠธ๋งŒ์„ (bit) ์ฒ˜๋ฆฌํ•˜๋ฏ€๋กœ, ์ง€์—ฐ์‹œ๊ฐ„ ์ธก๋ฉด์—์„œ ๋งค์šฐ ๋‚˜์  ์ˆ˜ ๋ฐ–์— ์—†๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ์ŠคํŒŒ์ดํ‚น ๋‰ด๋Ÿฐ๋ชจ๋ธ๋กœ ๊ตฌ์„ฑ๋œ ์ŠคํŒŒ์ดํ‚น ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์„ ํ™•๋ฅ ์ปดํ“จํŒ…์„ ํ™œ์šฉํ•œ ์‹ฌ์ธต์‹ ๊ฒฝ๋ง ๊ตฌ์กฐ์™€ ๊ฒฐํ•ฉํ•˜์˜€๋‹ค. ์ŠคํŒŒ์ดํ‚น ๋‰ด๋Ÿฐ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ ๋งค ํด๋ฝ์‚ฌ์ดํด๋งˆ๋‹ค ์—ฌ๋Ÿฌ ๋น„ํŠธ๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์˜ ์ž…๋ ฅ ์ธํ„ฐํŽ˜์ด์Šค๋กœ ์‚ฌ์šฉ๋  ๊ฒฝ์šฐ ์ง€์—ฐ์‹œ๊ฐ„์„ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค. ํ•˜์ง€๋งŒ, ํ™•๋ฅ ์ปดํ“จํŒ… ๋‰ด๋Ÿฐ๋ชจ๋ธ๊ณผ ์ŠคํŒŒ์ดํ‚น ๋‰ด๋Ÿฐ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ ๋ถ€ํ˜ธํ™” (encoding) ๋ฐฉ์‹์ด ๋‹ค๋ฅธ ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ํ•ด๋‹น ๋ถ€ํ˜ธํ™” ๋ถˆ์ผ์น˜ ๋ฌธ์ œ๋ฅผ ๋ชจ๋ธ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํ•™์Šตํ•  ๋•Œ ๊ณ ๋ คํ•จ์œผ๋กœ์จ, ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์˜ ๊ฐ’์ด ๋ถ€ํ˜ธํ™” ๋ถˆ์ผ์น˜๋ฅผ ๊ณ ๋ คํ•˜์—ฌ ์กฐ์ ˆ (calibration) ๋  ์ˆ˜ ์žˆ๋„๋ก ํ•˜์—ฌ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜์˜€๋‹ค. ์ด๋Ÿฌํ•œ ๋ถ„์„์˜ ๊ฒฐ๊ณผ๋กœ, ์•ž ์ชฝ์—๋Š” ์ŠคํŒŒ์ดํ‚น ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์„ ๋ฐฐ์น˜ํ•˜๊ณ  ๋’ท ์ชฝ์• ๋Š” ํ™•๋ฅ ์ปดํ“จํŒ… ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์„ ๋ฐฐ์น˜ํ•˜๋Š” ํ˜ผ์„ฑ์‹ ๊ฒฝ๋ง์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ํ˜ผ์„ฑ์‹ ๊ฒฝ๋ง์€ ์ŠคํŒŒ์ดํ‚น ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์„ ํ†ตํ•ด ๋งค ํด๋ฝ์‚ฌ์ดํด๋งˆ๋‹ค ์ฒ˜๋ฆฌ๋˜๋Š” ๋น„ํŠธ ์–‘์˜ ์ฆ๊ฐ€๋กœ ์ธํ•œ ์ง€์—ฐ์‹œ๊ฐ„ ๊ฐ์†Œ ํšจ๊ณผ์™€ ํ™•๋ฅ ์ปดํ“จํŒ… ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์˜ ์ €์ „๋ ฅ ์†Œ๋ชจ ํŠน์„ฑ์„ ๋ชจ๋‘ ํ™œ์šฉํ•จ์œผ๋กœ์จ ๊ฐ ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์„ ๋”ฐ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ๋Œ€๋น„ ์šฐ์ˆ˜ํ•œ ์—๋„ˆ์ง€ ํšจ์œจ์„ฑ์„ ๋น„์Šทํ•˜๊ฑฐ๋‚˜ ๋” ๋‚˜์€ ์ •ํ™•๋„ ๊ฒฐ๊ณผ๋ฅผ ๋‚ด๋ฉด์„œ ๋‹ฌ์„ฑํ•œ๋‹ค. ์„ธ ๋ฒˆ์งธ ์ฑ•ํ„ฐ๋Š” ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์„ 8๋น„ํŠธ ๋ถ€๋™์†Œ์ˆซ์  ์—ฐ์‚ฐ์œผ๋กœ ํ•™์Šตํ•˜๋Š” ์‹ ๊ฒฝ๋ง์ฒ˜๋ฆฌ์œ ๋‹›์˜ (neural processing unit) ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐฑ์‹ ์„ (parameter update) ๋ฉ”๋ชจ๋ฆฌ-๋‚ด-์—ฐ์‚ฐ์œผ๋กœ (in-memory processing) ๊ฐ€์†ํ•˜๋Š” GradPIM ์•„ํ‚คํ…์ณ๋ฅผ ์ œ์•ˆํ•˜์˜€๋‹ค. GradPIM์€ 8๋น„ํŠธ์˜ ๋‚ฎ์€ ์ •ํ™•๋„ ์—ฐ์‚ฐ์€ ์‹ ๊ฒฝ๋ง์ฒ˜๋ฆฌ์œ ๋‹›์— ๋‚จ๊ธฐ๊ณ , ๋†’์€ ์ •ํ™•๋„๋ฅผ ๊ฐ€์ง€๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜๋Š” ์—ฐ์‚ฐ์€ (ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐฑ์‹ ) ๋ฉ”๋ชจ๋ฆฌ ๋‚ด๋ถ€์— ๋‘ ์œผ๋กœ์จ ์‹ ๊ฒฝ๋ง์ฒ˜๋ฆฌ์œ ๋‹›๊ณผ ๋ฉ”๋ชจ๋ฆฌ๊ฐ„์˜ ๋ฐ์ดํ„ฐํ†ต์‹ ์˜ ์–‘์„ ์ค„์—ฌ, ๋†’์€ ์—ฐ์‚ฐํšจ์œจ๊ณผ ์ „๋ ฅํšจ์œจ์„ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค. ๋˜ํ•œ, GradPIM์€ bank-group ์ˆ˜์ค€์˜ ๋ณ‘๋ ฌํ™”๋ฅผ ์ด๋ฃจ์–ด ๋‚ด ๋†’์€ ๋‚ด๋ถ€ ๋Œ€์—ญํญ์„ ํ™œ์šฉํ•จ์œผ๋กœ์จ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์„ ํฌ๊ฒŒ ํ™•์žฅ์‹œํ‚ฌ ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ๋‹ค. ๋˜ํ•œ ์ด๋Ÿฌํ•œ ๋ฉ”๋ชจ๋ฆฌ ๊ตฌ์กฐ์˜ ๋ณ€๊ฒฝ์ด ์ตœ์†Œํ™”๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ถ”๊ฐ€์ ์ธ ํ•˜๋“œ์›จ์–ด ๋น„์šฉ๋„ ์ตœ์†Œํ™”๋˜์—ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•ด GradPIM์ด ์ตœ์†Œํ•œ์˜ DRAM ํ”„๋กœํ† ์ฝœ ๋ณ€ํ™”์™€ DRAM์นฉ ๋‚ด์˜ ๊ณต๊ฐ„์‚ฌ์šฉ์„ ํ†ตํ•ด ์‹ฌ์ธต์‹ ๊ฒฝ๋ง ํ•™์Šต๊ณผ์ • ์ค‘ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐฑ์‹ ์— ํ•„์š”ํ•œ ์‹œ๊ฐ„์„ 40%๋งŒํผ ํ–ฅ์ƒ์‹œ์ผฐ์Œ์„ ๋ณด์˜€๋‹ค.Chapter I: Dynamic Computation Approximation for Aging Compensation 1 1.1 Introduction 1 1.1.1 Chip Reliability 1 1.1.2 Reliability Guardband 2 1.1.3 Approximate Computing in Logic Circuits 2 1.1.4 Computation approximation for Aging Compensation 3 1.1.5 Motivational Case Study 4 1.2 Previous Work 5 1.2.1 Aging-induced Delay 5 1.2.2 Delay-Configurable Circuits 6 1.3 Proposed System 8 1.3.1 Overview of the Proposed System 8 1.3.2 Proposed Adder 9 1.3.3 Proposed Multiplier 11 1.3.4 Proposed Monitoring Circuit 16 1.3.5 Aging Compensation Scheme 19 1.4 Design Methodology 20 1.5 Evaluation 24 1.5.1 Experimental setup 24 1.5.2 RTL component level Adder/Multiplier 27 1.5.3 RTL component level Monitoring circuit 30 1.5.4 System level 31 1.6 Summary 38 Chapter II: Energy-Efficient Neural Network by Combining Approximate Neuron Models 40 2.1 Introduction 40 2.1.1 Deep Neural Network (DNN) 40 2.1.2 Low-power designs for DNN 41 2.1.3 Stochastic-Computing Deep Neural Network 41 2.1.4 Spiking Deep Neural Network 43 2.2 Hybrid of Stochastic and Spiking DNNs 44 2.2.1 Stochastic-Computing vs Spiking Deep Neural Network 44 2.2.2 Combining Spiking Layers and Stochastic Layers 46 2.2.3 Encoding Mismatch 47 2.3 Evaluation 49 2.3.1 Latency and Test Error 49 2.3.2 Energy Efficiency 51 2.4 Summary 54 Chapter III: GradPIM: In-memory Gradient Descent in Mixed-Precision DNN Training 55 3.1 Introduction 55 3.1.1 Neural Processing Unit 55 3.1.2 Mixed-precision Training 56 3.1.3 Mixed-precision Training with In-memory Gradient Descent 57 3.1.4 DNN Parameter Update Algorithms 59 3.1.5 Modern DRAM Architecture 61 3.1.6 Motivation 63 3.2 Previous Work 65 3.2.1 Processing-In-Memory 65 3.2.2 Co-design Neural Processing Unit and Processing-In-Memory 66 3.2.3 Low-precision Computation in NPU 67 3.3 GradPIM 68 3.3.1 GradPIM Architecture 68 3.3.2 GradPIM Operations 69 3.3.3 Timing Considerations 70 3.3.4 Update Phase Procedure 73 3.3.5 Commanding GradPIM 75 3.4 NPU Co-design with GradPIM 76 3.4.1 NPU Architecture 76 3.4.2 Data Placement 79 3.5 Evaluation 82 3.5.1 Evaluation Methodology 82 3.5.2 Experimental Results 83 3.5.3 Sensitivity Analysis 88 3.5.4 Layer Characterizations 90 3.5.5 Distributed Data Parallelism 90 3.6 Summary 92 3.6.1 Discussion 92 Bibliography 113 ์š”์•ฝ 114Docto
    • โ€ฆ
    corecore