312 research outputs found

    Harnessing resilience: biased voltage overscaling for probabilistic signal processing

    Get PDF
    A central component of modern computing is the idea that computation requires determinism. Contrary to this belief, the primary contribution of this work shows that useful computation can be accomplished in an error-prone fashion. Focusing on low-power computing and the increasing push toward energy conservation, the work seeks to sacrifice accuracy in exchange for energy savings. Probabilistic computing forms the basis for this error-prone computation by diverging from the requirement of determinism and allowing for randomness within computing. Implemented as probabilistic CMOS (PCMOS), the approach realizes enormous energy sav- ings in applications that require probability at an algorithmic level. Extending probabilistic computing to applications that are inherently deterministic, the biased voltage overscaling (BIVOS) technique presented here constrains the randomness introduced through PCMOS. Doing so, BIVOS is able to limit the magnitude of any resulting deviations and realizes energy savings with minimal impact to application quality. Implemented for a ripple-carry adder, array multiplier, and finite-impulse-response (FIR) filter; a BIVOS solution substantially reduces energy consumption and does so with im- proved error rates compared to an energy equivalent reduced-precision solution. When applied to H.264 video decoding, a BIVOS solution is able to achieve a 33.9% reduction in energy consumption while maintaining a peak-signal-to-noise ratio of 35.0dB (compared to 14.3dB for a comparable reduced-precision solution). While the work presented here focuses on a specific technology, the technique realized through BIVOS has far broader implications. It is the departure from the conventional mindset that useful computation requires determinism that represents the primary innovation of this work. With applicability to emerging and yet to be discovered technologies, BIVOS has the potential to contribute to computing in a variety of fashions.PhDCommittee Chair: Anderson, David; Committee Member: Conte, Thomas; Committee Member: Ferri, Bonnie; Committee Member: Hasler, Paul; Committee Member: Mooney, Vincen

    High Speed and Low Power Consumption Carry Skip Adder using Binary to Excess-One Converter

    Get PDF
    Arithmetic and Logic Unit (ALU) is a vital component of any CPU. In ALU, adders play a major role not only in addition but also in performing many other basic arithmetic operations like subtraction, multiplication, etc. Thus realizing an efficient adder is required for better performance of an ALU and therefore the processor. For the optimization of speed in adders, the most important factor is carry generation. For the implementation of a fast adder, the generated carry should be driven to the output as fast as possible, thereby reducing the worst path delay which determines the ultimate speed of the digital structure. In conventional carry skip adder the multiplexer is used as a skip logic that provides a better performance and performs an efficient operation with the minimum circuitry. Even though, it affords a significant advantages there may be a large critical path delay revealed by the multiplexer that leads to increase of area usage and power consumption. The basic idea of this paper is to use Binary to Excess-1 Converters (BEC) to achieve lower area and power consumption

    A low-power geometric mapping co-processor for high-speed graphics application

    No full text
    In this article we present a novel design of a low-power geometric mapping co-processor that can be used for high-performance graphics system. The processor can carry out any single or a combination of transformations belonging to affine transformation family ranging from 1-D to 3-D. It allows interactive operations which can be defined either by a user (allowing it to be a stand-alone geometric transformation processor) or by a host processor (allowing it to be a co-processor to accelerate certain graphics operations). It occupies a silicon area of 6 mm2 and consumes 40 mW power when synthesized with 0.25?m technology

    Arithmetic logic UNIT (ALU) design using reconfigurable CMOS logic

    Get PDF
    Using the reconfigurable logic of multi-input floating gate MOSFETs, a 4-bit ALU has been designed for 3V operation. The ALU can perform four arithmetic and four logical operations. Multi- input floating gate (MIFG) transistors have been promising in realizing increased functionality on a chip. A multi- input floating gate MOS transistor accepts multiple inputs signals, calculates the weighted sum of all input signals and then controls the ON and OFF states of the transistor. This enhances the transistor function to more than just switching. This changes the way a logic function can be realized. Implementing a design using multi-input floating gate MOSFETs brings about reduction in transis tor count and number of interconnections. The advantage of bringing down the number of devices is that a design becomes area efficient and power consumption reduces. There are several applications that stress on smaller chip area and reduced power. Multi- input floating gate devices have their use in memories, analog and digital circuits. In the present work we have shown successful implementation of multi- input floating gate MOSFETs in ALU design. A comparison has been made between adders using different design methods w.r.t transistor count. It is seen that our design, implemented using multi-input floating gate MOSFETs, uses the least number of transistors when compared to other designs. The design was fabricated using double polysilicon standard CMOS process by MOSIS in 1.5mm technology. The experimental waveforms and delay measurements have also been presented

    A Low-Area, Energy-Efficient 64-Bit Reconfigurable Carry Select Modified Tree-Based Adder for Media Signal Processing

    Get PDF
    Multimedia systems play an essential part in our daily lives and have drastically improved the quality of life over time. Multimedia devices like cellphones, radios, televisions, and computers require low-area and low-power reconfigurable adders to process greedy computation algorithms for the real-time audio/video signal and image processing such as discrete cosine transform, inverse discrete cosine transform, and fast Fourier transform, etc. In this thesis, a novel 64-bit reconfigurable adder is proposed and implemented to reduce the area and power consumption. This adder can be run-time reconfigured to different reconfigurable word lengths, i.e., one 64- bit, two 32-bits, four 16-bits or eight 8-bits addition, depending on the partition signal command. A Carry Select Modified Tree (CSMT) based adder is used in the reconfigurable adder to reduce the area by 22 % and the power consumption by 47 % when compared to the conventional design. The proposed adder, implemented in 180 nm CMOS technology at 1.8-volt supply, has a worst-case Delay of 20.67 nanoseconds with an overall area of 36,417 ฮผmยฒ and power consumption of 447.93 ฮผW

    Subthreshold FIR Filter Architecture for Ultra Low Power Applications

    Full text link

    A low-power asynchronous data-path for a FIR filter bank

    Get PDF

    Energy Aware Design and Analysis for Synchronous and Asynchronous Circuits

    Get PDF
    Power dissipation has become a major concern for IC designers. Various low power design techniques have been developed for synchronous circuits. Asynchronous circuits, however. have gained more interests recently due to their benefits in lower noise, easy timing control, etc. But few publications on energy reduction techniques for asynchronous logic are available. Power awareness indicates the ability of the system power to scale with changing conditions and quality requirements. Scalability is an important figure-of-merit since it allows the end user to implement operational policy. just like the user of mobile multimedia equipment needs to select between better quality and longer battery operation time. This dissertation discusses power/energy optimization and performs analysis on both synchronous and asynchronous logic. The major contributions of this dissertation include: 1 ) A 2-Dimensional Pipeline Gating technique for synchronous pipelined circuits to improve their power awareness has been proposed. This technique gates the corresponding clock lines connected to registers in both vertical direction (the data flow direction) and horizontal direction (registers within each pipeline stage) based on current input precision. 2) Two energy reduction techniques, Signal Bypassing & Insertion and Zero Insertion. have been developed for NCL circuits. Both techniques use Nulls to replace redundant Data 0\u27s based on current input precision in order to reduce the switching activity while Signal Bypassing & Insertion is for non-pipelined NCI, circuits and Zero Insertion is for pipelined counterparts. A dynamic active-bit detection scheme is also developed as an expansion. 3) Two energy estimation techniques, Equivalent Inverter Modeling based on Input Mapping in transistor-level and Switching Activity Modeling in gate-level, have been proposed. The former one is for CMOS gates with feedbacks and the latter one is for NCL circuits

    ๊ทผ์‚ฌ ์ปดํ“จํŒ…์„ ์ด์šฉํ•œ ํšŒ๋กœ ๋…ธํ™” ๋ณด์ƒ๊ณผ ์—๋„ˆ์ง€ ํšจ์œจ์ ์ธ ์‹ ๊ฒฝ๋ง ๊ตฌํ˜„

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2020. 8. ์ดํ˜์žฌ.Approximate computing reduces the cost (energy and/or latency) of computations by relaxing the correctness (i.e., precision) of computations up to the level, which is dependent on types of applications. Moreover, it can be realized in various hierarchies of computing system design from circuit level to application level. This dissertation presents the methodologies applying approximate computing across such hierarchies; compensating aging-induced delay in logic circuit by dynamic computation approximation (Chapter 1), designing energy-efficient neural network by combining low-power and low-latency approximate neuron models (Chapter 2), and co-designing in-memory gradient descent module with neural processing unit so as to address a memory bottleneck incurred by memory I/O for high-precision data (Chapter 3). The first chapter of this dissertation presents a novel design methodology to turn the timing violation caused by aging into computation approximation error without the reliability guardband or increasing the supply voltage. It can be realized by accurately monitoring the critical path delay at run-time. The proposal is evaluated at two levels: RTL component level and system level. The experimental results at the RTL component level show a significant improvement in terms of (normalized) mean squared error caused by the timing violation and, at the system level, show that the proposed approach successfully transforms the aging-induced timing violation errors into much less harmful computation approximation errors, therefore it recovers image quality up to perceptually acceptable levels. It reduces the dynamic and static power consumption by 21.45% and 10.78%, respectively, with 0.8% area overhead compared to the conventional approach. The second chapter of this dissertation presents an energy-efficient neural network consisting of alternative neuron models; Stochastic-Computing (SC) and Spiking (SP) neuron models. SC has been adopted in various fields to improve the power efficiency of systems by performing arithmetic computations stochastically, which approximates binary computation in conventional computing systems. Moreover, a recent work showed that deep neural network (DNN) can be implemented in the manner of stochastic computing and it greatly reduces power consumption. However, Stochastic DNN (SC-DNN) suffers from problem of high latency as it processes only a bit per cycle. To address such problem, it is proposed to adopt Spiking DNN (SP-DNN) as an input interface for SC-DNN since SP effectively processes more bits per cycle than SC-DNN. Moreover, this chapter resolves the encoding mismatch problem, between two different neuron models, without hardware cost by compensating the encoding mismatch with synapse weight calibration. A resultant hybrid DNN (SPSC-DNN) consists of SP-DNN as bottom layers and SC-DNN as top layers. Exploiting the reduced latency from SP-DNN and low-power consumption from SC-DNN, the proposed SPSC-DNN achieves improved energy-efficiency with lower error-rate compared to SC-DNN and SP-DNN in same network configuration. The third chapter of this dissertation proposes GradPim architecture, which accelerates the parameter updates by in-memory processing which is codesigned with 8-bit floating-point training in Neural Processing Unit (NPU) for deep neural networks. By keeping the high precision processing algorithms in memory, such as the parameter update incorporating high-precision weights in its computation, the GradPim architecture can achieve high computational efficiency using 8-bit floating point in NPU and also gain power efficiency by eliminating massive high-precision data transfers between NPU and off-chip memory. A simple extension of DDR4 SDRAM utilizing bank-group parallelism makes the operation designs in processing-in-memory (PIM) module efficient in terms of hardware cost and performance. The experimental results show that the proposed architecture can improve the performance of the parameter update phase in the training by up to 40% and greatly reduce the memory bandwidth requirement while posing only a minimal amount of overhead to the protocol and the DRAM area.๊ทผ์‚ฌ ์ปดํ“จํŒ…์€ ์—ฐ์‚ฐ์˜ ์ •ํ™•๋„์˜ ์†์‹ค์„ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜ ๋ณ„ ์ ์ ˆํ•œ ์ˆ˜์ค€๊นŒ์ง€ ํ—ˆ์šฉํ•จ์œผ๋กœ์จ ์—ฐ์‚ฐ์— ํ•„์š”ํ•œ ๋น„์šฉ (์—๋„ˆ์ง€๋‚˜ ์ง€์—ฐ์‹œ๊ฐ„)์„ ์ค„์ธ๋‹ค. ๊ฒŒ๋‹ค๊ฐ€, ๊ทผ์‚ฌ ์ปดํ“จํŒ…์€ ์ปดํ“จํŒ… ์‹œ์Šคํ…œ ์„ค๊ณ„์˜ ํšŒ๋กœ ๊ณ„์ธต๋ถ€ํ„ฐ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜ ๊ณ„์ธต๊นŒ์ง€ ๋‹ค์–‘ํ•œ ๊ณ„์ธต์— ์ ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๊ทผ์‚ฌ ์ปดํ“จํŒ… ๋ฐฉ๋ฒ•๋ก ์„ ๋‹ค์–‘ํ•œ ์‹œ์Šคํ…œ ์„ค๊ณ„์˜ ๊ณ„์ธต์— ์ ์šฉํ•˜์—ฌ ์ „๋ ฅ๊ณผ ์—๋„ˆ์ง€ ์ธก๋ฉด์—์„œ ์ด๋“์„ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•๋“ค์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ด๋Š”, ์—ฐ์‚ฐ ๊ทผ์‚ฌํ™” (computation Approximation)๋ฅผ ํ†ตํ•ด ํšŒ๋กœ์˜ ๋…ธํ™”๋กœ ์ธํ•ด ์ฆ๊ฐ€๋œ ์ง€์—ฐ์‹œ๊ฐ„์„ ์ถ”๊ฐ€์ ์ธ ์ „๋ ฅ์†Œ๋ชจ ์—†์ด ๋ณด์ƒํ•˜๋Š” ๋ฐฉ๋ฒ•๊ณผ (์ฑ•ํ„ฐ 1), ๊ทผ์‚ฌ ๋‰ด๋Ÿฐ๋ชจ๋ธ (approximate neuron model)์„ ์ด์šฉํ•ด ์—๋„ˆ์ง€ ํšจ์œจ์ด ๋†’์€ ์‹ ๊ฒฝ๋ง์„ ๊ตฌ์„ฑํ•˜๋Š” ๋ฐฉ๋ฒ• (์ฑ•ํ„ฐ 2), ๊ทธ๋ฆฌ๊ณ  ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์œผ๋กœ ์ธํ•œ ๋ณ‘๋ชฉํ˜„์ƒ ๋ฌธ์ œ๋ฅผ ๋†’์€ ์ •ํ™•๋„ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•œ ์—ฐ์‚ฐ์„ ๋ฉ”๋ชจ๋ฆฌ ๋‚ด์—์„œ ์ˆ˜ํ–‰ํ•จ์œผ๋กœ์จ ์™„ํ™”์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์„ (์ฑ•ํ„ฐ3) ์ œ์•ˆํ•˜์˜€๋‹ค. ์ฒซ ๋ฒˆ์งธ ์ฑ•ํ„ฐ๋Š” ํšŒ๋กœ์˜ ๋…ธํ™”๋กœ ์ธํ•œ ์ง€์—ฐ์‹œ๊ฐ„์œ„๋ฐ˜์„ (timing violation) ์„ค๊ณ„๋งˆ์ง„์ด๋‚˜ (reliability guardband) ๊ณต๊ธ‰์ „๋ ฅ์˜ ์ฆ๊ฐ€ ์—†์ด ์—ฐ์‚ฐ์˜ค์ฐจ (computation approximation error)๋ฅผ ํ†ตํ•ด ๋ณด์ƒํ•˜๋Š” ์„ค๊ณ„๋ฐฉ๋ฒ•๋ก  (design methodology)๋ฅผ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ์ฃผ์š”๊ฒฝ๋กœ์˜ (critical path) ์ง€์—ฐ์‹œ๊ฐ„์„ ๋™์ž‘์‹œ๊ฐ„์— ์ •ํ™•ํ•˜๊ฒŒ ์ธก์ •ํ•  ํ•„์š”๊ฐ€ ์žˆ๋‹ค. ์—ฌ๊ธฐ์„œ ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•๋ก ์€ RTL component์™€ system ๋‹จ๊ณ„์—์„œ ํ‰๊ฐ€๋˜์—ˆ๋‹ค. RTL component ๋‹จ๊ณ„์˜ ์‹คํ—˜๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•ด ์ œ์•ˆํ•œ ๋ฐฉ์‹์ด ํ‘œ์ค€ํ™”๋œ ํ‰๊ท ์ œ๊ณฑ์˜ค์ฐจ๋ฅผ (normalized mean squared error) ์ƒ๋‹นํžˆ ์ค„์˜€์Œ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  system ๋‹จ๊ณ„์—์„œ๋Š” ์ด๋ฏธ์ง€์ฒ˜๋ฆฌ ์‹œ์Šคํ…œ์—์„œ ์ด๋ฏธ์ง€์˜ ํ’ˆ์งˆ์ด ์ธ์ง€์ ์œผ๋กœ ์ถฉ๋ถ„ํžˆ ํšŒ๋ณต๋˜๋Š” ๊ฒƒ์„ ๋ณด์ž„์œผ๋กœ์จ ํšŒ๋กœ๋…ธํ™”๋กœ ์ธํ•ด ๋ฐœ์ƒํ•œ ์ง€์—ฐ์‹œ๊ฐ„์œ„๋ฐ˜ ์˜ค์ฐจ๊ฐ€ ์—๋Ÿฌ์˜ ํฌ๊ธฐ๊ฐ€ ์ž‘์€ ์—ฐ์‚ฐ์˜ค์ฐจ๋กœ ๋ณ€๊ฒฝ๋˜๋Š” ๊ฒƒ์„ ํ™•์ธ ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๊ฒฐ๋ก ์ ์œผ๋กœ, ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•๋ก ์„ ๋”ฐ๋ž์„ ๋•Œ 0.8%์˜ ๊ณต๊ฐ„์„ (area) ๋” ์‚ฌ์šฉํ•˜๋Š” ๋น„์šฉ์„ ์ง€๋ถˆํ•˜๊ณ  21.45%์˜ ๋™์ ์ „๋ ฅ์†Œ๋ชจ์™€ (dynamic power consumption) 10.78%์˜ ์ •์ ์ „๋ ฅ์†Œ๋ชจ์˜ (static power consumption) ๊ฐ์†Œ๋ฅผ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋‘ ๋ฒˆ์งธ ์ฑ•ํ„ฐ๋Š” ๊ทผ์‚ฌ ๋‰ด๋Ÿฐ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜๋Š” ๊ณ -์—๋„ˆ์ง€ํšจ์œจ์˜ ์‹ ๊ฒฝ๋ง์„ (neural network) ์ œ์•ˆํ•˜์˜€๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ ์‚ฌ์šฉํ•œ ๋‘ ๊ฐ€์ง€์˜ ๊ทผ์‚ฌ ๋‰ด๋Ÿฐ๋ชจ๋ธ์€ ํ™•๋ฅ ์ปดํ“จํŒ…๊ณผ (stochastic computing) ์ŠคํŒŒ์ดํ‚น๋‰ด๋Ÿฐ (spiking neuron) ์ด๋ก ๋“ค์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ชจ๋ธ๋ง๋˜์—ˆ๋‹ค. ํ™•๋ฅ ์ปดํ“จํŒ…์€ ์‚ฐ์ˆ ์—ฐ์‚ฐ๋“ค์„ ํ™•๋ฅ ์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•จ์œผ๋กœ์จ ์ด์ง„์—ฐ์‚ฐ์„ ๋‚ฎ์€ ์ „๋ ฅ์†Œ๋ชจ๋กœ ์ˆ˜ํ–‰ํ•œ๋‹ค. ์ตœ๊ทผ์— ํ™•๋ฅ ์ปดํ“จํŒ… ๋‰ด๋Ÿฐ๋ชจ๋ธ์„ ์ด์šฉํ•˜์—ฌ ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง (deep neural network)๋ฅผ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์—ฐ๊ตฌ๊ฐ€ ์ง„ํ–‰๋˜์—ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜, ํ™•๋ฅ ์ปดํ“จํŒ…์„ ๋‰ด๋Ÿฐ๋ชจ๋ธ๋ง์— ํ™œ์šฉํ•  ๊ฒฝ์šฐ ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์ด ๋งค ํด๋ฝ์‚ฌ์ดํด๋งˆ๋‹ค (clock cycle) ํ•˜๋‚˜์˜ ๋น„ํŠธ๋งŒ์„ (bit) ์ฒ˜๋ฆฌํ•˜๋ฏ€๋กœ, ์ง€์—ฐ์‹œ๊ฐ„ ์ธก๋ฉด์—์„œ ๋งค์šฐ ๋‚˜์  ์ˆ˜ ๋ฐ–์— ์—†๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ์ŠคํŒŒ์ดํ‚น ๋‰ด๋Ÿฐ๋ชจ๋ธ๋กœ ๊ตฌ์„ฑ๋œ ์ŠคํŒŒ์ดํ‚น ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์„ ํ™•๋ฅ ์ปดํ“จํŒ…์„ ํ™œ์šฉํ•œ ์‹ฌ์ธต์‹ ๊ฒฝ๋ง ๊ตฌ์กฐ์™€ ๊ฒฐํ•ฉํ•˜์˜€๋‹ค. ์ŠคํŒŒ์ดํ‚น ๋‰ด๋Ÿฐ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ ๋งค ํด๋ฝ์‚ฌ์ดํด๋งˆ๋‹ค ์—ฌ๋Ÿฌ ๋น„ํŠธ๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์˜ ์ž…๋ ฅ ์ธํ„ฐํŽ˜์ด์Šค๋กœ ์‚ฌ์šฉ๋  ๊ฒฝ์šฐ ์ง€์—ฐ์‹œ๊ฐ„์„ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค. ํ•˜์ง€๋งŒ, ํ™•๋ฅ ์ปดํ“จํŒ… ๋‰ด๋Ÿฐ๋ชจ๋ธ๊ณผ ์ŠคํŒŒ์ดํ‚น ๋‰ด๋Ÿฐ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ ๋ถ€ํ˜ธํ™” (encoding) ๋ฐฉ์‹์ด ๋‹ค๋ฅธ ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ํ•ด๋‹น ๋ถ€ํ˜ธํ™” ๋ถˆ์ผ์น˜ ๋ฌธ์ œ๋ฅผ ๋ชจ๋ธ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํ•™์Šตํ•  ๋•Œ ๊ณ ๋ คํ•จ์œผ๋กœ์จ, ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์˜ ๊ฐ’์ด ๋ถ€ํ˜ธํ™” ๋ถˆ์ผ์น˜๋ฅผ ๊ณ ๋ คํ•˜์—ฌ ์กฐ์ ˆ (calibration) ๋  ์ˆ˜ ์žˆ๋„๋ก ํ•˜์—ฌ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜์˜€๋‹ค. ์ด๋Ÿฌํ•œ ๋ถ„์„์˜ ๊ฒฐ๊ณผ๋กœ, ์•ž ์ชฝ์—๋Š” ์ŠคํŒŒ์ดํ‚น ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์„ ๋ฐฐ์น˜ํ•˜๊ณ  ๋’ท ์ชฝ์• ๋Š” ํ™•๋ฅ ์ปดํ“จํŒ… ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์„ ๋ฐฐ์น˜ํ•˜๋Š” ํ˜ผ์„ฑ์‹ ๊ฒฝ๋ง์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ํ˜ผ์„ฑ์‹ ๊ฒฝ๋ง์€ ์ŠคํŒŒ์ดํ‚น ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์„ ํ†ตํ•ด ๋งค ํด๋ฝ์‚ฌ์ดํด๋งˆ๋‹ค ์ฒ˜๋ฆฌ๋˜๋Š” ๋น„ํŠธ ์–‘์˜ ์ฆ๊ฐ€๋กœ ์ธํ•œ ์ง€์—ฐ์‹œ๊ฐ„ ๊ฐ์†Œ ํšจ๊ณผ์™€ ํ™•๋ฅ ์ปดํ“จํŒ… ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์˜ ์ €์ „๋ ฅ ์†Œ๋ชจ ํŠน์„ฑ์„ ๋ชจ๋‘ ํ™œ์šฉํ•จ์œผ๋กœ์จ ๊ฐ ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์„ ๋”ฐ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ๋Œ€๋น„ ์šฐ์ˆ˜ํ•œ ์—๋„ˆ์ง€ ํšจ์œจ์„ฑ์„ ๋น„์Šทํ•˜๊ฑฐ๋‚˜ ๋” ๋‚˜์€ ์ •ํ™•๋„ ๊ฒฐ๊ณผ๋ฅผ ๋‚ด๋ฉด์„œ ๋‹ฌ์„ฑํ•œ๋‹ค. ์„ธ ๋ฒˆ์งธ ์ฑ•ํ„ฐ๋Š” ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์„ 8๋น„ํŠธ ๋ถ€๋™์†Œ์ˆซ์  ์—ฐ์‚ฐ์œผ๋กœ ํ•™์Šตํ•˜๋Š” ์‹ ๊ฒฝ๋ง์ฒ˜๋ฆฌ์œ ๋‹›์˜ (neural processing unit) ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐฑ์‹ ์„ (parameter update) ๋ฉ”๋ชจ๋ฆฌ-๋‚ด-์—ฐ์‚ฐ์œผ๋กœ (in-memory processing) ๊ฐ€์†ํ•˜๋Š” GradPIM ์•„ํ‚คํ…์ณ๋ฅผ ์ œ์•ˆํ•˜์˜€๋‹ค. GradPIM์€ 8๋น„ํŠธ์˜ ๋‚ฎ์€ ์ •ํ™•๋„ ์—ฐ์‚ฐ์€ ์‹ ๊ฒฝ๋ง์ฒ˜๋ฆฌ์œ ๋‹›์— ๋‚จ๊ธฐ๊ณ , ๋†’์€ ์ •ํ™•๋„๋ฅผ ๊ฐ€์ง€๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜๋Š” ์—ฐ์‚ฐ์€ (ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐฑ์‹ ) ๋ฉ”๋ชจ๋ฆฌ ๋‚ด๋ถ€์— ๋‘ ์œผ๋กœ์จ ์‹ ๊ฒฝ๋ง์ฒ˜๋ฆฌ์œ ๋‹›๊ณผ ๋ฉ”๋ชจ๋ฆฌ๊ฐ„์˜ ๋ฐ์ดํ„ฐํ†ต์‹ ์˜ ์–‘์„ ์ค„์—ฌ, ๋†’์€ ์—ฐ์‚ฐํšจ์œจ๊ณผ ์ „๋ ฅํšจ์œจ์„ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค. ๋˜ํ•œ, GradPIM์€ bank-group ์ˆ˜์ค€์˜ ๋ณ‘๋ ฌํ™”๋ฅผ ์ด๋ฃจ์–ด ๋‚ด ๋†’์€ ๋‚ด๋ถ€ ๋Œ€์—ญํญ์„ ํ™œ์šฉํ•จ์œผ๋กœ์จ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์„ ํฌ๊ฒŒ ํ™•์žฅ์‹œํ‚ฌ ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ๋‹ค. ๋˜ํ•œ ์ด๋Ÿฌํ•œ ๋ฉ”๋ชจ๋ฆฌ ๊ตฌ์กฐ์˜ ๋ณ€๊ฒฝ์ด ์ตœ์†Œํ™”๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ถ”๊ฐ€์ ์ธ ํ•˜๋“œ์›จ์–ด ๋น„์šฉ๋„ ์ตœ์†Œํ™”๋˜์—ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•ด GradPIM์ด ์ตœ์†Œํ•œ์˜ DRAM ํ”„๋กœํ† ์ฝœ ๋ณ€ํ™”์™€ DRAM์นฉ ๋‚ด์˜ ๊ณต๊ฐ„์‚ฌ์šฉ์„ ํ†ตํ•ด ์‹ฌ์ธต์‹ ๊ฒฝ๋ง ํ•™์Šต๊ณผ์ • ์ค‘ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐฑ์‹ ์— ํ•„์š”ํ•œ ์‹œ๊ฐ„์„ 40%๋งŒํผ ํ–ฅ์ƒ์‹œ์ผฐ์Œ์„ ๋ณด์˜€๋‹ค.Chapter I: Dynamic Computation Approximation for Aging Compensation 1 1.1 Introduction 1 1.1.1 Chip Reliability 1 1.1.2 Reliability Guardband 2 1.1.3 Approximate Computing in Logic Circuits 2 1.1.4 Computation approximation for Aging Compensation 3 1.1.5 Motivational Case Study 4 1.2 Previous Work 5 1.2.1 Aging-induced Delay 5 1.2.2 Delay-Configurable Circuits 6 1.3 Proposed System 8 1.3.1 Overview of the Proposed System 8 1.3.2 Proposed Adder 9 1.3.3 Proposed Multiplier 11 1.3.4 Proposed Monitoring Circuit 16 1.3.5 Aging Compensation Scheme 19 1.4 Design Methodology 20 1.5 Evaluation 24 1.5.1 Experimental setup 24 1.5.2 RTL component level Adder/Multiplier 27 1.5.3 RTL component level Monitoring circuit 30 1.5.4 System level 31 1.6 Summary 38 Chapter II: Energy-Efficient Neural Network by Combining Approximate Neuron Models 40 2.1 Introduction 40 2.1.1 Deep Neural Network (DNN) 40 2.1.2 Low-power designs for DNN 41 2.1.3 Stochastic-Computing Deep Neural Network 41 2.1.4 Spiking Deep Neural Network 43 2.2 Hybrid of Stochastic and Spiking DNNs 44 2.2.1 Stochastic-Computing vs Spiking Deep Neural Network 44 2.2.2 Combining Spiking Layers and Stochastic Layers 46 2.2.3 Encoding Mismatch 47 2.3 Evaluation 49 2.3.1 Latency and Test Error 49 2.3.2 Energy Efficiency 51 2.4 Summary 54 Chapter III: GradPIM: In-memory Gradient Descent in Mixed-Precision DNN Training 55 3.1 Introduction 55 3.1.1 Neural Processing Unit 55 3.1.2 Mixed-precision Training 56 3.1.3 Mixed-precision Training with In-memory Gradient Descent 57 3.1.4 DNN Parameter Update Algorithms 59 3.1.5 Modern DRAM Architecture 61 3.1.6 Motivation 63 3.2 Previous Work 65 3.2.1 Processing-In-Memory 65 3.2.2 Co-design Neural Processing Unit and Processing-In-Memory 66 3.2.3 Low-precision Computation in NPU 67 3.3 GradPIM 68 3.3.1 GradPIM Architecture 68 3.3.2 GradPIM Operations 69 3.3.3 Timing Considerations 70 3.3.4 Update Phase Procedure 73 3.3.5 Commanding GradPIM 75 3.4 NPU Co-design with GradPIM 76 3.4.1 NPU Architecture 76 3.4.2 Data Placement 79 3.5 Evaluation 82 3.5.1 Evaluation Methodology 82 3.5.2 Experimental Results 83 3.5.3 Sensitivity Analysis 88 3.5.4 Layer Characterizations 90 3.5.5 Distributed Data Parallelism 90 3.6 Summary 92 3.6.1 Discussion 92 Bibliography 113 ์š”์•ฝ 114Docto
    • โ€ฆ
    corecore