53 research outputs found

    A Scalable Near-Memory Architecture for Training Deep Neural Networks on Large In-Memory Datasets

    Get PDF
    Most investigations into near-memory hardware accelerators for deep neural networks have primarily focused on inference, while the potential of accelerating training has received relatively little attention so far. Based on an in-depth analysis of the key computational patterns in state-of-the-art gradient-based training methods, we propose an efficient near-memory acceleration engine called NTX that can be used to train state-of-the-art deep convolutional neural networks at scale. Our main contributions are: (i) a loose coupling of RISC-V cores and NTX co-processors reducing offloading overhead by 7 x over previously published results; (ii) an optimized IEEE 754 compliant data path for fast high-precision convolutions and gradient propagation; (iii) evaluation of near-memory computing with NTX embedded into residual area on the Logic Base die of a Hybrid Memory Cube; and (iv) a scaling analysis to meshes of HMCs in a data center scenario. We demonstrate a 2.7 x energy efficiency improvement of NTX over contemporary GPUs at 4.4 x less silicon area, and a compute performance of 1.2 Tflop/s for training large state-of-the-art networks with full floating-point precision. At the data center scale, a mesh of NTX achieves above 95 percent parallel and energy efficiency, while providing 2.1 x energy savings or 3.1 x performance improvement over a GPU-based system

    Circuits and Systems Advances in Near Threshold Computing

    Get PDF
    Modern society is witnessing a sea change in ubiquitous computing, in which people have embraced computing systems as an indispensable part of day-to-day existence. Computation, storage, and communication abilities of smartphones, for example, have undergone monumental changes over the past decade. However, global emphasis on creating and sustaining green environments is leading to a rapid and ongoing proliferation of edge computing systems and applications. As a broad spectrum of healthcare, home, and transport applications shift to the edge of the network, near-threshold computing (NTC) is emerging as one of the promising low-power computing platforms. An NTC device sets its supply voltage close to its threshold voltage, dramatically reducing the energy consumption. Despite showing substantial promise in terms of energy efficiency, NTC is yet to see widescale commercial adoption. This is because circuits and systems operating with NTC suffer from several problems, including increased sensitivity to process variation, reliability problems, performance degradation, and security vulnerabilities, to name a few. To realize its potential, we need designs, techniques, and solutions to overcome these challenges associated with NTC circuits and systems. The readers of this book will be able to familiarize themselves with recent advances in electronics systems, focusing on near-threshold computing

    Clock multiplication techniques for high-speed I/Os

    Get PDF
    Generation of a low-jitter, high-frequency clock from a low-frequency reference clock using classical analog phase-locked loops (PLLs) requires a large loop filter capacitor and power hungry oscillator. Digital PLLs can help reduce area but their jitter performance is severely degraded by quantization error. In this dissertation different clock multiplication techniques have been explored that can be suitable for high-speed wireline systems. With the emphasis on ring oscillator based architecture using cascaded stages, three possible architectures are explored. First, a scrambling TDC (STDC) is presented to improve deterministic jitter (DJ) performance when used with a low-frequency reference clock. A cascaded architecture with digital multiplying delay locked loop as the first stage and hybrid analog/digital PLL as the second stage is used to achieve low random jitter in a power efficient manner. Fabricated in a 90nm CMOS process, the prototype frequency synthesizer consumes 4.76mW power from a 1.0V supply and generates 160MHz and 2.56 GHz output clocks from a 1.25MHz crystal reference frequency. The long-term absolute jitter of the 60MHz digital MDLL and 2.56 GHz digital PLL outputs are 2.4 psrms and 4.18 psrms, while the peak-to-peak jitter is 22.1 ps and 35.2 ps, respectively. The proposed frequency synthesizer occupies an active die area of 0.16mm2 and achieves power efficiency of 1.86 mW/GHz. Second, a hybrid phase/current-mode phase interpolator (HPC-PI) is presented to improve phase noise performance of ring oscillator-based fractional-N PLLs. The proposed HPC-PI alleviates the bandwidth trade-off between VCO phase noise suppression and ฮ”ฮฃ quantization noise suppression. By combining the phase detection and interpolation functions into an XOR phase detector/interpolator (XOR PD-PI) block, accurate quantization error cancellation is achieved without using calibration. Use of a digital MDLL in front of the fractional-N PLL helps in alleviating the bandwidth limitation due to reference frequency and enables bandwidth extension even further. The extended bandwidth helps in suppressing the ring-VCO phase noise and lowering the in-band noise floor. Fabricated in 65nm CMOS process, the prototype generates fractional frequencies from 4.25 to 4.75 GHz, with an in-band phase noise floor of -104 dBc/Hz and 1.5 psrms integrated jitter. The clock multiplier achieves power efficiency of 2.4mW/GHz and FoM of -225.8 dB. Finally, an efficient clock generation, recovery, and distribution techniques for flexible-rate transceivers are presented. Using a fixed-frequency low-jitter clock provided by an integer-N PLL, fractional frequencies are generated/recovered locally using multi-phase fractional clock multipliers. Fabricated in a 65nm CMOS, the prototype transceiver can be programmed to operate at any rate from 3-to-10 Gb/s. At 10 Gb/s, integrated jitter of the Tx output and recovered clock is 360 fsrms and 758 fsrms, respectively

    Design Techniques for Energy-Quality Scalable Digital Systems

    Get PDF
    Energy efficiency is one of the key design goals in modern computing. Increasingly complex tasks are being executed in mobile devices and Internet of Things end-nodes, which are expected to operate for long time intervals, in the orders of months or years, with the limited energy budgets provided by small form-factor batteries. Fortunately, many of such tasks are error resilient, meaning that they can toler- ate some relaxation in the accuracy, precision or reliability of internal operations, without a significant impact on the overall output quality. The error resilience of an application may derive from a number of factors. The processing of analog sensor inputs measuring quantities from the physical world may not always require maximum precision, as the amount of information that can be extracted is limited by the presence of external noise. Outputs destined for human consumption may also contain small or occasional errors, thanks to the limited capabilities of our vision and hearing systems. Finally, some computational patterns commonly found in domains such as statistics, machine learning and operational research, naturally tend to reduce or eliminate errors. Energy-Quality (EQ) scalable digital systems systematically trade off the quality of computations with energy efficiency, by relaxing the precision, the accuracy, or the reliability of internal software and hardware components in exchange for energy reductions. This design paradigm is believed to offer one of the most promising solutions to the impelling need for low-energy computing. Despite these high expectations, the current state-of-the-art in EQ scalable design suffers from important shortcomings. First, the great majority of techniques proposed in literature focus only on processing hardware and software components. Nonetheless, for many real devices, processing contributes only to a small portion of the total energy consumption, which is dominated by other components (e.g. I/O, memory or data transfers). Second, in order to fulfill its promises and become diffused in commercial devices, EQ scalable design needs to achieve industrial level maturity. This involves moving from purely academic research based on high-level models and theoretical assumptions to engineered flows compatible with existing industry standards. Third, the time-varying nature of error tolerance, both among different applications and within a single task, should become more central in the proposed design methods. This involves designing โ€œdynamicโ€ systems in which the precision or reliability of operations (and consequently their energy consumption) can be dynamically tuned at runtime, rather than โ€œstaticโ€ solutions, in which the output quality is fixed at design-time. This thesis introduces several new EQ scalable design techniques for digital systems that take the previous observations into account. Besides processing, the proposed methods apply the principles of EQ scalable design also to interconnects and peripherals, which are often relevant contributors to the total energy in sensor nodes and mobile systems respectively. Regardless of the target component, the presented techniques pay special attention to the accurate evaluation of benefits and overheads deriving from EQ scalability, using industrial-level models, and on the integration with existing standard tools and protocols. Moreover, all the works presented in this thesis allow the dynamic reconfiguration of output quality and energy consumption. More specifically, the contribution of this thesis is divided in three parts. In a first body of work, the design of EQ scalable modules for processing hardware data paths is considered. Three design flows are presented, targeting different technologies and exploiting different ways to achieve EQ scalability, i.e. timing-induced errors and precision reduction. These works are inspired by previous approaches from the literature, namely Reduced-Precision Redundancy and Dynamic Accuracy Scaling, which are re-thought to make them compatible with standard Electronic Design Automation (EDA) tools and flows, providing solutions to overcome their main limitations. The second part of the thesis investigates the application of EQ scalable design to serial interconnects, which are the de facto standard for data exchanges between processing hardware and sensors. In this context, two novel bus encodings are proposed, called Approximate Differential Encoding and Serial-T0, that exploit the statistical characteristics of data produced by sensors to reduce the energy consumption on the bus at the cost of controlled data approximations. The two techniques achieve different results for data of different origins, but share the common features of allowing runtime reconfiguration of the allowed error and being compatible with standard serial bus protocols. Finally, the last part of the manuscript is devoted to the application of EQ scalable design principles to displays, which are often among the most energy- hungry components in mobile systems. The two proposals in this context leverage the emissive nature of Organic Light-Emitting Diode (OLED) displays to save energy by altering the displayed image, thus inducing an output quality reduction that depends on the amount of such alteration. The first technique implements an image-adaptive form of brightness scaling, whose outputs are optimized in terms of balance between power consumption and similarity with the input. The second approach achieves concurrent power reduction and image enhancement, by means of an adaptive polynomial transformation. Both solutions focus on minimizing the overheads associated with a real-time implementation of the transformations in software or hardware, so that these do not offset the savings in the display. For each of these three topics, results show that the aforementioned goal of building EQ scalable systems compatible with existing best practices and mature for being integrated in commercial devices can be effectively achieved. Moreover, they also show that very simple and similar principles can be applied to design EQ scalable versions of different system components (processing, peripherals and I/O), and to equip these components with knobs for the runtime reconfiguration of the energy versus quality tradeoff

    Neurostream: Scalable and Energy Efficient Deep Learning with Smart Memory Cubes

    Get PDF
    open4siHigh-performance computing systems are moving towards 2.5D and 3D memory hierarchies, based on High Bandwidth Memory (HBM) and Hybrid Memory Cube (HMC) to mitigate the main memory bottlenecks. This trend is also creating new opportunities to revisit near-memory computation. In this paper, we propose a flexible processor-in-memory (PIM) solution for scalable and energy-efficient execution of deep convolutional networks (ConvNets), one of the fastest-growing workloads for servers and high-end embedded systems. Our co-design approach consists of a network of Smart Memory Cubes (modular extensions to the standard HMC) each augmented with a many-core PIM platform called NeuroCluster. NeuroClusters have a modular design based on NeuroStream coprocessors (for Convolution-intensive computations) and general-purpose RISC-V cores. In addition, a DRAM-friendly tiling mechanism and a scalable computation paradigm are presented to efficiently harness this computational capability with a very low programming effort. NeuroCluster occupies only 8 percent of the total logic-base (LoB) die area in a standard HMC and achieves an average performance of 240 GFLOPS for complete execution of full-featured state-of-the-art (SoA) ConvNets within a power budget of 2.5 W. Overall 11 W is consumed in a single SMC device, with 22.5 GFLOPS/W energy-efficiency which is 3.5X better than the best GPU implementations in similar technologies. The minor increase in system-level power and the negligible area increase make our PIM system a cost-effective and energy efficient solution, easily scalable to 955 GFLOPS with a small network of just four SMCs.openAzarkhish, Erfan*; Rossi, Davide; Loi, Igor; Benini, LucaAzarkhish, Erfan*; Rossi, Davide; Loi, Igor; Benini, Luc

    ์–‘์žํ™”๋œ ํ•™์Šต์„ ํ†ตํ•œ ์ €์ „๋ ฅ ๋”ฅ๋Ÿฌ๋‹ ํ›ˆ๋ จ ๊ฐ€์†๊ธฐ ์„ค๊ณ„

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ์œตํ•ฉ๊ณผํ•™๊ธฐ์ˆ ๋Œ€ํ•™์› ์œตํ•ฉ๊ณผํ•™๋ถ€(์ง€๋Šฅํ˜•์œตํ•ฉ์‹œ์Šคํ…œ์ „๊ณต), 2022.2. ์ „๋™์„.๋”ฅ๋Ÿฌ๋‹์˜ ์‹œ๋Œ€๊ฐ€ ๋„๋ž˜ํ•จ์— ๋”ฐ๋ผ, ์‹ฌ์ธต ์ธ๊ณต ์‹ ๊ฒฝ๋ง (DNN)์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ์š”๊ตฌ๋˜๋Š” ํ•™์Šต ๋ฐ ์ถ”๋ก  ์—ฐ์‚ฐ๋Ÿ‰ ๋˜ํ•œ ๊ธฐํ•˜๊ธ‰์ˆ˜์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜์˜€๋‹ค. ๋”ฅ ๋Ÿฌ๋‹ ์‹œ๋Œ€์˜ ๋„๋ž˜์™€ ํ•จ๊ป˜ ๋‹ค์–‘ํ•œ ์ž‘์—…์— ๋Œ€ํ•œ ์‹ ๊ฒฝ๋ง ํ›ˆ๋ จ ๋ฐ ํŠน์ • ์šฉ๋„์— ๋Œ€ํ•ด ํ›ˆ๋ จ๋œ ์‹ ๊ฒฝ๋ง ์ถ”๋ก  ์ˆ˜ํ–‰ ์ธก๋ฉด์—์„œ ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง (DNN) ์ฒ˜๋ฆฌ์— ๋Œ€ํ•œ ์ปดํ“จํŒ… ์š”๊ตฌ๊ฐ€ ๊ทน์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜์˜€์œผ๋ฉฐ, ์ด๋Ÿฌํ•œ ์ถ”์„ธ๋Š” ์ธ๊ณต์ง€๋Šฅ์˜ ์‚ฌ์šฉ์ด ๋”์šฑ ๋ฒ”์šฉ์ ์œผ๋กœ ์ง„ํ™”ํ•จ์— ๋”ฐ๋ผ ๋”์šฑ ๊ฐ€์†ํ™” ๋  ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒ๋œ๋‹ค. ์ด๋Ÿฌํ•œ ์—ฐ์‚ฐ ์š”๊ตฌ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐ ์„ผํ„ฐ ๋‚ด๋ถ€์— ๋ฐฐ์น˜ํ•˜๊ธฐ ์œ„ํ•œ FPGA (Field-Programmable Gate Array) ๋˜๋Š” ASIC (Application-Specific Integrated Circuit) ๊ธฐ๋ฐ˜ ์‹œ์Šคํ…œ์—์„œ ์ €์ „๋ ฅ์„ ์œ„ํ•œ SoC (System-on-Chip)์˜ ๊ฐ€์† ๋ธ”๋ก์— ์ด๋ฅด๊ธฐ๊นŒ์ง€ ๋‹ค์–‘ํ•œ ๋งž์ถคํ˜• ํ•˜๋“œ์›จ์–ด๊ฐ€ ์‚ฐ์—… ๋ฐ ํ•™๊ณ„์—์„œ ์ œ์•ˆ๋˜์—ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š”, ์ธ๊ณต ์‹ ๊ฒฝ๋ง์˜ ์—๋„ˆ์ง€ ํšจ์œจ์ ์ธ ํ›ˆ๋ จ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ๋งž์ถคํ˜• ์ง‘์  ํšŒ๋กœ ํ•˜๋“œ์›จ์–ด๋ฅผ ๋ณด๋‹ค ์—๋„ˆ์ง€ ํšจ์œจ์ ์œผ๋กœ ์„ค๊ณ„ํ•  ์ˆ˜ ์žˆ๋Š” ๋‹ค์–‘ํ•œ ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์•ˆํ•˜๊ณ  ์‹ค์ œ ์ €์ „๋ ฅ ์ธ๊ณต ์‹ ๊ฒฝ๋ง ํ›ˆ๋ จ ์‹œ์Šคํ…œ์„ ์„ค๊ณ„ํ•˜๊ณ  ์ œ์ž‘ํ•˜์—ฌ, ๊ทธ ํšจ์œจ์„ ํ‰๊ฐ€ํ•˜๊ณ ์ž ํ•œ๋‹ค. ํŠนํžˆ, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ์ €์ „๋ ฅ ๊ณ ์„ฑ๋Šฅ ์„ค๊ณ„ ๋ฐฉ๋ฒ•๋ก ์„ ํฌ๊ฒŒ ์„ธ ๊ฐ€์ง€๋กœ ๋ถ„๋ฅ˜ํ•˜์—ฌ ๋ถ„์„์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค. ์ด๋Ÿฌํ•œ ๋ถ„๋ฅ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. (1) ํ›ˆ๋ จ ์•Œ๊ณ ๋ฆฌ์ฆ˜. ํ‘œ์ค€์ ์œผ๋กœ ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง ํ›ˆ๋ จ์€ ์—ญ์ „ํŒŒ (Back-Propagation) ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ์ˆ˜ํ–‰๋˜์ง€๋งŒ, ๋” ํšจ์œจ์ ์ธ ํ•˜๋“œ์›จ์–ด ๊ตฌํ˜„์„ ์œ„ํ•ด ์ŠคํŒŒ์ดํฌ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ†ต์‹ ํ•˜๋Š” ๋‰ด๋Ÿฐ์ด ์žˆ๋Š” ๋‰ด๋กœ๋ชจํ”ฝ ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋˜๋Š” ๋น„๋Œ€์นญ ํ”ผ๋“œ๋ฐฑ ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ์ƒ๋ฌผํ•™์  ๋ชจ์‚ฌ๋„๊ฐ€ ๋†’์€ (Bio-Plausible) ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ™œ์šฉํ•˜์—ฌ ๋” ํšจ์œจ์ ์ธ ํ›ˆ๋ จ ์‹œ์Šคํ…œ์„ ์„ค๊ณ„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์กฐ์‚ฌ ๋ฐ ์ œ์‹œํ•˜๊ณ , ๊ทธ ํ•˜๋“œ์›จ์–ด ํšจ์œจ์„ฑ์„ ๋ถ„์„ํ•˜์˜€๋‹ค. (2) ์ €์ •๋ฐ€๋„ ์ˆ˜ ์ฒด๊ณ„ ํ™œ์šฉ. ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” DNN ๊ฐ€์†๊ธฐ์—์„œ ํšจ์œจ์„ฑ์„ ๋†’์ด๋Š” ๊ฐ€์žฅ ๊ฐ•๋ ฅํ•œ ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜๋Š” ์ˆ˜์น˜ ์ •๋ฐ€๋„๋ฅผ ์กฐ์ •ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. DNN์˜ ์ถ”๋ก  ๋‹จ๊ณ„์— ๋‚ฎ์€ ์ •๋ฐ€๋„ ์ˆซ์ž๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ์ž˜ ์—ฐ๊ตฌ๋˜์—ˆ์ง€๋งŒ, ์„ฑ๋Šฅ ์ €ํ•˜ ์—†์ด DNN์„ ํ›ˆ๋ จํ•˜๋Š” ๊ฒƒ์€ ์ƒ๋Œ€์ ์œผ ๊ธฐ์ˆ ์  ์–ด๋ ค์›€์ด ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹ค์–‘ํ•œ ๋ชจ๋ธ๊ณผ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ DNN์„ ์„ฑ๋Šฅ ์ €ํ•˜ ์—†์ด ํ›ˆ๋ จํ•˜๊ธฐ ์œ„ํ•œ ์ƒˆ๋กœ์šด ์ˆ˜ ์ฒด๊ณ„๋ฅผ ์ œ์•ˆํ•˜์˜€๋‹ค. (3) ์‹œ์Šคํ…œ ๊ตฌํ˜„ ๊ธฐ๋ฒ•. ์ง‘์  ํšŒ๋กœ์—์„œ ๋งž์ถคํ˜• ํ›ˆ๋ จ ์‹œ์Šคํ…œ์„ ์‹ค์ œ๋กœ ์‹คํ˜„ํ•  ๋•Œ, ๊ฑฐ์˜ ๋ฌดํ•œํ•œ ์„ค๊ณ„ ๊ณต๊ฐ„์€ ์นฉ ๋‚ด๋ถ€์˜ ๋ฐ์ดํ„ฐ ํ๋ฆ„, ์‹œ์Šคํ…œ ๋ถ€ํ•˜ ๋ถ„์‚ฐ, ๊ฐ€์†/๊ฒŒ์ดํŒ… ๋ธ”๋ก ๋“ฑ ๋‹ค์–‘ํ•œ ์š”์†Œ์— ๋”ฐ๋ผ ๊ฒฐ๊ณผ์˜ ํ’ˆ์งˆ์ด ํฌ๊ฒŒ ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋” ๋‚˜์€ ์„ฑ๋Šฅ๊ณผ ํšจ์œจ์„ฑ์œผ๋กœ ์ด์–ด์ง€๋Š” ๋‹ค์–‘ํ•œ ์„ค๊ณ„ ๊ธฐ๋ฒ•์„ ์†Œ๊ฐœํ•˜๊ณ  ๋ถ„์„ํ•˜๊ณ ์ž ํ•œ๋‹ค. ์ฒซ์งธ๋กœ, ์†๊ธ€์”จ ๋ถ„๋ฅ˜ ํ•™์Šต์„ ์œ„ํ•œ ๋‰ด๋กœ๋ชจํ”ฝ ํ•™์Šต ์‹œ์Šคํ…œ์„ ์ œ์ž‘ํ•˜์—ฌ ํ‰๊ฐ€ํ•˜์˜€๋‹ค. ์ด ํ•™์Šต ์‹œ์Šคํ…œ์€ ์ „ํ†ต์ ์ธ ๊ธฐ๊ณ„ ํ•™์Šต์˜ ํ›ˆ๋ จ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ๋‚ฎ์€ ํ›ˆ๋ จ ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ œ๊ณตํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•˜์—ฌ ์„ค๊ณ„๋˜์—ˆ๋‹ค. ์ด ๋ชฉ์ ์„ ๋‹ฌ์„ฑํ•˜๊ธฐ ์œ„ํ•ด, ๋” ์ ์€ ์—ฐ์‚ฐ ์š”๊ตฌ๋Ÿ‰๊ณผ ๋ฒ„ํผ ๋ฉ”๋ชจ๋ฆฌ ํ•„์š”์น˜๋ฅผ ์œ„ํ•ด ๊ธฐ์กด์˜ ๋‰ด๋กœ๋ชจํ”ฝ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ˆ˜์ •ํ•˜์˜€์œผ๋ฉฐ, ์ด ๊ณผ์ •์—์„œ ํ›ˆ๋ จ ์„ฑ๋Šฅ ์†์‹ค ์—†์ด ๊ธฐ์กด ์—ญ์ „ํŒŒ ๊ธฐ๋ฐ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๊ทผ์ ‘ํ•œ ํ›ˆ๋ จ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค. ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ์—…๋ฐ์ดํŠธ๋ฅผ ๊ฑด๋„ˆ๋›ฐ๋Š” ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ๊ตฌํ˜„ํ•˜๊ณ  Lock-Free ๋งค๊ฐœ๋ณ€์ˆ˜ ์—…๋ฐ์ดํŠธ ๋ฐฉ์‹์„ ์ฑ„ํƒํ•˜์—ฌ ํ›ˆ๋ จ์— ์†Œ๋ชจ๋˜๋Š” ์—๋„ˆ์ง€๋ฅผ ํ›ˆ๋ จ์ด ์ง„ํ–‰๋จ์— ๋”ฐ๋ผ ๋™์ ์œผ๋กœ ๊ฐ์†Œ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ์‹œ์Šคํ…œ ๊ตฌํ˜„ ๊ธฐ๋ฒ• ๋˜ํ•œ ์†Œ๊ฐœํ•˜๊ณ  ๊ทธ ์„ฑ๋Šฅ์„ ๋ถ„์„ํ•˜์˜€๋‹ค. ์ด๋Ÿฐ ๊ธฐ๋ฒ•์„ ํ†ตํ•ด, ์ด ํ•™์Šต ์‹œ์Šคํ…œ์€ ๊ธฐ์กด์˜ ํ›ˆ๋ จ ์‹œ์Šคํ…œ ๋Œ€๋น„ ๋›ฐ์–ด๋‚œ ๋ถ„๋ฅ˜ ์„ฑ๋Šฅ-์—๋„ˆ์ง€ ์†Œ๋ชจ๋Ÿ‰ ๊ด€๊ณ„๋ฅผ ๋ณด์ด๋ฉด์„œ๋„ ๊ธฐ์กด์˜ ์—ญ์ „ํŒŒ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ธฐ๋ฐ˜์˜ ์ธ๊ณต ์‹ ๊ฒฝ๋ง์˜ ํ›ˆ๋ จ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜์˜€๋‹ค. ๋‘˜์งธ๋กœ, ํŠน์ˆ˜ ๋ช…๋ น์–ด ์ฒด๊ณ„ ๋ฐ ๋งž์ถคํ˜• ์ˆ˜ ์ฒด๊ณ„๋ฅผ ํ™œ์šฉํ•œ ํ”„๋กœ๊ทธ๋žจ ๊ฐ€๋Šฅํ•œ DNN ํ›ˆ๋ จ์šฉ ํ”„๋กœ์„ธ์„œ๊ฐ€ ์„ค๊ณ„๋˜๊ณ  ์ œ์ž‘๋˜์—ˆ๋‹ค. ๊ธฐ์กด DNN ์ถ”๋ก ์šฉ ๊ฐ€์†๊ธฐ๋Š” 8๋น„ํŠธ ์ •์ˆ˜ ๊ธฐ๋ฐ˜์œผ๋กœ ์ด๋ฃจ์–ด์ง„ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•˜์ง€๋งŒ, DNN ํ•™์Šต ์„ค๊ณ„์‹œ 8๋น„ํŠธ ์ˆ˜ ์ฒด๊ณ„๋ฅผ ์ด์šฉํ•˜๋ฉฐ ํ›ˆ๋ จ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ๋ณด์ด์ง€ ์•Š๋Š” ๊ฒƒ์€ ์ƒ๋‹นํ•œ ๊ธฐ์ˆ ์  ๋‚œ์ด๋„๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์—ˆ๋‹ค. ์ด๋Ÿฐ ๋ฌธ์ œ๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๊ณต์œ ํ˜• ๋ฉฑ์ง€์ˆ˜ ํŽธํ–ฅ๊ฐ’์„ ํ™œ์šฉํ•˜๋Š” 8๋น„ํŠธ ๋ถ€๋™ ์†Œ์ˆ˜์  ์ˆ˜ ์ฒด๊ณ„๋ฅผ ์ƒˆ๋กœ์ด ์ œ์•ˆํ•˜์˜€์œผ๋ฉฐ, ์ด ์ˆ˜ ์ฒด๊ณ„์˜ ํšจ์šฉ์„ฑ์„ ๋ณด์ด๊ธฐ ์œ„ํ•ด ์ด DNN ํ›ˆ๋ จ ํ”„๋กœ์„ธ์„œ๊ฐ€ ์„ค๊ณ„๋˜์—ˆ๋‹ค. ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ์ด ํ”„๋กœ์„ธ์„œ๋Š” ๋‹จ์ˆœํ•œ MAC ๊ธฐ๋ฐ˜ Matrix-Multiplication ๊ฐ€์†๊ธฐ๊ฐ€ ์•„๋‹Œ, Fused-Multiply-Add ํŠธ๋ฆฌ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ์—๋„ˆ์ง€ ํšจ์œจ์ ์ธ ๊ฐ€์†๊ธฐ ๊ตฌ์กฐ๋ฅผ ์ฑ„ํƒํ•˜๋ฉด์„œ๋„, ์นฉ ๋‚ด๋ถ€์—์„œ์˜ ๋ฐ์ดํ„ฐ ์ด๋™๋Ÿ‰ ์ตœ์ ํ™” ๋ฐ ์ปจ๋ณผ๋ฃจ์…˜์˜ ๊ณต๊ฐ„์„ฑ์„ ๊ทน๋Œ€ํ™”ํ•  ์ˆ˜ ์žˆ๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐ ์ „๋‹ฌ ์œ ๋‹›์„ ์ž…์ถœ๋ ฅ๋ถ€์— 2D๋กœ ์ œ์ž‘ํ•˜์—ฌ ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜์—์„œ์˜ ์ปจ๋ณผ๋ฃจ์…˜ ์ถ”๋ก  ๋ฐ ํ›ˆ๋ จ ๋‹จ๊ณ„์—์„œ์˜ ๊ณต๊ฐ„์„ฑ์„ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•˜์˜€๋‹ค. ๋ณธ DNN ํ›ˆ๋ จ ํ”„๋กœ์„ธ์„œ๋Š” ๋งž์ถคํ˜• ๋ฒกํ„ฐ ์—ฐ์‚ฐ๊ธฐ, ๊ฐ€์† ๋ช…๋ น์–ด ์ฒด๊ณ„, ์™ธ๋ถ€ DRAM์œผ๋กœ์˜ ์ง์ ‘์ ์ธ ์ ‘๊ทผ ์ œ์–ด ๋ฐฉ์‹ ๋“ฑ์„ ํ†ตํ•ด ํ•œ ํ”„๋กœ์„ธ์„œ ๋‚ด์—์„œ DNN ํ›ˆ๋ จ์˜ ๋ชจ๋“  ๋‹จ๊ณ„๋ฅผ ๋‹ค์–‘ํ•œ ๋ชจ๋ธ ๋ฐ ํ™˜๊ฒฝ์—์„œ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋„๋ก ์„ค๊ณ„๋˜์—ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋ณธ ํ”„๋กœ์„ธ์„œ๋Š” ๊ธฐ์กด์˜ ์—ฐ๊ตฌ์—์„œ ์ œ์‹œ๋˜์—ˆ๋˜ ๋‹ค๋ฅธ ํ”„๋กœ์„ธ์„œ์— ๋น„ํ•ด ๋™์ผ ๋ชจ๋ธ์„ ์ฒ˜๋ฆฌํ•˜๋ฉด์„œ 2.48๋ฐฐ ๊ฐ€๋Ÿ‰ ๋” ๋†’์€ ์—๋„ˆ์ง€ ํšจ์œจ์„ฑ, 43% ์ ์€ DRAM ์ ‘๊ทผ ์š”๊ตฌ๋Ÿ‰, 0.8%p ๋†’์€ ํ›ˆ๋ จ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค. ์ด๋ ‡๊ฒŒ ์†Œ๊ฐœ๋œ ๋‘ ๊ฐ€์ง€ ์„ค๊ณ„๋Š” ๋ชจ๋‘ ์‹ค์ œ ์นฉ์œผ๋กœ ์ œ์ž‘๋˜์–ด ๊ฒ€์ฆ๋˜์—ˆ๋‹ค. ์ธก์ • ๋ฐ์ดํ„ฐ ๋ฐ ์ „๋ ฅ ์†Œ๋ชจ๋Ÿ‰์„ ํ†ตํ•ด ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆ๋œ ์ €์ „๋ ฅ ๋”ฅ๋Ÿฌ๋‹ ํ›ˆ๋ จ ์‹œ์Šคํ…œ ์„ค๊ณ„ ๊ธฐ๋ฒ•์˜ ํšจ์œจ์„ ๊ฒ€์ฆํ•˜์˜€์œผ๋ฉฐ, ํŠนํžˆ ์ƒ๋ฌผํ•™์  ๋ชจ์‚ฌ๋„๊ฐ€ ๋†’์€ ํ›ˆ๋ จ ์•Œ๊ณ ๋ฆฌ์ฆ˜, ๋”ฅ๋Ÿฌ๋‹ ํ›ˆ๋ จ์— ์ตœ์ ํ™”๋œ ์ˆ˜ ์ฒด๊ณ„, ๊ทธ๋ฆฌ๊ณ  ํšจ์œจ์ ์ธ ์‹œ์Šคํ…œ ๊ตฌํ˜„ ๊ธฐ๋ฒ•์„ ํ™œ์šฉํ•˜์—ฌ ์‹œ์Šคํ…œ์˜ ์—๋„ˆ์ง€ ํšจ์œจ์„ฑ์„ ๊ฐœ์„ ํ•˜๋Š” ๋ชฉํ‘œ๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€๋Š”์ง€ ์ •๋Ÿ‰์ ์œผ๋กœ ๋ถ„์„ํ•˜์˜€๋‹ค.With the advent of the deep learning era, the computational need for processing deep neural networks (DNN) have increased dramatically, both in terms of performing training the neural networks on various tasks as well as in performing inference on the trained neural networks for specific use cases. To address those needs, many custom hardware ranging from systems based on field-programmable gate arrays (FPGA) or application-specific integrated circuits (ASIC) for deployment inside data centers to acceleration blocks in system-on-chip (SoC) for low-power processing in mobile devices were proposed. In this dissertation, custom integrated circuits hardware for energy efficient processing of training neural networks are designed, fabricated, and measured for evaluation of different methodologies that could be utilized for more energy efficient processing under same training performance constraints. In particular, these methodologies are categorized to three different categories for evaluation: (1) Training algorithm. While standard deep neural network training is performed with the back-propagation (BP) algorithm, we investigate various training algorithms, such as neuromorphic learning algorithms with spiking neurons or bio-plausible algorithms with asymmetric feedback for exploiting computational properties for more efficient hardware implementation. (2) Low-precision arithmetic. One of the most powerful methods for increased efficiency in DNN accelerators is through scaling numerical precision. While utilizing low precision numerics for inference phase of DNNs is well studied, training DNNs without performance degradation is relatively more challenging. A novel numerical scheme for training DNNs in various models and scenarios is proposed in this dissertation. (3) System implementation techniques. In actual realization of a custom training system in integrated circuits, nearly infinite design space leads to vastly different quality of results depending on dataflow inside the chip, system load balancing, acceleration and gating blocks, et cetera. Different design techniques which leads to better performance and efficiency are introduced in this dissertation. First, a neuromorphic learning system for classifying handwritten digits (MNIST) is introduced. This learning system aims to deliver low training overhead while maintaining the training performance of classical machine learning. In order to achieve this goal, a neuromorphic learning algorithm is modified for lower operation count and memory buffer requirement while maintaining or even obtaining higher machine learning performance. Moreover, implementation techniques such as update skipping mechanism and lock-free parameter updates allow even lower training overhead, dynamically reducing training energy overhead from 25.6% to 7.5%. With these proposed methodologies, this system greatly improves the accuracy-energy trade-off in on-chip learning system as well as showing close learning performance to classical DNN training through back propagation. Second, a programmable DNN training processor with a custom numerical format is introduced. While prior DNN inference accelerators have utilized 8-bit integers, implementing 8-bit numerics for a training accelerator remained to be a challenge due to higher precision requirements in the backward step of DNN training. To overcome this limitation, a custom 8-bit floating point format dubbed 8-bit floating point with shared exponent bias (FP8-SEB) is introduced in this dissertation. Moreover, a processing architecture of 24-way fused-multiply-adder (FMA) tree greatly increases processing energy efficiency per MAC, while complemented with a novel 2-dimensional routing data-path for making use of spatiality to increase data reuse in both forward, backward, and weight gradient step of convolutional neural networks. This DNN training processor is implemented with a custom vector processing unit, acceleration instructions, and DMA in external DRAMs for end-to-end DNN training in various models and datasets. Compared against prior low-precision training processor in ResNet-18 training, this work achieves 2.48ร— higher energy efficiency, 43% less DRAM accesses, and 0.8\p higher training accuracy. Both of the designs introduced are fabricated in real silicon and verified both in simulations and in physical measurements. Design methodologies are carefully evaluated using simulations of the fabricated chip and measurements with monitored data and power consumption under varying conditions that expose the design techniques in effect. The efficiency of various biologically plausible algorithms, novel numerical formats, and system implementation techniques are analyzed in discussed in this dissertations based on the obtained measurements.Abstract i Contents iv List of Tables vii List of Figures viii 1 Introduction 1 1.1 Study Background 1 1.2 Purpose of Research 6 1.3 Contents 8 2 Hardware-Friendly Learning Algorithms 9 2.1 Modified Learning Rule for Neuromorphic System 9 2.1.1 The Segregated Dendrites Algorithm 9 2.1.2 Modification of the Segregated Dendrites Algorithm 13 2.2 Non-BP Learning Rules on DNN Training Processor 18 2.2.1 Feedback Alignment and Direct Feedback Alignment 18 2.2.2 Reduced Memory Access in Non-BP Learning Rules 23 3 Optimal Numerical Format for DNN Training 27 3.1 Related Works 27 3.2 Proposed FP8 with Shared Exponent Bias 30 3.3 Training Results with FP8-SEB 33 3.4 Fused Multiply Adder Tree for FP8-SEB 37 4 System Implementations 41 4.1 Neuromorphic Learning System 41 4.1.1 Bio-Plausibility 41 4.1.2 Top Level Architecture 43 4.1.3 Lock-Free Weight Updates 47 4.1.4 Update Skipping Mechanism 48 4.2 Low-Precision DNN Training System 51 4.2.1 Top Level Architecture 52 4.2.2 Optimized Auxiliary Instructions in the Vector Processing Unit 55 4.2.3 Buffer Organization 57 4.2.4 Input-Output 2D Spatial Routing for FMA Trees 60 5 Measurement Results 70 5.1 Measurement Results on the Neuromorphic Learning System 70 5.1.1 Measurement Results and Test Setup . 70 5.1.2 Comparison against other works 73 5.1.3 Scalability of the Learning Algorithm 77 5.2 Measurements Results on the Low-Precision DNN Training Processor 79 5.2.1 Measurement Results in Benchmarked Tests 79 5.2.2 Comparison Against Other DNN Training Processors 89 6 Conclusion 93 6.1 Discussion for Future Works 93 6.1.1 Scaling to CNNs in the Neuromorphic System 93 6.1.2 Discussions for Improvements on DNN Training Processor 96 6.2 Conclusion 99 Abstract (In Korean) 108๋ฐ•

    Clock Generator Circuits for Low-Power Heterogeneous Multiprocessor Systems-on-Chip

    Get PDF
    In this work concepts and circuits for local clock generation in low-power heterogeneous multiprocessor systems-on-chip (MPSoCs) are researched and developed. The targeted systems feature a globally asynchronous locally synchronous (GALS) clocking architecture and advanced power management functionality, as for example fine-grained ultra-fast dynamic voltage and frequency scaling (DVFS). To enable this functionality compact clock generators with low chip area, low power consumption, wide output frequency range and the capability for ultra-fast frequency changes are required. They are to be instantiated individually per core. For this purpose compact all digital phase-locked loop (ADPLL) frequency synthesizers are developed. The bang-bang ADPLL architecture is analyzed using a numerical system model and optimized for low jitter accumulation. A 65nm CMOS ADPLL is implemented, featuring a novel active current bias circuit which compensates the supply voltage and temperature sensitivity of the digitally controlled oscillator (DCO) for reduced digital tuning effort. Additionally, a 28nm ADPLL with a new ultra-fast lock-in scheme based on single-shot phase synchronization is proposed. The core clock is generated by an open-loop method using phase-switching between multi-phase DCO clocks at a fixed frequency. This allows instantaneous core frequency changes for ultra-fast DVFS without re-locking the closed loop ADPLL. The sensitivity of the open-loop clock generator with respect to phase mismatch is analyzed analytically and a compensation technique by cross-coupled inverter buffers is proposed. The clock generators show small area (0.0097mm2 (65nm), 0.00234mm2 (28nm)), low power consumption (2.7mW (65nm), 0.64mW (28nm)) and they provide core clock frequencies from 83MHz to 666MHz which can be changed instantaneously. The jitter performance is compliant to DDR2/DDR3 memory interface specifications. Additionally, high-speed clocks for novel serial on-chip data transceivers are generated. The ADPLL circuits have been verified successfully by 3 testchip implementations. They enable efficient realization of future low-power MPSoCs with advanced power management functionality in deep-submicron CMOS technologies.In dieser Arbeit werden Konzepte und Schaltungen zur lokalen Takterzeugung in heterogenen Multiprozessorsystemen (MPSoCs) mit geringer Verlustleistung erforscht und entwickelt. Diese Systeme besitzen eine global-asynchrone lokal-synchrone Architektur sowie Funktionalitรคt zum Power Management, wie z.B. das feingranulare, schnelle Skalieren von Spannung und Taktfrequenz (DVFS). Um diese Funktionalitรคt zu realisieren werden kompakte Taktgeneratoren benรถtigt, welche eine kleine Chipflรคche einnehmen, wenig Verlustleitung aufnehmen, einen weiten Bereich an Ausgangsfrequenzen erzeugen und diese sehr schnell รคndern kรถnnen. Sie sollen individuell pro Prozessorkern integriert werden. Dazu werden kompakte volldigitale Phasenregelkreise (ADPLLs) entwickelt, wobei eine bang-bang ADPLL Architektur numerisch modelliert und fรผr kleine Jitterakkumulation optimiert wird. Es wird eine 65nm CMOS ADPLL implementiert, welche eine neuartige Kompensationsschlatung fรผr den digital gesteuerten Oszillator (DCO) zur Verringerung der Sensitivitรคt bezรผglich Versorgungsspannung und Temperatur beinhaltet. Zusรคtzlich wird eine 28nm CMOS ADPLL mit einer neuen Technik zum schnellen Einschwingen unter Nutzung eines Phasensynchronisierers realisiert. Der Prozessortakt wird durch ein neuartiges Phasenmultiplex- und Frequenzteilerverfahren erzeugt, welches es ermรถglicht die Taktfrequenz sofort zu รคndern um schnelles DVFS zu realisieren. Die Sensitivitรคt dieses Frequenzgenerators bezรผglich Phasen-Mismatch wird theoretisch analysiert und durch Verwendung von kreuzgekoppelten Taktverstรคrkern kompensiert. Die hier entwickelten Taktgeneratoren haben eine kleine Chipflรคche (0.0097mm2 (65nm), 0.00234mm2 (28nm)) und Leistungsaufnahme (2.7mW (65nm), 0.64mW (28nm)). Sie stellen Frequenzen von 83MHz bis 666MHz bereit, welche sofort geรคndert werden kรถnnen. Die Schaltungen erfรผllen die Jitterspezifikationen von DDR2/DDR3 Speicherinterfaces. Zusรคtzliche kรถnnen schnelle Takte fรผr neuartige serielle on-Chip Verbindungen erzeugt werden. Die ADPLL Schaltungen wurden erfolgreich in 3 Testchips erprobt. Sie ermรถglichen die effiziente Realisierung von zukรผnftigen MPSoCs mit Power Management in modernsten CMOS Technologien
    • โ€ฆ
    corecore