763 research outputs found

    Application specific asynchronous microengines for efficient high-level control

    Get PDF
    technical reportDespite the growing interest in asynchronous circuits programmable asynchronous controllers based on the idea of microprogramming have not been actively pursued Since programmable control is widely used in many commercial ASICs to allow late correction of design errors to easily upgrade product families to meet the time to market and even efficient run time modications to control in adaptive systems we consider it crucial that self timed techniques support efficient programmable control This is especially true given that asynchronous (self-timed) circuits are well suited for realizing reactive and control intensive designs We offer a practical solution to programmable asynchronous control in the form of application-speciffic microprogrammed asynchronous controllers (or microengines). The features of our solution include a modular and easily extensible datapath structure support for two main styles of hand shaking (namely two phase and four phase), and many efficiency measures based on exploiting concurrency between operations and employing efficient circuit structures Our results demonstrate that the proposed microengine can yield high performance-in fact performance close to that offered by automated high level synthesis tools targeting custom hard wired burstmode machines

    ๋ฉ”๋ชจ๋ฆฌ ์ง‘์•ฝ์  ๊ธฐ๊ณ„ํ•™์Šต ์‘์šฉํ”„๋กœ๊ทธ๋žจ์„ ์œ„ํ•œ ๋””๋žจ ๊ธฐ๋ฐ˜ ํ”„๋กœ์„ธ์‹ฑ ์ธ ๋ฉ”๋ชจ๋ฆฌ ๋งˆ์ดํฌ๋กœ์•„ํ‚คํ…์ฒ˜

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ์œตํ•ฉ๊ณผํ•™๊ธฐ์ˆ ๋Œ€ํ•™์› ์œตํ•ฉ๊ณผํ•™๋ถ€(์ง€๋Šฅํ˜•์œตํ•ฉ์‹œ์Šคํ…œ์ „๊ณต), 2022.2. ์•ˆ์ •ํ˜ธ.Recently, as research on neural networks has gained significant traction, a number of memory-intensive neural network models such as recurrent neural network (RNN) models and recommendation models are introduced to process various tasks. RNN models and recommendation models spend most of their execution time processing matrix-vector multiplication (MV-mul) and processing embedding layers, respectively. A fundamental primitive of embedding layers, tensor gather-and-reduction (GnR), gathers embedding vectors and then reduces them to a new embedding vector. Because the matrices in RNNs and the embedding tables in recommendation models have poor reusability and the ever-increasing sizes of the matrices and the embedding tables become too large to fit in the on-chip storage of devices, the performance and energy efficiency of MV-mul and GnR are determined by those of main-memory DRAM. Therefore, computing these operations within DRAM draws significant attention. In this dissertation, we first propose a main-memory architecture called MViD, which performs MV-mul by placing MAC units inside DRAM banks. For higher computational efficiency, we use a sparse matrix format and exploit quantization. Because of the limited power budget for DRAM devices, we implement the MAC units only on a portion of the DRAM banks. We architect MViD to slow down or pause MV-mul for concurrently processing memory requests from processors while satisfying the limited power budget. Our results show that MViD provides 7.2ร— higher throughput compared to the baseline system with four DRAM ranks (performing MV-mul in a chip-multiprocessor) while running inference of Deep Speech 2 with a memory-intensive workload. Then we propose TRiM, an NDP architecture for accelerating recommendation systems. Based on the observation that the DRAM datapath has a hierarchical tree structure, TRiM augments the DRAM datapath with "in-DRAM" reduction units at the DDR4/5 rank/bank-group/bank level. We modify the interface of DRAM to provide commands effectively to multiple reduction units running in parallel. We also propose a host-side architecture with hot embedding-vector replication to alleviate the load imbalance that arises across the reduction units. An optimal TRiM design based on DDR5 achieves up to a 7.7ร— and 3.9ร— speedup and reduces by 55% and 50% the energy consumption of the embedding vector gather and reduction over the baseline and the state-of-the-art NDP architecture with minimal area overhead equivalent to 2.66% of DRAM chips.์ตœ๊ทผ ๋งŽ์€ ์‹ ๊ฒฝ๋ง ์—ฐ๊ตฌ๋“ค์ด ๊ด€์‹ฌ์„ ๋ฐ›์œผ๋ฉด์„œ, RNN ๋ชจ๋ธ ํ˜น์€ ์ถ”์ฒœ ์‹œ์Šคํ…œ ๋ชจ๋ธ๊ณผ ๊ฐ™์€ ๋ฉ”๋ชจ๋ฆฌ ์ง‘์•ฝ์  ์‹ ๊ฒฝ๋ง ๋ชจ๋ธ๋“ค์ด ๋‹ค์–‘ํ•œ ์ž‘์—…์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋“ฑ์žฅํ•˜๊ณ ์žˆ๋‹ค. RNN ๋ชจ๋ธ๊ณผ ์ถ”์ฒœ ์‹œ์Šคํ…œ ๋ชจ๋ธ์€ ๋Œ€๋ถ€๋ถ„์˜ ์‹คํ–‰ ์‹œ๊ฐ„ ๋™์•ˆ ๊ฐ๊ฐ ํ–‰๋ ฌ-๋ฒกํ„ฐ ๊ณฑ์„ ์—ฐ์‚ฐํ•˜๊ณ  ์ž„๋ฒ ๋”ฉ ๋ ˆ์ด์–ด๋ฅผ ์ฒ˜๋ฆฌํ•œ๋‹ค. ์ž„๋ฒ ๋”ฉ ๋ ˆ์ด์–ด์˜ ๊ธฐ๋ณธ ์—ฐ์‚ฐ์ธ GnR ์—ฐ์‚ฐ์€ ์—ฌ๋Ÿฌ๊ฐœ์˜ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๋ฅผ ๋ชจ์€ ๋‹ค์Œ ์ด๋“ค์„ ํ•ฉ์น˜๋Š” ๋™์ž‘์„ ํ•œ๋‹ค. RNN ์ฒ˜๋ฆฌ์‹œ ํ•„์š”ํ•œ ํ–‰๋ ฌ๊ณผ ์ถ”์ฒœ ์‹œ์Šคํ…œ ๋ชจ๋ธ ์ฒ˜๋ฆฌ์‹œ ํ•„์š”ํ•œ ์ž„๋ฒ ๋”ฉ ํ…Œ์ด๋ธ”์€ ์žฌ์‚ฌ์šฉ์„ฑ์ด ๋‚ฎ๊ณ  ์ด๋“ค์˜ ํฌ๊ธฐ๋Š” ๊ณ„์† ์ฆ๊ฐ€ํ•˜์—ฌ ์˜จ์นฉ ์Šคํ† ๋ฆฌ์ง€์— ์ €์žฅ๋  ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์— ํ–‰๋ ฌ-๋ฒกํ„ฐ ๊ณฑ ๋ฐ GnR ์—ฐ์‚ฐ์˜ ์„ฑ๋Šฅ ๋ฐ ์—๋„ˆ์ง€ ํšจ์œจ์„ฑ์€ ์ฃผ ๋ฉ”๋ชจ๋ฆฌ DRAM์˜ ์„ฑ๋Šฅ ๋ฐ ์—๋„ˆ์ง€ ํšจ์œจ์„ฑ์— ์˜ํ•ด ๊ฒฐ์ •๋œ๋‹ค. ๋”ฐ๋ผ์„œ DRAM ๋‚ด์—์„œ ์ด๋Ÿฌํ•œ ์—ฐ์‚ฐ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ์‹์ด ๊ด€์‹ฌ์„ ๋Œ๊ณ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋จผ์ € DRAM ๋ฑ…ํฌ ๋‚ด๋ถ€์— MAC ์œ ๋‹›์„ ๋ฐฐ์น˜ํ•˜์—ฌ ํ–‰๋ ฌ-๋ฒกํ„ฐ ๊ณฑ์„ ์ˆ˜ํ–‰ํ•˜๋Š” MViD๋ผ๋Š” ์ฃผ ๋ฉ”๋ชจ๋ฆฌ ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋” ๋†’์€ ๊ณ„์‚ฐ ํšจ์œจ์„ฑ์„ ์œ„ํ•ด ํฌ์†Œ ํ–‰๋ ฌ ํ˜•์‹์„ ์‚ฌ์šฉํ•˜๊ณ  ์–‘์žํ™”๋ฅผ ํ™œ์šฉํ•œ๋‹ค. DRAM ์žฅ์น˜๊ฐ€ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์ œํ•œ๋œ ์ „๋ ฅ ๋•Œ๋ฌธ์— DRAM ๋ฑ…ํฌ์˜ ์ผ๋ถ€์—๋งŒ MAC ์žฅ์น˜๋ฅผ ๊ตฌํ˜„ํ•œ๋‹ค. ์ „๋ ฅ ์ œํ•œ ์กฐ๊ฑด์„ ์ถฉ์กฑํ•˜๋ฉด์„œ ํ”„๋กœ์„ธ์„œ์˜ ๋ฉ”๋ชจ๋ฆฌ ์š”์ฒญ์„ ๋™์‹œ์— ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ํ–‰๋ ฌ-๋ฒกํ„ฐ๊ณฑ์„ ๋Šฆ์ถ”๊ฑฐ๋‚˜ ์ผ์‹œ ์ค‘์ง€ํ•˜๋„๋ก MViD๋ฅผ ์„ค๊ณ„ํ•œ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ๋กœ MViD๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ ์ง‘์•ฝ์  ์›Œํฌ๋กœ๋“œ๋กœ Deep Speech 2์˜ ์ถ”๋ก ์„ ์‹คํ–‰ํ•˜๋ฉด์„œ 4๊ฐœ์˜ DRAM ๋žญํฌ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ํ”„๋กœ์„ธ์„œ์—์„œ ํ–‰๋ ฌ-๋ฒกํ„ฐ๊ณฑ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ธฐ์ค€ ์‹œ์Šคํ…œ์— ๋น„ํ•ด 7.2๋ฐฐ ๋” ๋†’์€ ์ฒ˜๋ฆฌ๋Ÿ‰์„ ์ œ๊ณตํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์šฐ๋ฆฌ๋Š” ์ถ”์ฒœ ์‹œ์Šคํ…œ์„ ๊ฐ€์†ํ•˜๊ธฐ ์œ„ํ•œ ๋ฉ”๋ชจ๋ฆฌ ๊ทผ์ฒ˜ ์ฒ˜๋ฆฌ ๊ตฌ์กฐ์ธ TRiM์„ ์ œ์•ˆํ•œ๋‹ค. DRAM ๋ฐ์ดํ„ฐ ๊ฒฝ๋กœ๊ฐ€ ๊ณ„์ธต์  ํŠธ๋ฆฌ ๊ตฌ์กฐ๋ฅผ ๊ฐ–๋Š”๋‹ค๋Š” ์‚ฌ์‹ค์„ ๊ธฐ๋ฐ˜์œผ๋กœ TRiM์€ DDR4/5 ๋žญํฌ/๋ฑ…ํฌ๊ทธ๋ฃน/๋ฑ…ํฌ ์ˆ˜์ค€์—์„œ DRAM ๋‚ด๋ถ€ ๋ฒกํ„ฐ ๊ฐ์†Œ ์žฅ์น˜๋กœ DRAM ๋ฐ์ดํ„ฐ ๊ฒฝ๋กœ๋ฅผ ๊ฐ•ํ™”ํ•œ๋‹ค. ๋ณ‘๋ ฌ๋กœ ์‹คํ–‰๋˜๋Š” ์—ฌ๋Ÿฌ ๋ฒกํ„ฐ ๊ฐ์†Œ ์žฅ์น˜์— ๋ช…๋ น์„ ํšจ๊ณผ์ ์œผ๋กœ ์ œ๊ณตํ•˜๊ธฐ ์œ„ํ•ด DRAM์˜ ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ์ˆ˜์ •ํ•œ๋‹ค. ๋˜ํ•œ ๋ฒกํ„ฐ ๊ฐ์†Œ ์žฅ์น˜์—์„œ ๋ฐœ์ƒํ•˜๋Š” ๋ถ€ํ•˜ ๋ถˆ๊ท ํ˜•์„ ์™„ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ํ˜ธ์ŠคํŠธ ์ธก ๊ตฌ์กฐ์— ํ•ซ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ ๋ณต์ œ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. DDR5๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ์ตœ์ ์˜ TRiM ์„ค๊ณ„๋Š” DRAM ์นฉ์˜ 2.66%์— ํ•ด๋‹นํ•˜๋Š” ํฌ๊ธฐ ์˜ค๋ฒ„ํ—ค๋“œ๋งŒ์œผ๋กœ ์ตœ๋Œ€ 7.7๋ฐฐ ๋ฐ 3.9๋ฐฐ์˜ ์†๋„ ํ–ฅ์ƒ์„ ๋‹ฌ์„ฑํ•˜๊ณ  ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ ์ˆ˜์ง‘์˜ ์—๋„ˆ์ง€ ์†Œ๋น„๋ฅผ 55% ๋ฐ 50% ์ค„์ธ๋‹ค.Abstract i Contents iv List of Tables vii List of Figures viii 1 Introduction 1 1.1 Accelerating RNNs on Edge 3 1.2 Accelerating Recommendation Model 5 1.3 Research Contributions 8 1.4 Outline 9 2 Background 11 2.1 Memory-intensive Machine Learning Applications 11 2.2 DRAM Organization and Operations 13 3 MViD: Sparse Matrix-Vector Multiplication in Mobile DRAM for Accelerating Recurrent Neural Networks 18 3.1 Background and Motivation 18 3.1.1 Energy-efficient RNN Mobile Inference 18 3.1.2 How to Improve the Energy Efficiency and Bandwidth of DRAM Accesses in MV-mul 21 3.2 MV-mul in DRAM 23 3.2.1 Exploiting Quantization and Sparsity in RNN's Matrix Elements 23 3.2.2 The Operation Sequence of MV-mul in DRAM 27 3.2.3 Concurrently Serving Requests from Processors and Performing MV-mul in DRAM 32 3.2.4 Put It All Together: MViD Architecture 37 3.2.5 Additional Optimization Schemes 38 3.3 Evaluation 39 3.3.1 Power/Area/Timing Analysis 39 3.3.2 Performance/Energy Evaluation 42 3.4 Discussion 48 4 TRiM: Enhancing Processor-Memory Interfaces with Scalable Tensor Reduction in Memory 51 4.1 Prior NDP architectures for accelerating Tensor Gather-andReduction 51 4.1.1 Tensor Gather-and-Reduction in RecSys 51 4.1.2 Prior NDP accelerators for GnR 52 4.1.3 Quantitative Analysis 56 4.1.4 Additional Schemes for Accelerating GnR 58 4.2 Tensor Reduction in Memory 58 4.2.1 Basic Concept for TRiM 59 4.2.2 How to Provision C/A Bandwidth 62 4.2.3 Exploring NDP Unit Placement 65 4.2.4 TRiM-G Organization and Operations 68 4.2.5 Host-side Architecture for TRiM 70 4.2.6 Schemes for Improving Reliability 75 4.3 Experimental Setup 76 4.4 Evaluation 77 4.4.1 Performance and Energy Efficiency 79 4.4.2 Sensitivity Study of Hot-entry Replication 82 4.4.3 Design Overhead 82 4.5 Discussion 83 5 Discussion 86 6 Related work 89 7 Conclusion 92 REFERENCES 94 ๊ตญ๋ฌธ์ดˆ๋ก 117๋ฐ•

    Self-timed circuits using DCVSL semi-bundled delay wrappers

    Get PDF
    Journal ArticleWe present a technique for generating robust self-timed completion signals for general dynamic datapath circuits. The wrapper circuit is based on our previous domino semi-bundled delay (SBD) circuits, but uses DCVSL circuits in the wrapper for higher performance. We describe the basic SBD-DCVSL building blocks in the template with respect to their circuit structures and operational behavior. These DCVSL SBD circuits show better performance, exhibit reduced overhead, and require reduced operating margins for the matched delay compared with the domino version. The DCVSL wrapper can also identify a class of delay faults in the data path

    Optimization of DSSS Receivers Using Hardware-in-the-Loop Simulations

    Get PDF
    Over the years, there has been significant interest in defining a hardware abstraction layer to facilitate code reuse in software defined radio (SDR) applications. Designers are looking for a way to enable application software to specify a waveform, configure the platform, and control digital signal processing (DSP) functions in a hardware platform in a way that insulates it from the details of realization. This thesis presents a tool-based methodolgy for developing and optimizing a Direct Sequence Spread Spectrum (DSSS) transceiver deployed in custom hardware like Field Programmble Gate Arrays (FPGAs). The system model consists of a tranmitter which employs a quadrature phase shift keying (QPSK) modulation scheme, an additive white Gaussian noise (AWGN) channel, and a receiver whose main parts consist of an analog-to-digital converter (ADC), digital down converter (DDC), image rejection low-pass filter (LPF), carrier phase locked loop (PLL), tracking locked loop, down-sampler, spread spectrum correlators, and rectangular-to-polar converter. The design methodology is based on a new programming model for FPGAs developed in the industry by Xilinx Inc. The Xilinx System Generator for DSP software tool provides design portability and streamlines system development by enabling engineers to create and validate a system model in Xilinx FPGAs. By providing hierarchical modeling and automatic HDL code generation for programmable devices, designs can be easily verified through hardware-in-the-loop (HIL) simulations. HIL provides a significant increase in simulation speed which allows optimization of the receiver design with respect to the datapath size for different functional parts of the receiver. The parameterized datapath points used in the simulation are ADC resolution, DDC datapath size, LPF datapath size, correlator height, correlator datapath size, and rectangular-to-polar datapath size. These parameters are changed in the software enviornment and tested for bit error rate (BER) performance through real-time hardware simualtions. The final result presents a system design with minimum harware area occupancy relative to an acceptable BER degradation
    • โ€ฆ
    corecore