193 research outputs found

    SPEC2: SPECtral SParsE CNN Accelerator on FPGAs

    Full text link
    To accelerate inference of Convolutional Neural Networks (CNNs), various techniques have been proposed to reduce computation redundancy. Converting convolutional layers into frequency domain significantly reduces the computation complexity of the sliding window operations in space domain. On the other hand, weight pruning techniques address the redundancy in model parameters by converting dense convolutional kernels into sparse ones. To obtain high-throughput FPGA implementation, we propose SPEC2 -- the first work to prune and accelerate spectral CNNs. First, we propose a systematic pruning algorithm based on Alternative Direction Method of Multipliers (ADMM). The offline pruning iteratively sets the majority of spectral weights to zero, without using any handcrafted heuristics. Then, we design an optimized pipeline architecture on FPGA that has efficient random access into the sparse kernels and exploits various dimensions of parallelism in convolutional layers. Overall, SPEC2 achieves high inference throughput with extremely low computation complexity and negligible accuracy degradation. We demonstrate SPEC2 by pruning and implementing LeNet and VGG16 on the Xilinx Virtex platform. After pruning 75% of the spectral weights, SPEC2 achieves 0% accuracy loss for LeNet, and <1% accuracy loss for VGG16. The resulting accelerators achieve up to 24x higher throughput, compared with the state-of-the-art FPGA implementations for VGG16.Comment: This is a 10-page conference paper in 26TH IEEE International Conference On High Performance Computing, Data, and Analytics (HiPC

    ๋ฉ”๋ชจ๋ฆฌ ์ง‘์•ฝ์  ๊ธฐ๊ณ„ํ•™์Šต ์‘์šฉํ”„๋กœ๊ทธ๋žจ์„ ์œ„ํ•œ ๋””๋žจ ๊ธฐ๋ฐ˜ ํ”„๋กœ์„ธ์‹ฑ ์ธ ๋ฉ”๋ชจ๋ฆฌ ๋งˆ์ดํฌ๋กœ์•„ํ‚คํ…์ฒ˜

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ์œตํ•ฉ๊ณผํ•™๊ธฐ์ˆ ๋Œ€ํ•™์› ์œตํ•ฉ๊ณผํ•™๋ถ€(์ง€๋Šฅํ˜•์œตํ•ฉ์‹œ์Šคํ…œ์ „๊ณต), 2022.2. ์•ˆ์ •ํ˜ธ.Recently, as research on neural networks has gained significant traction, a number of memory-intensive neural network models such as recurrent neural network (RNN) models and recommendation models are introduced to process various tasks. RNN models and recommendation models spend most of their execution time processing matrix-vector multiplication (MV-mul) and processing embedding layers, respectively. A fundamental primitive of embedding layers, tensor gather-and-reduction (GnR), gathers embedding vectors and then reduces them to a new embedding vector. Because the matrices in RNNs and the embedding tables in recommendation models have poor reusability and the ever-increasing sizes of the matrices and the embedding tables become too large to fit in the on-chip storage of devices, the performance and energy efficiency of MV-mul and GnR are determined by those of main-memory DRAM. Therefore, computing these operations within DRAM draws significant attention. In this dissertation, we first propose a main-memory architecture called MViD, which performs MV-mul by placing MAC units inside DRAM banks. For higher computational efficiency, we use a sparse matrix format and exploit quantization. Because of the limited power budget for DRAM devices, we implement the MAC units only on a portion of the DRAM banks. We architect MViD to slow down or pause MV-mul for concurrently processing memory requests from processors while satisfying the limited power budget. Our results show that MViD provides 7.2ร— higher throughput compared to the baseline system with four DRAM ranks (performing MV-mul in a chip-multiprocessor) while running inference of Deep Speech 2 with a memory-intensive workload. Then we propose TRiM, an NDP architecture for accelerating recommendation systems. Based on the observation that the DRAM datapath has a hierarchical tree structure, TRiM augments the DRAM datapath with "in-DRAM" reduction units at the DDR4/5 rank/bank-group/bank level. We modify the interface of DRAM to provide commands effectively to multiple reduction units running in parallel. We also propose a host-side architecture with hot embedding-vector replication to alleviate the load imbalance that arises across the reduction units. An optimal TRiM design based on DDR5 achieves up to a 7.7ร— and 3.9ร— speedup and reduces by 55% and 50% the energy consumption of the embedding vector gather and reduction over the baseline and the state-of-the-art NDP architecture with minimal area overhead equivalent to 2.66% of DRAM chips.์ตœ๊ทผ ๋งŽ์€ ์‹ ๊ฒฝ๋ง ์—ฐ๊ตฌ๋“ค์ด ๊ด€์‹ฌ์„ ๋ฐ›์œผ๋ฉด์„œ, RNN ๋ชจ๋ธ ํ˜น์€ ์ถ”์ฒœ ์‹œ์Šคํ…œ ๋ชจ๋ธ๊ณผ ๊ฐ™์€ ๋ฉ”๋ชจ๋ฆฌ ์ง‘์•ฝ์  ์‹ ๊ฒฝ๋ง ๋ชจ๋ธ๋“ค์ด ๋‹ค์–‘ํ•œ ์ž‘์—…์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋“ฑ์žฅํ•˜๊ณ ์žˆ๋‹ค. RNN ๋ชจ๋ธ๊ณผ ์ถ”์ฒœ ์‹œ์Šคํ…œ ๋ชจ๋ธ์€ ๋Œ€๋ถ€๋ถ„์˜ ์‹คํ–‰ ์‹œ๊ฐ„ ๋™์•ˆ ๊ฐ๊ฐ ํ–‰๋ ฌ-๋ฒกํ„ฐ ๊ณฑ์„ ์—ฐ์‚ฐํ•˜๊ณ  ์ž„๋ฒ ๋”ฉ ๋ ˆ์ด์–ด๋ฅผ ์ฒ˜๋ฆฌํ•œ๋‹ค. ์ž„๋ฒ ๋”ฉ ๋ ˆ์ด์–ด์˜ ๊ธฐ๋ณธ ์—ฐ์‚ฐ์ธ GnR ์—ฐ์‚ฐ์€ ์—ฌ๋Ÿฌ๊ฐœ์˜ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๋ฅผ ๋ชจ์€ ๋‹ค์Œ ์ด๋“ค์„ ํ•ฉ์น˜๋Š” ๋™์ž‘์„ ํ•œ๋‹ค. RNN ์ฒ˜๋ฆฌ์‹œ ํ•„์š”ํ•œ ํ–‰๋ ฌ๊ณผ ์ถ”์ฒœ ์‹œ์Šคํ…œ ๋ชจ๋ธ ์ฒ˜๋ฆฌ์‹œ ํ•„์š”ํ•œ ์ž„๋ฒ ๋”ฉ ํ…Œ์ด๋ธ”์€ ์žฌ์‚ฌ์šฉ์„ฑ์ด ๋‚ฎ๊ณ  ์ด๋“ค์˜ ํฌ๊ธฐ๋Š” ๊ณ„์† ์ฆ๊ฐ€ํ•˜์—ฌ ์˜จ์นฉ ์Šคํ† ๋ฆฌ์ง€์— ์ €์žฅ๋  ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์— ํ–‰๋ ฌ-๋ฒกํ„ฐ ๊ณฑ ๋ฐ GnR ์—ฐ์‚ฐ์˜ ์„ฑ๋Šฅ ๋ฐ ์—๋„ˆ์ง€ ํšจ์œจ์„ฑ์€ ์ฃผ ๋ฉ”๋ชจ๋ฆฌ DRAM์˜ ์„ฑ๋Šฅ ๋ฐ ์—๋„ˆ์ง€ ํšจ์œจ์„ฑ์— ์˜ํ•ด ๊ฒฐ์ •๋œ๋‹ค. ๋”ฐ๋ผ์„œ DRAM ๋‚ด์—์„œ ์ด๋Ÿฌํ•œ ์—ฐ์‚ฐ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ์‹์ด ๊ด€์‹ฌ์„ ๋Œ๊ณ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋จผ์ € DRAM ๋ฑ…ํฌ ๋‚ด๋ถ€์— MAC ์œ ๋‹›์„ ๋ฐฐ์น˜ํ•˜์—ฌ ํ–‰๋ ฌ-๋ฒกํ„ฐ ๊ณฑ์„ ์ˆ˜ํ–‰ํ•˜๋Š” MViD๋ผ๋Š” ์ฃผ ๋ฉ”๋ชจ๋ฆฌ ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋” ๋†’์€ ๊ณ„์‚ฐ ํšจ์œจ์„ฑ์„ ์œ„ํ•ด ํฌ์†Œ ํ–‰๋ ฌ ํ˜•์‹์„ ์‚ฌ์šฉํ•˜๊ณ  ์–‘์žํ™”๋ฅผ ํ™œ์šฉํ•œ๋‹ค. DRAM ์žฅ์น˜๊ฐ€ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์ œํ•œ๋œ ์ „๋ ฅ ๋•Œ๋ฌธ์— DRAM ๋ฑ…ํฌ์˜ ์ผ๋ถ€์—๋งŒ MAC ์žฅ์น˜๋ฅผ ๊ตฌํ˜„ํ•œ๋‹ค. ์ „๋ ฅ ์ œํ•œ ์กฐ๊ฑด์„ ์ถฉ์กฑํ•˜๋ฉด์„œ ํ”„๋กœ์„ธ์„œ์˜ ๋ฉ”๋ชจ๋ฆฌ ์š”์ฒญ์„ ๋™์‹œ์— ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ํ–‰๋ ฌ-๋ฒกํ„ฐ๊ณฑ์„ ๋Šฆ์ถ”๊ฑฐ๋‚˜ ์ผ์‹œ ์ค‘์ง€ํ•˜๋„๋ก MViD๋ฅผ ์„ค๊ณ„ํ•œ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ๋กœ MViD๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ ์ง‘์•ฝ์  ์›Œํฌ๋กœ๋“œ๋กœ Deep Speech 2์˜ ์ถ”๋ก ์„ ์‹คํ–‰ํ•˜๋ฉด์„œ 4๊ฐœ์˜ DRAM ๋žญํฌ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ํ”„๋กœ์„ธ์„œ์—์„œ ํ–‰๋ ฌ-๋ฒกํ„ฐ๊ณฑ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ธฐ์ค€ ์‹œ์Šคํ…œ์— ๋น„ํ•ด 7.2๋ฐฐ ๋” ๋†’์€ ์ฒ˜๋ฆฌ๋Ÿ‰์„ ์ œ๊ณตํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์šฐ๋ฆฌ๋Š” ์ถ”์ฒœ ์‹œ์Šคํ…œ์„ ๊ฐ€์†ํ•˜๊ธฐ ์œ„ํ•œ ๋ฉ”๋ชจ๋ฆฌ ๊ทผ์ฒ˜ ์ฒ˜๋ฆฌ ๊ตฌ์กฐ์ธ TRiM์„ ์ œ์•ˆํ•œ๋‹ค. DRAM ๋ฐ์ดํ„ฐ ๊ฒฝ๋กœ๊ฐ€ ๊ณ„์ธต์  ํŠธ๋ฆฌ ๊ตฌ์กฐ๋ฅผ ๊ฐ–๋Š”๋‹ค๋Š” ์‚ฌ์‹ค์„ ๊ธฐ๋ฐ˜์œผ๋กœ TRiM์€ DDR4/5 ๋žญํฌ/๋ฑ…ํฌ๊ทธ๋ฃน/๋ฑ…ํฌ ์ˆ˜์ค€์—์„œ DRAM ๋‚ด๋ถ€ ๋ฒกํ„ฐ ๊ฐ์†Œ ์žฅ์น˜๋กœ DRAM ๋ฐ์ดํ„ฐ ๊ฒฝ๋กœ๋ฅผ ๊ฐ•ํ™”ํ•œ๋‹ค. ๋ณ‘๋ ฌ๋กœ ์‹คํ–‰๋˜๋Š” ์—ฌ๋Ÿฌ ๋ฒกํ„ฐ ๊ฐ์†Œ ์žฅ์น˜์— ๋ช…๋ น์„ ํšจ๊ณผ์ ์œผ๋กœ ์ œ๊ณตํ•˜๊ธฐ ์œ„ํ•ด DRAM์˜ ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ์ˆ˜์ •ํ•œ๋‹ค. ๋˜ํ•œ ๋ฒกํ„ฐ ๊ฐ์†Œ ์žฅ์น˜์—์„œ ๋ฐœ์ƒํ•˜๋Š” ๋ถ€ํ•˜ ๋ถˆ๊ท ํ˜•์„ ์™„ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ํ˜ธ์ŠคํŠธ ์ธก ๊ตฌ์กฐ์— ํ•ซ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ ๋ณต์ œ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. DDR5๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ์ตœ์ ์˜ TRiM ์„ค๊ณ„๋Š” DRAM ์นฉ์˜ 2.66%์— ํ•ด๋‹นํ•˜๋Š” ํฌ๊ธฐ ์˜ค๋ฒ„ํ—ค๋“œ๋งŒ์œผ๋กœ ์ตœ๋Œ€ 7.7๋ฐฐ ๋ฐ 3.9๋ฐฐ์˜ ์†๋„ ํ–ฅ์ƒ์„ ๋‹ฌ์„ฑํ•˜๊ณ  ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ ์ˆ˜์ง‘์˜ ์—๋„ˆ์ง€ ์†Œ๋น„๋ฅผ 55% ๋ฐ 50% ์ค„์ธ๋‹ค.Abstract i Contents iv List of Tables vii List of Figures viii 1 Introduction 1 1.1 Accelerating RNNs on Edge 3 1.2 Accelerating Recommendation Model 5 1.3 Research Contributions 8 1.4 Outline 9 2 Background 11 2.1 Memory-intensive Machine Learning Applications 11 2.2 DRAM Organization and Operations 13 3 MViD: Sparse Matrix-Vector Multiplication in Mobile DRAM for Accelerating Recurrent Neural Networks 18 3.1 Background and Motivation 18 3.1.1 Energy-efficient RNN Mobile Inference 18 3.1.2 How to Improve the Energy Efficiency and Bandwidth of DRAM Accesses in MV-mul 21 3.2 MV-mul in DRAM 23 3.2.1 Exploiting Quantization and Sparsity in RNN's Matrix Elements 23 3.2.2 The Operation Sequence of MV-mul in DRAM 27 3.2.3 Concurrently Serving Requests from Processors and Performing MV-mul in DRAM 32 3.2.4 Put It All Together: MViD Architecture 37 3.2.5 Additional Optimization Schemes 38 3.3 Evaluation 39 3.3.1 Power/Area/Timing Analysis 39 3.3.2 Performance/Energy Evaluation 42 3.4 Discussion 48 4 TRiM: Enhancing Processor-Memory Interfaces with Scalable Tensor Reduction in Memory 51 4.1 Prior NDP architectures for accelerating Tensor Gather-andReduction 51 4.1.1 Tensor Gather-and-Reduction in RecSys 51 4.1.2 Prior NDP accelerators for GnR 52 4.1.3 Quantitative Analysis 56 4.1.4 Additional Schemes for Accelerating GnR 58 4.2 Tensor Reduction in Memory 58 4.2.1 Basic Concept for TRiM 59 4.2.2 How to Provision C/A Bandwidth 62 4.2.3 Exploring NDP Unit Placement 65 4.2.4 TRiM-G Organization and Operations 68 4.2.5 Host-side Architecture for TRiM 70 4.2.6 Schemes for Improving Reliability 75 4.3 Experimental Setup 76 4.4 Evaluation 77 4.4.1 Performance and Energy Efficiency 79 4.4.2 Sensitivity Study of Hot-entry Replication 82 4.4.3 Design Overhead 82 4.5 Discussion 83 5 Discussion 86 6 Related work 89 7 Conclusion 92 REFERENCES 94 ๊ตญ๋ฌธ์ดˆ๋ก 117๋ฐ•

    Hardware and Software Optimizations for Accelerating Deep Neural Networks: Survey of Current Trends, Challenges, and the Road Ahead

    Get PDF
    Currently, Machine Learning (ML) is becoming ubiquitous in everyday life. Deep Learning (DL) is already present in many applications ranging from computer vision for medicine to autonomous driving of modern cars as well as other sectors in security, healthcare, and finance. However, to achieve impressive performance, these algorithms employ very deep networks, requiring a significant computational power, both during the training and inference time. A single inference of a DL model may require billions of multiply-and-accumulated operations, making the DL extremely compute-and energy-hungry. In a scenario where several sophisticated algorithms need to be executed with limited energy and low latency, the need for cost-effective hardware platforms capable of implementing energy-efficient DL execution arises. This paper first introduces the key properties of two brain-inspired models like Deep Neural Network (DNN), and Spiking Neural Network (SNN), and then analyzes techniques to produce efficient and high-performance designs. This work summarizes and compares the works for four leading platforms for the execution of algorithms such as CPU, GPU, FPGA and ASIC describing the main solutions of the state-of-the-art, giving much prominence to the last two solutions since they offer greater design flexibility and bear the potential of high energy-efficiency, especially for the inference process. In addition to hardware solutions, this paper discusses some of the important security issues that these DNN and SNN models may have during their execution, and offers a comprehensive section on benchmarking, explaining how to assess the quality of different networks and hardware systems designed for them

    Algorithm/Architecture Co-Design for Low-Power Neuromorphic Computing

    Full text link
    The development of computing systems based on the conventional von Neumann architecture has slowed down in the past decade as complementary metal-oxide-semiconductor (CMOS) technology scaling becomes more and more difficult. To satisfy the ever-increasing demands in computing power, neuromorphic computing has emerged as an attractive alternative. This dissertation focuses on developing learning algorithm, hardware architecture, circuit components, and design methodologies for low-power neuromorphic computing that can be employed in various energy-constrained applications. A top-down approach is adopted in this research. Starting from the algorithm-architecture co-design, a hardware-friendly learning algorithm is developed for spiking neural networks (SNNs). The possibility of estimating gradients from spike timings is explored. The learning algorithm is developed for the ease of hardware implementation, as well as the compatibility with many well-established learning techniques developed for classic artificial neural networks (ANNs). An SNN hardware equipped with the proposed on-chip learning algorithm is implemented in CMOS technology. In this design, two unique features of SNNs, the event-driven computation and the inferring with a progressive precision, are leveraged to reduce the energy consumption. In addition to low-power SNN hardware, accelerators for ANNs are also presented to accelerate the adaptive dynamic programing algorithm. An efficient and flexible single-instruction-multiple-data architecture is proposed to exploit the inherent data-level parallelism in the inference and learning of ANNs. In addition, the accelerator is augmented with a virtual update technique, which helps improve the throughput and energy efficiency remarkably. Lastly, two techniques in the architecture-circuit level are introduced to mitigate the degraded reliability of the memory system in a neuromorphic hardware owing to the aggressively-scaled supply voltage and integration density. The first method uses on-chip feedback to compensate for the process variation and the second technique improves the throughput and energy efficiency of a conventional error-correction method.PHDElectrical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/144149/1/zhengn_1.pd

    Tools for efficient Deep Learning

    Get PDF
    In the era of Deep Learning (DL), there is a fast-growing demand for building and deploying Deep Neural Networks (DNNs) on various platforms. This thesis proposes five tools to address the challenges for designing DNNs that are efficient in time, in resources and in power consumption. We first present Aegis and SPGC to address the challenges in improving the memory efficiency of DL training and inference. Aegis makes mixed precision training (MPT) stabler by layer-wise gradient scaling. Empirical experiments show that Aegis can improve MPT accuracy by at most 4\%. SPGC focuses on structured pruning: replacing standard convolution with group convolution (GConv) to avoid irregular sparsity. SPGC formulates GConv pruning as a channel permutation problem and proposes a novel heuristic polynomial-time algorithm. Common DNNs pruned by SPGC have maximally 1\% higher accuracy than prior work. This thesis also addresses the challenges lying in the gap between DNN descriptions and executables by Polygeist for software and POLSCA for hardware. Many novel techniques, e.g. statement splitting and memory partitioning, are explored and used to expand polyhedral optimisation. Polygeist can speed up software execution in sequential and parallel by 2.53 and 9.47 times on Polybench/C. POLSCA achieves 1.5 times speedup over hardware designs directly generated from high-level synthesis on Polybench/C. Moreover, this thesis presents Deacon, a framework that generates FPGA-based DNN accelerators of streaming architectures with advanced pipelining techniques to address the challenges from heterogeneous convolution and residual connections. Deacon provides fine-grained pipelining, graph-level optimisation, and heuristic exploration by graph colouring. Compared with prior designs, Deacon shows resource/power consumption efficiency improvement of 1.2x/3.5x for MobileNets and 1.0x/2.8x for SqueezeNets. All these tools are open source, some of which have already gained public engagement. We believe they can make efficient deep learning applications easier to build and deploy.Open Acces

    Edge-Cloud Polarization and Collaboration: A Comprehensive Survey for AI

    Full text link
    Influenced by the great success of deep learning via cloud computing and the rapid development of edge chips, research in artificial intelligence (AI) has shifted to both of the computing paradigms, i.e., cloud computing and edge computing. In recent years, we have witnessed significant progress in developing more advanced AI models on cloud servers that surpass traditional deep learning models owing to model innovations (e.g., Transformers, Pretrained families), explosion of training data and soaring computing capabilities. However, edge computing, especially edge and cloud collaborative computing, are still in its infancy to announce their success due to the resource-constrained IoT scenarios with very limited algorithms deployed. In this survey, we conduct a systematic review for both cloud and edge AI. Specifically, we are the first to set up the collaborative learning mechanism for cloud and edge modeling with a thorough review of the architectures that enable such mechanism. We also discuss potentials and practical experiences of some on-going advanced edge AI topics including pretraining models, graph neural networks and reinforcement learning. Finally, we discuss the promising directions and challenges in this field.Comment: 20 pages, Transactions on Knowledge and Data Engineerin

    Reliability and Security of Compute-In-Memory Based Deep Neural Network Accelerators

    Get PDF
    Compute-In-Memory (CIM) is a promising solution for accelerating DNNs at edge devices, utilizing mixed-signal computations. However, it requires more cross-layer designs from algorithm levels to hardware implementations as it behaves differently from the pure digital system. On one side, the mixed-signal computations of CIM face unignorable variations, which could hamper the software performance. On the other side, there are potential software/hardware security vulnerabilities with CIM accelerators. This research aims to solve the reliability and security issues in CIM design for accelerating Deep Neural Network (DNN) algorithms as they prevent the real-life use of the CIM-based accelerators. Some non-ideal effects in CIM accelerators are explored, which could cause reliability issues, and solved by the software-hardware co-design methods. In addition, different security vulnerabilities for SRAM-based CIM and eNVM-based CIM inference engines are defined, and corresponding countermeasures are proposed.Ph.D

    FPGA-based high-performance neural network acceleration

    Full text link
    In the last ten years, Artificial Intelligence through Deep Neural Networks (DNNs) has penetrated virtually every aspect of science, technology, and business. Advances are rapid with thousands of papers being published annually. Many types of DNNs have been and continue to be developed -- in this thesis, we address Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Graph Neural Networks (GNNs) -- each with a different set of target applications and implementation challenges. The overall problem for all of these Neural Networks (NNs) is that their target applications generally pose stringent constraints on latency and throughput, but also have strict accuracy requirements. Much research has therefore gone into all aspects of improving NN quality and performance: algorithms, code optimization, acceleration with GPUs, and acceleration with hardware, both dedicated ASICs and off-the-shelf FPGAs. In this thesis, we concentrate on the last of these approaches. There have been many previous efforts in creating hardware to accelerate NNs. The problem designers face is that optimal NN models typically have significant irregularities, making them hardware unfriendly. One commonly used approach is to train NN models to follow regular computation and data patterns. This approach, however, can hurt the models' accuracy or lead to models with non-negligible redundancies. This dissertation takes a different approach. Instead of regularizing the model, we create architectures friendly to irregular models. Our thesis is that high-accuracy and high-performance NN inference and training can be achieved by creating a series of novel irregularity-aware architectures for Field-Programmable Gate Arrays (FPGAs). In four different studies on four different NN types, we find that this approach results in speedups of 2.1x to 3255x compared with carefully selected prior art; for inference, there is no change in accuracy. The bulk of this dissertation revolves around these studies, the various workload balancing techniques, and the resulting NN acceleration architectures. In particular, we propose four different architectures to handle, respectively, data structure level, operation level, bit level, and model level irregularities. At the data structure level, we propose AWB-GCN, which uses runtime workload rebalancing to handle Sparse Matrices Multiplications (SpMM) on extremely sparse and unbalanced input. With GNN inference as a case study, AWB-GCN achieves over 90% system efficiency, guarantees efficient off-chip memory access, and provides considerable speedups over CPUs (3255x), GPUs (80x), and a prior ASIC accelerator (5.1x). At the operation level, we propose O3BNN-R, which can detect redundant operations and prune them at run time. This works even for those that are highly data-dependent and unpredictable. With Binarized NNs (BNNs) as a case study, O3BNN-R can prune over 30% of the operations, without any accuracy loss, yielding speedups over state-of-the-art implementations on CPUs (1122x), GPUs (2.3x), and FPGAs (2.1x). At the bit level, we propose CQNN. CQNN embeds a Coarse-Grained Reconfigurable Architecture (CGRA) which can be programmed at runtime to support NN functions with various data-width requirements. Results show that CQNN can deliver us-level Quantized NN (QNN) inference. At the model level, we propose FPDeep, especially for training. In order to address model-level irregularity, FPDeep uses a novel model partitioning schemes to balance workload and storage among nodes. By using a hybrid of model and layer parallelism to train DNNs, FPDeep avoids the large gap that commonly occurs between training and testing accuracy due to the improper convergence to sharp minimizers (caused by large training batches). Results show that FPDeep provides scalable, fast, and accurate training and leads to 6.6x higher energy efficiency than GPUs

    Investigating Opportunities and Challenges in Modeling and Designing Scale-Out DNN Accelerators

    Get PDF
    The rapid growth of deep learning used in practical applications such as speech recognition, computer vision, natural language processing, robotics, any many other fields has opened the gate to new technology possibilities. Unfortunately, traditional hardware systems are being stretched to the maximum to accommodate the intense workloads presented by state-of-the-art deep learning processes in a time when transistor technology is not scaling. To serve the demand for better computational power and more specialized computations, specialized hardware needs to be developed that provides better latency and bandwidth specifications for various demanding applications. The trend in the semi-conductor industry is to move towards heterogenous System-On Chip (SoC) thereby choosing application specific performance vs. generality seen in most CPU architectures today. In most situations, hardware engineers are left to construct systems that serve the needs of various applications, often needing to predict the use-cases of the system. As with any field, the ability to predict and act on the future innovation trends of the industry is the difference between success and failure. A novel simulator for the design of convolutional neural network accelerators is presented and described in detail named SCALE-Sim (Systolic CNN Accelerator Simulator). The simulator is available as an open-sourced repository and has 2 primary use-cases in which computer architects can extract significant results. The first use-case is for system designers who would like to integrate an existing DNN accelerator architecture into a larger SoC and would be interested in system-level characterization results. The second use-case is for an accelerator architect who would like to use the tool to explore the accelerator design space by sweeping through design parameters.M.S
    • โ€ฆ
    corecore