Search CORE

193 research outputs found

SPEC2: SPECtral SParsE CNN Accelerator on FPGAs

Author: Kannan Rajgopal
Lakhotia Kartik
Niu Yue
Prasanna Viktor
Srivastava Ajitesh
Wang Yanzhi
Zeng Hanqing
Publication venue
Publication date: 16/10/2019
Field of study

To accelerate inference of Convolutional Neural Networks (CNNs), various techniques have been proposed to reduce computation redundancy. Converting convolutional layers into frequency domain significantly reduces the computation complexity of the sliding window operations in space domain. On the other hand, weight pruning techniques address the redundancy in model parameters by converting dense convolutional kernels into sparse ones. To obtain high-throughput FPGA implementation, we propose SPEC2 -- the first work to prune and accelerate spectral CNNs. First, we propose a systematic pruning algorithm based on Alternative Direction Method of Multipliers (ADMM). The offline pruning iteratively sets the majority of spectral weights to zero, without using any handcrafted heuristics. Then, we design an optimized pipeline architecture on FPGA that has efficient random access into the sparse kernels and exploits various dimensions of parallelism in convolutional layers. Overall, SPEC2 achieves high inference throughput with extremely low computation complexity and negligible accuracy degradation. We demonstrate SPEC2 by pruning and implementing LeNet and VGG16 on the Xilinx Virtex platform. After pruning 75% of the spectral weights, SPEC2 achieves 0% accuracy loss for LeNet, and <1% accuracy loss for VGG16. The resulting accelerators achieve up to 24x higher throughput, compared with the state-of-the-art FPGA implementations for VGG16.Comment: This is a 10-page conference paper in 26TH IEEE International Conference On High Performance Computing, Data, and Analytics (HiPC

arXiv.org e-Print Archive

메모리 집약적 기계학습 응용프로그램을 위한 디램 기반 프로세싱 인 메모리 마이크로아키텍처

Author: 김병호
Publication venue: 서울대학교 대학원
Publication date: 01/02/2022
Field of study

학위논문(박사) -- 서울대학교대학원 : 융합과학기술대학원 융합과학부(지능형융합시스템전공), 2022.2. 안정호.Recently, as research on neural networks has gained significant traction, a number of memory-intensive neural network models such as recurrent neural network (RNN) models and recommendation models are introduced to process various tasks. RNN models and recommendation models spend most of their execution time processing matrix-vector multiplication (MV-mul) and processing embedding layers, respectively. A fundamental primitive of embedding layers, tensor gather-and-reduction (GnR), gathers embedding vectors and then reduces them to a new embedding vector. Because the matrices in RNNs and the embedding tables in recommendation models have poor reusability and the ever-increasing sizes of the matrices and the embedding tables become too large to fit in the on-chip storage of devices, the performance and energy efficiency of MV-mul and GnR are determined by those of main-memory DRAM. Therefore, computing these operations within DRAM draws significant attention. In this dissertation, we first propose a main-memory architecture called MViD, which performs MV-mul by placing MAC units inside DRAM banks. For higher computational efficiency, we use a sparse matrix format and exploit quantization. Because of the limited power budget for DRAM devices, we implement the MAC units only on a portion of the DRAM banks. We architect MViD to slow down or pause MV-mul for concurrently processing memory requests from processors while satisfying the limited power budget. Our results show that MViD provides 7.2× higher throughput compared to the baseline system with four DRAM ranks (performing MV-mul in a chip-multiprocessor) while running inference of Deep Speech 2 with a memory-intensive workload. Then we propose TRiM, an NDP architecture for accelerating recommendation systems. Based on the observation that the DRAM datapath has a hierarchical tree structure, TRiM augments the DRAM datapath with "in-DRAM" reduction units at the DDR4/5 rank/bank-group/bank level. We modify the interface of DRAM to provide commands effectively to multiple reduction units running in parallel. We also propose a host-side architecture with hot embedding-vector replication to alleviate the load imbalance that arises across the reduction units. An optimal TRiM design based on DDR5 achieves up to a 7.7× and 3.9× speedup and reduces by 55% and 50% the energy consumption of the embedding vector gather and reduction over the baseline and the state-of-the-art NDP architecture with minimal area overhead equivalent to 2.66% of DRAM chips.최근 많은 신경망 연구들이 관심을 받으면서, RNN 모델 혹은 추천 시스템 모델과 같은 메모리 집약적 신경망 모델들이 다양한 작업을 처리하기 위해서 등장하고있다. RNN 모델과 추천 시스템 모델은 대부분의 실행 시간 동안 각각 행렬-벡터 곱을 연산하고 임베딩 레이어를 처리한다. 임베딩 레이어의 기본 연산인 GnR 연산은 여러개의 임베딩 벡터를 모은 다음 이들을 합치는 동작을 한다. RNN 처리시 필요한 행렬과 추천 시스템 모델 처리시 필요한 임베딩 테이블은 재사용성이 낮고 이들의 크기는 계속 증가하여 온칩 스토리지에 저장될 수 없기 때문에 행렬-벡터 곱 및 GnR 연산의 성능 및 에너지 효율성은 주 메모리 DRAM의 성능 및 에너지 효율성에 의해 결정된다. 따라서 DRAM 내에서 이러한 연산을 처리하는 방식이 관심을 끌고있다. 본 논문에서는 먼저 DRAM 뱅크 내부에 MAC 유닛을 배치하여 행렬-벡터 곱을 수행하는 MViD라는 주 메모리 구조를 제안한다. 그리고 더 높은 계산 효율성을 위해 희소 행렬 형식을 사용하고 양자화를 활용한다. DRAM 장치가 사용할 수 있는 제한된 전력 때문에 DRAM 뱅크의 일부에만 MAC 장치를 구현한다. 전력 제한 조건을 충족하면서 프로세서의 메모리 요청을 동시에 처리하기 위해 행렬-벡터곱을 늦추거나 일시 중지하도록 MViD를 설계한다. 그 결과로 MViD가 메모리 집약적 워크로드로 Deep Speech 2의 추론을 실행하면서 4개의 DRAM 랭크를 사용하는 프로세서에서 행렬-벡터곱을 처리하는 기준 시스템에 비해 7.2배 더 높은 처리량을 제공한다는 것을 보여준다. 그리고 우리는 추천 시스템을 가속하기 위한 메모리 근처 처리 구조인 TRiM을 제안한다. DRAM 데이터 경로가 계층적 트리 구조를 갖는다는 사실을 기반으로 TRiM은 DDR4/5 랭크/뱅크그룹/뱅크 수준에서 DRAM 내부 벡터 감소 장치로 DRAM 데이터 경로를 강화한다. 병렬로 실행되는 여러 벡터 감소 장치에 명령을 효과적으로 제공하기 위해 DRAM의 인터페이스를 수정한다. 또한 벡터 감소 장치에서 발생하는 부하 불균형을 완화하기 위해 호스트 측 구조에 핫 임베딩 벡터 복제를 제안한다. DDR5를 기반으로 하는 최적의 TRiM 설계는 DRAM 칩의 2.66%에 해당하는 크기 오버헤드만으로 최대 7.7배 및 3.9배의 속도 향상을 달성하고 임베딩 벡터 수집의 에너지 소비를 55% 및 50% 줄인다.Abstract i Contents iv List of Tables vii List of Figures viii 1 Introduction 1 1.1 Accelerating RNNs on Edge 3 1.2 Accelerating Recommendation Model 5 1.3 Research Contributions 8 1.4 Outline 9 2 Background 11 2.1 Memory-intensive Machine Learning Applications 11 2.2 DRAM Organization and Operations 13 3 MViD: Sparse Matrix-Vector Multiplication in Mobile DRAM for Accelerating Recurrent Neural Networks 18 3.1 Background and Motivation 18 3.1.1 Energy-efficient RNN Mobile Inference 18 3.1.2 How to Improve the Energy Efficiency and Bandwidth of DRAM Accesses in MV-mul 21 3.2 MV-mul in DRAM 23 3.2.1 Exploiting Quantization and Sparsity in RNN's Matrix Elements 23 3.2.2 The Operation Sequence of MV-mul in DRAM 27 3.2.3 Concurrently Serving Requests from Processors and Performing MV-mul in DRAM 32 3.2.4 Put It All Together: MViD Architecture 37 3.2.5 Additional Optimization Schemes 38 3.3 Evaluation 39 3.3.1 Power/Area/Timing Analysis 39 3.3.2 Performance/Energy Evaluation 42 3.4 Discussion 48 4 TRiM: Enhancing Processor-Memory Interfaces with Scalable Tensor Reduction in Memory 51 4.1 Prior NDP architectures for accelerating Tensor Gather-andReduction 51 4.1.1 Tensor Gather-and-Reduction in RecSys 51 4.1.2 Prior NDP accelerators for GnR 52 4.1.3 Quantitative Analysis 56 4.1.4 Additional Schemes for Accelerating GnR 58 4.2 Tensor Reduction in Memory 58 4.2.1 Basic Concept for TRiM 59 4.2.2 How to Provision C/A Bandwidth 62 4.2.3 Exploring NDP Unit Placement 65 4.2.4 TRiM-G Organization and Operations 68 4.2.5 Host-side Architecture for TRiM 70 4.2.6 Schemes for Improving Reliability 75 4.3 Experimental Setup 76 4.4 Evaluation 77 4.4.1 Performance and Energy Efficiency 79 4.4.2 Sensitivity Study of Hot-entry Replication 82 4.4.3 Design Overhead 82 4.5 Discussion 83 5 Discussion 86 6 Related work 89 7 Conclusion 92 REFERENCES 94 국문초록 117박

SNU Open Repository and Archive

Hardware and Software Optimizations for Accelerating Deep Neural Networks: Survey of Current Trends, Challenges, and the Road Ahead

Author: Bussolino B.
Capra M.
Marchisio A.
Martina M.
Masera G.
Shafique M.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2020
Field of study

Currently, Machine Learning (ML) is becoming ubiquitous in everyday life. Deep Learning (DL) is already present in many applications ranging from computer vision for medicine to autonomous driving of modern cars as well as other sectors in security, healthcare, and finance. However, to achieve impressive performance, these algorithms employ very deep networks, requiring a significant computational power, both during the training and inference time. A single inference of a DL model may require billions of multiply-and-accumulated operations, making the DL extremely compute-and energy-hungry. In a scenario where several sophisticated algorithms need to be executed with limited energy and low latency, the need for cost-effective hardware platforms capable of implementing energy-efficient DL execution arises. This paper first introduces the key properties of two brain-inspired models like Deep Neural Network (DNN), and Spiking Neural Network (SNN), and then analyzes techniques to produce efficient and high-performance designs. This work summarizes and compares the works for four leading platforms for the execution of algorithms such as CPU, GPU, FPGA and ASIC describing the main solutions of the state-of-the-art, giving much prominence to the last two solutions since they offer greater design flexibility and bear the potential of high energy-efficiency, especially for the inference process. In addition to hardware solutions, this paper discusses some of the important security issues that these DNN and SNN models may have during their execution, and offers a comprehensive section on benchmarking, explaining how to assess the quality of different networks and hardware systems designed for them

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Algorithm/Architecture Co-Design for Low-Power Neuromorphic Computing

Author: Zheng Nan
Publication venue
Publication date
Field of study

The development of computing systems based on the conventional von Neumann architecture has slowed down in the past decade as complementary metal-oxide-semiconductor (CMOS) technology scaling becomes more and more difficult. To satisfy the ever-increasing demands in computing power, neuromorphic computing has emerged as an attractive alternative. This dissertation focuses on developing learning algorithm, hardware architecture, circuit components, and design methodologies for low-power neuromorphic computing that can be employed in various energy-constrained applications. A top-down approach is adopted in this research. Starting from the algorithm-architecture co-design, a hardware-friendly learning algorithm is developed for spiking neural networks (SNNs). The possibility of estimating gradients from spike timings is explored. The learning algorithm is developed for the ease of hardware implementation, as well as the compatibility with many well-established learning techniques developed for classic artificial neural networks (ANNs). An SNN hardware equipped with the proposed on-chip learning algorithm is implemented in CMOS technology. In this design, two unique features of SNNs, the event-driven computation and the inferring with a progressive precision, are leveraged to reduce the energy consumption. In addition to low-power SNN hardware, accelerators for ANNs are also presented to accelerate the adaptive dynamic programing algorithm. An efficient and flexible single-instruction-multiple-data architecture is proposed to exploit the inherent data-level parallelism in the inference and learning of ANNs. In addition, the accelerator is augmented with a virtual update technique, which helps improve the throughput and energy efficiency remarkably. Lastly, two techniques in the architecture-circuit level are introduced to mitigate the degraded reliability of the memory system in a neuromorphic hardware owing to the aggressively-scaled supply voltage and integration density. The first method uses on-chip feedback to compensate for the process variation and the second technique improves the throughput and energy efficiency of a conventional error-correction method.PHDElectrical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/144149/1/zhengn_1.pd

Deep Blue Documents at the University of Michigan

Tools for efficient Deep Learning

Author: Zhao Ruizhe
Publication venue: Computing, Imperial College London
Publication date: 01/09/2023
Field of study

In the era of Deep Learning (DL), there is a fast-growing demand for building and deploying Deep Neural Networks (DNNs) on various platforms. This thesis proposes five tools to address the challenges for designing DNNs that are efficient in time, in resources and in power consumption. We first present Aegis and SPGC to address the challenges in improving the memory efficiency of DL training and inference. Aegis makes mixed precision training (MPT) stabler by layer-wise gradient scaling. Empirical experiments show that Aegis can improve MPT accuracy by at most 4\%. SPGC focuses on structured pruning: replacing standard convolution with group convolution (GConv) to avoid irregular sparsity. SPGC formulates GConv pruning as a channel permutation problem and proposes a novel heuristic polynomial-time algorithm. Common DNNs pruned by SPGC have maximally 1\% higher accuracy than prior work. This thesis also addresses the challenges lying in the gap between DNN descriptions and executables by Polygeist for software and POLSCA for hardware. Many novel techniques, e.g. statement splitting and memory partitioning, are explored and used to expand polyhedral optimisation. Polygeist can speed up software execution in sequential and parallel by 2.53 and 9.47 times on Polybench/C. POLSCA achieves 1.5 times speedup over hardware designs directly generated from high-level synthesis on Polybench/C. Moreover, this thesis presents Deacon, a framework that generates FPGA-based DNN accelerators of streaming architectures with advanced pipelining techniques to address the challenges from heterogeneous convolution and residual connections. Deacon provides fine-grained pipelining, graph-level optimisation, and heuristic exploration by graph colouring. Compared with prior designs, Deacon shows resource/power consumption efficiency improvement of 1.2x/3.5x for MobileNets and 1.0x/2.8x for SqueezeNets. All these tools are open source, some of which have already gained public engagement. We believe they can make efficient deep learning applications easier to build and deploy.Open Acces

Spiral - Imperial College Digital Repository

Edge-Cloud Polarization and Collaboration: A Comprehensive Survey for AI

Author: Chu Yunfei
Ji Luo
Jia Kunyang
Kuang Kun
Ma Jianxin
Shen Tao
Tan Ziqi
Wang Feng
Wu Anpeng
Wu Chao
Wu Fei
Yang Hongxia
Yao Jiangchao
Yao Yang
Zhang Fengda
Zhang Jianwei
Zhang Shengyu
Zhou Jingren
Publication venue
Publication date: 23/05/2022
Field of study

Influenced by the great success of deep learning via cloud computing and the rapid development of edge chips, research in artificial intelligence (AI) has shifted to both of the computing paradigms, i.e., cloud computing and edge computing. In recent years, we have witnessed significant progress in developing more advanced AI models on cloud servers that surpass traditional deep learning models owing to model innovations (e.g., Transformers, Pretrained families), explosion of training data and soaring computing capabilities. However, edge computing, especially edge and cloud collaborative computing, are still in its infancy to announce their success due to the resource-constrained IoT scenarios with very limited algorithms deployed. In this survey, we conduct a systematic review for both cloud and edge AI. Specifically, we are the first to set up the collaborative learning mechanism for cloud and edge modeling with a thorough review of the architectures that enable such mechanism. We also discuss potentials and practical experiences of some on-going advanced edge AI topics including pretraining models, graph neural networks and reinforcement learning. Finally, we discuss the promising directions and challenges in this field.Comment: 20 pages, Transactions on Knowledge and Data Engineerin

arXiv.org e-Print Archive

Reliability and Security of Compute-In-Memory Based Deep Neural Network Accelerators

Author: Huang Shanshi
Publication venue: Georgia Institute of Technology
Publication date: 10/01/2023
Field of study

Compute-In-Memory (CIM) is a promising solution for accelerating DNNs at edge devices, utilizing mixed-signal computations. However, it requires more cross-layer designs from algorithm levels to hardware implementations as it behaves differently from the pure digital system. On one side, the mixed-signal computations of CIM face unignorable variations, which could hamper the software performance. On the other side, there are potential software/hardware security vulnerabilities with CIM accelerators. This research aims to solve the reliability and security issues in CIM design for accelerating Deep Neural Network (DNN) algorithms as they prevent the real-life use of the CIM-based accelerators. Some non-ideal effects in CIM accelerators are explored, which could cause reliability issues, and solved by the software-hardware co-design methods. In addition, different security vulnerabilities for SRAM-based CIM and eNVM-based CIM inference engines are defined, and corresponding countermeasures are proposed.Ph.D

Scholarly Materials And Research @ Georgia Tech

FPGA-based high-performance neural network acceleration

Author: Geng Tong
Publication venue
Publication date: 19/01/2021
Field of study

In the last ten years, Artificial Intelligence through Deep Neural Networks (DNNs) has penetrated virtually every aspect of science, technology, and business. Advances are rapid with thousands of papers being published annually. Many types of DNNs have been and continue to be developed -- in this thesis, we address Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Graph Neural Networks (GNNs) -- each with a different set of target applications and implementation challenges. The overall problem for all of these Neural Networks (NNs) is that their target applications generally pose stringent constraints on latency and throughput, but also have strict accuracy requirements. Much research has therefore gone into all aspects of improving NN quality and performance: algorithms, code optimization, acceleration with GPUs, and acceleration with hardware, both dedicated ASICs and off-the-shelf FPGAs. In this thesis, we concentrate on the last of these approaches. There have been many previous efforts in creating hardware to accelerate NNs. The problem designers face is that optimal NN models typically have significant irregularities, making them hardware unfriendly. One commonly used approach is to train NN models to follow regular computation and data patterns. This approach, however, can hurt the models' accuracy or lead to models with non-negligible redundancies. This dissertation takes a different approach. Instead of regularizing the model, we create architectures friendly to irregular models. Our thesis is that high-accuracy and high-performance NN inference and training can be achieved by creating a series of novel irregularity-aware architectures for Field-Programmable Gate Arrays (FPGAs). In four different studies on four different NN types, we find that this approach results in speedups of 2.1x to 3255x compared with carefully selected prior art; for inference, there is no change in accuracy. The bulk of this dissertation revolves around these studies, the various workload balancing techniques, and the resulting NN acceleration architectures. In particular, we propose four different architectures to handle, respectively, data structure level, operation level, bit level, and model level irregularities. At the data structure level, we propose AWB-GCN, which uses runtime workload rebalancing to handle Sparse Matrices Multiplications (SpMM) on extremely sparse and unbalanced input. With GNN inference as a case study, AWB-GCN achieves over 90% system efficiency, guarantees efficient off-chip memory access, and provides considerable speedups over CPUs (3255x), GPUs (80x), and a prior ASIC accelerator (5.1x). At the operation level, we propose O3BNN-R, which can detect redundant operations and prune them at run time. This works even for those that are highly data-dependent and unpredictable. With Binarized NNs (BNNs) as a case study, O3BNN-R can prune over 30% of the operations, without any accuracy loss, yielding speedups over state-of-the-art implementations on CPUs (1122x), GPUs (2.3x), and FPGAs (2.1x). At the bit level, we propose CQNN. CQNN embeds a Coarse-Grained Reconfigurable Architecture (CGRA) which can be programmed at runtime to support NN functions with various data-width requirements. Results show that CQNN can deliver us-level Quantized NN (QNN) inference. At the model level, we propose FPDeep, especially for training. In order to address model-level irregularity, FPDeep uses a novel model partitioning schemes to balance workload and storage among nodes. By using a hybrid of model and layer parallelism to train DNNs, FPDeep avoids the large gap that commonly occurs between training and testing accuracy due to the improper convergence to sharp minimizers (caused by large training batches). Results show that FPDeep provides scalable, fast, and accurate training and leads to 6.6x higher energy efficiency than GPUs

Boston University Institutional Repository (OpenBU)

Investigating Opportunities and Challenges in Modeling and Designing Scale-Out DNN Accelerators

Author: Nadella Vineet
Publication venue: Georgia Institute of Technology
Publication date: 10/06/2021
Field of study

The rapid growth of deep learning used in practical applications such as speech recognition, computer vision, natural language processing, robotics, any many other fields has opened the gate to new technology possibilities. Unfortunately, traditional hardware systems are being stretched to the maximum to accommodate the intense workloads presented by state-of-the-art deep learning processes in a time when transistor technology is not scaling. To serve the demand for better computational power and more specialized computations, specialized hardware needs to be developed that provides better latency and bandwidth specifications for various demanding applications. The trend in the semi-conductor industry is to move towards heterogenous System-On Chip (SoC) thereby choosing application specific performance vs. generality seen in most CPU architectures today. In most situations, hardware engineers are left to construct systems that serve the needs of various applications, often needing to predict the use-cases of the system. As with any field, the ability to predict and act on the future innovation trends of the industry is the difference between success and failure. A novel simulator for the design of convolutional neural network accelerators is presented and described in detail named SCALE-Sim (Systolic CNN Accelerator Simulator). The simulator is available as an open-sourced repository and has 2 primary use-cases in which computer architects can extract significant results. The first use-case is for system designers who would like to integrate an existing DNN accelerator architecture into a larger SoC and would be interested in system-level characterization results. The second use-case is for an accelerator architect who would like to use the tool to explore the accelerator design space by sweeping through design parameters.M.S

Scholarly Materials And Research @ Georgia Tech