9 research outputs found

    Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

    Get PDF
    In the modern-day era of technology, a paradigm shift has been witnessed in the areas involving applications of Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL). Specifically, Deep Neural Networks (DNNs) have emerged as a popular field of interest in most AI applications such as computer vision, image and video processing, robotics, etc. In the context of developed digital technologies and the availability of authentic data and data handling infrastructure, DNNs have been a credible choice for solving more complex real-life problems. The performance and accuracy of a DNN is a way better than human intelligence in certain situations. However, it is noteworthy that the DNN is computationally too cumbersome in terms of the resources and time to handle these computations. Furthermore, general-purpose architectures like CPUs have issues in handling such computationally intensive algorithms. Therefore, a lot of interest and efforts have been invested by the research fraternity in specialized hardware architectures such as Graphics Processing Unit (GPU), Field Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), and Coarse Grained Reconfigurable Array (CGRA) in the context of effective implementation of computationally intensive algorithms. This paper brings forward the various research works carried out on the development and deployment of DNNs using the aforementioned specialized hardware architectures and embedded AI accelerators. The review discusses the detailed description of the specialized hardware-based accelerators used in the training and/or inference of DNN. A comparative study based on factors like power, area, and throughput, is also made on the various accelerators discussed. Finally, future research and development directions are discussed, such as future trends in DNN implementation on specialized hardware accelerators. This review article is intended to serve as a guide for hardware architectures for accelerating and improving the effectiveness of deep learning research.publishedVersio

    Doctor of Philosophy

    Get PDF
    dissertationDeep Neural Networks (DNNs) are the state-of-art solution in a growing number of tasks including computer vision, speech recognition, and genomics. However, DNNs are computationally expensive as they are carefully trained to extract and abstract features from raw data using multiple layers of neurons with millions of parameters. In this dissertation, we primarily focus on inference, e.g., using a DNN to classify an input image. This is an operation that will be repeatedly performed on billions of devices in the datacenter, in self-driving cars, in drones, etc. We observe that DNNs spend a vast majority of their runtime to runtime performing matrix-by-vector multiplications (MVM). MVMs have two major bottlenecks: fetching the matrix and performing sum-of-product operations. To address these bottlenecks, we use in-situ computing, where the matrix is stored in programmable resistor arrays, called crossbars, and sum-of-product operations are performed using analog computing. In this dissertation, we propose two hardware units, ISAAC and Newton.In ISAAC, we show that in-situ computing designs can outperform DNN digital accelerators, if they leverage pipelining, smart encodings, and can distribute a computation in time and space, within crossbars, and across crossbars. In the ISAAC design, roughly half the chip area/power can be attributed to the analog-to-digital conversion (ADC), i.e., it remains the key design challenge in mixed-signal accelerators for deep networks. In spite of the ADC bottleneck, ISAAC is able to out-perform the computational efficiency of the state-of-the-art design (DaDianNao) by 8x. In Newton, we take advantage of a number of techniques to address ADC inefficiency. These techniques exploit matrix transformations, heterogeneity, and smart mapping of computation to the analog substrate. We show that Newton can increase the efficiency of in-situ computing by an additional 2x. Finally, we show that in-situ computing, unfortunately, cannot be easily adapted to handle training of deep networks, i.e., it is only suitable for inference of already-trained networks. By improving the efficiency of DNN inference with ISAAC and Newton, we move closer to low-cost deep learning that in turn will have societal impact through self-driving cars, assistive systems for the disabled, and precision medicine

    이쒅 μžμ—°μ–΄ 처리 λͺ¨λΈμ„ μœ„ν•œ ν™•μž₯ν˜• 컴퓨터 μ‹œμŠ€ν…œ 섀계

    Get PDF
    ν•™μœ„λ…Όλ¬Έ (박사) -- μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› : κ³΅κ³ΌλŒ€ν•™ 전기·정보곡학뢀, 2021. 2. κΉ€μž₯우.Modern neural-network (NN) accelerators have been successful by accelerating a small number of basic operations (e.g., convolution, fully-connected, feedback) comprising the specific target neural-network models (e.g., CNN, RNN). However, this approach no longer works for the emerging full-scale natural language processing (NLP)-based neural network models (e.g., Memory networks, Transformer, BERT), which consist of different combinations of complex and heterogeneous operations (e.g., self-attention, multi-head attention, large-scale feed-forward). Existing acceleration proposals cover only the proposal-specific basic operations and/or customize them for specific models only, which leads to the low performance improvement and the narrow model coverage. Therefore, an ideal NLP accelerator should first identify all performance-critical operations required by different NLP models and support them as a single accelerator to achieve a high model coverage, and can adaptively optimize its architecture to achieve the best performance for the given model. To address these scalability and model/config diversity issues, the dissertation introduces two novel projects (i.e., MnnFast and NLP-Fast) to efficiently accelerate a wide spectrum of full-scale NLP models. First, MnnFast proposes three novel optimizations to resolve three major performance problems (i.e., high memory bandwidth, heavy computation, and cache contention) in memory-augmented neural networks. Next, NLP-Fast adopts three optimization techniques to resolve the huge performance variation due to the model/config diversity in emerging NLP models. We implement both MnnFast and NLP-Fast on different hardware platforms (i.e., CPU, GPU, FPGA) and thoroughly evaluate their performance improvement on each platform.μžμ—°μ–΄ 처리의 μ€‘μš”μ„±μ΄ λŒ€λ‘λ¨μ— 따라 μ—¬λŸ¬ κΈ°μ—… 및 연ꡬ진듀은 λ‹€μ–‘ν•˜κ³  λ³΅μž‘ν•œ μ’…λ₯˜μ˜ μžμ—°μ–΄ 처리 λͺ¨λΈλ“€μ„ μ œμ‹œν•˜κ³  μžˆλ‹€. 즉 μžμ—°μ–΄ 처리 λͺ¨λΈλ“€μ€ ν˜•νƒœκ°€ λ³΅μž‘ν•΄μ§€κ³ ,둜규λͺ¨κ°€ 컀지며, μ’…λ₯˜κ°€ λ‹€μ–‘ν•΄μ§€λŠ” 양상을 보여쀀닀. λ³Έ ν•™μœ„λ…Όλ¬Έμ€ μ΄λŸ¬ν•œ μžμ—°μ–΄ 처리 λͺ¨λΈμ˜ λ³΅μž‘μ„±, ν™•μž₯μ„±, 닀양성을 ν•΄κ²°ν•˜κΈ° μœ„ν•΄ μ—¬λŸ¬ 핡심 아이디어λ₯Ό μ œμ‹œν•˜μ˜€λ‹€. 각각의 핡심 아이디어듀은 λ‹€μŒκ³Ό κ°™λ‹€. (1) λ‹€μ–‘ν•œ μ’…λ₯˜μ˜ μžμ—°μ–΄ 처리 λͺ¨λΈμ˜ μ„±λŠ₯ μ˜€λ²„ν—€λ“œ 뢄포도λ₯Ό μ•Œμ•„λ‚΄κΈ° μœ„ν•œ 정적/동적 뢄석을 μˆ˜ν–‰ν•œλ‹€. (2) μ„±λŠ₯ 뢄석을 톡해 μ•Œμ•„λ‚Έ 주된 μ„±λŠ₯ 병λͺ© μš”μ†Œλ“€μ˜ λ©”λͺ¨λ¦¬ μ‚¬μš©μ„ μ΅œμ ν™” ν•˜κΈ° μœ„ν•œ 전체둠적 λͺ¨λΈ 병렬화 κΈ°μˆ μ„ μ œμ‹œν•œλ‹€. (3) μ—¬λŸ¬ μ—°μ‚°λ“€μ˜ μ—°μ‚°λŸ‰μ„ κ°μ†Œν•˜λŠ” 기술과 μ—°μ‚°λŸ‰ κ°μ†Œλ‘œ μΈν•œ skewness 문제λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•œ dynamic scheduler κΈ°μˆ μ„ μ œμ‹œν•œλ‹€. (4) ν˜„ μžμ—°μ–΄ 처리 λͺ¨λΈμ˜ μ„±λŠ₯ 닀양성을 ν•΄κ²°ν•˜κΈ° μœ„ν•΄ 각 λͺ¨λΈμ— μ΅œμ ν™”λœ λ””μžμΈμ„ μ œμ‹œν•˜λŠ” κΈ°μˆ μ„ μ œμ‹œν•œλ‹€. μ΄λŸ¬ν•œ 핡심 κΈ°μˆ λ“€μ€ μ—¬λŸ¬ μ’…λ₯˜μ˜ ν•˜λ“œμ›¨μ–΄ 가속기 (예: CPU, GPU, FPGA, ASIC) 에도 λ²”μš©μ μœΌλ‘œ μ‚¬μš©λ  수 있기 λ•Œλ¬Έμ— 맀우 νš¨κ³Όμ μ΄λ―€λ‘œ, μ œμ‹œλœ κΈ°μˆ λ“€μ€ μžμ—°μ–΄ 처리 λͺ¨λΈμ„ μœ„ν•œ 컴퓨터 μ‹œμŠ€ν…œ 섀계 뢄야에 κ΄‘λ²”μœ„ν•˜κ²Œ 적용될 수 μžˆλ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” ν•΄λ‹Ή κΈ°μˆ λ“€μ„ μ μš©ν•˜μ—¬ CPU, GPU, FPGA 각각의 ν™˜κ²½μ—μ„œ, μ œμ‹œλœ κΈ°μˆ λ“€μ΄ λͺ¨λ‘ μœ μ˜λ―Έν•œ μ„±λŠ₯ν–₯상을 달성함을 보여쀀닀.1 INTRODUCTION 1 2 Background 6 2.1 Memory Networks 6 2.2 Deep Learning for NLP 9 3 A Fast and Scalable System Architecture for Memory-Augmented Neural Networks 14 3.1 Motivation & Design Goals 14 3.1.1 Performance Problems in MemNN - High Off-chip Memory Bandwidth Requirements 15 3.1.2 Performance Problems in MemNN - High Computation 16 3.1.3 Performance Problems in MemNN - Shared Cache Contention 17 3.1.4 Design Goals 18 3.2 MnnFast 19 3.2.1 Column-Based Algorithm 19 3.2.2 Zero Skipping 22 3.2.3 Embedding Cache 25 3.3 Implementation 26 3.3.1 General-Purpose Architecture - CPU 26 3.3.2 General-Purpose Architecture - GPU 28 3.3.3 Custom Hardware (FPGA) 29 3.4 Evaluation 31 3.4.1 Experimental Setup 31 3.4.2 CPU 33 3.4.3 GPU 35 3.4.4 FPGA 37 3.4.5 Comparison Between CPU and FPGA 39 3.5 Conclusion 39 4 A Fast, Scalable, and Flexible System for Large-Scale Heterogeneous NLP Models 40 4.1 Motivation & Design Goals 40 4.1.1 High Model Complexity 40 4.1.2 High Memory Bandwidth 41 4.1.3 Heavy Computation 42 4.1.4 Huge Performance Variation 43 4.1.5 Design Goals 43 4.2 NLP-Fast 44 4.2.1 Bottleneck Analysis of NLP Models 44 4.2.2 Holistic Model Partitioning 47 4.2.3 Cross-operation Zero Skipping 51 4.2.4 Adaptive Hardware Reconfiguration 54 4.3 NLP-Fast Toolkit 56 4.4 Implementation 59 4.4.1 General-Purpose Architecture - CPU 59 4.4.2 General-Purpose Architecture - GPU 61 4.4.3 Custom Hardware (FPGA) 62 4.5 Evaluation 64 4.5.1 Experimental Setup 65 4.5.2 CPU 65 4.5.3 GPU 67 4.5.4 FPGA 69 4.6 Conclusion 72 5 Related Work 73 5.1 Various DNN Accelerators 73 5.2 Various NLP Accelerators 74 5.3 Model Partitioning 75 5.4 Approximation 76 5.5 Improving Flexibility 78 5.6 Resource Optimization 78 6 Conclusion 80 Abstract (In Korean) 106Docto

    Hardware and Software Optimizations for Accelerating Deep Neural Networks: Survey of Current Trends, Challenges, and the Road Ahead

    Get PDF
    Currently, Machine Learning (ML) is becoming ubiquitous in everyday life. Deep Learning (DL) is already present in many applications ranging from computer vision for medicine to autonomous driving of modern cars as well as other sectors in security, healthcare, and finance. However, to achieve impressive performance, these algorithms employ very deep networks, requiring a significant computational power, both during the training and inference time. A single inference of a DL model may require billions of multiply-and-accumulated operations, making the DL extremely compute-and energy-hungry. In a scenario where several sophisticated algorithms need to be executed with limited energy and low latency, the need for cost-effective hardware platforms capable of implementing energy-efficient DL execution arises. This paper first introduces the key properties of two brain-inspired models like Deep Neural Network (DNN), and Spiking Neural Network (SNN), and then analyzes techniques to produce efficient and high-performance designs. This work summarizes and compares the works for four leading platforms for the execution of algorithms such as CPU, GPU, FPGA and ASIC describing the main solutions of the state-of-the-art, giving much prominence to the last two solutions since they offer greater design flexibility and bear the potential of high energy-efficiency, especially for the inference process. In addition to hardware solutions, this paper discusses some of the important security issues that these DNN and SNN models may have during their execution, and offers a comprehensive section on benchmarking, explaining how to assess the quality of different networks and hardware systems designed for them

    Exploiting and coping with sparsity to accelerate DNNs on CPUs

    Get PDF
    Deep Neural Networks (DNNs) have become ubiquitous, achieving state-of-the-art results across a wide range of tasks. While GPUs and domain specific accelerators are emerging, general-purpose CPUs hold a firm position in the DNN market due to their high flexibility, high availability, high memory capacity, and low latency. Various working sets in DNN workloads can be sparse, i.e., contain zeros. Depending on the source of the sparsity, the level of the sparsity varies. First, when the level is low enough, traditional sparse algorithms are not competitive against dense algorithms. In such cases, the common practice is to apply dense algorithms on uncompressed sparse inputs. However, this implies that a fraction of the computations are ineffectual because they operate on zero-valued inputs. Second, when the level is high, one may apply traditional sparse algorithms on compressed sparse inputs. Although such approach does not induce ineffectual computations, the indirection in a compressed format often causes irregular memory accesses, hampering the performance. This thesis studies how to improve DNN training and inference performance on CPUs by both discovering work-skipping opportunity in the first case and coping with the irregularity in the second case. To tackle the first case, this thesis proposes both a pure software approach and a software-transparent hardware approach. The software approach is called SparseTrain. It leverages the moderately sparse activations in Convolutional Neural Networks (CNNs) to speed up their training and inference. Such sparsity changes dynamically and is unstructured, i.e. it has no discernible patterns. SparseTrain detects the zeros inside a dense representation and dynamically skips over useless computations at run-time. The hardware approach is called the Sparsity Aware Vector Engine (SAVE). SAVE exploits the unstructured sparsity in both the activations and the weights. Similar to SparseTrain, SAVE also dynamically detects zeros in a dense representation and then skips ineffectual work. SAVE augments a CPU's vector processing pipeline. It assembles denser vector operands by combining effectual vector lanes from multiple vector instructions that contain ineffectual lanes. SAVE is general purpose. It accelerates any vector workload that has zeros in the inputs. Nonetheless, it contains optimizations targeting matrix multiplication based DNN models. Both SparseTrain and SAVE accelerate DNN training and inference on CPUs significantly. For the second case, this thesis focuses on a type of DNN that is severely impacted by the irregularity from sparsity --- Graph Neural Networks (GNNs). GNNs take graphs as the input, and graphs often contain highly sparse connections. This thesis proposes software optimizations that (i) overlap the irregular memory accesses with the compute, (ii) compress and decompress the features dynamically, and (iii) improve the temporal reuse of the features. The optimized implementation significantly outperforms a state-of-the-art GNN implementation. In addition, this thesis discusses the idea of offloading a GNN's irregular memory access phase to an augmented Direct Memory Access (DMA) engine, as a future work

    Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks

    Get PDF
    The growing energy and performance costs of deep learning have driven the community to reduce the size of neural networks by selectively pruning components. Similarly to their biological counterparts, sparse networks generalize just as well, sometimes even better than, the original dense networks. Sparsity promises to reduce the memory footprint of regular networks to fit mobile devices, as well as shorten training time for ever growing networks. In this paper, we survey prior work on sparsity in deep learning and provide an extensive tutorial of sparsification for both inference and training. We describe approaches to remove and add elements of neural networks, different training strategies to achieve model sparsity, and mechanisms to exploit sparsity in practice. Our work distills ideas from more than 300 research papers and provides guidance to practitioners who wish to utilize sparsity today, as well as to researchers whose goal is to push the frontier forward. We include the necessary background on mathematical methods in sparsification, describe phenomena such as early structure adaptation, the intricate relations between sparsity and the training process, and show techniques for achieving acceleration on real hardware. We also define a metric of pruned parameter efficiency that could serve as a baseline for comparison of different sparse networks. We close by speculating on how sparsity can improve future workloads and outline major open problems in the field

    Artificial general intelligence: Proceedings of the Second Conference on Artificial General Intelligence, AGI 2009, Arlington, Virginia, USA, March 6-9, 2009

    Get PDF
    Artificial General Intelligence (AGI) research focuses on the original and ultimate goal of AI – to create broad human-like and transhuman intelligence, by exploring all available paths, including theoretical and experimental computer science, cognitive science, neuroscience, and innovative interdisciplinary methodologies. Due to the difficulty of this task, for the last few decades the majority of AI researchers have focused on what has been called narrow AI – the production of AI systems displaying intelligence regarding specific, highly constrained tasks. In recent years, however, more and more researchers have recognized the necessity – and feasibility – of returning to the original goals of the field. Increasingly, there is a call for a transition back to confronting the more difficult issues of human level intelligence and more broadly artificial general intelligence
    corecore