Search CORE

263 research outputs found

심층신경망 학습 가속기 구조 설계: 뉴런의 성김을 이용한 역전파 가속

Author: 이건희
Publication venue: 서울대학교 대학원
Publication date: 01/08/2020
Field of study

학위논문 (박사) -- 서울대학교 대학원 : 공과대학 전기·정보공학부, 2020. 8. 이혁재.Deep neural network has become one of the most important technologies in the various fields in computer science which tried to follow the human sense. In some fields, their performance defeats that of human sense with the help of the deep neural network. Since the fact that general purpose GPU can speed up deep neural network, GPU became the main device used for deep neural network. As the complexity of deep neural network becomes high that deep neural network requires more and more computing resources. However, general-purpose GPU consumes a lot of energy that the needs of specific hardware for deep neural network are rising. And nowadays, the specific hardwares are focusing on inference. With complicated network models, training a model consumes enormous time and energy using conventional devices. So there are increasing needs specific hardwares for DNN training. The dissertation exploits deep neural network training accelerator architecture. The training process of a deep neural network (DNN) consists of three phases: forward propagation, backward propagation, and weight update. Among these, backward propagation for calculating gradients of activations is the most time consuming phase. The dissertation proposes hardware architectures to accelerate DNN training, focus- ing on the backward propagation phase. The dissertation makes use of the sparsity of the neurons incurred by ReLU layer or dropout layer to accelerate the backward propagation. The first part of the dissertation proposes a hardware architecture to accelerate DNN backward propagation for convolutional layer. We assume using rectified linear unit (ReLU), which is the most widely used activation function. Since the output as well as the derivative of ReLU is zero for negative inputs, the gradient for activation is also zero for negative values. Thus, it is not needed to calculate the gradient of input activation if the input activation value is zero. Based on this observation, we design an efficient DNN accelerating hardware that skips the gradient computations for zero activations. We show the effectiveness of the approach through experiments with our accelerator design. The second part of the dissertation proposes a hardware architecture for fully connected layer. Similar to ReLU layer, dropoout layer has explicit zero gradient for the dropped activation without gradient computation. Dropout is one of the regulariza- tion techniques which can solve the overfitting problem. During the DNN training, the dropout disconnect connections between neurons. Since the error does not propagated through the disconnected connections, we can detect zero gradient becomre computation. Making use of this characteristics, the dissertation proposes a hardware which can accelerate the backward propagation of fully connected layer. Further, the dissertation showed the effectiveness of the approach through simulation.심층신경망은 컴퓨터 과학의 다양한 분야 중 인간의 감각을 쫓는 분야에서 가장 중요한 기술이 되어왔다. 몇몇 분야에서는 이미 심층신경망의 도움으로 인간의 감 각을 뛰어넘은 분야도 존재한다. GPGPU를 이용한 심층신경망의 가속이 가능해진 이후, GPU는 심층신경망에 있어 가장 주요한 장치로 사용되고 있다. 심층신경망 의 복잡도가 높아짐에 따라 연산에 더 많은 컴퓨팅 자원을 요구하고 있다. 그러나 GPGPU는 에너지 소모가 크기에 효율적인 심층신경망 전용 하드웨어 개발에 대한 요구가 증가하고 있다. 현재까지 이러한 전용 하드웨어는 주로 심층신경망 추론에 집중되어 왔다. 복잡한 심층신경망 모델은 학습에 긴 시간이 들고 많은 에너지를 소 모한다. 이에 심층신경망 학습을 위한 전용 하드웨어에 대한 요구가 늘어가고 있다. 본 학위논문은 심층신경망 학습 가속기 구조를 탐색하였다. 심층신경망의 학습 은 순전파, 역전파, 가중치 갱신 이렇게 세 단계로 이루어져 있다. 이 중 액티베이 션의 그래디언트를 구하는 역전파 단계가 가장 시간이 오래 걸리는 단계이다. 본 학위논문에서는 역전파 단계에 중점을 둔 심층신경망 학습을 가속하는 하드웨어 구조를 제안한다. ReLU 레이어 혹은 dropout 레이어로 인해 생긴 뉴론의 성김을 이용하여 심층신경망 학습의 역전파를 가속한다. 학위논문의 첫 부분은 합성곱 신경망의 역전파를 가속하는 심층신경망 학습 하 드웨어이다. 가장 많이 쓰이는 활성화 함수인 ReLU를 이용하는 신경망을 가정했다. 음수 입력값에 대한 ReLU 활성화 함수의 도함수가 0이 되어 해당 액티베이션의 그 래디언트 또한 0이 된다. 이 경우 그래디언트 값에 대한 계산 없이도 그래디언트 값이 0이 되는 것을 알 수 있기에 해당 그래디언트는 계산하지 않아도 된다. 이러한 특성을 이용하여 0값인 액티베이션에 대한 그래디언트 계산을 건너 뛸 수 있는 효율적인 심층신경망 가속 하드웨어를 설계했다. 또한 실험을 통해 본 하드웨어의 효율성을 검증했다. 학위논문의 두번째 부분은 완전연결 신경망의 학습을 가속하는 하드웨어 구조 제안이다. ReLU 레이어와 비슷하게 dropout 레이어 또한 그래디언트 계산 없이도 그 결과가 0임을 알 수 있다. Dropout은 심층신경망의 과적합을 해결하는 일반화 기법 중 하나로, 심층신경망 학습 과정 동안에만 무작위로 신경망의 연결을 끊어 놓는다. 신경망이 끊어진 경로로는 역전파 단계에서 에러가 전파되지 않기에 해당 그래디언트 값 또한 0임을 미리 알 수 있다. 이 특성을 이용하여 완전연결 신경망의 역전파를 가속할 수 있는 하드웨어를 설계했다. 또한 시뮬레이션을 통해 본 하드웨 어의 효율성을 검증했다.1 Introduction 1 1.1 Deep Neural Network Training 4 1.2 Convolutional Neural Network 5 1.2.1 Forward propagation 5 1.2.2 Backward propagation 6 1.2.3 Weight update 6 1.3 Rectified Linear Unit 7 1.4 Dropout 8 1.5 Previous Works 9 2 Acceleration of DNN Backward Propagation on CNN layer 12 2.1 Motivation 12 2.2 Selective Gradient Computation for Zero Activations 17 2.2.1 Baseline Architecture 17 2.2.2 Bit-Vector for Selective Gradient Computation 20 2.2.3 Filter Collector 23 2.2.4 Zero-Gradient Insertion in Write DMA 26 2.3 Overall Architecture 27 2.4 SRAM Buffer 28 2.4.1 Motivation 28 2.4.2 Selective Gradient Computation and SRAM Buffer 29 2.5 Experimental Results 32 2.5.1 Performance Simulator 32 2.5.2 RTL Implementation 33 2.5.3 Performance Improvement 35 2.5.4 Energy Reduction 37 2.5.5 Impacts of SRAM Buffer 39 2.6 Summary 45 3 Acceleration of DNN Backward Propagation on Fully Connected Layer 46 3.1 Motivation 46 3.1.1 Dropout 46 3.1.2 Conventional Dropout Layer Implementations 46 3.1.3 Applications of Dropout 47 3.2 Selective Gradient Computation for Dropped Activations 50 3.2.1 Baseline Architecture 50 3.2.2 Filter Dropper 50 3.3 Overall Architecture 54 3.4 Experimental Results 55 3.4.1 Simulator and Benchmark 55 3.4.2 Results 55 3.5 Summary 56 4 Conclusion 57Docto

SNU Open Repository and Archive

The Evolution of Distributed Systems for Graph Neural Networks and their Origin in Graph Processing and Deep Learning: A Survey

Author: Jacobsen Hans-Arno
Mayer Ruben
Vatter Jana
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 23/05/2023
Field of study

Graph Neural Networks (GNNs) are an emerging research field. This specialized Deep Neural Network (DNN) architecture is capable of processing graph structured data and bridges the gap between graph processing and Deep Learning (DL). As graphs are everywhere, GNNs can be applied to various domains including recommendation systems, computer vision, natural language processing, biology and chemistry. With the rapid growing size of real world graphs, the need for efficient and scalable GNN training solutions has come. Consequently, many works proposing GNN systems have emerged throughout the past few years. However, there is an acute lack of overview, categorization and comparison of such systems. We aim to fill this gap by summarizing and categorizing important methods and techniques for large-scale GNN solutions. In addition, we establish connections between GNN systems, graph processing systems and DL systems.Comment: Accepted at ACM Computing Survey

arXiv.org e-Print Archive

Distributed Graph Neural Network Training: A Survey

Author: Chen Lei
Cui Bin
Gu Xizhi
Li Hongzheng
Li Yawen
Miao Xupeng
Shao Yingxia
Yin Hongbo
Zhang Wentao
Publication venue
Publication date: 25/08/2023
Field of study

Graph neural networks (GNNs) are a type of deep learning models that are trained on graphs and have been successfully applied in various domains. Despite the effectiveness of GNNs, it is still challenging for GNNs to efficiently scale to large graphs. As a remedy, distributed computing becomes a promising solution of training large-scale GNNs, since it is able to provide abundant computing resources. However, the dependency of graph structure increases the difficulty of achieving high-efficiency distributed GNN training, which suffers from the massive communication and workload imbalance. In recent years, many efforts have been made on distributed GNN training, and an array of training algorithms and systems have been proposed. Yet, there is a lack of systematic review on the optimization techniques for the distributed execution of GNN training. In this survey, we analyze three major challenges in distributed GNN training that are massive feature communication, the loss of model accuracy and workload imbalance. Then we introduce a new taxonomy for the optimization techniques in distributed GNN training that address the above challenges. The new taxonomy classifies existing techniques into four categories that are GNN data partition, GNN batch generation, GNN execution model, and GNN communication protocol. We carefully discuss the techniques in each category. In the end, we summarize existing distributed GNN systems for multi-GPUs, GPU-clusters and CPU-clusters, respectively, and give a discussion about the future direction on distributed GNN training

arXiv.org e-Print Archive

Energy Efficient Learning with Low Resolution Stochastic Domain Wall Synapse Based Deep Neural Networks

Author: Atulasimha Jayasimha
Lozano Mark
Misba Walid A.
Querlioz Damien
Publication venue
Publication date: 14/11/2021
Field of study

We demonstrate that extremely low resolution quantized (nominally 5-state) synapses with large stochastic variations in Domain Wall (DW) position can be both energy efficient and achieve reasonably high testing accuracies compared to Deep Neural Networks (DNNs) of similar sizes using floating precision synaptic weights. Specifically, voltage controlled DW devices demonstrate stochastic behavior as modeled rigorously with micromagnetic simulations and can only encode limited states; however, they can be extremely energy efficient during both training and inference. We show that by implementing suitable modifications to the learning algorithms, we can address the stochastic behavior as well as mitigate the effect of their low-resolution to achieve high testing accuracies. In this study, we propose both in-situ and ex-situ training algorithms, based on modification of the algorithm proposed by Hubara et al. [1] which works well with quantization of synaptic weights. We train several 5-layer DNNs on MNIST dataset using 2-, 3- and 5-state DW device as synapse. For in-situ training, a separate high precision memory unit is adopted to preserve and accumulate the weight gradients, which are then quantized to program the low precision DW devices. Moreover, a sizeable noise tolerance margin is used during the training to address the intrinsic programming noise. For ex-situ training, a precursor DNN is first trained based on the characterized DW device model and a noise tolerance margin, which is similar to the in-situ training. Remarkably, for in-situ inference the energy dissipation to program the devices is only 13 pJ per inference given that the training is performed over the entire MNIST dataset for 10 epochs

arXiv.org e-Print Archive

Distributed Pruning Towards Tiny Neural Networks in Federated Learning

Author: Fang Ruogu
Huang Hong
Sun Chaoyue
Wu Dapeng
Yuan Xiaoyong
Zhang Lan
Publication venue
Publication date: 11/07/2023
Field of study

Neural network pruning is an essential technique for reducing the size and complexity of deep neural networks, enabling large-scale models on devices with limited resources. However, existing pruning approaches heavily rely on training data for guiding the pruning strategies, making them ineffective for federated learning over distributed and confidential datasets. Additionally, the memory- and computation-intensive pruning process becomes infeasible for recourse-constrained devices in federated learning. To address these challenges, we propose FedTiny, a distributed pruning framework for federated learning that generates specialized tiny models for memory- and computing-constrained devices. We introduce two key modules in FedTiny to adaptively search coarse- and finer-pruned specialized models to fit deployment scenarios with sparse and cheap local computation. First, an adaptive batch normalization selection module is designed to mitigate biases in pruning caused by the heterogeneity of local data. Second, a lightweight progressive pruning module aims to finer prune the models under strict memory and computational budgets, allowing the pruning policy for each layer to be gradually determined rather than evaluating the overall model structure. The experimental results demonstrate the effectiveness of FedTiny, which outperforms state-of-the-art approaches, particularly when compressing deep models to extremely sparse tiny models. FedTiny achieves an accuracy improvement of 2.61% while significantly reducing the computational cost by 95.91% and the memory footprint by 94.01% compared to state-of-the-art methods.Comment: This paper has been accepted to ICDCS 202

arXiv.org e-Print Archive