263 research outputs found

    ์‹ฌ์ธต์‹ ๊ฒฝ๋ง ํ•™์Šต ๊ฐ€์†๊ธฐ ๊ตฌ์กฐ ์„ค๊ณ„: ๋‰ด๋Ÿฐ์˜ ์„ฑ๊น€์„ ์ด์šฉํ•œ ์—ญ์ „ํŒŒ ๊ฐ€์†

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2020. 8. ์ดํ˜์žฌ.Deep neural network has become one of the most important technologies in the various fields in computer science which tried to follow the human sense. In some fields, their performance defeats that of human sense with the help of the deep neural network. Since the fact that general purpose GPU can speed up deep neural network, GPU became the main device used for deep neural network. As the complexity of deep neural network becomes high that deep neural network requires more and more computing resources. However, general-purpose GPU consumes a lot of energy that the needs of specific hardware for deep neural network are rising. And nowadays, the specific hardwares are focusing on inference. With complicated network models, training a model consumes enormous time and energy using conventional devices. So there are increasing needs specific hardwares for DNN training. The dissertation exploits deep neural network training accelerator architecture. The training process of a deep neural network (DNN) consists of three phases: forward propagation, backward propagation, and weight update. Among these, backward propagation for calculating gradients of activations is the most time consuming phase. The dissertation proposes hardware architectures to accelerate DNN training, focus- ing on the backward propagation phase. The dissertation makes use of the sparsity of the neurons incurred by ReLU layer or dropout layer to accelerate the backward propagation. The first part of the dissertation proposes a hardware architecture to accelerate DNN backward propagation for convolutional layer. We assume using rectified linear unit (ReLU), which is the most widely used activation function. Since the output as well as the derivative of ReLU is zero for negative inputs, the gradient for activation is also zero for negative values. Thus, it is not needed to calculate the gradient of input activation if the input activation value is zero. Based on this observation, we design an efficient DNN accelerating hardware that skips the gradient computations for zero activations. We show the effectiveness of the approach through experiments with our accelerator design. The second part of the dissertation proposes a hardware architecture for fully connected layer. Similar to ReLU layer, dropoout layer has explicit zero gradient for the dropped activation without gradient computation. Dropout is one of the regulariza- tion techniques which can solve the overfitting problem. During the DNN training, the dropout disconnect connections between neurons. Since the error does not propagated through the disconnected connections, we can detect zero gradient becomre computation. Making use of this characteristics, the dissertation proposes a hardware which can accelerate the backward propagation of fully connected layer. Further, the dissertation showed the effectiveness of the approach through simulation.์‹ฌ์ธต์‹ ๊ฒฝ๋ง์€ ์ปดํ“จํ„ฐ ๊ณผํ•™์˜ ๋‹ค์–‘ํ•œ ๋ถ„์•ผ ์ค‘ ์ธ๊ฐ„์˜ ๊ฐ๊ฐ์„ ์ซ“๋Š” ๋ถ„์•ผ์—์„œ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๊ธฐ์ˆ ์ด ๋˜์–ด์™”๋‹ค. ๋ช‡๋ช‡ ๋ถ„์•ผ์—์„œ๋Š” ์ด๋ฏธ ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์˜ ๋„์›€์œผ๋กœ ์ธ๊ฐ„์˜ ๊ฐ ๊ฐ์„ ๋›ฐ์–ด๋„˜์€ ๋ถ„์•ผ๋„ ์กด์žฌํ•œ๋‹ค. GPGPU๋ฅผ ์ด์šฉํ•œ ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์˜ ๊ฐ€์†์ด ๊ฐ€๋Šฅํ•ด์ง„ ์ดํ›„, GPU๋Š” ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์— ์žˆ์–ด ๊ฐ€์žฅ ์ฃผ์š”ํ•œ ์žฅ์น˜๋กœ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ๋‹ค. ์‹ฌ์ธต์‹ ๊ฒฝ๋ง ์˜ ๋ณต์žก๋„๊ฐ€ ๋†’์•„์ง์— ๋”ฐ๋ผ ์—ฐ์‚ฐ์— ๋” ๋งŽ์€ ์ปดํ“จํŒ… ์ž์›์„ ์š”๊ตฌํ•˜๊ณ  ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ GPGPU๋Š” ์—๋„ˆ์ง€ ์†Œ๋ชจ๊ฐ€ ํฌ๊ธฐ์— ํšจ์œจ์ ์ธ ์‹ฌ์ธต์‹ ๊ฒฝ๋ง ์ „์šฉ ํ•˜๋“œ์›จ์–ด ๊ฐœ๋ฐœ์— ๋Œ€ํ•œ ์š”๊ตฌ๊ฐ€ ์ฆ๊ฐ€ํ•˜๊ณ  ์žˆ๋‹ค. ํ˜„์žฌ๊นŒ์ง€ ์ด๋Ÿฌํ•œ ์ „์šฉ ํ•˜๋“œ์›จ์–ด๋Š” ์ฃผ๋กœ ์‹ฌ์ธต์‹ ๊ฒฝ๋ง ์ถ”๋ก ์— ์ง‘์ค‘๋˜์–ด ์™”๋‹ค. ๋ณต์žกํ•œ ์‹ฌ์ธต์‹ ๊ฒฝ๋ง ๋ชจ๋ธ์€ ํ•™์Šต์— ๊ธด ์‹œ๊ฐ„์ด ๋“ค๊ณ  ๋งŽ์€ ์—๋„ˆ์ง€๋ฅผ ์†Œ ๋ชจํ•œ๋‹ค. ์ด์— ์‹ฌ์ธต์‹ ๊ฒฝ๋ง ํ•™์Šต์„ ์œ„ํ•œ ์ „์šฉ ํ•˜๋“œ์›จ์–ด์— ๋Œ€ํ•œ ์š”๊ตฌ๊ฐ€ ๋Š˜์–ด๊ฐ€๊ณ  ์žˆ๋‹ค. ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์€ ์‹ฌ์ธต์‹ ๊ฒฝ๋ง ํ•™์Šต ๊ฐ€์†๊ธฐ ๊ตฌ์กฐ๋ฅผ ํƒ์ƒ‰ํ•˜์˜€๋‹ค. ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์˜ ํ•™์Šต ์€ ์ˆœ์ „ํŒŒ, ์—ญ์ „ํŒŒ, ๊ฐ€์ค‘์น˜ ๊ฐฑ์‹  ์ด๋ ‡๊ฒŒ ์„ธ ๋‹จ๊ณ„๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๋‹ค. ์ด ์ค‘ ์•กํ‹ฐ๋ฒ ์ด ์…˜์˜ ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ๊ตฌํ•˜๋Š” ์—ญ์ „ํŒŒ ๋‹จ๊ณ„๊ฐ€ ๊ฐ€์žฅ ์‹œ๊ฐ„์ด ์˜ค๋ž˜ ๊ฑธ๋ฆฌ๋Š” ๋‹จ๊ณ„์ด๋‹ค. ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์—์„œ๋Š” ์—ญ์ „ํŒŒ ๋‹จ๊ณ„์— ์ค‘์ ์„ ๋‘” ์‹ฌ์ธต์‹ ๊ฒฝ๋ง ํ•™์Šต์„ ๊ฐ€์†ํ•˜๋Š” ํ•˜๋“œ์›จ์–ด ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ReLU ๋ ˆ์ด์–ด ํ˜น์€ dropout ๋ ˆ์ด์–ด๋กœ ์ธํ•ด ์ƒ๊ธด ๋‰ด๋ก ์˜ ์„ฑ๊น€์„ ์ด์šฉํ•˜์—ฌ ์‹ฌ์ธต์‹ ๊ฒฝ๋ง ํ•™์Šต์˜ ์—ญ์ „ํŒŒ๋ฅผ ๊ฐ€์†ํ•œ๋‹ค. ํ•™์œ„๋…ผ๋ฌธ์˜ ์ฒซ ๋ถ€๋ถ„์€ ํ•ฉ์„ฑ๊ณฑ ์‹ ๊ฒฝ๋ง์˜ ์—ญ์ „ํŒŒ๋ฅผ ๊ฐ€์†ํ•˜๋Š” ์‹ฌ์ธต์‹ ๊ฒฝ๋ง ํ•™์Šต ํ•˜ ๋“œ์›จ์–ด์ด๋‹ค. ๊ฐ€์žฅ ๋งŽ์ด ์“ฐ์ด๋Š” ํ™œ์„ฑํ™” ํ•จ์ˆ˜์ธ ReLU๋ฅผ ์ด์šฉํ•˜๋Š” ์‹ ๊ฒฝ๋ง์„ ๊ฐ€์ •ํ–ˆ๋‹ค. ์Œ์ˆ˜ ์ž…๋ ฅ๊ฐ’์— ๋Œ€ํ•œ ReLU ํ™œ์„ฑํ™” ํ•จ์ˆ˜์˜ ๋„ํ•จ์ˆ˜๊ฐ€ 0์ด ๋˜์–ด ํ•ด๋‹น ์•กํ‹ฐ๋ฒ ์ด์…˜์˜ ๊ทธ ๋ž˜๋””์–ธํŠธ ๋˜ํ•œ 0์ด ๋œ๋‹ค. ์ด ๊ฒฝ์šฐ ๊ทธ๋ž˜๋””์–ธํŠธ ๊ฐ’์— ๋Œ€ํ•œ ๊ณ„์‚ฐ ์—†์ด๋„ ๊ทธ๋ž˜๋””์–ธํŠธ ๊ฐ’์ด 0์ด ๋˜๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๊ธฐ์— ํ•ด๋‹น ๊ทธ๋ž˜๋””์–ธํŠธ๋Š” ๊ณ„์‚ฐํ•˜์ง€ ์•Š์•„๋„ ๋œ๋‹ค. ์ด๋Ÿฌํ•œ ํŠน์„ฑ์„ ์ด์šฉํ•˜์—ฌ 0๊ฐ’์ธ ์•กํ‹ฐ๋ฒ ์ด์…˜์— ๋Œ€ํ•œ ๊ทธ๋ž˜๋””์–ธํŠธ ๊ณ„์‚ฐ์„ ๊ฑด๋„ˆ ๋›ธ ์ˆ˜ ์žˆ๋Š” ํšจ์œจ์ ์ธ ์‹ฌ์ธต์‹ ๊ฒฝ๋ง ๊ฐ€์† ํ•˜๋“œ์›จ์–ด๋ฅผ ์„ค๊ณ„ํ–ˆ๋‹ค. ๋˜ํ•œ ์‹คํ—˜์„ ํ†ตํ•ด ๋ณธ ํ•˜๋“œ์›จ์–ด์˜ ํšจ์œจ์„ฑ์„ ๊ฒ€์ฆํ–ˆ๋‹ค. ํ•™์œ„๋…ผ๋ฌธ์˜ ๋‘๋ฒˆ์งธ ๋ถ€๋ถ„์€ ์™„์ „์—ฐ๊ฒฐ ์‹ ๊ฒฝ๋ง์˜ ํ•™์Šต์„ ๊ฐ€์†ํ•˜๋Š” ํ•˜๋“œ์›จ์–ด ๊ตฌ์กฐ ์ œ์•ˆ์ด๋‹ค. ReLU ๋ ˆ์ด์–ด์™€ ๋น„์Šทํ•˜๊ฒŒ dropout ๋ ˆ์ด์–ด ๋˜ํ•œ ๊ทธ๋ž˜๋””์–ธํŠธ ๊ณ„์‚ฐ ์—†์ด๋„ ๊ทธ ๊ฒฐ๊ณผ๊ฐ€ 0์ž„์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. Dropout์€ ์‹ฌ์ธต์‹ ๊ฒฝ๋ง์˜ ๊ณผ์ ํ•ฉ์„ ํ•ด๊ฒฐํ•˜๋Š” ์ผ๋ฐ˜ํ™” ๊ธฐ๋ฒ• ์ค‘ ํ•˜๋‚˜๋กœ, ์‹ฌ์ธต์‹ ๊ฒฝ๋ง ํ•™์Šต ๊ณผ์ • ๋™์•ˆ์—๋งŒ ๋ฌด์ž‘์œ„๋กœ ์‹ ๊ฒฝ๋ง์˜ ์—ฐ๊ฒฐ์„ ๋Š์–ด ๋†“๋Š”๋‹ค. ์‹ ๊ฒฝ๋ง์ด ๋Š์–ด์ง„ ๊ฒฝ๋กœ๋กœ๋Š” ์—ญ์ „ํŒŒ ๋‹จ๊ณ„์—์„œ ์—๋Ÿฌ๊ฐ€ ์ „ํŒŒ๋˜์ง€ ์•Š๊ธฐ์— ํ•ด๋‹น ๊ทธ๋ž˜๋””์–ธํŠธ ๊ฐ’ ๋˜ํ•œ 0์ž„์„ ๋ฏธ๋ฆฌ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ์ด ํŠน์„ฑ์„ ์ด์šฉํ•˜์—ฌ ์™„์ „์—ฐ๊ฒฐ ์‹ ๊ฒฝ๋ง์˜ ์—ญ์ „ํŒŒ๋ฅผ ๊ฐ€์†ํ•  ์ˆ˜ ์žˆ๋Š” ํ•˜๋“œ์›จ์–ด๋ฅผ ์„ค๊ณ„ํ–ˆ๋‹ค. ๋˜ํ•œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์„ ํ†ตํ•ด ๋ณธ ํ•˜๋“œ์›จ ์–ด์˜ ํšจ์œจ์„ฑ์„ ๊ฒ€์ฆํ–ˆ๋‹ค.1 Introduction 1 1.1 Deep Neural Network Training 4 1.2 Convolutional Neural Network 5 1.2.1 Forward propagation 5 1.2.2 Backward propagation 6 1.2.3 Weight update 6 1.3 Rectified Linear Unit 7 1.4 Dropout 8 1.5 Previous Works 9 2 Acceleration of DNN Backward Propagation on CNN layer 12 2.1 Motivation 12 2.2 Selective Gradient Computation for Zero Activations 17 2.2.1 Baseline Architecture 17 2.2.2 Bit-Vector for Selective Gradient Computation 20 2.2.3 Filter Collector 23 2.2.4 Zero-Gradient Insertion in Write DMA 26 2.3 Overall Architecture 27 2.4 SRAM Buffer 28 2.4.1 Motivation 28 2.4.2 Selective Gradient Computation and SRAM Buffer 29 2.5 Experimental Results 32 2.5.1 Performance Simulator 32 2.5.2 RTL Implementation 33 2.5.3 Performance Improvement 35 2.5.4 Energy Reduction 37 2.5.5 Impacts of SRAM Buffer 39 2.6 Summary 45 3 Acceleration of DNN Backward Propagation on Fully Connected Layer 46 3.1 Motivation 46 3.1.1 Dropout 46 3.1.2 Conventional Dropout Layer Implementations 46 3.1.3 Applications of Dropout 47 3.2 Selective Gradient Computation for Dropped Activations 50 3.2.1 Baseline Architecture 50 3.2.2 Filter Dropper 50 3.3 Overall Architecture 54 3.4 Experimental Results 55 3.4.1 Simulator and Benchmark 55 3.4.2 Results 55 3.5 Summary 56 4 Conclusion 57Docto

    The Evolution of Distributed Systems for Graph Neural Networks and their Origin in Graph Processing and Deep Learning: A Survey

    Full text link
    Graph Neural Networks (GNNs) are an emerging research field. This specialized Deep Neural Network (DNN) architecture is capable of processing graph structured data and bridges the gap between graph processing and Deep Learning (DL). As graphs are everywhere, GNNs can be applied to various domains including recommendation systems, computer vision, natural language processing, biology and chemistry. With the rapid growing size of real world graphs, the need for efficient and scalable GNN training solutions has come. Consequently, many works proposing GNN systems have emerged throughout the past few years. However, there is an acute lack of overview, categorization and comparison of such systems. We aim to fill this gap by summarizing and categorizing important methods and techniques for large-scale GNN solutions. In addition, we establish connections between GNN systems, graph processing systems and DL systems.Comment: Accepted at ACM Computing Survey

    Distributed Graph Neural Network Training: A Survey

    Full text link
    Graph neural networks (GNNs) are a type of deep learning models that are trained on graphs and have been successfully applied in various domains. Despite the effectiveness of GNNs, it is still challenging for GNNs to efficiently scale to large graphs. As a remedy, distributed computing becomes a promising solution of training large-scale GNNs, since it is able to provide abundant computing resources. However, the dependency of graph structure increases the difficulty of achieving high-efficiency distributed GNN training, which suffers from the massive communication and workload imbalance. In recent years, many efforts have been made on distributed GNN training, and an array of training algorithms and systems have been proposed. Yet, there is a lack of systematic review on the optimization techniques for the distributed execution of GNN training. In this survey, we analyze three major challenges in distributed GNN training that are massive feature communication, the loss of model accuracy and workload imbalance. Then we introduce a new taxonomy for the optimization techniques in distributed GNN training that address the above challenges. The new taxonomy classifies existing techniques into four categories that are GNN data partition, GNN batch generation, GNN execution model, and GNN communication protocol. We carefully discuss the techniques in each category. In the end, we summarize existing distributed GNN systems for multi-GPUs, GPU-clusters and CPU-clusters, respectively, and give a discussion about the future direction on distributed GNN training

    Energy Efficient Learning with Low Resolution Stochastic Domain Wall Synapse Based Deep Neural Networks

    Full text link
    We demonstrate that extremely low resolution quantized (nominally 5-state) synapses with large stochastic variations in Domain Wall (DW) position can be both energy efficient and achieve reasonably high testing accuracies compared to Deep Neural Networks (DNNs) of similar sizes using floating precision synaptic weights. Specifically, voltage controlled DW devices demonstrate stochastic behavior as modeled rigorously with micromagnetic simulations and can only encode limited states; however, they can be extremely energy efficient during both training and inference. We show that by implementing suitable modifications to the learning algorithms, we can address the stochastic behavior as well as mitigate the effect of their low-resolution to achieve high testing accuracies. In this study, we propose both in-situ and ex-situ training algorithms, based on modification of the algorithm proposed by Hubara et al. [1] which works well with quantization of synaptic weights. We train several 5-layer DNNs on MNIST dataset using 2-, 3- and 5-state DW device as synapse. For in-situ training, a separate high precision memory unit is adopted to preserve and accumulate the weight gradients, which are then quantized to program the low precision DW devices. Moreover, a sizeable noise tolerance margin is used during the training to address the intrinsic programming noise. For ex-situ training, a precursor DNN is first trained based on the characterized DW device model and a noise tolerance margin, which is similar to the in-situ training. Remarkably, for in-situ inference the energy dissipation to program the devices is only 13 pJ per inference given that the training is performed over the entire MNIST dataset for 10 epochs

    Distributed Pruning Towards Tiny Neural Networks in Federated Learning

    Full text link
    Neural network pruning is an essential technique for reducing the size and complexity of deep neural networks, enabling large-scale models on devices with limited resources. However, existing pruning approaches heavily rely on training data for guiding the pruning strategies, making them ineffective for federated learning over distributed and confidential datasets. Additionally, the memory- and computation-intensive pruning process becomes infeasible for recourse-constrained devices in federated learning. To address these challenges, we propose FedTiny, a distributed pruning framework for federated learning that generates specialized tiny models for memory- and computing-constrained devices. We introduce two key modules in FedTiny to adaptively search coarse- and finer-pruned specialized models to fit deployment scenarios with sparse and cheap local computation. First, an adaptive batch normalization selection module is designed to mitigate biases in pruning caused by the heterogeneity of local data. Second, a lightweight progressive pruning module aims to finer prune the models under strict memory and computational budgets, allowing the pruning policy for each layer to be gradually determined rather than evaluating the overall model structure. The experimental results demonstrate the effectiveness of FedTiny, which outperforms state-of-the-art approaches, particularly when compressing deep models to extremely sparse tiny models. FedTiny achieves an accuracy improvement of 2.61% while significantly reducing the computational cost by 95.91% and the memory footprint by 94.01% compared to state-of-the-art methods.Comment: This paper has been accepted to ICDCS 202
    • โ€ฆ
    corecore