263 research outputs found
์ฌ์ธต์ ๊ฒฝ๋ง ํ์ต ๊ฐ์๊ธฐ ๊ตฌ์กฐ ์ค๊ณ: ๋ด๋ฐ์ ์ฑ๊น์ ์ด์ฉํ ์ญ์ ํ ๊ฐ์
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ) -- ์์ธ๋ํ๊ต ๋ํ์ : ๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ ๋ณด๊ณตํ๋ถ, 2020. 8. ์ดํ์ฌ.Deep neural network has become one of the most important technologies in the various fields in computer science which tried to follow the human sense. In some fields, their performance defeats that of human sense with the help of the deep neural network. Since the fact that general purpose GPU can speed up deep neural network, GPU became the main device used for deep neural network. As the complexity of deep neural network becomes high that deep neural network requires more and more computing resources. However, general-purpose GPU consumes a lot of energy that the needs of specific hardware for deep neural network are rising. And nowadays, the specific hardwares are focusing on inference. With complicated network models, training a model consumes enormous time and energy using conventional devices. So there are increasing needs specific hardwares for DNN training.
The dissertation exploits deep neural network training accelerator architecture. The training process of a deep neural network (DNN) consists of three phases: forward propagation, backward propagation, and weight update. Among these, backward propagation for calculating gradients of activations is the most time consuming phase. The dissertation proposes hardware architectures to accelerate DNN training, focus- ing on the backward propagation phase. The dissertation makes use of the sparsity of the neurons incurred by ReLU layer or dropout layer to accelerate the backward propagation.
The first part of the dissertation proposes a hardware architecture to accelerate DNN backward propagation for convolutional layer. We assume using rectified linear unit (ReLU), which is the most widely used activation function. Since the output as well as the derivative of ReLU is zero for negative inputs, the gradient for activation is also zero for negative values. Thus, it is not needed to calculate the gradient of input activation if the input activation value is zero. Based on this observation, we design an efficient DNN accelerating hardware that skips the gradient computations for zero activations. We show the effectiveness of the approach through experiments with our accelerator design.
The second part of the dissertation proposes a hardware architecture for fully connected layer. Similar to ReLU layer, dropoout layer has explicit zero gradient for the dropped activation without gradient computation. Dropout is one of the regulariza- tion techniques which can solve the overfitting problem. During the DNN training, the dropout disconnect connections between neurons. Since the error does not propagated through the disconnected connections, we can detect zero gradient becomre computation. Making use of this characteristics, the dissertation proposes a hardware which can accelerate the backward propagation of fully connected layer. Further, the dissertation showed the effectiveness of the approach through simulation.์ฌ์ธต์ ๊ฒฝ๋ง์ ์ปดํจํฐ ๊ณผํ์ ๋ค์ํ ๋ถ์ผ ์ค ์ธ๊ฐ์ ๊ฐ๊ฐ์ ์ซ๋ ๋ถ์ผ์์ ๊ฐ์ฅ ์ค์ํ ๊ธฐ์ ์ด ๋์ด์๋ค. ๋ช๋ช ๋ถ์ผ์์๋ ์ด๋ฏธ ์ฌ์ธต์ ๊ฒฝ๋ง์ ๋์์ผ๋ก ์ธ๊ฐ์ ๊ฐ ๊ฐ์ ๋ฐ์ด๋์ ๋ถ์ผ๋ ์กด์ฌํ๋ค. GPGPU๋ฅผ ์ด์ฉํ ์ฌ์ธต์ ๊ฒฝ๋ง์ ๊ฐ์์ด ๊ฐ๋ฅํด์ง ์ดํ, GPU๋ ์ฌ์ธต์ ๊ฒฝ๋ง์ ์์ด ๊ฐ์ฅ ์ฃผ์ํ ์ฅ์น๋ก ์ฌ์ฉ๋๊ณ ์๋ค. ์ฌ์ธต์ ๊ฒฝ๋ง ์ ๋ณต์ก๋๊ฐ ๋์์ง์ ๋ฐ๋ผ ์ฐ์ฐ์ ๋ ๋ง์ ์ปดํจํ
์์์ ์๊ตฌํ๊ณ ์๋ค. ๊ทธ๋ฌ๋ GPGPU๋ ์๋์ง ์๋ชจ๊ฐ ํฌ๊ธฐ์ ํจ์จ์ ์ธ ์ฌ์ธต์ ๊ฒฝ๋ง ์ ์ฉ ํ๋์จ์ด ๊ฐ๋ฐ์ ๋ํ ์๊ตฌ๊ฐ ์ฆ๊ฐํ๊ณ ์๋ค. ํ์ฌ๊น์ง ์ด๋ฌํ ์ ์ฉ ํ๋์จ์ด๋ ์ฃผ๋ก ์ฌ์ธต์ ๊ฒฝ๋ง ์ถ๋ก ์ ์ง์ค๋์ด ์๋ค. ๋ณต์กํ ์ฌ์ธต์ ๊ฒฝ๋ง ๋ชจ๋ธ์ ํ์ต์ ๊ธด ์๊ฐ์ด ๋ค๊ณ ๋ง์ ์๋์ง๋ฅผ ์ ๋ชจํ๋ค. ์ด์ ์ฌ์ธต์ ๊ฒฝ๋ง ํ์ต์ ์ํ ์ ์ฉ ํ๋์จ์ด์ ๋ํ ์๊ตฌ๊ฐ ๋์ด๊ฐ๊ณ ์๋ค.
๋ณธ ํ์๋
ผ๋ฌธ์ ์ฌ์ธต์ ๊ฒฝ๋ง ํ์ต ๊ฐ์๊ธฐ ๊ตฌ์กฐ๋ฅผ ํ์ํ์๋ค. ์ฌ์ธต์ ๊ฒฝ๋ง์ ํ์ต ์ ์์ ํ, ์ญ์ ํ, ๊ฐ์ค์น ๊ฐฑ์ ์ด๋ ๊ฒ ์ธ ๋จ๊ณ๋ก ์ด๋ฃจ์ด์ ธ ์๋ค. ์ด ์ค ์กํฐ๋ฒ ์ด ์
์ ๊ทธ๋๋์ธํธ๋ฅผ ๊ตฌํ๋ ์ญ์ ํ ๋จ๊ณ๊ฐ ๊ฐ์ฅ ์๊ฐ์ด ์ค๋ ๊ฑธ๋ฆฌ๋ ๋จ๊ณ์ด๋ค. ๋ณธ ํ์๋
ผ๋ฌธ์์๋ ์ญ์ ํ ๋จ๊ณ์ ์ค์ ์ ๋ ์ฌ์ธต์ ๊ฒฝ๋ง ํ์ต์ ๊ฐ์ํ๋ ํ๋์จ์ด ๊ตฌ์กฐ๋ฅผ ์ ์ํ๋ค. ReLU ๋ ์ด์ด ํน์ dropout ๋ ์ด์ด๋ก ์ธํด ์๊ธด ๋ด๋ก ์ ์ฑ๊น์ ์ด์ฉํ์ฌ ์ฌ์ธต์ ๊ฒฝ๋ง ํ์ต์ ์ญ์ ํ๋ฅผ ๊ฐ์ํ๋ค.
ํ์๋
ผ๋ฌธ์ ์ฒซ ๋ถ๋ถ์ ํฉ์ฑ๊ณฑ ์ ๊ฒฝ๋ง์ ์ญ์ ํ๋ฅผ ๊ฐ์ํ๋ ์ฌ์ธต์ ๊ฒฝ๋ง ํ์ต ํ ๋์จ์ด์ด๋ค. ๊ฐ์ฅ ๋ง์ด ์ฐ์ด๋ ํ์ฑํ ํจ์์ธ ReLU๋ฅผ ์ด์ฉํ๋ ์ ๊ฒฝ๋ง์ ๊ฐ์ ํ๋ค. ์์ ์
๋ ฅ๊ฐ์ ๋ํ ReLU ํ์ฑํ ํจ์์ ๋ํจ์๊ฐ 0์ด ๋์ด ํด๋น ์กํฐ๋ฒ ์ด์
์ ๊ทธ ๋๋์ธํธ ๋ํ 0์ด ๋๋ค. ์ด ๊ฒฝ์ฐ ๊ทธ๋๋์ธํธ ๊ฐ์ ๋ํ ๊ณ์ฐ ์์ด๋ ๊ทธ๋๋์ธํธ ๊ฐ์ด 0์ด ๋๋ ๊ฒ์ ์ ์ ์๊ธฐ์ ํด๋น ๊ทธ๋๋์ธํธ๋ ๊ณ์ฐํ์ง ์์๋ ๋๋ค. ์ด๋ฌํ ํน์ฑ์ ์ด์ฉํ์ฌ 0๊ฐ์ธ ์กํฐ๋ฒ ์ด์
์ ๋ํ ๊ทธ๋๋์ธํธ ๊ณ์ฐ์ ๊ฑด๋ ๋ธ ์ ์๋ ํจ์จ์ ์ธ ์ฌ์ธต์ ๊ฒฝ๋ง ๊ฐ์ ํ๋์จ์ด๋ฅผ ์ค๊ณํ๋ค. ๋ํ ์คํ์ ํตํด ๋ณธ ํ๋์จ์ด์ ํจ์จ์ฑ์ ๊ฒ์ฆํ๋ค.
ํ์๋
ผ๋ฌธ์ ๋๋ฒ์งธ ๋ถ๋ถ์ ์์ ์ฐ๊ฒฐ ์ ๊ฒฝ๋ง์ ํ์ต์ ๊ฐ์ํ๋ ํ๋์จ์ด ๊ตฌ์กฐ ์ ์์ด๋ค. ReLU ๋ ์ด์ด์ ๋น์ทํ๊ฒ dropout ๋ ์ด์ด ๋ํ ๊ทธ๋๋์ธํธ ๊ณ์ฐ ์์ด๋ ๊ทธ ๊ฒฐ๊ณผ๊ฐ 0์์ ์ ์ ์๋ค. Dropout์ ์ฌ์ธต์ ๊ฒฝ๋ง์ ๊ณผ์ ํฉ์ ํด๊ฒฐํ๋ ์ผ๋ฐํ ๊ธฐ๋ฒ ์ค ํ๋๋ก, ์ฌ์ธต์ ๊ฒฝ๋ง ํ์ต ๊ณผ์ ๋์์๋ง ๋ฌด์์๋ก ์ ๊ฒฝ๋ง์ ์ฐ๊ฒฐ์ ๋์ด ๋๋๋ค. ์ ๊ฒฝ๋ง์ด ๋์ด์ง ๊ฒฝ๋ก๋ก๋ ์ญ์ ํ ๋จ๊ณ์์ ์๋ฌ๊ฐ ์ ํ๋์ง ์๊ธฐ์ ํด๋น ๊ทธ๋๋์ธํธ ๊ฐ ๋ํ 0์์ ๋ฏธ๋ฆฌ ์ ์ ์๋ค. ์ด ํน์ฑ์ ์ด์ฉํ์ฌ ์์ ์ฐ๊ฒฐ ์ ๊ฒฝ๋ง์ ์ญ์ ํ๋ฅผ ๊ฐ์ํ ์ ์๋ ํ๋์จ์ด๋ฅผ ์ค๊ณํ๋ค. ๋ํ ์๋ฎฌ๋ ์ด์
์ ํตํด ๋ณธ ํ๋์จ ์ด์ ํจ์จ์ฑ์ ๊ฒ์ฆํ๋ค.1 Introduction 1
1.1 Deep Neural Network Training 4
1.2 Convolutional Neural Network 5
1.2.1 Forward propagation 5
1.2.2 Backward propagation 6
1.2.3 Weight update 6
1.3 Rectified Linear Unit 7
1.4 Dropout 8
1.5 Previous Works 9
2 Acceleration of DNN Backward Propagation on CNN layer 12
2.1 Motivation 12
2.2 Selective Gradient Computation for Zero Activations 17
2.2.1 Baseline Architecture 17
2.2.2 Bit-Vector for Selective Gradient Computation 20
2.2.3 Filter Collector 23
2.2.4 Zero-Gradient Insertion in Write DMA 26
2.3 Overall Architecture 27
2.4 SRAM Buffer 28
2.4.1 Motivation 28
2.4.2 Selective Gradient Computation and SRAM Buffer 29
2.5 Experimental Results 32
2.5.1 Performance Simulator 32
2.5.2 RTL Implementation 33
2.5.3 Performance Improvement 35
2.5.4 Energy Reduction 37
2.5.5 Impacts of SRAM Buffer 39
2.6 Summary 45
3 Acceleration of DNN Backward Propagation on Fully Connected Layer 46
3.1 Motivation 46
3.1.1 Dropout 46
3.1.2 Conventional Dropout Layer Implementations 46
3.1.3 Applications of Dropout 47
3.2 Selective Gradient Computation for Dropped Activations 50
3.2.1 Baseline Architecture 50
3.2.2 Filter Dropper 50
3.3 Overall Architecture 54
3.4 Experimental Results 55
3.4.1 Simulator and Benchmark 55
3.4.2 Results 55
3.5 Summary 56
4 Conclusion 57Docto
The Evolution of Distributed Systems for Graph Neural Networks and their Origin in Graph Processing and Deep Learning: A Survey
Graph Neural Networks (GNNs) are an emerging research field. This specialized
Deep Neural Network (DNN) architecture is capable of processing graph
structured data and bridges the gap between graph processing and Deep Learning
(DL). As graphs are everywhere, GNNs can be applied to various domains
including recommendation systems, computer vision, natural language processing,
biology and chemistry. With the rapid growing size of real world graphs, the
need for efficient and scalable GNN training solutions has come. Consequently,
many works proposing GNN systems have emerged throughout the past few years.
However, there is an acute lack of overview, categorization and comparison of
such systems. We aim to fill this gap by summarizing and categorizing important
methods and techniques for large-scale GNN solutions. In addition, we establish
connections between GNN systems, graph processing systems and DL systems.Comment: Accepted at ACM Computing Survey
Distributed Graph Neural Network Training: A Survey
Graph neural networks (GNNs) are a type of deep learning models that are
trained on graphs and have been successfully applied in various domains.
Despite the effectiveness of GNNs, it is still challenging for GNNs to
efficiently scale to large graphs. As a remedy, distributed computing becomes a
promising solution of training large-scale GNNs, since it is able to provide
abundant computing resources. However, the dependency of graph structure
increases the difficulty of achieving high-efficiency distributed GNN training,
which suffers from the massive communication and workload imbalance. In recent
years, many efforts have been made on distributed GNN training, and an array of
training algorithms and systems have been proposed. Yet, there is a lack of
systematic review on the optimization techniques for the distributed execution
of GNN training. In this survey, we analyze three major challenges in
distributed GNN training that are massive feature communication, the loss of
model accuracy and workload imbalance. Then we introduce a new taxonomy for the
optimization techniques in distributed GNN training that address the above
challenges. The new taxonomy classifies existing techniques into four
categories that are GNN data partition, GNN batch generation, GNN execution
model, and GNN communication protocol. We carefully discuss the techniques in
each category. In the end, we summarize existing distributed GNN systems for
multi-GPUs, GPU-clusters and CPU-clusters, respectively, and give a discussion
about the future direction on distributed GNN training
Energy Efficient Learning with Low Resolution Stochastic Domain Wall Synapse Based Deep Neural Networks
We demonstrate that extremely low resolution quantized (nominally 5-state)
synapses with large stochastic variations in Domain Wall (DW) position can be
both energy efficient and achieve reasonably high testing accuracies compared
to Deep Neural Networks (DNNs) of similar sizes using floating precision
synaptic weights. Specifically, voltage controlled DW devices demonstrate
stochastic behavior as modeled rigorously with micromagnetic simulations and
can only encode limited states; however, they can be extremely energy efficient
during both training and inference. We show that by implementing suitable
modifications to the learning algorithms, we can address the stochastic
behavior as well as mitigate the effect of their low-resolution to achieve high
testing accuracies. In this study, we propose both in-situ and ex-situ training
algorithms, based on modification of the algorithm proposed by Hubara et al.
[1] which works well with quantization of synaptic weights. We train several
5-layer DNNs on MNIST dataset using 2-, 3- and 5-state DW device as synapse.
For in-situ training, a separate high precision memory unit is adopted to
preserve and accumulate the weight gradients, which are then quantized to
program the low precision DW devices. Moreover, a sizeable noise tolerance
margin is used during the training to address the intrinsic programming noise.
For ex-situ training, a precursor DNN is first trained based on the
characterized DW device model and a noise tolerance margin, which is similar to
the in-situ training. Remarkably, for in-situ inference the energy dissipation
to program the devices is only 13 pJ per inference given that the training is
performed over the entire MNIST dataset for 10 epochs
Distributed Pruning Towards Tiny Neural Networks in Federated Learning
Neural network pruning is an essential technique for reducing the size and
complexity of deep neural networks, enabling large-scale models on devices with
limited resources. However, existing pruning approaches heavily rely on
training data for guiding the pruning strategies, making them ineffective for
federated learning over distributed and confidential datasets. Additionally,
the memory- and computation-intensive pruning process becomes infeasible for
recourse-constrained devices in federated learning. To address these
challenges, we propose FedTiny, a distributed pruning framework for federated
learning that generates specialized tiny models for memory- and
computing-constrained devices. We introduce two key modules in FedTiny to
adaptively search coarse- and finer-pruned specialized models to fit deployment
scenarios with sparse and cheap local computation. First, an adaptive batch
normalization selection module is designed to mitigate biases in pruning caused
by the heterogeneity of local data. Second, a lightweight progressive pruning
module aims to finer prune the models under strict memory and computational
budgets, allowing the pruning policy for each layer to be gradually determined
rather than evaluating the overall model structure. The experimental results
demonstrate the effectiveness of FedTiny, which outperforms state-of-the-art
approaches, particularly when compressing deep models to extremely sparse tiny
models. FedTiny achieves an accuracy improvement of 2.61% while significantly
reducing the computational cost by 95.91% and the memory footprint by 94.01%
compared to state-of-the-art methods.Comment: This paper has been accepted to ICDCS 202
- โฆ