876 research outputs found
Automatic differentiation in machine learning: a survey
Derivatives, mostly in the form of gradients and Hessians, are ubiquitous in
machine learning. Automatic differentiation (AD), also called algorithmic
differentiation or simply "autodiff", is a family of techniques similar to but
more general than backpropagation for efficiently and accurately evaluating
derivatives of numeric functions expressed as computer programs. AD is a small
but established field with applications in areas including computational fluid
dynamics, atmospheric sciences, and engineering design optimization. Until very
recently, the fields of machine learning and AD have largely been unaware of
each other and, in some cases, have independently discovered each other's
results. Despite its relevance, general-purpose AD has been missing from the
machine learning toolbox, a situation slowly changing with its ongoing adoption
under the names "dynamic computational graphs" and "differentiable
programming". We survey the intersection of AD and machine learning, cover
applications where AD has direct relevance, and address the main implementation
techniques. By precisely defining the main differentiation techniques and their
interrelationships, we aim to bring clarity to the usage of the terms
"autodiff", "automatic differentiation", and "symbolic differentiation" as
these are encountered more and more in machine learning settings.Comment: 43 pages, 5 figure
Neuron-level fuzzy memoization in RNNs
The final publication is available at ACM via http://dx.doi.org/10.1145/3352460.3358309Recurrent Neural Networks (RNNs) are a key technology for applications such as automatic speech recognition or machine translation. Unlike conventional feed-forward DNNs, RNNs remember past information to improve the accuracy of future predictions and, therefore, they are very effective for sequence processing problems.
For each application run, each recurrent layer is executed many times for processing a potentially large sequence of inputs (words, images, audio frames, etc.). In this paper, we make the observation that the output of a neuron exhibits small changes in consecutive invocations. We exploit this property to build a neuron-level fuzzy memoization scheme, which dynamically caches the output of each neuron and reuses it whenever it is predicted that the current output will be similar to a previously computed result, avoiding in this way the output computations.
The main challenge in this scheme is determining whether the new neuron's output for the current input in the sequence will be similar to a recently computed result. To this end, we extend the recurrent layer with a much simpler Bitwise Neural Network (BNN), and show that the BNN and RNN outputs are highly correlated: if two BNN outputs are very similar, the corresponding outputs in the original RNN layer are likely to exhibit negligible changes. The BNN provides a low-cost and effective mechanism for deciding when fuzzy memoization can be applied with a small impact on accuracy.
We evaluate our memoization scheme on top of a state-of-the-art accelerator for RNNs, for a variety of different neural networks from multiple application domains. We show that our technique avoids more than 24.2% of computations, resulting in 18.5% energy savings and 1.35x speedup on average.Peer ReviewedPostprint (author's final draft
์ด์ง ๋ด๋ด ๋คํธ์ํฌ๋ฅผ ์ํ DRAM ๊ธฐ๋ฐ์ ๋ด๋ด ๋คํธ์ํฌ ๊ฐ์๊ธฐ ๊ตฌ์กฐ
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ) -- ์์ธ๋ํ๊ต ๋ํ์ : ๊ณต๊ณผ๋ํ ์ปดํจํฐ๊ณตํ๋ถ, 2021. 2. ์ ์น์ฃผ.In the convolutional neural network applications, most computations occurred by the multiplication and accumulation of the convolution and fully-connected layers. From the hardware perspective (i.e., in the gate-level circuits), these operations are performed by many dot-products between the feature map and kernel vectors. Since the feature map and kernel have the matrix form, the vector converted from 3D, or 4D matrices is reused many times for the matrix multiplications. As the throughput of the DNN increases, the power consumption and performance bottleneck due to the data movement become a more critical issue. More importantly, power consumption due to off-chip memory accesses dominates total power since off-chip memory access consumes several hundred times greater power than the computation. The accelerators' throughput is about several hundred GOPS~several TOPS, but Memory bandwidth is less than 25.6 or 34 GB/s (with DDR4 or LPDDR4).
By reducing the network size and/or data movement size, both data movement power and performance bottleneck problems are improved. Among the algorithms, Quantization is widely used. Binary Neural Networks (BNNs) dramatically reduce precision down to 1 bit. The accuracy is much lower than that of the FP16, but the accuracy is continuously improving through various studies. With the data flow control, there is a method of reducing redundant data movement by increasing data reuse. The above two methods are widely applied in accelerators because they do not need additional computations in the inference computation.
In this dissertation, I present 1) a DRAM-based accelerator architecture and 2) a DRAM refresh method to improve performance reduction due to DRAM refresh. Both methods are orthogonal, so can be integrated into the DRAM chip and operate independently.
First, we proposed a DRAM-based accelerator architecture capable of massive and large vector dot product operation. In the field of CNN accelerators to which BNN can be applied, a computing-in-memory (CIM) structure that utilizes a cell-array structure of Memory for vector dot product operation is being actively studied. Since DRAM stores all the neural network data, it is advantageous to reduce the amount of data transfer. The proposed architecture operates by utilizing the basic operation of the DRAM.
The second method is to reduce the performance degradation and power consumption caused by DRAM refresh. Since the DRAM cannot read and write data while performing a periodic refresh, system performance decreases. The proposed refresh method tests the refresh characteristics inside the DRAM chip during self-refresh and increases the refresh cycle according to the characteristics. Since it operates independently inside DRAM, it can be applied to all systems using DRAM and is the same for deep neural network accelerators.
We surveyed system integration with a software stack to use the in-DRAM accelerator in the DL framework. As a result, it is expected to control in-DRAM accelerators with the memory controller implementation method verified in the previous experiment. Also, we have added the performance simulation function of in-DRAM accelerator to PyTorch. When running a neural network in PyTorch, it reports the computation latency and data movement latency occurring in the layer running in the in-DRAM accelerator. It is a significant advantage to predict the performance when running in hardware while co-designing the network.์ปจ๋ณผ๋ฃจ์
๋ ๋ด๋ด ๋คํธ์ํฌ (CNN) ์ดํ๋ฆฌ์ผ์ด์
์์๋, ๋๋ถ๋ถ์ ์ฐ์ฐ์ด ์ปจ๋ณผ๋ฃจ์
๋ ์ด์ด์ ํ๋ฆฌ-์ปค๋ฅํฐ๋ ๋ ์ด์ด์์ ๋ฐ์ํ๋ ๊ณฑ์
๊ณผ ๋์ ์ฐ์ฐ์ด๋ค. ๊ฒ์ดํธ-๋ก์ง ๋ ๋ฒจ์์๋, ๋๋์ ๋ฒกํฐ ๋ด์ ์ผ๋ก ์คํ๋๋ฉฐ, ์
๋ ฅ๊ณผ ์ปค๋ ๋ฒกํฐ๋ค์ ๋ฐ๋ณตํด์ ์ฌ์ฉํ์ฌ ์ฐ์ฐํ๋ค. ๋ฅ ๋ด๋ด ๋คํธ์ํฌ ์ฐ์ฐ์๋ ๋ฒ์ฉ ์ฐ์ฐ ์ ๋๋ณด๋ค, ๋จ์ํ ์ฐ์ฐ์ด ๊ฐ๋ฅํ ์์ ์ฐ์ฐ ์ ๋์ ๋๋์ผ๋ก ์ฌ์ฉํ๋ ๊ฒ์ด ์ ํฉํ๋ค. ๊ฐ์๊ธฐ์ ์ฑ๋ฅ์ด ์ผ์ ์ด์ ๋์์ง๋ฉด, ๊ฐ์๊ธฐ์ ์ฑ๋ฅ์ ์ฐ์ฐ์ ํ์ํ ๋ฐ์ดํฐ ์ ์ก์ ์ํด ์ ํ๋๋ค. ๋ฉ๋ชจ๋ฆฌ์์ ๋ฐ์ดํฐ๋ฅผ ์คํ-์นฉ์ผ๋ก ์ ์กํ ๋์ ์๋์ง ์๋ชจ๊ฐ, ์ฐ์ฐ ์ ๋์์ ์ฐ์ฐ์ ์ฌ์ฉ๋๋ ์๋์ง์ ์๋ฐฑ๋ฐฐ๋ก ํฌ๋ค. ๋ํ ์ฐ์ฐ๊ธฐ์ ์ฑ๋ฅ์ ์ด๋น ์๋ฐฑ ๊ธฐ๊ฐ~์ ํ
๋ผ-์ฐ์ฐ์ด ๊ฐ๋ฅํ์ง๋ง, ๋ฉ๋ชจ๋ฆฌ์ ๋ฐ์ดํฐ ์ ์ก์ ์ด๋น ์์ญ ๊ธฐ๊ฐ ๋ฐ์ดํธ์ด๋ค.
๋ฐ์ดํฐ ์ ์ก์ ์ํ ํ์์ ์ฑ๋ฅ ๋ฌธ์ ๋ฅผ ๋์์ ํด๊ฒฐํ๋ ๋ฐฉ๋ฒ์, ์ ์ก๋๋ ๋ฐ์ดํฐ ํฌ๊ธฐ๋ฅผ ์ค์ด๋ ๊ฒ์ด๋ค. ์๊ณ ๋ฆฌ์ฆ ์ค์์๋ ๋คํธ์ํฌ์ ๋ฐ์ดํฐ๋ฅผ ์์ํํ์ฌ, ๋ฎ์ ์ ๋ฐ๋๋ก ๋ฐ์ดํฐ๋ฅผ ํํํ๋ ๋ฐฉ๋ฒ์ด ๋๋ฆฌ ์ฌ์ฉ๋๋ค. ์ด์ง ๋ด๋ด ๋คํธ์ํฌ(BNN)๋ ์ ๋ฐ๋๋ฅผ 1๋นํธ๊น์ง ๊ทน๋จ์ ์ผ๋ก ๋ฎ์ถ๋ค. 16๋นํธ ์ ๋ฐ๋๋ณด๋ค ๋คํธ์ํฌ์ ์ ํ๋๊ฐ ๋ฎ์์ง๋ ๋ฌธ์ ๊ฐ ์์ง๋ง, ๋ค์ํ ์ฐ๊ตฌ๋ฅผ ํตํด ์ ํ๋๊ฐ ์ง์์ ์ผ๋ก ๊ฐ์ ๋๊ณ ์๋ค. ๋ํ ๊ตฌ์กฐ์ ์ผ๋ก๋, ์ ์ก๋ ๋ฐ์ดํฐ๋ฅผ ์ฌ์ฌ์ฉํ์ฌ ๋์ผํ ๋ฐ์ดํฐ์ ๋ฐ๋ณต์ ์ธ ์ ์ก์ ์ค์ด๋ ๋ฐฉ๋ฒ์ด ์๋ค. ์์ ๋ ๊ฐ์ง ๋ฐฉ๋ฒ์ ์ถ๋ก ๊ณผ์ ์์ ๋ณ๋์ ์ฐ์ฐ ์์ด ์ ์ฉ ๊ฐ๋ฅํ์ฌ ๊ฐ์๊ธฐ์์ ๋๋ฆฌ ์ ์ฉ๋๊ณ ์๋ค.
๋ณธ ๋
ผ๋ฌธ์์๋, DRAM ๊ธฐ๋ฐ์ ๊ฐ์๊ธฐ ๊ตฌ์กฐ๋ฅผ ์ ์ํ๊ณ , DRAM refresh์ ์ํ ์ฑ๋ฅ ๊ฐ์๋ฅผ ๊ฐ์ ํ๋ ๊ธฐ์ ์ ์ ์ํ์๋ค. ๋ ๋ฐฉ๋ฒ์ ํ๋์ DRAM ์นฉ์ผ๋ก ์ง์ ๊ฐ๋ฅํ๋ฉฐ, ๋
๋ฆฝ์ ์ผ๋ก ๊ตฌ๋ ๊ฐ๋ฅํ๋ค.
์ฒซ๋ฒ์งธ๋ ๋๋์ ๋ฒกํฐ ๋ด์ ์ฐ์ฐ์ด ๊ฐ๋ฅํ DRAM ๊ธฐ๋ฐ ๊ฐ์๊ธฐ์ ๋ํ ์ฐ๊ตฌ์ด๋ค. BNN์ ์ ์ฉํ ์ ์๋ CNN๊ฐ์๊ธฐ ๋ถ์ผ์์, ๋ฉ๋ชจ๋ฆฌ์ ์
-์ด๋ ์ด ๊ตฌ์กฐ๋ฅผ ๋ฒกํฐ ๋ด์ ์ฐ์ฐ์ ํ์ฉํ๋ ์ปดํจํ
-์ธ-๋ฉ๋ชจ๋ฆฌ(CIM) ๊ตฌ์กฐ๊ฐ ํ๋ฐํ ์ฐ๊ตฌ๋๊ณ ์๋ค. ํนํ, DRAM์๋ ๋ด๋ด ๋คํธ์ํฌ์ ๋ชจ๋ ๋ฐ์ดํฐ๊ฐ ์๊ธฐ ๋๋ฌธ์, ๋ฐ์ดํฐ ์ ์ก๋์ ๊ฐ์์ ์ ๋ฆฌํ๋ค. ์ฐ๋ฆฌ๋ DRAM ์
-์ด๋ ์ด์ ๊ตฌ์กฐ๋ฅผ ๋ฐ๊พธ์ง ์๊ณ , DRAM์ ๊ธฐ๋ณธ ๋์์ ํ์ฉํ์ฌ ์ฐ์ฐํ๋ ๋ฐฉ๋ฒ์ ์ ์ํ์๋ค.
๋๋ฒ์งธ๋ DRAM ๋ฆฌํ๋ ์ฌ ์ฃผ๊ธฐ๋ฅผ ๋๋ ค์ ์ฑ๋ฅ ์ดํ์ ํ์ ์๋ชจ๋ฅผ ๊ฐ์ ํ๋ ๋ฐฉ๋ฒ์ด๋ค. DRAM์ด ๋ฆฌํ๋ ์ฌ๋ฅผ ์คํํ ๋๋ง๋ค, ๋ฐ์ดํฐ๋ฅผ ์ฝ๊ณ ์ธ ์ ์๊ธฐ ๋๋ฌธ์ ์์คํ
ํน์ ๊ฐ์๊ธฐ์ ์ฑ๋ฅ ๊ฐ์๊ฐ ๋ฐ์ํ๋ค. DRAM ์นฉ ๋ด๋ถ์์ DRAM์ ๋ฆฌํ๋ ์ฌ ํน์ฑ์ ํ
์คํธํ๊ณ , ๋ฆฌํ๋ ์ฌ ์ฃผ๊ธฐ๋ฅผ ๋๋ฆฌ๋ ๋ฐฉ๋ฒ์ ์ ์ํ์๋ค. DRAM ๋ด๋ถ์์ ๋
๋ฆฝ์ ์ผ๋ก ๋์ํ๊ธฐ ๋๋ฌธ์ DRAM์ ์ฌ์ฉํ๋ ๋ชจ๋ ์์คํ
์ ์ ์ฉ ๊ฐ๋ฅํ๋ฉฐ, ๋ฅ ๋ด๋ด ๋คํธ์ํฌ ๊ฐ์๊ธฐ์์๋ ๋์ผํ๋ค.
๋ํ, ์ ์๋ ๊ฐ์๊ธฐ๋ฅผ PyTorch์ ๊ฐ์ด ๋๋ฆฌ ์ฌ์ฉ๋๋ ๋ฅ๋ฌ๋ ํ๋ ์ ์ํฌ์์๋ ์ฝ๊ฒ ์ฌ์ฉํ ์ ์๋๋ก, ์ํํธ์จ์ด ์คํ์ ๋น๋กฏํ system integration ๋ฐฉ๋ฒ์ ์กฐ์ฌํ์๋ค. ๊ฒฐ๊ณผ์ ์ผ๋ก, ๊ธฐ์กด์ TVM compiler์ FPGA๋ก ๊ตฌํํ๋ TVM/VTA ๊ฐ์๊ธฐ์, DRAM refresh ์คํ์์ ๊ฒ์ฆ๋ ๋ฉ๋ชจ๋ฆฌ ์ปจํธ๋กค๋ฌ์ ์ปค์คํ
์ปดํ์ผ๋ฌ๋ฅผ ์ถ๊ฐํ๋ฉด in-DRAM ๊ฐ์๊ธฐ๋ฅผ ์ ์ดํ ์ ์์ ๊ฒ์ผ๋ก ๊ธฐ๋๋๋ค. ์ด์ ๋ํ์ฌ, in-DRAM ๊ฐ์๊ธฐ์ ๋ด๋ด ๋คํธ์ํฌ์ ์ค๊ณ ๋จ๊ณ์์ ์ฑ๋ฅ์ ์์ธกํ ์ ์๋๋ก, ์๋ฎฌ๋ ์ด์
๊ธฐ๋ฅ์ PyTorch์ ์ถ๊ฐํ์๋ค. PyTorch์์ ์ ๊ฒฝ๋ง์ ์คํํ ๋, DRAM ๊ฐ์๊ธฐ์์ ์คํ๋๋ ๊ณ์ธต์์ ๋ฐ์ํ๋ ๊ณ์ฐ ๋๊ธฐ ์๊ฐ ๋ฐ ๋ฐ์ดํฐ ์ด๋ ์๊ฐ์ ํ์ธํ ์ ์๋ค.Abstract i
Contents viii
List of Tables x
List of Figures xiv
Chapter 1 Introduction 1
Chapter 2 Background 6
2.1 Neural Network Operation . . . . . . . . . . . . . . . . 6
2.2 Data Movement Overhead . . . . . . . . . . . . . . . . 7
2.3 Binary Neural Networks . . . . . . . . . . . . . . . . . 10
2.4 Computing-in-Memory . . . . . . . . . . . . . . . . . . 11
2.5 Memory Bottleneck due to Refresh . . . . . . . . . . . . 13
Chapter 3 In-DRAM Neural Network Accelerator 16
3.1 Backgrounds . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.1 DRAM hierarchy . . . . . . . . . . . . . . . . . 18
3.1.2 DRAM Basic Operation . . . . . . . . . . . . . 21
3.1.3 DRAM Commands with Timing Parameters . . . 22
3.1.4 Bit-wise Operation in DRAM . . . . . . . . . . 25
3.2 Motivations . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Proposed architecture . . . . . . . . . . . . . . . . . . . 30
3.3.1 Operation Examples of Row Operator . . . . . . 32
3.3.2 Convolutions on DRAM Chip . . . . . . . . . . 39
3.4 Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4.1 Input Broadcasting in DRAM . . . . . . . . . . 44
3.4.2 Input Data Movement With M2V . . . . . . . . . 47
3.4.3 Internal Data Movement With SiD . . . . . . . . 49
3.4.4 Data Partitioning for Parallel Operation . . . . . 52
3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . 56
3.5.1 Performance Estimation . . . . . . . . . . . . . 56
3.5.2 Configuration of In-DRAM Accelerator . . . . . 58
3.5.3 Improving the Accuracy of BNN . . . . . . . . . 60
3.5.4 Comparison with the Existing Works . . . . . . . 62
3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.6.1 Performance Comparison with ASIC Accelerators 67
3.6.2 Challenges of The Proposed Architecture . . . . 70
3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 72
Chapter 4 Reducing DRAM Refresh Power Consumption
by Runtime Profiling of Retention Time and Dualrow
Activation 74
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 74
4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . 77
4.3 Related Works . . . . . . . . . . . . . . . . . . . . . . . 78
4.4 Observations . . . . . . . . . . . . . . . . . . . . . . . . 84
4.5 Solution overview . . . . . . . . . . . . . . . . . . . . . 88
4.6 Runtime profiling . . . . . . . . . . . . . . . . . . . . . 93
4.6.1 Basic Operation . . . . . . . . . . . . . . . . . . 93
4.6.2 Profiling Multiple Rows in Parallel . . . . . . . . 96
4.6.3 Temperature, Data Backup and Error Check . . . 96
4.7 Dual-row Activation . . . . . . . . . . . . . . . . . . . . 98
4.8 Experiments . . . . . . . . . . . . . . . . . . . . . . . . 102
4.8.1 Experimental Setup . . . . . . . . . . . . . . . . 103
4.8.2 Refresh Period Improvement . . . . . . . . . . . 107
4.8.3 Power Reduction . . . . . . . . . . . . . . . . . 110
4.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 116
Chapter 5 System Integration 118
5.1 Integrate The Proposed Methods . . . . . . . . . . . . . 118
5.2 Software Stack . . . . . . . . . . . . . . . . . . . . . . 121
Chapter 6 Conclusion 129
Bibliography 131
๊ตญ๋ฌธ์ด๋ก 153Docto
DyCL: Dynamic Neural Network Compilation Via Program Rewriting and Graph Optimization
DL compiler's primary function is to translate DNN programs written in
high-level DL frameworks such as PyTorch and TensorFlow into portable
executables. These executables can then be flexibly executed by the deployed
host programs. However, existing DL compilers rely on a tracing mechanism,
which involves feeding a runtime input to a neural network program and tracing
the program execution paths to generate the computational graph necessary for
compilation. Unfortunately, this mechanism falls short when dealing with modern
dynamic neural networks (DyNNs) that possess varying computational graphs
depending on the inputs. Consequently, conventional DL compilers struggle to
accurately compile DyNNs into executable code. To address this limitation, we
propose \tool, a general approach that enables any existing DL compiler to
successfully compile DyNNs. \tool tackles the dynamic nature of DyNNs by
introducing a compilation mechanism that redistributes the control and data
flow of the original DNN programs during the compilation process. Specifically,
\tool develops program analysis and program transformation techniques to
convert a dynamic neural network into multiple sub-neural networks. Each
sub-neural network is devoid of conditional statements and is compiled
independently. Furthermore, \tool synthesizes a host module that models the
control flow of the DyNNs and facilitates the invocation of the sub-neural
networks. Our evaluation demonstrates the effectiveness of \tool, achieving a
100\% success rate in compiling all dynamic neural networks. Moreover, the
compiled executables generated by \tool exhibit significantly improved
performance, running between and faster than the
original DyNNs executed on general-purpose DL frameworks.Comment: This paper has been accepted to ISSTA 202
- โฆ