9 research outputs found
MAPLE: Microprocessor A Priori for Latency Estimation
Modern deep neural networks must demonstrate state-of-the-art accuracy while
exhibiting low latency and energy consumption. As such, neural architecture
search (NAS) algorithms take these two constraints into account when generating
a new architecture. However, efficiency metrics such as latency are typically
hardware dependent requiring the NAS algorithm to either measure or predict the
architecture latency. Measuring the latency of every evaluated architecture
adds a significant amount of time to the NAS process. Here we propose
Microprocessor A Priori for Latency Estimation MAPLE that does not rely on
transfer learning or domain adaptation but instead generalizes to new hardware
by incorporating a prior hardware characteristics during training. MAPLE takes
advantage of a novel quantitative strategy to characterize the underlying
microprocessor by measuring relevant hardware performance metrics, yielding a
fine-grained and expressive hardware descriptor. Moreover, the proposed MAPLE
benefits from the tightly coupled I/O between the CPU and GPU and their
dependency to predict DNN latency on GPUs while measuring microprocessor
performance hardware counters from the CPU feeding the GPU hardware. Through
this quantitative strategy as the hardware descriptor, MAPLE can generalize to
new hardware via a few shot adaptation strategy where with as few as 3 samples
it exhibits a 6% improvement over state-of-the-art methods requiring as much as
10 samples. Experimental results showed that, increasing the few shot
adaptation samples to 10 improves the accuracy significantly over the
state-of-the-art methods by 12%. Furthermore, it was demonstrated that MAPLE
exhibiting 8-10% better accuracy, on average, compared to relevant baselines at
any number of adaptation samples.Comment: 13 pages, 4 figure
Weight-dependent Gates for Differentiable Neural Network Pruning
In this paper, we propose a simple and effective network pruning framework,
which introduces novel weight-dependent gates to prune filter adaptively. We
argue that the pruning decision should depend on the convolutional weights, in
other words, it should be a learnable function of filter weights. We thus
construct the weight-dependent gates (W-Gates) to learn the information from
filter weights and obtain binary filter gates to prune or keep the filters
automatically. To prune the network under hardware constraint, we train a
Latency Predict Net (LPNet) to estimate the hardware latency of candidate
pruned networks. Based on the proposed LPNet, we can optimize W-Gates and the
pruning ratio of each layer under latency constraint. The whole framework is
differentiable and can be optimized by gradient-based method to achieve a
compact network with better trade-off between accuracy and efficiency. We have
demonstrated the effectiveness of our method on Resnet34, Resnet50 and
MobileNet V2, achieving up to 1.33/1.28/1.1 higher Top-1 accuracy with lower
hardware latency on ImageNet. Compared with state-of-the-art pruning methods,
our method achieves superior performance.Comment: ECCV worksho
ν¨μ¨μ μΈ μΆλ‘ μ μν νλμ¨μ΄ μΉνμ μ κ²½λ§ κ΅¬μ‘° λ° κ°μκΈ° μ€κ³
νμλ
Όλ¬Έ (λ°μ¬) -- μμΈλνκ΅ λνμ : 곡과λν μ κΈ°Β·μ 보곡νλΆ, 2020. 8. μ΄νμ¬.λ¨Έμ λ¬λ (Machine Learning) λ°©λ² μ€ νμ¬ κ°μ₯ μ£Όλͺ©λ°κ³ μλ λ₯λ¬λ(Deep Learning)μ κ΄ν μ°κ΅¬λ€μ΄ νλμ¨μ΄μ μννΈμ¨μ΄ λ μΈ‘λ©΄μμ λͺ¨λ νλ°νκ² μ§νλκ³ μλ€. λμ μ±λ₯μ μ μ§νλ©΄μλ ν¨μ¨μ μΌλ‘ μΆλ‘ μ νκΈ° μνμ¬ λͺ¨λ°μΌμ© μ κ²½λ§ κ΅¬μ‘°(Neural Network Architecture) μ€κ³ λ° νμ΅λ λͺ¨λΈ μμΆ λ± μννΈμ¨μ΄ μΈ‘λ©΄μμμ μ΅μ ν λ°©λ²λ€μ΄ μ°κ΅¬λκ³ μμΌλ©°, μ΄λ―Έ νμ΅λ λ₯λ¬λ λͺ¨λΈμ΄ μ£Όμ΄μ‘μ λ λΉ λ₯Έ μΆλ‘ κ³Ό λμ μλμ§ν¨μ¨μ±μ κ°λ κ°μκΈ°λ₯Ό μ€κ³νλ νλμ¨μ΄ μΈ‘λ©΄μμμ μ°κ΅¬κ° λμμ μ§νλκ³ μλ€. μ΄λ¬ν κΈ°μ‘΄μ μ΅μ ν λ° μ€κ³ λ°©λ²μμ λ λμκ° λ³Έ λ
Όλ¬Έμμλ μλ‘μ΄ νλμ¨μ΄ μ€κ³ κΈ°μ κ³Ό λͺ¨λΈ λ³ν λ°©λ² λ±μ μ μ©νμ¬ λ ν¨μ¨μ μΈ μΆλ‘ μμ€ν
μ λ§λλ κ²μ λͺ©νλ‘ νλ€.
첫 λ²μ§Έ, μλ‘μ΄ νλμ¨μ΄ μ€κ³ λ°©λ²μΈ νλ₯ μ»΄ν¨ν
(Stochastic computing)μ λμ
νμ¬ λ ν¨μ¨μ μΈ λ₯λ¬λ κ°μ νλμ¨μ΄λ₯Ό μ€κ³νμλ€. νλ₯ μ»΄ν¨ν
μ νλ₯ μ°μ°μ κΈ°λ°μ λ μλ‘μ΄ νλ‘ μ€κ³ λ°©λ²μΌλ‘ κΈ°μ‘΄μ μ΄μ§ μ°μ° νλ‘(Binary system)λ³΄λ€ ν¨μ¬ λ μ μ νΈλμ§μ€ν°λ₯Ό μ¬μ©νμ¬ λμΌν μ°μ° νλ‘λ₯Ό ꡬνν μ μλ€λ μ₯μ μ΄ μλ€. νΉν, λ₯λ¬λμμ κ°μ₯ λ§μ΄ μ¬μ©λλ κ³±μ
μ°μ°μ μνμ¬ μ΄μ§ μ°μ° νλ‘μμλ λ°°μ΄ μΉμ°κΈ°(Array Multiplier)λ₯Ό νμλ‘ νμ§λ§ νλ₯ μ»΄ν¨ν
μμλ AND κ²μ΄νΈνλλ‘ κ΅¬νμ΄ κ°λ₯νλ€. μ ν μ°κ΅¬λ€μ΄ νλ₯ μ»΄ν¨ν
νλ‘λ₯Ό κΈ°λ°ν λ₯λ¬λ κ°μκΈ°λ€μ μ€κ³νκ³ μλλ°, μΈμλ₯ μ΄ μ΄μ§ μ°μ° νλ‘μ λΉνμ¬ λ§μ΄ λ€μ³μ§λ κ²°κ³Όλ₯Ό 보μ¬μ£Όμλ€. μ΄λ¬ν λ¬Έμ λ€μ ν΄κ²°νκΈ° μνμ¬ λ³Έ λ
Όλ¬Έμμλ μ°μ°μ μ νλλ₯Ό λ λμΌ μ μλλ‘ λ¨κ·Ήμ± λΆνΈν(Unipolar encoding) λ°©λ²μ νμ©νμ¬ κ°μκΈ°λ₯Ό μ€κ³νμκ³ , νλ₯ μ»΄ν¨ν
μ«μ μμ±κΈ° (Stochastic number generator)μ μ€λ²ν€λλ₯Ό μ€μ΄κΈ° μνμ¬ νλ₯ μ»΄ν¨ν
μ«μ μμ±κΈ°λ₯Ό μ¬λ¬ κ°μ λ΄λ°μ΄ 곡μ νλ λ°©λ²μ μ μνμλ€.
λ λ²μ§Έ, λ λμ μΆλ‘ μλ ν₯μμ μνμ¬ νμ΅λ λ₯λ¬λ λͺ¨λΈμ μμΆνλ λ°©λ² λμ μ μ κ²½λ§ κ΅¬μ‘°λ₯Ό λ³ν νλ λ°©λ²μ μ μνμλ€. μ ν μ°κ΅¬λ€μ κ²°κ³Όλ₯Ό 보면, νμ΅λ λͺ¨λΈμ μμΆνλ λ°©λ²μ μ΅μ ꡬ쑰λ€μ μ μ©νκ² λλ©΄ κ°μ€μΉ νλΌλ―Έν°(Weight Parameter)μλ λμ μμΆλ₯ μ 보μ¬μ£Όμ§λ§ μ€μ μΆλ‘ μλ ν₯μμλ λ―Έλ―Έν ν¨κ³Όλ₯Ό 보μ¬μ£Όμλ€. μ€μ§μ μΈ μλ ν₯μμ΄ λ―Έν‘ν κ²μ μ κ²½λ§ κ΅¬μ‘°κ° κ°μ§κ³ μλ ꡬ쑰μμ νκ³μμ λ°μνλ λ¬Έμ μ΄κ³ , μ΄κ²μ ν΄κ²°νλ €λ©΄ μ κ²½λ§ κ΅¬μ‘°λ₯Ό λ°κΎΈλκ²μ΄ κ°μ₯ κ·Όλ³Έμ μΈ ν΄κ²°μ±
μ΄λ€. μ΄λ¬ν κ΄μ°° κ²°κ³Όλ₯Ό ν λλ‘ λ³Έ λ
Όλ¬Έμμλ μ νμ°κ΅¬λ³΄λ€ λ λμ μλ ν₯μμ μνμ¬ μ κ²½λ§ κ΅¬μ‘°λ₯Ό λ³ννλ λ°©λ²μ μ μνμλ€.
λ§μ§λ§μΌλ‘, κ° μΈ΅λ§λ€ μλ‘ λ€λ₯Έ ꡬ쑰λ₯Ό κ°μ§ μ μλλ‘ νμ λ²μλ₯Ό λ νμ₯μν€λ©΄μλ νμ΅μ κ°λ₯νκ² νλ μ κ²½λ§ κ΅¬μ‘° νμ λ°©λ²μ μ μνμλ€. μ ν μ°κ΅¬μμμ μ κ²½λ§ κ΅¬μ‘° νμμ κΈ°λ³Έ λ¨μμΈ μ
(Cell)μ ꡬ쑰λ₯Ό νμνκ³ , κ·Έ κ²°κ³Όλ₯Ό 볡μ¬νμ¬ νλμ ν° μ κ²½λ§μΌλ‘ λ§λλ λ°©λ²μ μ΄μ©νλ€. ν΄λΉ λ°©λ²μ νλμ μ
κ΅¬μ‘°λ§ μ¬μ©λκΈ° λλ¬Έμ μμΉμ λ°λ₯Έ μ
λ ₯ νΉμ±λ§΅(Input Feature Map)μ ν¬κΈ°λ κ°μ€μΉ νλΌλ―Έν°μ ν¬κΈ° λ±μ κ΄ν μ 보λ 무μνκ² λλ€. λ³Έ λ
Όλ¬Έμ μ΄λ¬ν λ¬Έμ μ λ€μ ν΄κ²°νλ©΄μλ μμ μ μΌλ‘ νμ΅μ μν¬ μ μλ λ°©λ²μ μ μνμλ€. λν, μ°μ°λλΏλ§μλλΌ λ©λͺ¨λ¦¬ μ κ·Ό νμμ μ μ½μ μ£Όμ΄ λ ν¨μ¨μ μΈ κ΅¬μ‘°λ₯Ό μ°Ύμ μ μλλ‘ λμμ£Όλ νλν°(Penalty)λ₯Ό μλ‘μ΄ κ³ μνμλ€.Deep learning is the most promising machine learning algorithm, and it is already used in real life. Actually, the latest smartphone use a neural network for better photograph and voice recognition. However, as the performance of the neural network improved, the hardware cost dramatically increases. Until the past few years, many researches focus on only a single side such as hardware or software, so its actual cost is hardly improved. Therefore, hardware and software co-optimization is needed to achieve further improvement. For this reason, this dissertation proposes the efficient inference system considering the hardware accelerator to the network architecture design.
The first part of the dissertation is a deep neural network accelerator with stochastic computing. The main goal is the efficient stochastic computing hardware design for a convolutional neural network. It includes stochastic ReLU and optimized max function, which are key components in the convolutional neural network. To avoid the range limitation problem of stochastic numbers and increase the signal-to-noise ratio, we perform weight normalization and upscaling. In addition, to reduce the overhead of binary-to-stochastic conversion, we propose a scheme for sharing stochastic number generators among the neurons in the convolutional neural network.
The second part of the dissertation is a neural architecture transformation. The network recasting is proposed, and it enables the network architecture transformation. The primary goal of this method is to accelerate the inference process through the transformation, but there can be many other practical applications. The method is based on block-wise recasting; it recasts each source block in a pre-trained teacher network to a target block in a student network. For the recasting, a target block is trained such that its output activation approximates that of the source block. Such a block-by-block recasting in a sequential manner transforms the network architecture while preserving accuracy. This method can be used to transform an arbitrary teacher network type to an arbitrary student network type. It can even generate a mixed-architecture network that consists of two or more types of block. The network recasting can generate a network with fewer parameters and/or activations, which reduce the inference time significantly. Naturally, it can be used for network compression by recasting a trained network into a smaller network of the same type.
The third part of the dissertation is a fine-grained neural architecture search. InheritedNAS is the fine-grained architecture search method, and it uses the coarsegrained architecture that is found from the cell-based architecture search. Basically, fine-grained architecture has a very large search space, so it is hard to find directly. A stage independent search is proposed, and this method divides the entire network to several stages and trains each stage independently. To break the dependency between each stage, a two-point matching distillation method is also proposed. And then, operation pruning is applied to remove the unimportant operation. The block-wise pruning method is used to remove the operations rather than the node-wise pruning. In addition, a hardware-aware latency penalty is proposed, and it covers not only FLOPs but also memory access.1 Introduction 1
1.1 DNN Accelerator with Stochastic Computing 2
1.2 Neural Architecture Transformation 4
1.3 Fine-Grained Neural Architecture Search 6
2 Background 8
2.1 Stochastic Computing 8
2.2 Neural Network 10
2.2.1 Network Compression 10
2.2.2 Neural Network Accelerator 13
2.3 Knowledge Distillation 17
2.4 Neural Architecture Search 19
3 DNN Accelerator with Stochastic Computing 23
3.1 Motivation 23
3.1.1 Multiplication Error on Stochastic Computing 23
3.1.2 DNN with Stochastic Computing 24
3.2 Unipolar SC Hardware for CNN 25
3.2.1 Overall Hardware Design 25
3.2.2 Stochastic ReLU function 27
3.2.3 Stochastic Max function 30
3.2.4 Efficient Average Function 36
3.3 Weight Modulation for SC Hardware 38
3.3.1 Weight Normalization for SC 38
3.3.2 Weight Upscaling for Output Layer 43
3.4 Early Decision Termination 44
3.5 Stochastic Number Generator Sharing 49
3.6 Experiments 53
3.6.1 Accuracy of CNN using Unipolar SC 53
3.6.2 Synthesis Result 57
3.7 Summary 58
4 Neural Architecture Transformation 59
4.1 Motivation 59
4.2 Network Recasting 61
4.2.1 Recasting from DenseNet to ResNet and ConvNet 63
4.2.2 Recasting from ResNet to ConvNet 63
4.2.3 Compression 63
4.2.4 Block Training 65
4.2.5 Sequential Recasting and Fine-tuning 67
4.3 Experiments 69
4.3.1 Visualization of Filter Reduction 70
4.3.2 CIFAR 71
4.3.3 ILSVRC2012 73
4.4 Summary 76
5 Fine-Grained Neural Architecture Search 77
5.1 Motivation 77
5.1.1 Search Space Reduction Versus Diversity 77
5.1.2 Hardware-Aware Optimization 78
5.2 InheritedNAS 79
5.2.1 Stage Independent Search 79
5.2.2 Operation Pruning 82
5.2.3 Entire Search Procedure 83
5.3 Hardware-aware Penalty Design 85
5.4 Experiments 87
5.4.1 Fine-Grained Architecture Search 88
5.4.2 Penalty Analysis 90
5.5 Summary 92
6 Conclusion 93
Abstract (In Korean) 113Docto