Search CORE

9 research outputs found

MAPLE: Microprocessor A Priori for Latency Estimation

Author: Abbasi Saad
Shafiee Mohammad Javad
Wong Alexander
Publication venue
Publication date: 25/05/2022
Field of study

Modern deep neural networks must demonstrate state-of-the-art accuracy while exhibiting low latency and energy consumption. As such, neural architecture search (NAS) algorithms take these two constraints into account when generating a new architecture. However, efficiency metrics such as latency are typically hardware dependent requiring the NAS algorithm to either measure or predict the architecture latency. Measuring the latency of every evaluated architecture adds a significant amount of time to the NAS process. Here we propose Microprocessor A Priori for Latency Estimation MAPLE that does not rely on transfer learning or domain adaptation but instead generalizes to new hardware by incorporating a prior hardware characteristics during training. MAPLE takes advantage of a novel quantitative strategy to characterize the underlying microprocessor by measuring relevant hardware performance metrics, yielding a fine-grained and expressive hardware descriptor. Moreover, the proposed MAPLE benefits from the tightly coupled I/O between the CPU and GPU and their dependency to predict DNN latency on GPUs while measuring microprocessor performance hardware counters from the CPU feeding the GPU hardware. Through this quantitative strategy as the hardware descriptor, MAPLE can generalize to new hardware via a few shot adaptation strategy where with as few as 3 samples it exhibits a 6% improvement over state-of-the-art methods requiring as much as 10 samples. Experimental results showed that, increasing the few shot adaptation samples to 10 improves the accuracy significantly over the state-of-the-art methods by 12%. Furthermore, it was demonstrated that MAPLE exhibiting 8-10% better accuracy, on average, compared to relevant baselines at any number of adaptation samples.Comment: 13 pages, 4 figure

arXiv.org e-Print Archive

Weight-dependent Gates for Differentiable Neural Network Pruning

Author: Li Yun
Liu Zechun
Wu Weiqun
Yao Haotian
Yin Baoqun
Zhang Chi
Zhang Xiangyu
Publication venue
Publication date: 16/08/2020
Field of study

In this paper, we propose a simple and effective network pruning framework, which introduces novel weight-dependent gates to prune filter adaptively. We argue that the pruning decision should depend on the convolutional weights, in other words, it should be a learnable function of filter weights. We thus construct the weight-dependent gates (W-Gates) to learn the information from filter weights and obtain binary filter gates to prune or keep the filters automatically. To prune the network under hardware constraint, we train a Latency Predict Net (LPNet) to estimate the hardware latency of candidate pruned networks. Based on the proposed LPNet, we can optimize W-Gates and the pruning ratio of each layer under latency constraint. The whole framework is differentiable and can be optimized by gradient-based method to achieve a compact network with better trade-off between accuracy and efficiency. We have demonstrated the effectiveness of our method on Resnet34, Resnet50 and MobileNet V2, achieving up to 1.33/1.28/1.1 higher Top-1 accuracy with lower hardware latency on ImageNet. Compared with state-of-the-art pruning methods, our method achieves superior performance.Comment: ECCV worksho

arXiv.org e-Print Archive

효율적인 추론을 위한 하드웨어 친화적 신경망 구조 및 가속기 설계

Author: 유준상
Publication venue: 서울대학교 대학원
Publication date: 01/08/2020
Field of study

학위논문 (박사) -- 서울대학교 대학원 : 공과대학 전기·정보공학부, 2020. 8. 이혁재.머신 러닝 (Machine Learning) 방법 중 현재 가장 주목받고 있는 딥러닝(Deep Learning)에 관한 연구들이 하드웨어와 소프트웨어 두 측면에서 모두 활발하게 진행되고 있다. 높은 성능을 유지하면서도 효율적으로 추론을 하기 위하여 모바일용 신경망 구조(Neural Network Architecture) 설계 및 학습된 모델 압축 등 소프트웨어 측면에서의 최적화 방법들이 연구되고 있으며, 이미 학습된 딥러닝 모델이 주어졌을 때 빠른 추론과 높은 에너지효율성을 갖는 가속기를 설계하는 하드웨어 측면에서의 연구가 동시에 진행되고 있다. 이러한 기존의 최적화 및 설계 방법에서 더 나아가 본 논문에서는 새로운 하드웨어 설계 기술과 모델 변환 방법 등을 적용하여 더 효율적인 추론 시스템을 만드는 것을 목표로 한다. 첫 번째, 새로운 하드웨어 설계 방법인 확률 컴퓨팅(Stochastic computing)을 도입하여 더 효율적인 딥러닝 가속 하드웨어를 설계하였다. 확률 컴퓨팅은 확률 연산에 기반을 둔 새로운 회로 설계 방법으로 기존의 이진 연산 회로(Binary system)보다 훨씬 더 적은 트랜지스터를 사용하여 동일한 연산 회로를 구현할 수 있다는 장점이 있다. 특히, 딥러닝에서 가장 많이 사용되는 곱셈 연산을 위하여 이진 연산 회로에서는 배열 승산기(Array Multiplier)를 필요로 하지만 확률 컴퓨팅에서는 AND 게이트하나로 구현이 가능하다. 선행 연구들이 확률 컴퓨팅 회로를 기반한 딥러닝 가속기들을 설계하고 있는데, 인식률이 이진 연산 회로에 비하여 많이 뒤쳐지는 결과를 보여주었다. 이러한 문제들을 해결하기 위하여 본 논문에서는 연산의 정확도를 더 높일 수 있도록 단극성 부호화(Unipolar encoding) 방법을 활용하여 가속기를 설계하였고, 확률 컴퓨팅 숫자 생성기 (Stochastic number generator)의 오버헤드를 줄이기 위하여 확률 컴퓨팅 숫자 생성기를 여러 개의 뉴런이 공유하는 방법을 제안하였다. 두 번째, 더 높은 추론 속도 향상을 위하여 학습된 딥러닝 모델을 압축하는 방법 대신에 신경망 구조를 변환 하는 방법을 제시하였다. 선행 연구들의 결과를 보면, 학습된 모델을 압축하는 방법을 최신 구조들에 적용하게 되면 가중치 파라미터(Weight Parameter)에는 높은 압축률을 보여주지만 실제 추론 속도 향상에는 미미한 효과를 보여주었다. 실질적인 속도 향상이 미흡한 것은 신경망 구조가 가지고 있는 구조상의 한계에서 발생하는 문제이고, 이것을 해결하려면 신경망 구조를 바꾸는것이 가장 근본적인 해결책이다. 이러한 관찰 결과를 토대로 본 논문에서는 선행연구보다 더 높은 속도 향상을 위하여 신경망 구조를 변환하는 방법을 제안하였다. 마지막으로, 각 층마다 서로 다른 구조를 가질 수 있도록 탐색 범위를 더 확장시키면서도 학습을 가능하게 하는 신경망 구조 탐색 방법을 제시하였다. 선행 연구에서의 신경망 구조 탐색은 기본 단위인 셀(Cell)의 구조를 탐색하고, 그 결과를 복사하여 하나의 큰 신경망으로 만드는 방법을 이용한다. 해당 방법은 하나의 셀 구조만 사용되기 때문에 위치에 따른 입력 특성맵(Input Feature Map)의 크기나 가중치 파라미터의 크기 등에 관한 정보는 무시하게 된다. 본 논문은 이러한 문제점들을 해결하면서도 안정적으로 학습을 시킬 수 있는 방법을 제시하였다. 또한, 연산량뿐만아니라 메모리 접근 횟수의 제약을 주어 더 효율적인 구조를 찾을 수 있도록 도와주는 페널티(Penalty)를 새로이 고안하였다.Deep learning is the most promising machine learning algorithm, and it is already used in real life. Actually, the latest smartphone use a neural network for better photograph and voice recognition. However, as the performance of the neural network improved, the hardware cost dramatically increases. Until the past few years, many researches focus on only a single side such as hardware or software, so its actual cost is hardly improved. Therefore, hardware and software co-optimization is needed to achieve further improvement. For this reason, this dissertation proposes the efficient inference system considering the hardware accelerator to the network architecture design. The first part of the dissertation is a deep neural network accelerator with stochastic computing. The main goal is the efficient stochastic computing hardware design for a convolutional neural network. It includes stochastic ReLU and optimized max function, which are key components in the convolutional neural network. To avoid the range limitation problem of stochastic numbers and increase the signal-to-noise ratio, we perform weight normalization and upscaling. In addition, to reduce the overhead of binary-to-stochastic conversion, we propose a scheme for sharing stochastic number generators among the neurons in the convolutional neural network. The second part of the dissertation is a neural architecture transformation. The network recasting is proposed, and it enables the network architecture transformation. The primary goal of this method is to accelerate the inference process through the transformation, but there can be many other practical applications. The method is based on block-wise recasting; it recasts each source block in a pre-trained teacher network to a target block in a student network. For the recasting, a target block is trained such that its output activation approximates that of the source block. Such a block-by-block recasting in a sequential manner transforms the network architecture while preserving accuracy. This method can be used to transform an arbitrary teacher network type to an arbitrary student network type. It can even generate a mixed-architecture network that consists of two or more types of block. The network recasting can generate a network with fewer parameters and/or activations, which reduce the inference time significantly. Naturally, it can be used for network compression by recasting a trained network into a smaller network of the same type. The third part of the dissertation is a fine-grained neural architecture search. InheritedNAS is the fine-grained architecture search method, and it uses the coarsegrained architecture that is found from the cell-based architecture search. Basically, fine-grained architecture has a very large search space, so it is hard to find directly. A stage independent search is proposed, and this method divides the entire network to several stages and trains each stage independently. To break the dependency between each stage, a two-point matching distillation method is also proposed. And then, operation pruning is applied to remove the unimportant operation. The block-wise pruning method is used to remove the operations rather than the node-wise pruning. In addition, a hardware-aware latency penalty is proposed, and it covers not only FLOPs but also memory access.1 Introduction 1 1.1 DNN Accelerator with Stochastic Computing 2 1.2 Neural Architecture Transformation 4 1.3 Fine-Grained Neural Architecture Search 6 2 Background 8 2.1 Stochastic Computing 8 2.2 Neural Network 10 2.2.1 Network Compression 10 2.2.2 Neural Network Accelerator 13 2.3 Knowledge Distillation 17 2.4 Neural Architecture Search 19 3 DNN Accelerator with Stochastic Computing 23 3.1 Motivation 23 3.1.1 Multiplication Error on Stochastic Computing 23 3.1.2 DNN with Stochastic Computing 24 3.2 Unipolar SC Hardware for CNN 25 3.2.1 Overall Hardware Design 25 3.2.2 Stochastic ReLU function 27 3.2.3 Stochastic Max function 30 3.2.4 Efficient Average Function 36 3.3 Weight Modulation for SC Hardware 38 3.3.1 Weight Normalization for SC 38 3.3.2 Weight Upscaling for Output Layer 43 3.4 Early Decision Termination 44 3.5 Stochastic Number Generator Sharing 49 3.6 Experiments 53 3.6.1 Accuracy of CNN using Unipolar SC 53 3.6.2 Synthesis Result 57 3.7 Summary 58 4 Neural Architecture Transformation 59 4.1 Motivation 59 4.2 Network Recasting 61 4.2.1 Recasting from DenseNet to ResNet and ConvNet 63 4.2.2 Recasting from ResNet to ConvNet 63 4.2.3 Compression 63 4.2.4 Block Training 65 4.2.5 Sequential Recasting and Fine-tuning 67 4.3 Experiments 69 4.3.1 Visualization of Filter Reduction 70 4.3.2 CIFAR 71 4.3.3 ILSVRC2012 73 4.4 Summary 76 5 Fine-Grained Neural Architecture Search 77 5.1 Motivation 77 5.1.1 Search Space Reduction Versus Diversity 77 5.1.2 Hardware-Aware Optimization 78 5.2 InheritedNAS 79 5.2.1 Stage Independent Search 79 5.2.2 Operation Pruning 82 5.2.3 Entire Search Procedure 83 5.3 Hardware-aware Penalty Design 85 5.4 Experiments 87 5.4.1 Fine-Grained Architecture Search 88 5.4.2 Penalty Analysis 90 5.5 Summary 92 6 Conclusion 93 Abstract (In Korean) 113Docto

SNU Open Repository and Archive