Search CORE

102 research outputs found

Performance Evaluation of cuDNN Convolution Algorithms on NVIDIA Volta GPUs

Author: Jordà Marc
Peña Monferrer Antonio José
Valero-Lara Pedro
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

Convolutional neural networks (CNNs) have recently attracted considerable attention due to their outstanding accuracy in applications, such as image recognition and natural language processing. While one advantage of the CNNs over other types of neural networks is their reduced computational cost, faster execution is still desired for both training and inference. Since convolution operations pose most of the execution time, multiple algorithms were and are being developed with the aim of accelerating this type of operations. However, due to the wide range of convolution parameter configurations used in the CNNs and the possible data type representations, it is not straightforward to assess in advance which of the available algorithms will be the best performing in each particular case. In this paper, we present a performance evaluation of the convolution algorithms provided by the cuDNN, the library used by most deep learning frameworks for their GPU operations. In our analysis, we leverage the convolution parameter configurations from widely used the CNNs and discuss which algorithms are better suited depending on the convolution parameters for both 32 and 16-bit floating-point (FP) data representations. Our results show that the filter size and the number of inputs are the most significant parameters when selecting a GPU convolution algorithm for 32-bit FP data. For 16-bit FP, leveraging specialized arithmetic units (NVIDIA Tensor Cores) is key to obtain the best performance.This work was supported by the European Union's Horizon 2020 Research and Innovation Program under the Marie Sklodowska-Curie under Grant 749516, and in part by the Spanish Juan de la Cierva under Grant IJCI-2017-33511Peer ReviewedPostprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Performance benchmarking, analysis, and optimization of deep learning inference

Author: Li Cheng
Publication venue
Publication date: 01/08/2020
Field of study

The world sees a proliferation of deep learning (DL) models and their wide adoption in different application domains. This has made the performance benchmarking, understanding, and optimization of DL inference an increasingly pressing task for both hardware designers and system providers, as they would like to offer the best possible computing system to serve DL models with the desired latency, throughput, and energy requirements while maximizing resource utilization. However, DL faces the following challenges in performance engineering. Benchmarking — While there have been significant efforts to develop benchmark suites that evaluate widely used DL models, developing, maintaining, and running benchmarks takes a non-trivial amount of effort, and DL benchmarking has been hampered in part due to the lack of representative and up-to-date benchmarking suites. Performance Understanding — Understanding the performance of DL workloads is challenging as their characteristics depend on the interplay between the models, frameworks, system libraries, and the hardware (or the HW/SW stack). Existing profiling tools are disjoint, however, and only focus on profiling within a particular level of the stack. This largely limits the types of analysis that can be performed on model execution. Optimization Advising — The current DL optimization process is manual and ad-hoc that requires a lot of effort and expertise. Existing tools lack the highly desired abilities to characterize ideal performance, identify sources of inefficiency, and quantify the benefits of potential optimizations. Such deficiencies have led to slow DL characterization/optimization cycles that cannot keep up with the fast pace at which new DL innovations are introduced. Evaluation and Comparison — The current DL landscape is fast-paced and is rife with non-uniform models, hardware/software (HW/SW) stacks, but lacks a DL benchmarking platform to facilitate evaluation and comparison of DL innovations, be it models, frameworks, libraries, or hardware. Due to the lack of a benchmarking platform, the current practice of evaluating the benefits of proposed DL innovations is both arduous and error-prone — stifling the adoption of the innovations. This thesis addresses the above challenges in DL performance engineering. First we introduce DLBricks, a composable benchmark generation design that reduces the effort of developing, maintaining, and running DL benchmarks. DLBricks decomposes DL models into a set of unique runnable networks and constructs the original model’s performance using the performance of the generated benchmarks. Then, we present XSP, an across-stack profiling design that correlates profiles from different sources to obtain a holistic and hierarchical view of DL model execution. XSP innovatively leverages distributed tracing and accurately capture the profiles at each level of the HW/SW stack in spite of profiling overhead. Next, we propose Benanza, a systematic DL benchmarking and analysis design that guides researchers to potential optimization opportunities and assesses hypothetical execution scenarios on GPUs. Finally, we design MLModelScope, a consistent, reproducible, and scalable DL benchmarking platform to facilitate evaluation and comparison of DL innovations. This thesis also briefly discusses TrIMS, TOPS, and CommScope which are developed based on the needs observed from the performance benchmarking and optimization work to solve relevant problems in the DL domain

Illinois Digital Environment for Access to Learning and Scholarship Repository

DeLTA: GPU Performance Model for Deep Learning Applications with In-depth Memory System Traffic Analysis

Author: Chatterjee Niladrish
Erez Mattan
Lee Donghyuk
Lym Sangkug
O'Connor Mike
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 02/04/2019
Field of study

Training convolutional neural networks (CNNs) requires intense compute throughput and high memory bandwidth. Especially, convolution layers account for the majority of the execution time of CNN training, and GPUs are commonly used to accelerate these layer workloads. GPU design optimization for efficient CNN training acceleration requires the accurate modeling of how their performance improves when computing and memory resources are increased. We present DeLTA, the first analytical model that accurately estimates the traffic at each GPU memory hierarchy level, while accounting for the complex reuse patterns of a parallel convolution algorithm. We demonstrate that our model is both accurate and robust for different CNNs and GPU architectures. We then show how this model can be used to carefully balance the scaling of different GPU resources for efficient CNN performance improvement

arXiv.org e-Print Archive

Crossref

cuDNN과 유사한 인터페이스를 갖는 오픈소스 딥 러닝 프리미티브 라이브러리

Author: 김대연
Publication venue: 서울대학교 대학원
Publication date: 01/02/2019
Field of study

학위논문 (석사)-- 서울대학교 대학원 : 공과대학 컴퓨터공학부, 2019. 2. 이재욱.Deep neural networks (DNNs) are a key enabler of today's intelligent applications and services. cuDNN is the de-facto standard library of deep learning primitives, which makes it easy to develop sophisticated DNN models. However, cuDNN is a propriatary software from NVIDIA, and thus does not allow the user to customize the library based on her needs. Furthermore, it only targets NVIDIA GPUs and cannot support other hardware devices such as manycore CPUs and FPGAs. In this thesis we propose OpenDNN, an open-source, cuDNN-like DNN primitive library that can flexibly support multiple hardware devices. In particular, we demonstrate the portability and flexibility of OpenDNN by porting it to multiple popular DNN frameworks and hardware devices, including GPUs, CPUs, and FPGAs.심층 신경망은 오늘날의 지능형 어플리케이션과 서비스의 핵심 요소로 각광받고 있다. NVIDIA에서 개발한 cuDNN은 딥 러닝 프리미티브 라이브러리의 표준으로, 정교한 심층 신경망 모델을 쉽게 개발하도록 돕는다. 그러나, cuDNN은 NVIDIA의 특허 소프트웨어로 유저들이 자신들의 요구에 맞게 제작하는 것을 허용하지 않는다. 게다가 NVIDIA GPU만을 지원하기 때문에 멀티코어 CPU나 타 FPGA를 지원하지 않는다. 이 논문에서는 다양한 하드웨어 장치를 유연하게 지원하고, cuDNN과 유사한 인터페이스를 가진 딥 러닝 프리미티브 라이브러리인 OpenDNN을 소개한다. 특히, 다양한 심층 신경망 프레임워크와 CPU, GPU, 그리고 FPGA와 같은 하드웨어 장치들에 연동하여 OpenDNN의 이식성과 유연성을 입증한다.Abstract Contents List of Tables List of Figures Chapter 1 Introduction Chapter 2 Background 2.1 Deep Neural Network 2.2 Heterogeneous Computer Chapter 3 OpenDNN API 3.1 Overview 3.2 Context Manager 3.3 Descriptor Manager 3.4 Computation Functions 3.5 Summary Chapter 4 Backend Devices 4.1 CPU 4.2 GPU 4.3 FPGA Chapter 5 OpenDNN-enabled DNN Frameworks 5.1 Caffe 5.2 TensorFlow 5.3 DarkNet Chapter 6 Evaluation 6.1 Programmable Effort 6.2 Performance Chapter 7 Related Work Chapter 8 Conclusion Bibliography 국문초록 AcknowledgementsMaste

SNU Open Repository and Archive