97 research outputs found

    Performance Evaluation of cuDNN Convolution Algorithms on NVIDIA Volta GPUs

    Get PDF
    Convolutional neural networks (CNNs) have recently attracted considerable attention due to their outstanding accuracy in applications, such as image recognition and natural language processing. While one advantage of the CNNs over other types of neural networks is their reduced computational cost, faster execution is still desired for both training and inference. Since convolution operations pose most of the execution time, multiple algorithms were and are being developed with the aim of accelerating this type of operations. However, due to the wide range of convolution parameter configurations used in the CNNs and the possible data type representations, it is not straightforward to assess in advance which of the available algorithms will be the best performing in each particular case. In this paper, we present a performance evaluation of the convolution algorithms provided by the cuDNN, the library used by most deep learning frameworks for their GPU operations. In our analysis, we leverage the convolution parameter configurations from widely used the CNNs and discuss which algorithms are better suited depending on the convolution parameters for both 32 and 16-bit floating-point (FP) data representations. Our results show that the filter size and the number of inputs are the most significant parameters when selecting a GPU convolution algorithm for 32-bit FP data. For 16-bit FP, leveraging specialized arithmetic units (NVIDIA Tensor Cores) is key to obtain the best performance.This work was supported by the European Union's Horizon 2020 Research and Innovation Program under the Marie Sklodowska-Curie under Grant 749516, and in part by the Spanish Juan de la Cierva under Grant IJCI-2017-33511Peer ReviewedPostprint (published version

    Performance benchmarking, analysis, and optimization of deep learning inference

    Get PDF
    The world sees a proliferation of deep learning (DL) models and their wide adoption in different application domains. This has made the performance benchmarking, understanding, and optimization of DL inference an increasingly pressing task for both hardware designers and system providers, as they would like to offer the best possible computing system to serve DL models with the desired latency, throughput, and energy requirements while maximizing resource utilization. However, DL faces the following challenges in performance engineering. Benchmarking โ€” While there have been significant efforts to develop benchmark suites that evaluate widely used DL models, developing, maintaining, and running benchmarks takes a non-trivial amount of effort, and DL benchmarking has been hampered in part due to the lack of representative and up-to-date benchmarking suites. Performance Understanding โ€” Understanding the performance of DL workloads is challenging as their characteristics depend on the interplay between the models, frameworks, system libraries, and the hardware (or the HW/SW stack). Existing profiling tools are disjoint, however, and only focus on profiling within a particular level of the stack. This largely limits the types of analysis that can be performed on model execution. Optimization Advising โ€” The current DL optimization process is manual and ad-hoc that requires a lot of effort and expertise. Existing tools lack the highly desired abilities to characterize ideal performance, identify sources of inefficiency, and quantify the benefits of potential optimizations. Such deficiencies have led to slow DL characterization/optimization cycles that cannot keep up with the fast pace at which new DL innovations are introduced. Evaluation and Comparison โ€” The current DL landscape is fast-paced and is rife with non-uniform models, hardware/software (HW/SW) stacks, but lacks a DL benchmarking platform to facilitate evaluation and comparison of DL innovations, be it models, frameworks, libraries, or hardware. Due to the lack of a benchmarking platform, the current practice of evaluating the benefits of proposed DL innovations is both arduous and error-prone โ€” stifling the adoption of the innovations. This thesis addresses the above challenges in DL performance engineering. First we introduce DLBricks, a composable benchmark generation design that reduces the effort of developing, maintaining, and running DL benchmarks. DLBricks decomposes DL models into a set of unique runnable networks and constructs the original modelโ€™s performance using the performance of the generated benchmarks. Then, we present XSP, an across-stack profiling design that correlates profiles from different sources to obtain a holistic and hierarchical view of DL model execution. XSP innovatively leverages distributed tracing and accurately capture the profiles at each level of the HW/SW stack in spite of profiling overhead. Next, we propose Benanza, a systematic DL benchmarking and analysis design that guides researchers to potential optimization opportunities and assesses hypothetical execution scenarios on GPUs. Finally, we design MLModelScope, a consistent, reproducible, and scalable DL benchmarking platform to facilitate evaluation and comparison of DL innovations. This thesis also briefly discusses TrIMS, TOPS, and CommScope which are developed based on the needs observed from the performance benchmarking and optimization work to solve relevant problems in the DL domain

    DeLTA: GPU Performance Model for Deep Learning Applications with In-depth Memory System Traffic Analysis

    Full text link
    Training convolutional neural networks (CNNs) requires intense compute throughput and high memory bandwidth. Especially, convolution layers account for the majority of the execution time of CNN training, and GPUs are commonly used to accelerate these layer workloads. GPU design optimization for efficient CNN training acceleration requires the accurate modeling of how their performance improves when computing and memory resources are increased. We present DeLTA, the first analytical model that accurately estimates the traffic at each GPU memory hierarchy level, while accounting for the complex reuse patterns of a parallel convolution algorithm. We demonstrate that our model is both accurate and robust for different CNNs and GPU architectures. We then show how this model can be used to carefully balance the scaling of different GPU resources for efficient CNN performance improvement

    cuDNN๊ณผ ์œ ์‚ฌํ•œ ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ๊ฐ–๋Š” ์˜คํ”ˆ์†Œ์Šค ๋”ฅ ๋Ÿฌ๋‹ ํ”„๋ฆฌ๋ฏธํ‹ฐ๋ธŒ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (์„์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2019. 2. ์ด์žฌ์šฑ.Deep neural networks (DNNs) are a key enabler of today's intelligent applications and services. cuDNN is the de-facto standard library of deep learning primitives, which makes it easy to develop sophisticated DNN models. However, cuDNN is a propriatary software from NVIDIA, and thus does not allow the user to customize the library based on her needs. Furthermore, it only targets NVIDIA GPUs and cannot support other hardware devices such as manycore CPUs and FPGAs. In this thesis we propose OpenDNN, an open-source, cuDNN-like DNN primitive library that can flexibly support multiple hardware devices. In particular, we demonstrate the portability and flexibility of OpenDNN by porting it to multiple popular DNN frameworks and hardware devices, including GPUs, CPUs, and FPGAs.์‹ฌ์ธต ์‹ ๊ฒฝ๋ง์€ ์˜ค๋Š˜๋‚ ์˜ ์ง€๋Šฅํ˜• ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๊ณผ ์„œ๋น„์Šค์˜ ํ•ต์‹ฌ ์š”์†Œ๋กœ ๊ฐ๊ด‘๋ฐ›๊ณ  ์žˆ๋‹ค. NVIDIA์—์„œ ๊ฐœ๋ฐœํ•œ cuDNN์€ ๋”ฅ ๋Ÿฌ๋‹ ํ”„๋ฆฌ๋ฏธํ‹ฐ๋ธŒ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ ํ‘œ์ค€์œผ๋กœ, ์ •๊ตํ•œ ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง ๋ชจ๋ธ์„ ์‰ฝ๊ฒŒ ๊ฐœ๋ฐœํ•˜๋„๋ก ๋•๋Š”๋‹ค. ๊ทธ๋Ÿฌ๋‚˜, cuDNN์€ NVIDIA์˜ ํŠนํ—ˆ ์†Œํ”„ํŠธ์›จ์–ด๋กœ ์œ ์ €๋“ค์ด ์ž์‹ ๋“ค์˜ ์š”๊ตฌ์— ๋งž๊ฒŒ ์ œ์ž‘ํ•˜๋Š” ๊ฒƒ์„ ํ—ˆ์šฉํ•˜์ง€ ์•Š๋Š”๋‹ค. ๊ฒŒ๋‹ค๊ฐ€ NVIDIA GPU๋งŒ์„ ์ง€์›ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ฉ€ํ‹ฐ์ฝ”์–ด CPU๋‚˜ ํƒ€ FPGA๋ฅผ ์ง€์›ํ•˜์ง€ ์•Š๋Š”๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹ค์–‘ํ•œ ํ•˜๋“œ์›จ์–ด ์žฅ์น˜๋ฅผ ์œ ์—ฐํ•˜๊ฒŒ ์ง€์›ํ•˜๊ณ , cuDNN๊ณผ ์œ ์‚ฌํ•œ ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ๊ฐ€์ง„ ๋”ฅ ๋Ÿฌ๋‹ ํ”„๋ฆฌ๋ฏธํ‹ฐ๋ธŒ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ธ OpenDNN์„ ์†Œ๊ฐœํ•œ๋‹ค. ํŠนํžˆ, ๋‹ค์–‘ํ•œ ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง ํ”„๋ ˆ์ž„์›Œํฌ์™€ CPU, GPU, ๊ทธ๋ฆฌ๊ณ  FPGA์™€ ๊ฐ™์€ ํ•˜๋“œ์›จ์–ด ์žฅ์น˜๋“ค์— ์—ฐ๋™ํ•˜์—ฌ OpenDNN์˜ ์ด์‹์„ฑ๊ณผ ์œ ์—ฐ์„ฑ์„ ์ž…์ฆํ•œ๋‹ค.Abstract Contents List of Tables List of Figures Chapter 1 Introduction Chapter 2 Background 2.1 Deep Neural Network 2.2 Heterogeneous Computer Chapter 3 OpenDNN API 3.1 Overview 3.2 Context Manager 3.3 Descriptor Manager 3.4 Computation Functions 3.5 Summary Chapter 4 Backend Devices 4.1 CPU 4.2 GPU 4.3 FPGA Chapter 5 OpenDNN-enabled DNN Frameworks 5.1 Caffe 5.2 TensorFlow 5.3 DarkNet Chapter 6 Evaluation 6.1 Programmable Effort 6.2 Performance Chapter 7 Related Work Chapter 8 Conclusion Bibliography ๊ตญ๋ฌธ์ดˆ๋ก AcknowledgementsMaste
    • โ€ฆ
    corecore