Search CORE

43 research outputs found

OpenCNN: A Winograd Minimal Filtering Algorithm Implementation in CUDA

Author: Andrade Diego
Fraguela Basilio B.
López Castro Roberto
Publication venue: 'MDPI AG'
Publication date: 01/01/2021
Field of study

[Abstract] Improving the performance of the convolution operation has become a key target for High Performance Computing (HPC) developers due to its prevalence in deep learning applied mainly to video processing. The improvement is being pushed by algorithmic and implementation innovations. Algorithmically, the convolution can be solved as it is mathematically enunciated, but other methods allow to transform it into a Fast Fourier Transform (FFT) or a GEneral Matrix Multiplication (GEMM). In this latter group, the Winograd algorithm is a state-of-the-art variant that is specially suitable for smaller convolutions. In this paper, we present openCNN, an optimized CUDA C++ implementation of the Winograd convolution algorithm. Our approach achieves speedups of up to 1.76× on Turing RTX 2080Ti and up to 1.85× on Ampere RTX 3090 with respect to Winograd convolution in cuDNN 8.2.0. OpenCNN is released as open-source software.This research was supported by the Ministry of Science and Innovation of Spain (PID2019-104184RB-I00, AEI/FEDER/EU, 10.13039/501100011033) and the predoctoral grant of Roberto L. Castro (FPU19/03974). and by the Xunta de Galicia co-founded by the European Regional Development Fund (ERDF) under the Consolidation Programme of Competitive Reference Groups (ED431C 2021/30). CITIC, Centro de Investigación de Galicia ref. ED431G 2019/01, receives financial support from Consellería de Educación, Universidade e Formación Profesional, Xunta de Galicia, through the ERDF (80%) and Secretaría Xeral de Universidades (20%)Xunta de Galicia; ED431C 2021/30Xunta de Galicia; ED431G 2019/0

Multidisciplinary Digital Publishing Institute

Repositorio da Universidade da Coruña

Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis

Author: Ben-Nun Tal
Hoefler Torsten
Publication venue
Publication date: 15/09/2018
Field of study

Deep Neural Networks (DNNs) are becoming an important tool in modern computing applications. Accelerating their training is a major challenge and techniques range from distributed algorithms to low-level circuit design. In this survey, we describe the problem from a theoretical perspective, followed by approaches for its parallelization. We present trends in DNN architectures and the resulting implications on parallelization strategies. We then review and model the different types of concurrency in DNNs: from the single operator, through parallelism in network inference and training, to distributed deep learning. We discuss asynchronous stochastic optimization, distributed system architectures, communication schemes, and neural architecture search. Based on those approaches, we extrapolate potential directions for parallelism in deep learning

arXiv.org e-Print Archive

Repository for Publications and Research Data

Efficient and portable Winograd convolutions for multi-core processors

Author: Alonso-Jordá Pedro
Castelló Adrián
Dolz Manuel F.
Martínez Héctor
Quintana-Orti Enrique S.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 12/02/2023
Field of study

We take a step forward towards developing high-performance codes for the convolution operator, based on the Winograd algorithm, that are easy to customise for general-purpose processor architectures. In our approach, augmenting the portability of the solution is achieved via the introduction of vector instructions from Intel SSE/AVX2/AVX512 and ARM NEON/SVE to exploit the single-instruction multiple-data capabilities of current processors as well as OpenMP pragmas to exploit multi-threaded parallelism. While this comes at the cost of sacrificing a fraction of the computational performance, our experimental results on three distinct processors, with Intel Xeon Skylake, ARM Cortex A57 and Fujitsu A64FX processors, show that the impact is affordable and still renders a Winograd-based solution that is competitive when compared with the lowering GEMM-based convolution

Repositori Institucional de la Universitat Jaume I

Optimizing Depthwise Separable Convolution Operations on GPUs

Author: Lu G
Wang Z
Zhang W
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 28/05/2021
Field of study

The depthwise separable convolution is widely used to reduce the computation overhead of multi-channel 2D convolutions. Existing implementations of depthwise separable convolutions target accelerating model training with large batch size with a large number of samples to be processed at once. Such approaches are inadequate for small-batch-sized model training and the typical scenario of model inference where the model takes in a few samples at once. This paper aims to bridge the gap of optimizing depthwise separable convolutions by targeting the GPU architecture. We achieve this by designing two novel algorithms to improve the column and row reuse of convolution operations to reduce the number of memory operations. Our approach employs a dynamic tile size scheme to adaptively distribute the computational data across GPU threads to improve the GPU utilization and to hide the memory access latency. We apply our approach on two GPU platforms: NVIDIA RTX 2080Ti and NVIDIA Jetson AGX Xavier GPUs, and two data types: 32-bit floating point (FP32) and 8-bit integer (INT8). We compared our approach against cuDNN that is heavily tuned for the NVIDIA GPU architecture. Experimental results show that our approach delivers over 2 (up to 3) performance improvement over cuDNN

White Rose Research Online