832 research outputs found

    MG3MConv: Multi-Grained Matrix-Multiplication-Mapping Convolution Algorithm toward the SW26010 Processor

    Full text link
    As the core of artificial intelligence applications, the research of convolution has become a hot topic in high performance computing. With the rapid development of the emerging SW26010 processor in artificial intelligence, there is an urgent need for high-performance convolution algorithms on the processor. However, the current support of convolution on SW26010 is still rudimentary. The only studies provide sufficient runtime peak performance but lack the adaptability to various convolution scenes. To perfect convolution algorithms on SW26010, we propose a multi-grained matrix-multiplication-mapping convolution algorithm called MG3MConv, which targets the architectural features of SW26010. MG3MConv supports diversified mapping schemes of convolution tasks based on the concept of the thread block proposed in this paper. All the architecture-oriented optimization methods are elaborately designed from four levels to fully exploit the hardware efficiency of SW26010. The experiments show that the hardware efficiency of MG3MConv can reach 84.78% in max, which is 1.75 times compared with that of cuDNN based on NVIDIA K80m GPU. Moreover, MG3MConv can overperform cuDNN in most convolution scenes. We also use six representative CNNs as real-world cases, and the hardware efficiency of MG3MConv reaches up to 67.04% on the VGG network model, which is 1.37 times and 1.96 times that of cuDNN and swDNN, respectively

    Performance–energy trade‑ofs of deep learning convolution algorithms on ARM processors

    Get PDF
    In this work, we assess the performance and energy efciency of high-performance codes for the convolution operator, based on the direct, explicit/implicit lowering and Winograd algorithms used for deep learning (DL) inference on a series of ARM-based processor architectures. Specifcally, we evaluate the NVIDIA Denver2 and Carmel processors, as well as the ARM Cortex-A57 and Cortex-A78AE CPUs as part of a recent set of NVIDIA Jetson platforms. The performance–energy evaluation is carried out using the ResNet-50 v1.5 convolutional neural network (CNN) on varying confgurations of convolution algorithms, number of threads/cores, and operating frequencies on the tested processor cores. The results demonstrate that the best throughput is obtained on all platforms with the Winograd convolution operator running on all the cores at their highest frequency. However, if the goal is to reduce the energy footprint, there is no rule of thumb for the optimal confguration.Funding for open access charge: CRUE-Universitat Jaume

    Performance–energy trade-offs of deep learning convolution algorithms on ARM processors

    Get PDF
    In this work, we assess the performance and energy efficiency of high-performance codes for the convolution operator, based on the direct, explicit/implicit lowering and Winograd algorithms used for deep learning (DL) inference on a series of ARM-based processor architectures. Specifically, we evaluate the NVIDIA Denver2 and Carmel processors, as well as the ARM Cortex-A57 and Cortex-A78AE CPUs as part of a recent set of NVIDIA Jetson platforms. The performance–energy evaluation is carried out using the ResNet-50 v1.5 convolutional neural network (CNN) on varying configurations of convolution algorithms, number of threads/cores, and operating frequencies on the tested processor cores. The results demonstrate that the best throughput is obtained on all platforms with the Winograd convolution operator running on all the cores at their highest frequency. However, if the goal is to reduce the energy footprint, there is no rule of thumb for the optimal configuration.Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This research was funded by Project PID2020-113656RB-C21/C22 supported by MCIN/AEI/10.13039/501100011033. Manuel F. Dolz was also supported by the Plan Gen–T grant CDEIGENT/2018/014 of the Generalitat Valenciana. Héctor Martínez is a POSTDOC_21_00025 fellow supported by Junta de Andalucía. Adrián Castelló is a FJC2019-039222-I fellow supported by MCIN/AEI/10.13039/501100011033. Antonio Maciá is a PRE2021-099284 fellow supported by MCIN/AEI/10.13039/501100011033

    Efficient and portable Winograd convolutions for multi-core processors

    Get PDF
    We take a step forward towards developing high-performance codes for the convolution operator, based on the Winograd algorithm, that are easy to customise for general-purpose processor architectures. In our approach, augmenting the portability of the solution is achieved via the introduction of vector instructions from Intel SSE/AVX2/AVX512 and ARM NEON/SVE to exploit the single-instruction multiple-data capabilities of current processors as well as OpenMP pragmas to exploit multi-threaded parallelism. While this comes at the cost of sacrificing a fraction of the computational performance, our experimental results on three distinct processors, with Intel Xeon Skylake, ARM Cortex A57 and Fujitsu A64FX processors, show that the impact is affordable and still renders a Winograd-based solution that is competitive when compared with the lowering GEMM-based convolution

    BioEM: GPU-accelerated computing of Bayesian inference of electron microscopy images

    Full text link
    In cryo-electron microscopy (EM), molecular structures are determined from large numbers of projection images of individual particles. To harness the full power of this single-molecule information, we use the Bayesian inference of EM (BioEM) formalism. By ranking structural models using posterior probabilities calculated for individual images, BioEM in principle addresses the challenge of working with highly dynamic or heterogeneous systems not easily handled in traditional EM reconstruction. However, the calculation of these posteriors for large numbers of particles and models is computationally demanding. Here we present highly parallelized, GPU-accelerated computer software that performs this task efficiently. Our flexible formulation employs CUDA, OpenMP, and MPI parallelization combined with both CPU and GPU computing. The resulting BioEM software scales nearly ideally both on pure CPU and on CPU+GPU architectures, thus enabling Bayesian analysis of tens of thousands of images in a reasonable time. The general mathematical framework and robust algorithms are not limited to cryo-electron microscopy but can be generalized for electron tomography and other imaging experiments
    • …
    corecore