192 research outputs found

    Data layout types : a type-based approach to automatic data layout transformations for improved SIMD vectorisation

    Get PDF
    The increasing complexity of modern hardware requires sophisticated programming techniques for programs to run efficiently. At the same time, increased power of modern hardware enables more advanced analyses to be included in compilers. This thesis focuses on one particular optimisation technique that improves utilisation of vector units. The foundation of this technique is the ability to chose memory mappings for data structures of a given program. Usually programming languages use a fixed layout for logical data structures in physical memory. Such a static mapping often has a negative effect on usability of vector units. In this thesis we consider a compiler for a programming language that allows every data structure in a program to have its own data layout. We make sure that data layouts across the program are sound, and most importantly we solve a problem of automatic data layout reconstruction. To consistently do this, we formulate this as a type inference problem, where type encodes a data layout for a given structure as well as implied program transformations. We prove that type-implied transformations preserve semantics of the original programs and we demonstrate significant performance improvements when targeting SIMD-capable architectures

    Efficient and portable Winograd convolutions for multi-core processors

    Get PDF
    We take a step forward towards developing high-performance codes for the convolution operator, based on the Winograd algorithm, that are easy to customise for general-purpose processor architectures. In our approach, augmenting the portability of the solution is achieved via the introduction of vector instructions from Intel SSE/AVX2/AVX512 and ARM NEON/SVE to exploit the single-instruction multiple-data capabilities of current processors as well as OpenMP pragmas to exploit multi-threaded parallelism. While this comes at the cost of sacrificing a fraction of the computational performance, our experimental results on three distinct processors, with Intel Xeon Skylake, ARM Cortex A57 and Fujitsu A64FX processors, show that the impact is affordable and still renders a Winograd-based solution that is competitive when compared with the lowering GEMM-based convolution

    Automated Design Space Exploration for optimised Deployment of DNN on Arm Cortex-A CPUs

    Full text link
    The spread of deep learning on embedded devices has prompted the development of numerous methods to optimise the deployment of deep neural networks (DNN). Works have mainly focused on: i) efficient DNN architectures, ii) network optimisation techniques such as pruning and quantisation, iii) optimised algorithms to speed up the execution of the most computational intensive layers and, iv) dedicated hardware to accelerate the data flow and computation. However, there is a lack of research on cross-level optimisation as the space of approaches becomes too large to test and obtain a globally optimised solution. Thus, leading to suboptimal deployment in terms of latency, accuracy, and memory. In this work, we first detail and analyse the methods to improve the deployment of DNNs across the different levels of software optimisation. Building on this knowledge, we present an automated exploration framework to ease the deployment of DNNs. The framework relies on a Reinforcement Learning search that, combined with a deep learning inference framework, automatically explores the design space and learns an optimised solution that speeds up the performance and reduces the memory on embedded CPU platforms. Thus, we present a set of results for state-of-the-art DNNs on a range of Arm Cortex-A CPU platforms achieving up to 4x improvement in performance and over 2x reduction in memory with negligible loss in accuracy with respect to the BLAS floating-point implementation

    Optimising Convolutional Neural Networks Inference on Low-Powered GPUs

    Get PDF
    No abstract available

    Big data, modeling, simulation, computational platform and holistic approaches for the fourth industrial revolution

    Get PDF
    Naturally, the mathematical process starts from proving the existence and uniqueness of the solution by the using the theorem, corollary, lemma, proposition, dealing with the simple and non-complex model. Proving the existence and uniqueness solution are guaranteed by governing the infinite amount of solutions and limited to the implementation of a small-scale simulation on a single desktop CPU. Accuracy, consistency and stability were easily controlled by a small data scale. However, the fourth industrial can be described the mathematical process as the advent of cyber-physical systems involving entirely new capabilities for researcher and machines (Xing, 2017). In numerical perspective, the fourth industrial revolution (4iR) required the transition from a uncomplex model and small scale simulation to complex model and big data for visualizing the real-world application in digital dialectical and exciting opportunity. Thus, a big data analytics and its classification are a problem solving for these limitations. Some applications of 4iR will highlight the extension version in terms of models, derivative and discretization, dimension of space and time, behavior of initial and boundary conditions, grid generation, data extraction, numerical method and image processing with high resolution feature in numerical perspective. In statistics, a big data depends on data growth however, from numerical perspective, a few classification strategies will be investigated deals with the specific classifier tool. This paper will investigate the conceptual framework for a big data classification, governing the mathematical modeling, selecting the superior numerical method, handling the large sparse simulation and investigating the parallel computing on high performance computing (HPC) platform. The conceptual framework will benefit to the big data provider, algorithm provider and system analyzer to classify and recommend the specific strategy for generating, handling and analyzing the big data. All the perspectives take a holistic view of technology. Current research, the particular conceptual framework will be described in holistic terms. 4iR has ability to take a holistic approach to explain an important of big data, complex modeling, large sparse simulation and high performance computing platform. Numerical analysis and parallel performance evaluation are the indicators for performance investigation of the classification strategy. This research will benefit to obtain an accurate decision, predictions and trending practice on how to obtain the approximation solution for science and engineering applications. As a conclusion, classification strategies for generating a fine granular mesh, identifying the root causes of failures and issues in real time solution. Furthermore, the big data-driven and data transfer evolution towards high speed of technology transfer to boost the economic and social development for the 4iR (Xing, 2017; Marwala et al., 2017)

    Performance–energy trade-offs of deep learning convolution algorithms on ARM processors

    Get PDF
    In this work, we assess the performance and energy efficiency of high-performance codes for the convolution operator, based on the direct, explicit/implicit lowering and Winograd algorithms used for deep learning (DL) inference on a series of ARM-based processor architectures. Specifically, we evaluate the NVIDIA Denver2 and Carmel processors, as well as the ARM Cortex-A57 and Cortex-A78AE CPUs as part of a recent set of NVIDIA Jetson platforms. The performance–energy evaluation is carried out using the ResNet-50 v1.5 convolutional neural network (CNN) on varying configurations of convolution algorithms, number of threads/cores, and operating frequencies on the tested processor cores. The results demonstrate that the best throughput is obtained on all platforms with the Winograd convolution operator running on all the cores at their highest frequency. However, if the goal is to reduce the energy footprint, there is no rule of thumb for the optimal configuration.Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This research was funded by Project PID2020-113656RB-C21/C22 supported by MCIN/AEI/10.13039/501100011033. Manuel F. Dolz was also supported by the Plan Gen–T grant CDEIGENT/2018/014 of the Generalitat Valenciana. Héctor Martínez is a POSTDOC_21_00025 fellow supported by Junta de Andalucía. Adrián Castelló is a FJC2019-039222-I fellow supported by MCIN/AEI/10.13039/501100011033. Antonio Maciá is a PRE2021-099284 fellow supported by MCIN/AEI/10.13039/501100011033

    Performance–energy trade‑ofs of deep learning convolution algorithms on ARM processors

    Get PDF
    In this work, we assess the performance and energy efciency of high-performance codes for the convolution operator, based on the direct, explicit/implicit lowering and Winograd algorithms used for deep learning (DL) inference on a series of ARM-based processor architectures. Specifcally, we evaluate the NVIDIA Denver2 and Carmel processors, as well as the ARM Cortex-A57 and Cortex-A78AE CPUs as part of a recent set of NVIDIA Jetson platforms. The performance–energy evaluation is carried out using the ResNet-50 v1.5 convolutional neural network (CNN) on varying confgurations of convolution algorithms, number of threads/cores, and operating frequencies on the tested processor cores. The results demonstrate that the best throughput is obtained on all platforms with the Winograd convolution operator running on all the cores at their highest frequency. However, if the goal is to reduce the energy footprint, there is no rule of thumb for the optimal confguration.Funding for open access charge: CRUE-Universitat Jaume
    corecore