306 research outputs found

    Performance Analysis of Matrix Multiplication for Deep Learning on the Edge

    Full text link
    The devices designed for the Internet-of-Things encompass a large variety of distinct processor architectures, forming a highly heterogeneous zoo. In order to tackle this, we employ a simulator to estimate the performance of the matrix-matrix multiplication (GEMM) kernel on processors designed to operate at the edge. Our simulator adheres to the modern implementations of GEMM, advocated by GotoBLAS2, BLIS, OpenBLAS, etc., to carefully account for the amount of data transfers across the memory hierarchy of different algorithmic variants of the kernel. %Armed with this tool, A small collection of experiments provide the necessary data to calibrate the simulator and deliver highly accurate estimations of the execution time for a given processor architecture.Comment: 12 pages, 2 Tables, 6 Figure

    Efficient and portable Winograd convolutions for multi-core processors

    Get PDF
    We take a step forward towards developing high-performance codes for the convolution operator, based on the Winograd algorithm, that are easy to customise for general-purpose processor architectures. In our approach, augmenting the portability of the solution is achieved via the introduction of vector instructions from Intel SSE/AVX2/AVX512 and ARM NEON/SVE to exploit the single-instruction multiple-data capabilities of current processors as well as OpenMP pragmas to exploit multi-threaded parallelism. While this comes at the cost of sacrificing a fraction of the computational performance, our experimental results on three distinct processors, with Intel Xeon Skylake, ARM Cortex A57 and Fujitsu A64FX processors, show that the impact is affordable and still renders a Winograd-based solution that is competitive when compared with the lowering GEMM-based convolution

    Automatic Generators for a Family of Matrix Multiplication Routines with Apache TVM

    Full text link
    We explore the utilization of the Apache TVM open source framework to automatically generate a family of algorithms that follow the approach taken by popular linear algebra libraries, such as GotoBLAS2, BLIS and OpenBLAS, in order to obtain high-performance blocked formulations of the general matrix multiplication (GEMM). % In addition, we fully automatize the generation process, by also leveraging the Apache TVM framework to derive a complete variety of the processor-specific micro-kernels for GEMM. This is in contrast with the convention in high performance libraries, which hand-encode a single micro-kernel per architecture using Assembly code. % In global, the combination of our TVM-generated blocked algorithms and micro-kernels for GEMM 1)~improves portability, maintainability and, globally, streamlines the software life cycle; 2)~provides high flexibility to easily tailor and optimize the solution to different data types, processor architectures, and matrix operand shapes, yielding performance on a par (or even superior for specific matrix shapes) with that of hand-tuned libraries; and 3)~features a small memory footprint.Comment: 35 pages, 22 figures. Submitted to ACM TOM

    Algorithm 1039: Automatic Generators for a Family of Matrix Multiplication Routines with Apache TVM

    Get PDF
    © ACM, 2024. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM Trans. Math. Softw. 50, 1, Article 6 (March 2024), 34 pages. https://doi.org/10.1145/3638532This work was supported by the research projects PID2020-113656RB-C22 (MCIN/AEI/10.13039/ 501100011033) and PID2021-126576NB-I00; and CM via Multiannual Agreement with Complutense University in the line Program to Stimulate Research for Young Doctors in the context of the V PRICIT under projects PR65/19-22445 and CM S2018/TCS-4423. A. Castello is a FJC2019-039222-I fellow supported by MCIN/AEI/10.13039/ 501100011033. H. Martinez is a postdoctoral fellow supported by the Consejeria de Transformacion Economica, Industria, Conocimiento y Universidades de la Junta de Andalucia. This project has received funding from the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 955558 (eFlows4HPC project). The JU receives support from the European Union's Horizon 2020 research and innovation programme, and Spain, Germany, France, Italy, Poland, Switzerland, and Norway.Alaejos-López, G.; Castelló, A.; Alonso-Jordá, P.; Igual, FD.; Martínez, H.; Quintana-Ortí, ES. (2024). Algorithm 1039: Automatic Generators for a Family of Matrix Multiplication Routines with Apache TVM. ACM Transactions on Mathematical Software. 50(1). https://doi.org/10.1145/363853250

    The Sound Emission Board of the KM3NeT Acoustic Positioning System

    Full text link
    We describe the sound emission board proposed for installation in the acoustic positioning system of the future KM3NeT underwater neutrino telescope. The KM3NeT European consortium aims to build a multi-cubic kilometre underwater neutrino telescope in the deep Mediterranean Sea. In this kind of telescope the mechanical structures holding the optical sensors, which detect the Cherenkov radiation produced by muons emanating from neutrino interactions, are not completely rigid and can move up to dozens of meters in undersea currents. Knowledge of the position of the optical sensors to an accuracy of about 10 cm is needed for adequate muon track reconstruction. A positioning system based on the acoustic triangulation of sound transit time differences between fixed seabed emitters and receiving hydrophones attached to the kilometre-scale vertical flexible structures carrying the optical sensors is being developed. In this paper, we describe the sound emission board developed in the framework of KM3NeT project, which is totally adapted to the chosen FFR SX30 ultrasonic transducer and fulfils the requirements imposed by the collaboration in terms of cost, high reliability, low power consumption, high acoustic emission power for short signals, low intrinsic noise and capacity to use arbitrary signals in emission mode.Comment: 9 pages, 4 figure
    corecore