306 research outputs found
Performance Analysis of Matrix Multiplication for Deep Learning on the Edge
The devices designed for the Internet-of-Things encompass a large variety of
distinct processor architectures, forming a highly heterogeneous zoo. In order
to tackle this, we employ a simulator to estimate the performance of the
matrix-matrix multiplication (GEMM) kernel on processors designed to operate at
the edge. Our simulator adheres to the modern implementations of GEMM,
advocated by GotoBLAS2, BLIS, OpenBLAS, etc., to carefully account for the
amount of data transfers across the memory hierarchy of different algorithmic
variants of the kernel. %Armed with this tool, A small collection of
experiments provide the necessary data to calibrate the simulator and deliver
highly accurate estimations of the execution time for a given processor
architecture.Comment: 12 pages, 2 Tables, 6 Figure
Efficient and portable Winograd convolutions for multi-core processors
We take a step forward towards developing high-performance codes for the convolution operator, based on the Winograd algorithm, that are easy to customise for general-purpose processor architectures. In our approach, augmenting the portability of the solution is achieved via the introduction of vector instructions from Intel SSE/AVX2/AVX512 and ARM NEON/SVE to exploit the single-instruction multiple-data capabilities of current processors as well as OpenMP pragmas to exploit multi-threaded parallelism. While this comes at the cost of sacrificing a fraction of the computational performance, our experimental results on three distinct processors, with Intel Xeon Skylake, ARM Cortex A57 and Fujitsu A64FX processors, show that the impact is affordable and still renders a Winograd-based solution that is competitive when compared with the lowering GEMM-based convolution
Automatic Generators for a Family of Matrix Multiplication Routines with Apache TVM
We explore the utilization of the Apache TVM open source framework to
automatically generate a family of algorithms that follow the approach taken by
popular linear algebra libraries, such as GotoBLAS2, BLIS and OpenBLAS, in
order to obtain high-performance blocked formulations of the general matrix
multiplication (GEMM). % In addition, we fully automatize the generation
process, by also leveraging the Apache TVM framework to derive a complete
variety of the processor-specific micro-kernels for GEMM. This is in contrast
with the convention in high performance libraries, which hand-encode a single
micro-kernel per architecture using Assembly code. % In global, the combination
of our TVM-generated blocked algorithms and micro-kernels for GEMM 1)~improves
portability, maintainability and, globally, streamlines the software life
cycle; 2)~provides high flexibility to easily tailor and optimize the solution
to different data types, processor architectures, and matrix operand shapes,
yielding performance on a par (or even superior for specific matrix shapes)
with that of hand-tuned libraries; and 3)~features a small memory footprint.Comment: 35 pages, 22 figures. Submitted to ACM TOM
Algorithm 1039: Automatic Generators for a Family of Matrix Multiplication Routines with Apache TVM
© ACM, 2024. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM Trans. Math. Softw. 50, 1, Article 6 (March 2024), 34 pages. https://doi.org/10.1145/3638532This work was supported by the research projects PID2020-113656RB-C22 (MCIN/AEI/10.13039/ 501100011033) and PID2021-126576NB-I00; and CM via Multiannual Agreement with Complutense University in the line Program to Stimulate Research for Young Doctors in the context of the V PRICIT under projects PR65/19-22445 and CM S2018/TCS-4423. A. Castello is a FJC2019-039222-I fellow supported by MCIN/AEI/10.13039/ 501100011033. H. Martinez is a postdoctoral fellow supported by the Consejeria de Transformacion Economica, Industria, Conocimiento y Universidades de la Junta de Andalucia. This project has received funding from the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 955558 (eFlows4HPC project). The JU receives support from the European Union's Horizon 2020 research and innovation programme, and Spain, Germany, France, Italy, Poland, Switzerland, and Norway.Alaejos-López, G.; Castelló, A.; Alonso-Jordá, P.; Igual, FD.; Martínez, H.; Quintana-Ortí, ES. (2024). Algorithm 1039: Automatic Generators for a Family of Matrix Multiplication Routines with Apache TVM. ACM Transactions on Mathematical Software. 50(1). https://doi.org/10.1145/363853250
The Sound Emission Board of the KM3NeT Acoustic Positioning System
We describe the sound emission board proposed for installation in the
acoustic positioning system of the future KM3NeT underwater neutrino telescope.
The KM3NeT European consortium aims to build a multi-cubic kilometre underwater
neutrino telescope in the deep Mediterranean Sea. In this kind of telescope the
mechanical structures holding the optical sensors, which detect the Cherenkov
radiation produced by muons emanating from neutrino interactions, are not
completely rigid and can move up to dozens of meters in undersea currents.
Knowledge of the position of the optical sensors to an accuracy of about 10 cm
is needed for adequate muon track reconstruction. A positioning system based on
the acoustic triangulation of sound transit time differences between fixed
seabed emitters and receiving hydrophones attached to the kilometre-scale
vertical flexible structures carrying the optical sensors is being developed.
In this paper, we describe the sound emission board developed in the framework
of KM3NeT project, which is totally adapted to the chosen FFR SX30 ultrasonic
transducer and fulfils the requirements imposed by the collaboration in terms
of cost, high reliability, low power consumption, high acoustic emission power
for short signals, low intrinsic noise and capacity to use arbitrary signals in
emission mode.Comment: 9 pages, 4 figure
- …