5 research outputs found
Recommended from our members
Theory and practice of classical matrix-matrix multiplication for hierarchical memory architectures
Matrix-matrix multiplication is perhaps the most important operation used as a
basic building block in dense linear algebra. A computer with a hierarchical memory architectures has memory that is organized in layers, with small and fast memories close to the processor, and big and slow memories further away from it.
Classical matrix-matrix multiplication is an operation particularly suited for such architectures, as it exhibits a large degree of data reuse, so expensive data movements can be amortized over a lot of computation. This dissertation advances the theory of
how to optimally reuse data during matrix-matrix multiplication on hierarchical memory architectures, and it uses this understanding to develop new practical algorithms for matrix-matrix multiplication that exhibit improved properties related to data movement.Computer Science
Automatic Generators for a Family of Matrix Multiplication Routines with Apache TVM
We explore the utilization of the Apache TVM open source framework to
automatically generate a family of algorithms that follow the approach taken by
popular linear algebra libraries, such as GotoBLAS2, BLIS and OpenBLAS, in
order to obtain high-performance blocked formulations of the general matrix
multiplication (GEMM). % In addition, we fully automatize the generation
process, by also leveraging the Apache TVM framework to derive a complete
variety of the processor-specific micro-kernels for GEMM. This is in contrast
with the convention in high performance libraries, which hand-encode a single
micro-kernel per architecture using Assembly code. % In global, the combination
of our TVM-generated blocked algorithms and micro-kernels for GEMM 1)~improves
portability, maintainability and, globally, streamlines the software life
cycle; 2)~provides high flexibility to easily tailor and optimize the solution
to different data types, processor architectures, and matrix operand shapes,
yielding performance on a par (or even superior for specific matrix shapes)
with that of hand-tuned libraries; and 3)~features a small memory footprint.Comment: 35 pages, 22 figures. Submitted to ACM TOM
Recommended from our members
Practical fast matrix multiplication algorithms
Matrix multiplication is a core building block for numerous scientific computing and, more recently, machine learning applications. Strassen's algorithm, the original Fast Matrix Multiplication (FMM) algorithm, has long fascinated computer scientists due to its startling property of reducing the number of computations required for multiplying n x n matrices from O(n³) to O(n [superscript 2.807]). Over the last half century, this has fueled many theoretical improvements such as other variations of Strassen-like FMM algorithms. Previous implementations of these FMM algorithms led to the "street wisdom" that they are only practical for large, relatively square matrices, that they require considerable workspace, and that they are difficult to achieve thread-level parallelism. The thesis of this work dispels these notions by demonstrating significant benefits for small and non-square matrices, requiring no workspace beyond what is already incorporated in high-performance implementations of matrix multiplication, and achieving performance benefits on multi-core, many-core, and distributed memory architectures.Computer Science
Optimización del producto matricial sobre dispositivos de bajo consumo para inferencia en Deep Learning
[ES] El aprendizaje automático mediante redes neuronales profundas ha experimentado un gran auge en la última década, principalmente por la combinación de varios factores, entre los que se incluyen la avalancha de datos para entrenar este tipo de sistemas (big data), una mayor capacidad de los sistemas de computación (procesadores gráficos de NVIDIA, TPUs de Google, etc.), los avances en técnicas algorítmicas de aprendizaje (por ejemplo, redes de tipo transformer para procesamiento del lenguaje), y la disponibilidad de entornos amigables para la tarea.
En la actualidad existen diferentes paquetes de software para el entrenamiento de redes neuronales profundas sobre clusters de computadores (TensorFlow de Google y PyTorch de Facebook),
e incluso los mismos paquetes tienen versiones especializadas (TensorFlow Lite, NVIDIA RT, QNNPACK, etc.) para realizar el proceso de inferencia sobre procesadores de bajo consumo, como los que pueden encontrarse en un móvil Android o iOS o en un vehículo sin conductor.
Muchos de los sistemas tratan redes neuronales convolucionales, especialmente aquellos que tratan con imágenes.
A un nivel más bajo de detalle podemos observar que el entrenamiento y la inferencia en las capas convolucionales de las redes neuronales mencionadas aparece un producto matricial con características particulares, bien definidas y que requieren de un tratamiento especial cuando se trata de su optimización.
Este trabajo de fin de máster trata de la optimización de esta operación, en particular, sobre arquitectura ARM, cuyo procesador multinúcleo puede encontrarse en gran parte de los dispositivos de bajo consumo donde se pretende ejecutar la inferencia de una red previamente entrenada.
La optimización planteada está inspirado en un paquete de rutinas optimizadas de álgebra lineal numérica denominado BLIS, de donde se obtienen los algoritmos básicos sobre los que se realiza el trabajo.
El proyecto permitirá al estudiante adquirir un buen conocimiento de los aspectos computacionales relacionados con el proceso inferencia con redes neuronales profundas, así como profundizar en la interacción entre el algoritmo y la arquitectura del procesador y cómo esta determina el rendimiento.[EN] The use of machine learning in deep neural networks has experienced a boom in the last decade, mainly due to a combination of several factors, including the abundance of data to train such systems (big data), increased computing power (NVIDIA graphics processors, Google TPUs, etc.), advances in algorithmic learning techniques (transformer neural networks for language processing) and the availability of user-friendly environments for the task.
There are currently different software packages for training deep neural networks on computer clusters (TensorFlow and PyTorch) and even the same packages have specialized versions (TensorFlow Lite, NVIDIA RT, QNNPACK, etc.) to perform the inference process on low-power processors, such as those that can be found in an Android or iOS mobile phone or in a driverless car.
Many of the systems deal with convolutional neural networks, especially those that deal with images. At a lower level of detail, we can observe that the training and inference in the convolutional layers of the aforementioned neural networks result in a matrix product with particular, well-defined characteristics that require special treatment when it comes to optimization. This master's thesis deals with the optimization of this operation, in particular, on an ARM architecture, whose multicore processor can be found in most of the low-power devices where it is intended to execute the inference of a previously trained network. The proposed optimization is inspired by a package of optimized numerical linear algebra routines called BLIS, from which the basic algorithms on which the work is carried out are obtained. The project will allow the student to acquire a good knowledge of the computational aspects related to the inference process with deep neural networks, as well as to deepen the interaction between the algorithm and the architecture of the processor and how this determines the performance.Stabile, EB. (2021). Optimización del producto matricial sobre dispositivos de bajo consumo para inferencia en Deep Learning. Universitat Politècnica de València. http://hdl.handle.net/10251/172885TFG