9 research outputs found

    DGEMM on Integer Matrix Multiplication Unit

    Full text link
    Deep learning hardware achieves high throughput and low power consumption by reducing computing precision and specializing in matrix multiplication. For machine learning inference, fixed-point value computation is commonplace, where the input and output values and the model parameters are quantized. Thus, many processors are now equipped with fast integer matrix multiplication units (IMMU). It is of significant interest to find a way to harness these IMMUs to improve the performance of HPC applications while maintaining accuracy. We focus on the Ozaki scheme, which computes a high-precision matrix multiplication by using lower-precision computing units, and show the advantages and disadvantages of using IMMU. The experiment using integer Tensor Cores shows that we can compute double-precision matrix multiplication faster than cuBLAS and an existing Ozaki scheme implementation on FP16 Tensor Cores on NVIDIA consumer GPUs. Furthermore, we demonstrate accelerating a quantum circuit simulation by up to 4.33 while maintaining the FP64 accuracy

    Traitement STAP en environnement hétérogène. Application à la détection radar et implémentation sur GPU

    Get PDF
    Les traitements spatio-temporels adaptatifs (STAP) sont des traitements qui exploitent conjointement les deux dimensions spatiale et temporelle des signaux reçus sur un réseau d'antennes, contrairement au traitement d'antenne classique qui n'exploite que la dimension spatiale, pour leur filtrage. Ces traitements sont particulièrement intéressants dans le cadre du filtrage des échos reçus par un radar aéroporté en provenance du sol pour lesquels il existe un lien direct entre direction d'arrivée et fréquence Doppler. Cependant, si les principes des traitements STAP sont maintenant bien acquis, leur mise en œuvre pratique face à un environnement réel se heurte à des points durs non encore résolus dans le contexte du radar opérationnel. Le premier verrou, adressé par la thèse dans une première phase, est d'ordre théorique, et consiste en la définition de procédures d'estimation de la matrice de covariance du fouillis sur la base d'une sélection des données d'apprentissage représentatives, dans un contexte à la fois de fouillis non homogène et de densité parfois importante des cibles d'intérêts. Le second verrou est d'ordre technologique, et réside dans l'implémentation physique des algorithmes, lié à la grande charge de calcul nécessaire. Ce point, crucial en aéroporté, est exploré par la thèse dans une deuxième phase, avec l'analyse de la faisabilité d'une implémentation sur GPU des étapes les plus lourdes d'un algorithme de traitement STAP.Space-time adaptive processing (STAP) is a processing that makes use of both the spatial and the temporal dimensions of the received signals by an antenna array, whereas conventional antenna processing only exploits the spatial dimension to perform filtering. These processing are very powerful to remove ground echoes received by airborne radars, where there is a direct relation between the arrival angle and the Doppler frequency. However, if the principles of STAP processing are now well understood, their performances are limited when facing practical situations. The first part of this thesis, is theoretical, and consists of defining effective procedures to estimate the covariance matrix of the clutter using a representative selection of training data, in a context of both non-homogeneous clutter and sometimes high density of targets. The second point studied in this thesis is technological, and lies in the physical implementation of the selected algorithms, because of their high computational workload requirement. This is a key point in airborne operations, and is explored by the thesis in a second phase, with the analysis of the feasibility of implementation on GPU of the heaviest stages of a STAP processing.PARIS11-SCD-Bib. électronique (914719901) / SudocSudocFranceF

    Solução numérica massivamente paralela de problemas potenciais utilizando o método dos elementos de contorno

    Get PDF
    Dissertação (mestrado)—Universidade de Brasília, Faculdade Gama/Faculdade de Tecnologia, Programa de Mestrado em Integridade de Materiais da Engenharia, 2013.Um problema potencial pode ser caracterizado como um problema da natureza cuja solução pode ser obtida através da Equação de Laplace, que é uma equação diferencial parcial de segunda ordem. A presença de problemas potenciais na natureza fez com que fosse desenvolvida uma área de pesquisa dedicada ao seu estudo. Em problemas de múltiplas dimensões o tratamento analítico pode ser inviável, sendo assim, é comum o uso de modelagem numérica a fim de obter suas soluções. Existem diversos modelos numéricos capazes de resolver a Equação de Laplace, dentre eles estão o Método dos Elementos Finitos (MEF), Método dos Volumes Finitos (MVF), Método das Diferenças Finitas (MDF) e Método dos Elementos de Contorno (MEC). Dentre os métodos citados, o MEC é o mais recente, e atualmente está sendo muito utilizado na resolução de problemas de grandes dimensões e de domínios semi-infinitos. O MEC utiliza uma formulação matemática baseada no Teorema de Green a fim de reduzir uma dimensão do problema. Assim, é possível obter um ganho no custo compu- tacional, apesar do esforço matemático ser maior. Apesar do ganho obtido, as matrizes geradas pelo método são cheias e não-simétricas, fazendo com que o custo computacional ainda seja elevado. Devido à crescente exigência dos usuários de computadores no quesito qualidade grá- fica, as fabricantes desse tipo de hardware se viram forçadas a desenvolver novas tecnolo- gias capazes de suprir essa demanda, surgindo assim, as placas gráficas com processadores e memórias dedicadas. Este tipo de hardware chamou a atenção da comunidade científica por ser paralelo por natureza, sendo capaz de obter um desempenho comparado a um supercomputador. Opresentetrabalhovisaaimplementaçãodeumabibliotecaparalelaadaptadaa estrutura de placas gráficas para resolução de sistemas de equações lineares obtidos a partir da discretização de problemas potenciais com o MEC, utilizando a linguagem de programação OpenCL, a fim de avaliar a viabilidade de seu uso em ambiente híbrido, ou seja, contendo uma ou mais Central Processing Units (CPUs) e Graphics Processing Units (GPUs). A implementação foi validada, sendo aplicada ao problema de um fluido potencial em torno de um cilindro circular impenetrável e diversas técnicas de otimização do algoritmo foram avaliadas de forma a fornecer uma base de conhecimento para futuros trabalhos que venham utilizar GPUs. Os resultados mostram que uma implementação do método iterativo de Jacobi, que é utilizado na resolução de sistemas lineares, com paralelização trivial, semelhante ao pro- blema de N-Corpos (N-Body), não oferece um desempenho expressivo que justifique o uso de computação massivamente paralela, por outro lado, utilizando as técnicas de otimiza- ções apresentadas, é possível obter um ganho de até 5.5 vezes em relação ao algoritmo serial. Além disso, o trabalho aponta limitações na alocação de memória disponibilizada pela implementação OpenCL da fabricante AMD ATI. _______________________________________________________________________________________ ABSTRACTA potential problem could be defined as a nature problem that satisfies The Laplace’s Equation, that is, a second order differential equation. The significant presence of po- tential problems in nature made it necessary to develop a research area dedicated to its study. This kind of problem can be solved using Laplace’s Equation as a tool, a partial second order differential equation. Analytical approaches may be impractical for higher dimensions, so it is common to use numerical modeling to obtain solutions. There are many numerical models capable of solving Laplace’s Equations, among them the Finite element method (FEM), Finite Volume Method (FVM), Finite Difference method (FDM) and the Boundary Element Method (BEM). Among these, the BEM is the most recent and is currently being used for higher dimensions and semi-infinite domain problems. The BEM uses a mathematical formulation based on Green’s Theorem to reduce one dimension of the problem, and as such, makes it possible to obtain computational gain, in spite of the higher mathematical efforts. A side effect of this gain is that the generated matrices are full and asymmetrical, making the computational cost to be still high. Computer user’s demand for higher quality graphics push manufactures to develop new technologies, suck as discrete graphic boards with dedicates processors an memory. This kind of hardware drew attention from the scientific community for its parallel nature being capable of performing Teraflop order of magnitude calculations for single precision math and Gigaflops order of magnitude calculations for double precision. This works intends to implement a parallel library using the BEM on graphic boards using the OpenCL programming language, to evaluate the viability of a hybrid (CPU and GPU) environment. The implementation was validated by being applied to a po- tential fluid around a solid cylinder and many algorithm optimization techniques was implemented to create a knowledge base to futures works using GPUs. The results show that an implementation of Jacobi’s iterative method, normally used on linear system solving, and a trivial parallelization similar to the one used on the N-Body solution does not show an expressive performance to justify massive parallel computing,due to the high number of memory accesses. Although, using the suggested optimization techniques, it is possible to reach a 5.5 gain when compared the same algorithm running serially

    A Construction Kit for Efficient Low Power Neural Network Accelerator Designs

    Get PDF
    Implementing embedded neural network processing at the edge requires efficient hardware acceleration that couples high computational performance with low power consumption. Driven by the rapid evolution of network architectures and their algorithmic features, accelerator designs are constantly updated and improved. To evaluate and compare hardware design choices, designers can refer to a myriad of accelerator implementations in the literature. Surveys provide an overview of these works but are often limited to system-level and benchmark-specific performance metrics, making it difficult to quantitatively compare the individual effect of each utilized optimization technique. This complicates the evaluation of optimizations for new accelerator designs, slowing-down the research progress. This work provides a survey of neural network accelerator optimization approaches that have been used in recent works and reports their individual effects on edge processing performance. It presents the list of optimizations and their quantitative effects as a construction kit, allowing to assess the design choices for each building block separately. Reported optimizations range from up to 10'000x memory savings to 33x energy reductions, providing chip designers an overview of design choices for implementing efficient low power neural network accelerators
    corecore