44 research outputs found
MPGemmFI: A Fault Injection Technique for Mixed Precision GEMM in ML Applications
Emerging deep learning workloads urgently need fast general matrix
multiplication (GEMM). To meet such demand, one of the critical features of
machine-learning-specific accelerators such as NVIDIA Tensor Cores, AMD Matrix
Cores, and Google TPUs is the support of mixed-precision enabled GEMM. For DNN
models, lower-precision FP data formats and computation offer acceptable
correctness but significant performance, area, and memory footprint
improvement. While promising, the mixed-precision computation on error
resilience remains unexplored. To this end, we develop a fault injection
framework that systematically injects fault into the mixed-precision
computation results. We investigate how the faults affect the accuracy of
machine learning applications. Based on the error resilience characteristics,
we offer lightweight error detection and correction solutions that
significantly improve the overall model accuracy if the models experience
hardware faults. The solutions can be efficiently integrated into the
accelerator's pipelines
Probing the Efficacy of Hardware-Aware Weight Pruning to Optimize the SpMM routine on Ampere GPUs
[Abstract]: The Deep Learning (DL) community found in pruning techniques a good way to reduce the models' resource and energy consumption. These techniques lead to smaller sparse models, but sparse computations in GPUs only outperform their dense counterparts for extremely high levels of sparsity. However, pruning up to such sparsity levels can seriously harm the accuracy of the Neural Networks (NNs). To alleviate this, novel performance-aware pruning techniques favor the generation of more regular sparse matrices that can improve the exploitation of the underlying hardware. Nevertheless, an important drawback is that these techniques heavily condition the location of the non-pruned values, which can strongly degrade the accuracy of the models. This paper focuses on improving the performance of the SpMM routine on DL workloads by combining performance-aware pruning with pruning-independent SpMM kernels to relax input-format constraints. We start with a microarchitecture-level performance study of SOTA SpMM implementations to identify their main bottlenecks and flaws. Then, the paper centers on maximizing the performance of the routine by adjusting the parameters of performance-aware pruning techniques to the hardware properties. This second study explains the intrinsic causes of the observed performance results. We show that, following this approach, a generic SpMM routine can perform up to 49% and 77% better for half and single precision, respectively, than using non-performance-aware pruning, providing speedups over cuBlas of up to 1.87× and 4.20×, respectively. Additionally, the performance achieved on half precision is boosted with a new Ampere-ready specialized implementation for the columnvector sparse format, CLASP, which achieves a 2.42× speedup over cuBlas. Finally, we also introduce ad-colPrune, a novel pruning technique that widens the design space of possible trade-offs between performance and accuracy. © 2022 Association for Computing Machinery.Xunta de Galicia; ED431C 2021/30Xunta de Galicia; ED431G 2019/01This research was supported by the Ministry of Science and Innovation of Spain (PID2019-104184RB-I00, AEI, 10.13039/501100011033), the Ministry of Education (predoctoral grant of Roberto L. Castro, FPU19/03974), and by Xunta de Galicia under the Consolidation Program of Competitive Reference Groups (ED431C 2021/30). We also acknowledge the support from CITIC, funded by Xunta de Galicia and FEDER funds of the EU (Centro de Investigación de Galicia accreditation 2019-2022, ED431G 2019/01). Finally, we acknowledge the Centro de Supercomputación de Galicia (CESGA) for the use of their computers
Kernel Fusion in Atomistic Spin Dynamics Simulations on Nvidia GPUs using Tensor Core
In atomistic spin dynamics simulations, the time cost of constructing the
space- and time-displaced pair correlation function in real space increases
quadratically as the number of spins , leading to significant computational
effort. The GEMM subroutine can be adopted to accelerate the calculation of the
dynamical spin-spin correlation function, but the computational cost of
simulating large spin systems ( spins) on CPUs remains expensive. In
this work, we perform the simulation on the graphics processing unit (GPU), a
hardware solution widely used as an accelerator for scientific computing and
deep learning. We show that GPUs can accelerate the simulation up to 25-fold
compared to multi-core CPUs when using the GEMM subroutine on both. To hide
memory latency, we fuse the element-wise operation into the GEMM kernel using
that can improve the performance by 26% 33% compared
to implementation based on . Furthermore, we perform the
on-the-fly calculation in the epilogue of the GEMM subroutine to avoid saving
intermediate results on global memory, which makes the large-scale atomistic
spin dynamics simulation feasible and affordable
Exascale Deep Learning to Accelerate Cancer Research
Deep learning, through the use of neural networks, has demonstrated
remarkable ability to automate many routine tasks when presented with
sufficient data for training. The neural network architecture (e.g. number of
layers, types of layers, connections between layers, etc.) plays a critical
role in determining what, if anything, the neural network is able to learn from
the training data. The trend for neural network architectures, especially those
trained on ImageNet, has been to grow ever deeper and more complex. The result
has been ever increasing accuracy on benchmark datasets with the cost of
increased computational demands. In this paper we demonstrate that neural
network architectures can be automatically generated, tailored for a specific
application, with dual objectives: accuracy of prediction and speed of
prediction. Using MENNDL--an HPC-enabled software stack for neural architecture
search--we generate a neural network with comparable accuracy to
state-of-the-art networks on a cancer pathology dataset that is also
faster at inference. The speedup in inference is necessary because of the
volume and velocity of cancer pathology data; specifically, the previous
state-of-the-art networks are too slow for individual researchers without
access to HPC systems to keep pace with the rate of data generation. Our new
model enables researchers with modest computational resources to analyze newly
generated data faster than it is collected.Comment: Submitted to IEEE Big Dat
Latency and accuracy optimized mobile face detection
Abstract. Face detection is a preprocessing step in many computer vision applications. Important factors are accuracy, inference duration, and energy efficiency of the detection framework. Computationally light detectors that execute in real-time are a requirement for many application areas, such as face tracking and recognition. Typical operating platforms in everyday use are smartphones and embedded devices, which have limited computation capacity.
The capability of face detectors is comparable to the ability of a human in easy detection tasks. When the conditions change, the challenges become different. Current challenges in face detection include atypically posed and tiny faces. Partially occluded faces and dim or bright environments pose challenges for detection systems. State-of-the-art performance in face detection research employs deep learning methods called neural networks, which loosely imitate the mammalian brain system. The most relevant technologies are convolutional neural networks, which are designed for local feature description.
In this thesis, the main computational optimization approach is neural network quantization. The network models were delegated to digital signal processors and graphics processing units. Quantization was shown to reduce the latency of computation substantially. The most energy-efficient inference was achieved through digital signal processor delegation. Multithreading was used for inference acceleration. It reduced the amount of energy consumption per algorithm run.Latenssi- ja tarkkuusoptimoitu kasvontunnistus mobiililaitteilla. Tiivistelmä. Kasvojen ilmaisu on esikäsittelyvaihe monelle konenäön sovellukselle. Tärkeitä kasvoilmaisimen ominaisuuksia ovat tarkkuus, energiatehokkuus ja suoritusnopeus. Monet sovellukset vaativat laskennallisesti kevyitä ilmaisimia, jotka toimivat reaaliajassa. Esimerkkejä sovelluksista ovat kasvojen seuranta- ja tunnistusjärjestelmät. Yleisiä käyttöalustoja ovat älypuhelimet ja sulautetut järjestelmät, joiden laskentakapasiteetti on rajallinen.
Kasvonilmaisimien tarkkuus vastaa ihmisen kykyä helpoissa ilmaisuissa. Nykyiset ongelmat kasvojen ilmaisussa liittyvät epätyypillisiin asentoihin ja erityisen pieniin kasvokokoihin. Myös kasvojen osittainen peittyminen, ja pimeät ja kirkkaat ympäristöt, vaikeuttavat ilmaisua. Neuroverkkoja käytetään tekoälyjärjestelmissä, joiden lähtökohtana on ollut mallintaa nisäkkäiden aivojen toimintaa. Konvoluutiopohjaiset neuroverkot ovat erikoistuneet paikallisten piirteiden analysointiin.
Tässä opinnäytetyössä käytetty laskennallisen optimoinnin menetelmä on neuroverkkojen kvantisointi. Neuroverkkojen ajo delegoitiin digitaalisille signaalinkäsittely- ja grafiikkasuorittimille. Kvantisoinnin osoitettiin vähentävän laskenta-aikaa huomattavasti ja suurin energiatehokkuus saavutettiin digitaalisen signaaliprosessorin avulla. Suoritusnopeutta lisättiin monisäikeistyksellä, jonka havaittiin vähentävän energiankulutusta