543 research outputs found
Exploring Hardware Fault Impacts on Different Real Number Representations of the Structural Resilience of TCUs in GPUs
The most recent generations of graphics processing units (GPUs) boost the execution of convolutional operations required by machine learning applications by resorting to specialized and efficient in-chip accelerators (Tensor Core Units or TCUs) that operate on matrix multiplication tiles. Unfortunately, modern cutting-edge semiconductor technologies are increasingly prone to hardware defects, and the trend to highly stress TCUs during the execution of safety-critical and high-performance computing (HPC) applications increases the likelihood of TCUs producing different kinds of failures. In fact, the intrinsic resiliency to hardware faults of arithmetic units plays a crucial role in safety-critical applications using GPUs (e.g., in automotive, space, and autonomous robotics). Recently, new arithmetic formats have been proposed, particularly those suited to neural network execution. However, the reliability characterization of TCUs supporting different arithmetic formats was still lacking. In this work, we quantitatively assessed the impact of hardware faults in TCU structures while employing two distinct formats (floating-point and posit) and using two different configurations (16 and 32 bits) to represent real numbers. For the experimental evaluation, we resorted to an architectural description of a TCU core (PyOpenTCU) and performed 120 fault simulation campaigns, injecting around 200,000 faults per campaign and requiring around 32 days of computation. Our results demonstrate that the posit format of TCUs is less affected by faults than the floating-point one (by up to three orders of magnitude for 16 bits and up to twenty orders for 32 bits). We also identified the most sensible fault locations (i.e., those that produce the largest errors), thus paving the way to adopting smart hardening solutions
Evaluating the Potential of Disaggregated Memory Systems for HPC applications
Disaggregated memory is a promising approach that addresses the limitations
of traditional memory architectures by enabling memory to be decoupled from
compute nodes and shared across a data center. Cloud platforms have deployed
such systems to improve overall system memory utilization, but performance can
vary across workloads. High-performance computing (HPC) is crucial in
scientific and engineering applications, where HPC machines also face the issue
of underutilized memory. As a result, improving system memory utilization while
understanding workload performance is essential for HPC operators. Therefore,
learning the potential of a disaggregated memory system before deployment is a
critical step. This paper proposes a methodology for exploring the design space
of a disaggregated memory system. It incorporates key metrics that affect
performance on disaggregated memory systems: memory capacity, local and remote
memory access ratio, injection bandwidth, and bisection bandwidth, providing an
intuitive approach to guide machine configurations based on technology trends
and workload characteristics. We apply our methodology to analyze thirteen
diverse workloads, including AI training, data analysis, genomics, protein,
fusion, atomic nuclei, and traditional HPC bookends. Our methodology
demonstrates the ability to comprehend the potential and pitfalls of a
disaggregated memory system and provides motivation for machine configurations.
Our results show that eleven of our thirteen applications can leverage
injection bandwidth disaggregated memory without affecting performance, while
one pays a rack bisection bandwidth penalty and two pay the system-wide
bisection bandwidth penalty. In addition, we also show that intra-rack memory
disaggregation would meet the application's memory requirement and provide
enough remote memory bandwidth.Comment: The submission builds on the following conference paper: N. Ding, S.
Williams, H.A. Nam, et al. Methodology for Evaluating the Potential of
Disaggregated Memory Systems,2nd International Workshop on RESource
DISaggregation in High-Performance Computing (RESDIS), November 18, 2022. It
is now submitted to the CCPE journal for revie
SCV-GNN: Sparse Compressed Vector-based Graph Neural Network Aggregation
Graph neural networks (GNNs) have emerged as a powerful tool to process
graph-based data in fields like communication networks, molecular interactions,
chemistry, social networks, and neuroscience. GNNs are characterized by the
ultra-sparse nature of their adjacency matrix that necessitates the development
of dedicated hardware beyond general-purpose sparse matrix multipliers. While
there has been extensive research on designing dedicated hardware accelerators
for GNNs, few have extensively explored the impact of the sparse storage format
on the efficiency of the GNN accelerators. This paper proposes SCV-GNN with the
novel sparse compressed vectors (SCV) format optimized for the aggregation
operation. We use Z-Morton ordering to derive a data-locality-based computation
ordering and partitioning scheme. The paper also presents how the proposed
SCV-GNN is scalable on a vector processing system. Experimental results over
various datasets show that the proposed method achieves a geometric mean
speedup of and over CSC and CSR aggregation
operations, respectively. The proposed method also reduces the memory traffic
by a factor of and over compressed sparse column
(CSC) and compressed sparse row (CSR), respectively. Thus, the proposed novel
aggregation format reduces the latency and memory access for GNN inference
Heterogeneous Integration of In-Memory Analog Computing Architectures with Tensor Processing Units
Tensor processing units (TPUs), specialized hardware accelerators for machine
learning tasks, have shown significant performance improvements when executing
convolutional layers in convolutional neural networks (CNNs). However, they
struggle to maintain the same efficiency in fully connected (FC) layers,
leading to suboptimal hardware utilization. In-memory analog computing (IMAC)
architectures, on the other hand, have demonstrated notable speedup in
executing FC layers. This paper introduces a novel, heterogeneous,
mixed-signal, and mixed-precision architecture that integrates an IMAC unit
with an edge TPU to enhance mobile CNN performance. To leverage the strengths
of TPUs for convolutional layers and IMAC circuits for dense layers, we propose
a unified learning algorithm that incorporates mixed-precision training
techniques to mitigate potential accuracy drops when deploying models on the
TPU-IMAC architecture. The simulations demonstrate that the TPU-IMAC
configuration achieves up to performance improvements, and
memory reductions compared to conventional TPU architectures for various CNN
models while maintaining comparable accuracy. The TPU-IMAC architecture shows
potential for various applications where energy efficiency and high performance
are essential, such as edge computing and real-time processing in mobile
devices. The unified training algorithm and the integration of IMAC and TPU
architectures contribute to the potential impact of this research on the
broader machine learning landscape
Guided rewriting and constraint satisfaction for parallel GPU code generation
Graphics Processing Units (GPUs) are notoriously hard to optimise for manually due to their scheduling and memory hierarchies. What is needed are good automatic code generators and optimisers for such parallel hardware. Functional approaches such as Accelerate, Futhark and LIFT leverage a high-level algorithmic Intermediate Representation (IR) to expose parallelism and abstract the implementation details away from the user. However, producing efficient code for a given accelerator remains challenging. Existing code generators depend on the user input to choose a subset of hard-coded optimizations or automated exploration of implementation search space. The former suffers from the lack of extensibility, while the latter is too costly due to the size of the search space. A hybrid approach is needed, where a space of valid implementations is built automatically and explored with the aid of human expertise.
This thesis presents a solution combining user-guided rewriting and automatically generated constraints to produce high-performance code. The first contribution is an automatic tuning technique to find a balance between performance and memory consumption. Leveraging its functional patterns, the LIFT compiler is empowered to infer tuning constraints and limit the search to valid tuning combinations only.
Next, the thesis reframes parallelisation as a constraint satisfaction problem. Parallelisation constraints are extracted automatically from the input expression, and a solver is used to identify valid rewriting. The constraints truncate the search space to valid parallel mappings only by capturing the scheduling restrictions of the GPU in the context of a given program. A synchronisation barrier insertion technique is proposed to prevent data races and improve the efficiency of the generated parallel mappings.
The final contribution of this thesis is the guided rewriting method, where the user encodes a design space of structural transformations using high-level IR nodes called rewrite points. These strongly typed pragmas express macro rewrites and expose design choices as explorable parameters. The thesis proposes a small set of reusable rewrite points to achieve tiling, cache locality, data reuse and memory optimisation.
A comparison with the vendor-provided handwritten kernel ARM Compute Library and the TVM code generator demonstrates the effectiveness of this thesis' contributions. With convolution as a use case, LIFT-generated direct and GEMM-based convolution implementations are shown to perform on par with the state-of-the-art solutions on a mobile GPU. Overall, this thesis demonstrates that a functional IR yields well to user-guided and automatic rewriting for high-performance code generation
Code Detection for Hardware Acceleration Using Large Language Models
Large language models (LLMs) have been massively applied to many tasks, often
surpassing state-of-the-art approaches. While their effectiveness in code
generation has been extensively studied (e.g., AlphaCode), their potential for
code detection remains unexplored.
This work presents the first analysis of code detection using LLMs. Our study
examines essential kernels, including matrix multiplication, convolution, and
fast-fourier transform, implemented in C/C++. We propose both a preliminary,
naive prompt and a novel prompting strategy for code detection.
Results reveal that conventional prompting achieves great precision but poor
accuracy (68.8%, 22.3%, and 79.2% for GEMM, convolution, and FFT, respectively)
due to a high number of false positives. Our novel prompting strategy
substantially reduces false positives, resulting in excellent overall accuracy
(91.1%, 97.9%, and 99.7%, respectively). These results pose a considerable
challenge to existing state-of-the-art code detection methods
TensorMD: Scalable Tensor-Diagram based Machine Learning Interatomic Potential on Heterogeneous Many-Core Processors
Molecular dynamics simulations have emerged as a potent tool for
investigating the physical properties and kinetic behaviors of materials at the
atomic scale, particularly in extreme conditions. Ab initio accuracy is now
achievable with machine learning based interatomic potentials. With recent
advancements in high-performance computing, highly accurate and large-scale
simulations become feasible. This study introduces TensorMD, a new machine
learning interatomic potential (MLIP) model that integrates physical principles
and tensor diagrams. The tensor formalism provides a more efficient computation
and greater flexibility for use with other scientific codes. Additionally, we
proposed several portable optimization strategies and developed a highly
optimized version for the new Sunway supercomputer. Our optimized TensorMD can
achieve unprecedented performance on the new Sunway, enabling simulations of up
to 52 billion atoms with a time-to-solution of 31 ps/step/atom, setting new
records for HPC + AI + MD
Adversarial Deep Learning and Security with a Hardware Perspective
Adversarial deep learning is the field of study which analyzes deep learning in the presence of adversarial entities. This entails understanding the capabilities, objectives, and attack scenarios available to the adversary to develop defensive mechanisms and avenues of robustness available to the benign parties. Understanding this facet of deep learning helps us improve the safety of the deep learning systems against external threats from adversaries. However, of equal importance, this perspective also helps the industry understand and respond to critical failures in the technology. The expectation of future success has driven significant interest in developing this technology broadly. Adversarial deep learning stands as a balancing force to ensure these developments remain grounded in the real-world and proceed along a responsible trajectory. Recently, the growth of deep learning has begun intersecting with the computer hardware domain to improve performance and efficiency for resource constrained application domains. The works investigated in this dissertation constitute our pioneering efforts in migrating adversarial deep learning into the hardware domain alongside its parent field of research
E-Sparse: Boosting the Large Language Model Inference through Entropy-based N:M Sparsity
Traditional pruning methods are known to be challenging to work in Large
Language Models (LLMs) for Generative AI because of their unaffordable training
process and large computational demands. For the first time, we introduce the
information entropy of hidden state features into a pruning metric design,
namely E-Sparse, to improve the accuracy of N:M sparsity on LLM. E-Sparse
employs the information richness to leverage the channel importance, and
further incorporates several novel techniques to put it into effect: (1) it
introduces information entropy to enhance the significance of parameter weights
and input feature norms as a novel pruning metric, and performs N:M sparsity
without modifying the remaining weights. (2) it designs global naive shuffle
and local block shuffle to quickly optimize the information distribution and
adequately cope with the impact of N:M sparsity on LLMs' accuracy. E-Sparse is
implemented as a Sparse-GEMM on FasterTransformer and runs on NVIDIA Ampere
GPUs. Extensive experiments on the LLaMA family and OPT models show that
E-Sparse can significantly speed up the model inference over the dense model
(up to 1.53X) and obtain significant memory saving (up to 43.52%), with
acceptable accuracy loss
ALBERTA: ALgorithm-Based Error Resilience in Transformer Architectures
Vision Transformers are being increasingly deployed in safety-critical
applications that demand high reliability. It is crucial to ensure the
correctness of their execution in spite of potential errors such as transient
hardware errors. We propose a novel algorithm-based resilience framework called
ALBERTA that allows us to perform end-to-end resilience analysis and protection
of transformer-based architectures. First, our work develops an efficient
process of computing and ranking the resilience of transformers layers. We find
that due to the large size of transformer models, applying traditional network
redundancy to a subset of the most vulnerable layers provides high error
coverage albeit with impractically high overhead. We address this shortcoming
by providing a software-directed, checksum-based error detection technique
aimed at protecting the most vulnerable general matrix multiply (GEMM) layers
in the transformer models that use either floating-point or integer arithmetic.
Results show that our approach achieves over 99% coverage for errors that
result in a mismatch at less than 0.2% computation overhead. Lastly, we present
the applicability of our framework in various modern GPU architectures under
different numerical precisions. We introduce an efficient self-correction
mechanism for resolving erroneous detection with an average overhead of less
than 0.002% (with a 2% overhead to resolve each erroneous detection)
- …