889 research outputs found
Guided rewriting and constraint satisfaction for parallel GPU code generation
Graphics Processing Units (GPUs) are notoriously hard to optimise for manually due to their scheduling and memory hierarchies. What is needed are good automatic code generators and optimisers for such parallel hardware. Functional approaches such as Accelerate, Futhark and LIFT leverage a high-level algorithmic Intermediate Representation (IR) to expose parallelism and abstract the implementation details away from the user. However, producing efficient code for a given accelerator remains challenging. Existing code generators depend on the user input to choose a subset of hard-coded optimizations or automated exploration of implementation search space. The former suffers from the lack of extensibility, while the latter is too costly due to the size of the search space. A hybrid approach is needed, where a space of valid implementations is built automatically and explored with the aid of human expertise.
This thesis presents a solution combining user-guided rewriting and automatically generated constraints to produce high-performance code. The first contribution is an automatic tuning technique to find a balance between performance and memory consumption. Leveraging its functional patterns, the LIFT compiler is empowered to infer tuning constraints and limit the search to valid tuning combinations only.
Next, the thesis reframes parallelisation as a constraint satisfaction problem. Parallelisation constraints are extracted automatically from the input expression, and a solver is used to identify valid rewriting. The constraints truncate the search space to valid parallel mappings only by capturing the scheduling restrictions of the GPU in the context of a given program. A synchronisation barrier insertion technique is proposed to prevent data races and improve the efficiency of the generated parallel mappings.
The final contribution of this thesis is the guided rewriting method, where the user encodes a design space of structural transformations using high-level IR nodes called rewrite points. These strongly typed pragmas express macro rewrites and expose design choices as explorable parameters. The thesis proposes a small set of reusable rewrite points to achieve tiling, cache locality, data reuse and memory optimisation.
A comparison with the vendor-provided handwritten kernel ARM Compute Library and the TVM code generator demonstrates the effectiveness of this thesis' contributions. With convolution as a use case, LIFT-generated direct and GEMM-based convolution implementations are shown to perform on par with the state-of-the-art solutions on a mobile GPU. Overall, this thesis demonstrates that a functional IR yields well to user-guided and automatic rewriting for high-performance code generation
The Sparse Abstract Machine
We propose the Sparse Abstract Machine (SAM), an abstract machine model for
targeting sparse tensor algebra to reconfigurable and fixed-function spatial
dataflow accelerators. SAM defines a streaming dataflow abstraction with sparse
primitives that encompass a large space of scheduled tensor algebra
expressions. SAM dataflow graphs naturally separate tensor formats from
algorithms and are expressive enough to incorporate arbitrary iteration
orderings and many hardware-specific optimizations. We also present Custard, a
compiler from a high-level language to SAM that demonstrates SAM's usefulness
as an intermediate representation. We automatically bind from SAM to a
streaming dataflow simulator. We evaluate the generality and extensibility of
SAM, explore the performance space of sparse tensor algebra optimizations using
SAM, and show SAM's ability to represent dataflow hardware.Comment: 18 pages, 17 figures, 3 table
Algebraic Temporal Blocking for Sparse Iterative Solvers on Multi-Core CPUs
Sparse linear iterative solvers are essential for many large-scale
simulations. Much of the runtime of these solvers is often spent in the
implicit evaluation of matrix polynomials via a sequence of sparse
matrix-vector products. A variety of approaches has been proposed to make these
polynomial evaluations explicit (i.e., fix the coefficients), e.g., polynomial
preconditioners or s-step Krylov methods. Furthermore, it is nowadays a popular
practice to approximate triangular solves by a matrix polynomial to increase
parallelism. Such algorithms allow to evaluate the polynomial using a so-called
matrix power kernel (MPK), which computes the product between a power of a
sparse matrix A and a dense vector x, or a related operation. Recently we have
shown that using the level-based formulation of sparse matrix-vector
multiplications in the Recursive Algebraic Coloring Engine (RACE) framework we
can perform temporal cache blocking of MPK to increase its performance. In this
work, we demonstrate the application of this cache-blocking optimization in
sparse iterative solvers.
By integrating the RACE library into the Trilinos framework, we demonstrate
the speedups achieved in preconditioned) s-step GMRES, polynomial
preconditioners, and algebraic multigrid (AMG). For MPK-dominated algorithms we
achieve speedups of up to 3x on modern multi-core compute nodes. For algorithms
with moderate contributions from subspace orthogonalization, the gain reduces
significantly, which is often caused by the insufficient quality of the
orthogonalization routines. Finally, we showcase the application of
RACE-accelerated solvers in a real-world wind turbine simulation (Nalu-Wind)
and highlight the new opportunities and perspectives opened up by RACE as a
cache-blocking technique for MPK-enabled sparse solvers.Comment: 25 pages, 11 figures, 3 table
Fast multiplication of random dense matrices with fixed sparse matrices
This work focuses on accelerating the multiplication of a dense random matrix
with a (fixed) sparse matrix, which is frequently used in sketching algorithms.
We develop a novel scheme that takes advantage of blocking and recomputation
(on-the-fly random number generation) to accelerate this operation. The
techniques we propose decrease memory movement, thereby increasing the
algorithm's parallel scalability in shared memory architectures. On the Intel
Frontera architecture, our algorithm can achieve 2x speedups over libraries
such as Eigen and Intel MKL on some examples. In addition, with 32 threads, we
can obtain a parallel efficiency of up to approximately 45%. We also present a
theoretical analysis for the memory movement lower bound of our algorithm,
showing that under mild assumptions, it's possible to beat the data movement
lower bound of general matrix-matrix multiply (GEMM) by a factor of ,
where is the cache size. Finally, we incorporate our sketching algorithm
into a randomized least squares solver. For extremely over-determined sparse
input matrices, we show that our results are competitive with SuiteSparse; in
some cases, we obtain a speedup of 10x over SuiteSparse
OpenLB User Guide: Associated with Release 1.6 of the Code
OpenLB is an object-oriented implementation of LBM. It is the first
implementation of a generic platform for LBM programming, which is shared with
the open source community (GPLv2). Since the first release in 2007, the code
has been continuously improved and extended which is documented by thirteen
releases as well as the corresponding release notes which are available on the
OpenLB website (https://www.openlb.net). The OpenLB code is written in C++ and
is used by application programmers as well as developers, with the ability to
implement custom models OpenLB supports complex data structures that allow
simulations in complex geometries and parallel execution using MPI, OpenMP and
CUDA on high-performance computers. The source code uses the concepts of
interfaces and templates, so that efficient, direct and intuitive
implementations of the LBM become possible. The efficiency and scalability has
been checked and proved by code reviews. This user manual and a source code
documentation by DoxyGen are available on the OpenLB project website
Finding Morton-Like Layouts for Multi-Dimensional Arrays Using Evolutionary Algorithms
The layout of multi-dimensional data can have a significant impact on the
efficacy of hardware caches and, by extension, the performance of applications.
Common multi-dimensional layouts include the canonical row-major and
column-major layouts as well as the Morton curve layout. In this paper, we
describe how the Morton layout can be generalized to a very large family of
multi-dimensional data layouts with widely varying performance characteristics.
We posit that this design space can be efficiently explored using a
combinatorial evolutionary methodology based on genetic algorithms. To this
end, we propose a chromosomal representation for such layouts as well as a
methodology for estimating the fitness of array layouts using cache simulation.
We show that our fitness function correlates to kernel running time in real
hardware, and that our evolutionary strategy allows us to find candidates with
favorable simulated cache properties in four out of the eight real-world
applications under consideration in a small number of generations. Finally, we
demonstrate that the array layouts found using our evolutionary method perform
well not only in simulated environments but that they can effect significant
performance gains -- up to a factor ten in extreme cases -- in real hardware
Analysing and Reducing Costs of Deep Learning Compiler Auto-tuning
Deep Learning (DL) is significantly impacting many industries, including automotive, retail and medicine, enabling autonomous driving, recommender systems and genomics modelling, amongst other applications. At the same time, demand for complex and fast DL models is continually growing. The most capable models tend to exhibit highest operational costs, primarily due to their large computational resource footprint and inefficient utilisation of computational resources employed by DL systems. In an attempt to tackle these problems, DL compilers and auto-tuners emerged, automating the traditionally manual task of DL model performance optimisation. While auto-tuning improves model inference speed, it is a costly process, which limits its wider adoption within DL deployment pipelines. The high operational costs associated with DL auto-tuning have multiple causes. During operation, DL auto-tuners explore large search spaces consisting of billions of tensor programs, to propose potential candidates that improve DL model inference latency. Subsequently, DL auto-tuners measure candidate performance in isolation on the target-device, which constitutes the majority of auto-tuning compute-time. Suboptimal candidate proposals, combined with their serial measurement in an isolated target-device lead to prolonged optimisation time and reduced resource availability, ultimately reducing cost-efficiency of the process. In this thesis, we investigate the reasons behind prolonged DL auto-tuning and quantify their impact on the optimisation costs, revealing directions for improved DL auto-tuner design. Based on these insights, we propose two complementary systems: Trimmer and DOPpler. Trimmer improves tensor program search efficacy by filtering out poorly performing candidates, and controls end-to-end auto-tuning using cost objectives, monitoring optimisation cost. Simultaneously, DOPpler breaks long-held assumptions about the serial candidate measurements by successfully parallelising them intra-device, with minimal penalty to optimisation quality. Through extensive experimental evaluation of both systems, we demonstrate that they significantly improve cost-efficiency of autotuning (up to 50.5%) across a plethora of tensor operators, DL models, auto-tuners and target-devices
Decryption Failure Attacks on Post-Quantum Cryptography
This dissertation discusses mainly new cryptanalytical results related to issues of securely implementing the next generation of asymmetric cryptography, or Public-Key Cryptography (PKC).PKC, as it has been deployed until today, depends heavily on the integer factorization and the discrete logarithm problems.Unfortunately, it has been well-known since the mid-90s, that these mathematical problems can be solved due to Peter Shor's algorithm for quantum computers, which achieves the answers in polynomial time.The recently accelerated pace of R&D towards quantum computers, eventually of sufficient size and power to threaten cryptography, has led the crypto research community towards a major shift of focus.A project towards standardization of Post-quantum Cryptography (PQC) was launched by the US-based standardization organization, NIST. PQC is the name given to algorithms designed for running on classical hardware/software whilst being resistant to attacks from quantum computers.PQC is well suited for replacing the current asymmetric schemes.A primary motivation for the project is to guide publicly available research toward the singular goal of finding weaknesses in the proposed next generation of PKC.For public key encryption (PKE) or digital signature (DS) schemes to be considered secure they must be shown to rely heavily on well-known mathematical problems with theoretical proofs of security under established models, such as indistinguishability under chosen ciphertext attack (IND-CCA).Also, they must withstand serious attack attempts by well-renowned cryptographers both concerning theoretical security and the actual software/hardware instantiations.It is well-known that security models, such as IND-CCA, are not designed to capture the intricacies of inner-state leakages.Such leakages are named side-channels, which is currently a major topic of interest in the NIST PQC project.This dissertation focuses on two things, in general:1) how does the low but non-zero probability of decryption failures affect the cryptanalysis of these new PQC candidates?And 2) how might side-channel vulnerabilities inadvertently be introduced when going from theory to the practice of software/hardware implementations?Of main concern are PQC algorithms based on lattice theory and coding theory.The primary contributions are the discovery of novel decryption failure side-channel attacks, improvements on existing attacks, an alternative implementation to a part of a PQC scheme, and some more theoretical cryptanalytical results
Low Rank Optimization for Efficient Deep Learning: Making A Balance between Compact Architecture and Fast Training
Deep neural networks have achieved great success in many data processing
applications. However, the high computational complexity and storage cost makes
deep learning hard to be used on resource-constrained devices, and it is not
environmental-friendly with much power cost. In this paper, we focus on
low-rank optimization for efficient deep learning techniques. In the space
domain, deep neural networks are compressed by low rank approximation of the
network parameters, which directly reduces the storage requirement with a
smaller number of network parameters. In the time domain, the network
parameters can be trained in a few subspaces, which enables efficient training
for fast convergence. The model compression in the spatial domain is summarized
into three categories as pre-train, pre-set, and compression-aware methods,
respectively. With a series of integrable techniques discussed, such as sparse
pruning, quantization, and entropy coding, we can ensemble them in an
integration framework with lower computational complexity and storage. Besides
of summary of recent technical advances, we have two findings for motivating
future works: one is that the effective rank outperforms other sparse measures
for network compression. The other is a spatial and temporal balance for
tensorized neural networks
Pathfinding Future PIM Architectures by Demystifying a Commercial PIM Technology
Processing-in-memory (PIM) has been explored for decades by computer
architects, yet it has never seen the light of day in real-world products due
to their high design overheads and lack of a killer application. With the
advent of critical memory-intensive workloads, several commercial PIM
technologies have been introduced to the market ranging from domain-specific
PIM architectures to more general-purpose PIM architectures. In this work, we
deepdive into UPMEM's commercial PIM technology, a general-purpose PIM-enabled
parallel architecture that is highly programmable. Our first key contribution
is the development of a flexible simulation framework for PIM. The simulator we
developed (aka PIMulator) enables the compilation of UPMEM-PIM source codes
into its compiled machine-level instructions, which are subsequently consumed
by our cycle-level performance simulator. Using PIMulator, we demystify UPMEM's
PIM design through a detailed characterization study. Building on top of our
characterization, we conduct a series of case studies to pathfind important
architectural features that we deem will be critical for future PIM
architectures to suppor
- …