7 research outputs found
Automatic Throughput and Critical Path Analysis of x86 and ARM Assembly Kernels
Useful models of loop kernel runtimes on out-of-order architectures require
an analysis of the in-core performance behavior of instructions and their
dependencies. While an instruction throughput prediction sets a lower bound to
the kernel runtime, the critical path defines an upper bound. Such predictions
are an essential part of analytic (i.e., white-box) performance models like the
Roofline and Execution-Cache-Memory (ECM) models. They enable a better
understanding of the performance-relevant interactions between hardware
architecture and loop code. The Open Source Architecture Code Analyzer (OSACA)
is a static analysis tool for predicting the execution time of sequential
loops. It previously supported only x86 (Intel and AMD) architectures and
simple, optimistic full-throughput execution. We have heavily extended OSACA to
support ARM instructions and critical path prediction including the detection
of loop-carried dependencies, which turns it into a versatile
cross-architecture modeling tool. We show runtime predictions for code on Intel
Cascade Lake, AMD Zen, and Marvell ThunderX2 micro-architectures based on
machine models from available documentation and semi-automatic benchmarking.
The predictions are compared with actual measurements.Comment: 6 pages, 3 figure
Automated Instruction Stream Throughput Prediction for Intel and AMD Microarchitectures
An accurate prediction of scheduling and execution of instruction streams is
a necessary prerequisite for predicting the in-core performance behavior of
throughput-bound loop kernels on out-of-order processor architectures. Such
predictions are an indispensable component of analytical performance models,
such as the Roofline and the Execution-Cache-Memory (ECM) model, and allow a
deep understanding of the performance-relevant interactions between hardware
architecture and loop code. We present the Open Source Architecture Code
Analyzer (OSACA), a static analysis tool for predicting the execution time of
sequential loops comprising x86 instructions under the assumption of an
infinite first-level cache and perfect out-of-order scheduling. We show the
process of building a machine model from available documentation and
semi-automatic benchmarking, and carry it out for the latest Intel Skylake and
AMD Zen micro-architectures. To validate the constructed models, we apply them
to several assembly kernels and compare runtime predictions with actual
measurements. Finally we give an outlook on how the method may be generalized
to new architectures.Comment: 11 pages, 4 figures, 7 table
Accelerating Sparse Tensor Decomposition Using Adaptive Linearized Representation
High-dimensional sparse data emerge in many critical application domains such
as cybersecurity, healthcare, anomaly detection, and trend analysis. To quickly
extract meaningful insights from massive volumes of these multi-dimensional
data, scientists employ unsupervised analysis tools based on tensor
decomposition (TD) methods. However, real-world sparse tensors exhibit highly
irregular shapes, data distributions, and sparsity, which pose significant
challenges for making efficient use of modern parallel architectures. This
study breaks the prevailing assumption that compressing sparse tensors into
coarse-grained structures (i.e., tensor slices or blocks) or along a particular
dimension/mode (i.e., mode-specific) is more efficient than keeping them in a
fine-grained, mode-agnostic form. Our novel sparse tensor representation,
Adaptive Linearized Tensor Order (ALTO), encodes tensors in a compact format
that can be easily streamed from memory and is amenable to both caching and
parallel execution. To demonstrate the efficacy of ALTO, we accelerate popular
TD methods that compute the Canonical Polyadic Decomposition (CPD) model across
a range of real-world sparse tensors. Additionally, we characterize the major
execution bottlenecks of TD methods on multiple generations of the latest Intel
Xeon Scalable processors, including Sapphire Rapids CPUs, and introduce dynamic
adaptation heuristics to automatically select the best algorithm based on the
sparse tensor characteristics. Across a diverse set of real-world data sets,
ALTO outperforms the state-of-the-art approaches, achieving more than an
order-of-magnitude speedup over the best mode-agnostic formats. Compared to the
best mode-specific formats, which require multiple tensor copies, ALTO achieves
more than 5.1x geometric mean speedup at a fraction (25%) of their storage.Comment: We extend the results of our previous ICS paper to significantly
improve the parallel performance of the Canonical Polyadic Alternating Least
Squares (CP-ALS) algorithm for normally distributed data and the Canonical
Polyadic Alternating Poisson Regression (CP-APR) algorithm for non-negative
count dat
ExecutionâCacheâMemory modeling and performance tuning of sparse matrixâvector multiplication and Lattice quantum chromodynamics on A64FX
The A64FX CPU is arguably the most powerful Arm-based processor design to date. Although it is a traditional cache-based multicore processor, its peak performance and memory bandwidth rival accelerator devices. A good understanding of its performance features is of paramount importance for developers who wish to leverage its full potential. We present an architectural analysis of the A64FX used in the Fujitsu FX1000 supercomputer at a level of detail that allows for the construction of Execution-Cache-Memory performance models for steady-state loops. In the process we identify architectural peculiarities that point to viable generic optimization strategies. After validating the model using simple streaming loops we apply the insight gained to sparse matrix-vector multiplication (SpMV) and the domain wall (DW) kernel from quantum chromodynamics. For SpMV we show why the compressed row storage (CRS) matrix storage format is not a good practical choice on this architecture and how the SELL-C-sigma format can achieve bandwidth saturation. For the DW kernel we provide a cache-reuse analysis and show how an appropriate choice of data layout for complex arrays can realize memory-bandwidth saturation in this case as well. A comparison with state-of-the-art high-end Intel Cascade Lake AP and Nvidia V100 systems puts the capabilities of the A64FX into perspective. We also explore the potential for power optimizations using the tuning knobs provided by the Fugaku system, achieving energy savings of about 31% for SpMV and 18% for DW