26,636 research outputs found
Validation of hardware events for successful performance pattern identification in High Performance Computing
Hardware performance monitoring (HPM) is a crucial ingredient of performance
analysis tools. While there are interfaces like LIKWID, PAPI or the kernel
interface perf\_event which provide HPM access with some additional features,
many higher level tools combine event counts with results retrieved from other
sources like function call traces to derive (semi-)automatic performance
advice. However, although HPM is available for x86 systems since the early 90s,
only a small subset of the HPM features is used in practice. Performance
patterns provide a more comprehensive approach, enabling the identification of
various performance-limiting effects. Patterns address issues like bandwidth
saturation, load imbalance, non-local data access in ccNUMA systems, or false
sharing of cache lines. This work defines HPM event sets that are best suited
to identify a selection of performance patterns on the Intel Haswell processor.
We validate the chosen event sets for accuracy in order to arrive at a reliable
pattern detection mechanism and point out shortcomings that cannot be easily
circumvented due to bugs or limitations in the hardware
Computer Architectures to Close the Loop in Real-time Optimization
© 2015 IEEE.Many modern control, automation, signal processing and machine learning applications rely on solving a sequence of optimization problems, which are updated with measurements of a real system that evolves in time. The solutions of each of these optimization problems are then used to make decisions, which may be followed by changing some parameters of the physical system, thereby resulting in a feedback loop between the computing and the physical system. Real-time optimization is not the same as fast optimization, due to the fact that the computation is affected by an uncertain system that evolves in time. The suitability of a design should therefore not be judged from the optimality of a single optimization problem, but based on the evolution of the entire cyber-physical system. The algorithms and hardware used for solving a single optimization problem in the office might therefore be far from ideal when solving a sequence of real-time optimization problems. Instead of there being a single, optimal design, one has to trade-off a number of objectives, including performance, robustness, energy usage, size and cost. We therefore provide here a tutorial introduction to some of the questions and implementation issues that arise in real-time optimization applications. We will concentrate on some of the decisions that have to be made when designing the computing architecture and algorithm and argue that the choice of one informs the other
Automated Instruction Stream Throughput Prediction for Intel and AMD Microarchitectures
An accurate prediction of scheduling and execution of instruction streams is
a necessary prerequisite for predicting the in-core performance behavior of
throughput-bound loop kernels on out-of-order processor architectures. Such
predictions are an indispensable component of analytical performance models,
such as the Roofline and the Execution-Cache-Memory (ECM) model, and allow a
deep understanding of the performance-relevant interactions between hardware
architecture and loop code. We present the Open Source Architecture Code
Analyzer (OSACA), a static analysis tool for predicting the execution time of
sequential loops comprising x86 instructions under the assumption of an
infinite first-level cache and perfect out-of-order scheduling. We show the
process of building a machine model from available documentation and
semi-automatic benchmarking, and carry it out for the latest Intel Skylake and
AMD Zen micro-architectures. To validate the constructed models, we apply them
to several assembly kernels and compare runtime predictions with actual
measurements. Finally we give an outlook on how the method may be generalized
to new architectures.Comment: 11 pages, 4 figures, 7 table
Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi
Intel Xeon Phi is a recently released high-performance coprocessor which
features 61 cores each supporting 4 hardware threads with 512-bit wide SIMD
registers achieving a peak theoretical performance of 1Tflop/s in double
precision. Many scientific applications involve operations on large sparse
matrices such as linear solvers, eigensolver, and graph mining algorithms. The
core of most of these applications involves the multiplication of a large,
sparse matrix with a dense vector (SpMV). In this paper, we investigate the
performance of the Xeon Phi coprocessor for SpMV. We first provide a
comprehensive introduction to this new architecture and analyze its peak
performance with a number of micro benchmarks. Although the design of a Xeon
Phi core is not much different than those of the cores in modern processors,
its large number of cores and hyperthreading capability allow many application
to saturate the available memory bandwidth, which is not the case for many
cutting-edge processors. Yet, our performance studies show that it is the
memory latency not the bandwidth which creates a bottleneck for SpMV on this
architecture. Finally, our experiments show that Xeon Phi's sparse kernel
performance is very promising and even better than that of cutting-edge general
purpose processors and GPUs
Performance comparison between Java and JNI for optimal implementation of computational micro-kernels
General purpose CPUs used in high performance computing (HPC) support a
vector instruction set and an out-of-order engine dedicated to increase the
instruction level parallelism. Hence, related optimizations are currently
critical to improve the performance of applications requiring numerical
computation. Moreover, the use of a Java run-time environment such as the
HotSpot Java Virtual Machine (JVM) in high performance computing is a promising
alternative. It benefits from its programming flexibility, productivity and the
performance is ensured by the Just-In-Time (JIT) compiler. Though, the JIT
compiler suffers from two main drawbacks. First, the JIT is a black box for
developers. We have no control over the generated code nor any feedback from
its optimization phases like vectorization. Secondly, the time constraint
narrows down the degree of optimization compared to static compilers like GCC
or LLVM. So, it is compelling to use statically compiled code since it benefits
from additional optimization reducing performance bottlenecks. Java enables to
call native code from dynamic libraries through the Java Native Interface
(JNI). Nevertheless, JNI methods are not inlined and require an additional cost
to be invoked compared to Java ones. Therefore, to benefit from better static
optimization, this call overhead must be leveraged by the amount of computation
performed at each JNI invocation. In this paper we tackle this problem and we
propose to do this analysis for a set of micro-kernels. Our goal is to select
the most efficient implementation considering the amount of computation defined
by the calling context. We also investigate the impact on performance of
several different optimization schemes which are vectorization, out-of-order
optimization, data alignment, method inlining and the use of native memory for
JNI methods.Comment: Part of ADAPT Workshop proceedings, 2015 (arXiv:1412.2347
- …