2,289 research outputs found
Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Modern Multi- and Many-Core Processors
This paper presents a low-overhead optimizer for the ubiquitous sparse
matrix-vector multiplication (SpMV) kernel. Architectural diversity among
different processors together with structural diversity among different sparse
matrices lead to bottleneck diversity. This justifies an SpMV optimizer that is
both matrix- and architecture-adaptive through runtime specialization. To this
direction, we present an approach that first identifies the performance
bottlenecks of SpMV for a given sparse matrix on the target platform either
through profiling or by matrix property inspection, and then selects suitable
optimizations to tackle those bottlenecks. Our optimization pool is based on
the widely used Compressed Sparse Row (CSR) sparse matrix storage format and
has low preprocessing overheads, making our overall approach practical even in
cases where fast decision making and optimization setup is required. We
evaluate our optimizer on three x86-based computing platforms and demonstrate
that it is able to distinguish and appropriately optimize SpMV for the majority
of matrices in a representative test suite, leading to significant speedups
over the CSR and Inspector-Executor CSR SpMV kernels available in the latest
release of the Intel MKL library.Comment: 10 pages, 7 figures, ICPP 201
Empowering parallel computing with field programmable gate arrays
After more than 30 years, reconfigurable computing has grown from a concept to a mature field of science and technology. The cornerstone of this evolution is the field programmable gate array, a building block enabling the configuration of a custom hardware architecture. The departure from static von Neumannlike architectures opens the way to eliminate the instruction overhead and to optimize the execution speed and power consumption. FPGAs now live in a growing ecosystem of development tools, enabling software programmers to map algorithms directly onto hardware. Applications abound in many directions, including data centers, IoT, AI, image processing and space exploration. The increasing success of FPGAs is largely due to an improved toolchain with solid high-level synthesis support as well as a better integration with processor and memory systems. On the other hand, long compile times and complex design exploration remain areas for improvement. In this paper we address the evolution of FPGAs towards advanced multi-functional accelerators, discuss different programming models and their HLS language implementations, as well as high-performance tuning of FPGAs integrated into a heterogeneous platform. We pinpoint fallacies and pitfalls, and identify opportunities for language enhancements and architectural refinements
Parallel code-specific CPU simulation with dynamic phase convergence modeling for HW/SW co-design
While SystemC models provide a promising solution to the complex problem of HW/SW co-design within the system-on-chip paradigm, such requires a detailed annotation of transaction level energy and performance data within the model. While this data can be obtained through source code profiling of an application running on the target processor, accomplishing such when the target CPU hardware is not actively available typically requires time-consuming CPU simulation, which is often too slow to practically consider for large programs. Additionally, while the use of SystemC modeling with TLM 2.0 standard is widely adopted for the SoC modeling, the process of transforming C/C++ code to SystemC code with TLM 2.0 functionality remains non-trivial. Herein we propose an automated framework that:
1. Enables high speed code-specific CPU profiling support for both Sniper and gem5 using parallelized dynamic steady state phase convergence modeling, providing automatic annotation of energy and performance within source code.
2. Provides an automated C to SystemC TLM 2.0 code generation flow that utilizes the back-annotated source code to produce a SystemC module for seamless incorporation into the virtual prototype.
Maximum speedups obtained using Sniper and gem5 are 48.76x and 562x respectively, while average results obtained speedups of 31.5x and 323.1x. Sniper results maintain an average accuracy of 0.89% for latency and 0.10% for energy, while gem5 achieves average accuracies of 4.16% and 2.87% for latency and energy respectively
Pac-Sim: Simulation of Multi-threaded Workloads using Intelligent, Live Sampling
High-performance, multi-core processors are the key to accelerating workloads
in several application domains. To continue to scale performance at the limit
of Moore's Law and Dennard scaling, software and hardware designers have turned
to dynamic solutions that adapt to the needs of applications in a transparent,
automatic way. For example, modern hardware improves its performance and power
efficiency by changing the hardware configuration, like the frequency and
voltage of cores, according to a number of parameters such as the technology
used, the workload running, etc. With this level of dynamism, it is essential
to simulate next-generation multi-core processors in a way that can both
respond to system changes and accurately determine system performance metrics.
Currently, no sampled simulation platform can achieve these goals of dynamic,
fast, and accurate simulation of multi-threaded workloads.
In this work, we propose a solution that allows for fast, accurate simulation
in the presence of both hardware and software dynamism. To accomplish this
goal, we present Pac-Sim, a novel sampled simulation methodology for fast,
accurate sampled simulation that requires no upfront analysis of the workload.
With our proposed methodology, it is now possible to simulate long-running
dynamically scheduled multi-threaded programs with significant simulation
speedups even in the presence of dynamic hardware events. We evaluate Pac-Sim
using the multi-threaded SPEC CPU2017, NPB, and PARSEC benchmarks with both
static and dynamic thread scheduling. The experimental results show that
Pac-Sim achieves a very low sampling error of 1.63% and 3.81% on average for
statically and dynamically scheduled benchmarks, respectively. Pac-Sim also
demonstrates significant simulation speedups as high as 523.5
(210.3 on average) for the train input set of SPEC CPU2017.Comment: 14 pages, 14 figure
- …