9,556 research outputs found
Efficient multicore-aware parallelization strategies for iterative stencil computations
Stencil computations consume a major part of runtime in many scientific
simulation codes. As prototypes for this class of algorithms we consider the
iterative Jacobi and Gauss-Seidel smoothers and aim at highly efficient
parallel implementations for cache-based multicore architectures. Temporal
cache blocking is a known advanced optimization technique, which can reduce the
pressure on the memory bus significantly. We apply and refine this optimization
for a recently presented temporal blocking strategy designed to explicitly
utilize multicore characteristics. Especially for the case of Gauss-Seidel
smoothers we show that simultaneous multi-threading (SMT) can yield substantial
performance improvements for our optimized algorithm.Comment: 15 pages, 10 figure
RPPM : Rapid Performance Prediction of Multithreaded workloads on multicore processors
Analytical performance modeling is a useful complement to detailed cycle-level simulation to quickly explore the design space in an early design stage. Mechanistic analytical modeling is particularly interesting as it provides deep insight and does not require expensive offline profiling as empirical modeling. Previous work in mechanistic analytical modeling, unfortunately, is limited to single-threaded applications running on single-core processors.
This work proposes RPPM, a mechanistic analytical performance model for multi-threaded applications on multicore hardware. RPPM collects microarchitecture-independent characteristics of a multi-threaded workload to predict performance on a previously unseen multicore architecture. The profile needs to be collected only once to predict a range of processor architectures. We evaluate RPPM's accuracy against simulation and report a performance prediction error of 11.2% on average (23% max). We demonstrate RPPM's usefulness for conducting design space exploration experiments as well as for analyzing parallel application performance
Hera-JVM: abstracting processor heterogeneity behind a virtual machine
Heterogeneous multi-core processors, such as the Cell processor, can deliver exceptional performance, however, they are notoriously difficult to program effectively. We present Hera-JVM, a runtime system which hides a processor’s heterogeneity behind a homogeneous virtual machine interface. Preliminary results of three benchmarks running under Hera-JVM are presented. These results suggest a set of application behaviour characteristics that the runtime system should take into account when placing threads on different core types.
Architecture-Aware Configuration and Scheduling of Matrix Multiplication on Asymmetric Multicore Processors
Asymmetric multicore processors (AMPs) have recently emerged as an appealing
technology for severely energy-constrained environments, especially in mobile
appliances where heterogeneity in applications is mainstream. In addition,
given the growing interest for low-power high performance computing, this type
of architectures is also being investigated as a means to improve the
throughput-per-Watt of complex scientific applications.
In this paper, we design and embed several architecture-aware optimizations
into a multi-threaded general matrix multiplication (gemm), a key operation of
the BLAS, in order to obtain a high performance implementation for ARM
big.LITTLE AMPs. Our solution is based on the reference implementation of gemm
in the BLIS library, and integrates a cache-aware configuration as well as
asymmetric--static and dynamic scheduling strategies that carefully tune and
distribute the operation's micro-kernels among the big and LITTLE cores of the
target processor. The experimental results on a Samsung Exynos 5422, a
system-on-chip with ARM Cortex-A15 and Cortex-A7 clusters that implements the
big.LITTLE model, expose that our cache-aware versions of gemm with asymmetric
scheduling attain important gains in performance with respect to its
architecture-oblivious counterparts while exploiting all the resources of the
AMP to deliver considerable energy efficiency
On the Impact of Memory Allocation on High-Performance Query Processing
Somewhat surprisingly, the behavior of analytical query engines is crucially
affected by the dynamic memory allocator used. Memory allocators highly
influence performance, scalability, memory efficiency and memory fairness to
other processes. In this work, we provide the first comprehensive experimental
analysis on the impact of memory allocation for high-performance query engines.
We test five state-of-the-art dynamic memory allocators and discuss their
strengths and weaknesses within our DBMS. The right allocator can increase the
performance of TPC-DS (SF 100) by 2.7x on a 4-socket Intel Xeon server
- …