2,964 research outputs found
SQUASH: Simple QoS-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators
Modern SoCs integrate multiple CPU cores and Hardware Accelerators (HWAs)
that share the same main memory system, causing interference among memory
requests from different agents. The result of this interference, if not
controlled well, is missed deadlines for HWAs and low CPU performance.
State-of-the-art mechanisms designed for CPU-GPU systems strive to meet a
target frame rate for GPUs by prioritizing the GPU close to the time when it
has to complete a frame. We observe two major problems when such an approach is
adapted to a heterogeneous CPU-HWA system. First, HWAs miss deadlines because
they are prioritized only close to their deadlines. Second, such an approach
does not consider the diverse memory access characteristics of different
applications running on CPUs and HWAs, leading to low performance for
latency-sensitive CPU applications and deadline misses for some HWAs, including
GPUs.
In this paper, we propose a Simple Quality of service Aware memory Scheduler
for Heterogeneous systems (SQUASH), that overcomes these problems using three
key ideas, with the goal of meeting deadlines of HWAs while providing high CPU
performance. First, SQUASH prioritizes a HWA when it is not on track to meet
its deadline any time during a deadline period. Second, SQUASH prioritizes HWAs
over memory-intensive CPU applications based on the observation that the
performance of memory-intensive applications is not sensitive to memory
latency. Third, SQUASH treats short-deadline HWAs differently as they are more
likely to miss their deadlines and schedules their requests based on worst-case
memory access time estimates.
Extensive evaluations across a wide variety of different workloads and
systems show that SQUASH achieves significantly better CPU performance than the
best previous scheduler while always meeting the deadlines for all HWAs,
including GPUs, thereby largely improving frame rates
A domain-specific language and matrix-free stencil code for investigating electronic properties of Dirac and topological materials
We introduce PVSC-DTM (Parallel Vectorized Stencil Code for Dirac and
Topological Materials), a library and code generator based on a domain-specific
language tailored to implement the specific stencil-like algorithms that can
describe Dirac and topological materials such as graphene and topological
insulators in a matrix-free way. The generated hybrid-parallel (MPI+OpenMP)
code is fully vectorized using Single Instruction Multiple Data (SIMD)
extensions. It is significantly faster than matrix-based approaches on the node
level and performs in accordance with the roofline model. We demonstrate the
chip-level performance and distributed-memory scalability of basic building
blocks such as sparse matrix-(multiple-) vector multiplication on modern
multicore CPUs. As an application example, we use the PVSC-DTM scheme to (i)
explore the scattering of a Dirac wave on an array of gate-defined quantum
dots, to (ii) calculate a bunch of interior eigenvalues for strong topological
insulators, and to (iii) discuss the photoemission spectra of a disordered Weyl
semimetal.Comment: 16 pages, 2 tables, 11 figure
Exploring performance and power properties of modern multicore chips via simple machine models
Modern multicore chips show complex behavior with respect to performance and
power. Starting with the Intel Sandy Bridge processor, it has become possible
to directly measure the power dissipation of a CPU chip and correlate this data
with the performance properties of the running code. Going beyond a simple
bottleneck analysis, we employ the recently published Execution-Cache-Memory
(ECM) model to describe the single- and multi-core performance of streaming
kernels. The model refines the well-known roofline model, since it can predict
the scaling and the saturation behavior of bandwidth-limited loop kernels on a
multicore chip. The saturation point is especially relevant for considerations
of energy consumption. From power dissipation measurements of benchmark
programs with vastly different requirements to the hardware, we derive a
simple, phenomenological power model for the Sandy Bridge processor. Together
with the ECM model, we are able to explain many peculiarities in the
performance and power behavior of multicore processors, and derive guidelines
for energy-efficient execution of parallel programs. Finally, we show that the
ECM and power models can be successfully used to describe the scaling and power
behavior of a lattice-Boltzmann flow solver code.Comment: 23 pages, 10 figures. Typos corrected, DOI adde
- …