2,052 research outputs found
MARACAS: a real-time multicore VCPU scheduling framework
This paper describes a multicore scheduling and load-balancing framework called MARACAS, to address shared cache and memory bus contention. It builds upon prior work centered around the concept of virtual CPU (VCPU) scheduling. Threads are associated with VCPUs that have periodically replenished time budgets. VCPUs are guaranteed to receive their periodic budgets even if they are migrated between cores. A load balancing algorithm ensures VCPUs are mapped to cores to fairly distribute surplus CPU cycles, after ensuring VCPU timing guarantees. MARACAS uses surplus cycles to throttle the execution of threads running on specific cores when memory contention exceeds a certain threshold. This enables threads on other cores to make better progress without interference from co-runners. Our scheduling framework features a novel memory-aware scheduling approach that uses performance counters to derive an average memory request latency. We show that latency-based memory throttling is more effective than rate-based memory access control in reducing bus contention. MARACAS also supports cache-aware scheduling and migration using page recoloring to improve performance isolation amongst VCPUs. Experiments show how MARACAS reduces multicore resource contention, leading to improved task progress.http://www.cs.bu.edu/fac/richwest/papers/rtss_2016.pdfAccepted manuscrip
A Graph-Partition-Based Scheduling Policy for Heterogeneous Architectures
In order to improve system performance efficiently, a number of systems
choose to equip multi-core and many-core processors (such as GPUs). Due to
their discrete memory these heterogeneous architectures comprise a distributed
system within a computer. A data-flow programming model is attractive in this
setting for its ease of expressing concurrency. Programmers only need to define
task dependencies without considering how to schedule them on the hardware.
However, mapping the resulting task graph onto hardware efficiently remains a
challenge. In this paper, we propose a graph-partition scheduling policy for
mapping data-flow workloads to heterogeneous hardware. According to our
experiments, our graph-partition-based scheduling achieves comparable
performance to conventional queue-base approaches.Comment: Presented at DATE Friday Workshop on Heterogeneous Architectures and
Design Methods for Embedded Image Systems (HIS 2015) (arXiv:1502.07241
Best practices for HPM-assisted performance engineering on modern multicore processors
Many tools and libraries employ hardware performance monitoring (HPM) on
modern processors, and using this data for performance assessment and as a
starting point for code optimizations is very popular. However, such data is
only useful if it is interpreted with care, and if the right metrics are chosen
for the right purpose. We demonstrate the sensible use of hardware performance
counters in the context of a structured performance engineering approach for
applications in computational science. Typical performance patterns and their
respective metric signatures are defined, and some of them are illustrated
using case studies. Although these generic concepts do not depend on specific
tools or environments, we restrict ourselves to modern x86-based multicore
processors and use the likwid-perfctr tool under the Linux OS.Comment: 10 pages, 2 figure
Hierarchical Dynamic Loop Self-Scheduling on Distributed-Memory Systems Using an MPI+MPI Approach
Computationally-intensive loops are the primary source of parallelism in
scientific applications. Such loops are often irregular and a balanced
execution of their loop iterations is critical for achieving high performance.
However, several factors may lead to an imbalanced load execution, such as
problem characteristics, algorithmic, and systemic variations. Dynamic loop
self-scheduling (DLS) techniques are devised to mitigate these factors, and
consequently, improve application performance. On distributed-memory systems,
DLS techniques can be implemented using a hierarchical master-worker execution
model and are, therefore, called hierarchical DLS techniques. These techniques
self-schedule loop iterations at two levels of hardware parallelism: across and
within compute nodes. Hybrid programming approaches that combine the message
passing interface (MPI) with open multi-processing (OpenMP) dominate the
implementation of hierarchical DLS techniques. The MPI-3 standard includes the
feature of sharing memory regions among MPI processes. This feature introduced
the MPI+MPI approach that simplifies the implementation of parallel scientific
applications. The present work designs and implements hierarchical DLS
techniques by exploiting the MPI+MPI approach. Four well-known DLS techniques
are considered in the evaluation proposed herein. The results indicate certain
performance advantages of the proposed approach compared to the hybrid
MPI+OpenMP approach
- …