7 research outputs found
Enhancing Resource Management through Prediction-based Policies
Task-based programming models are emerging as a promising alternative to make
the most of multi-/many-core systems. These programming models rely on runtime
systems, and their goal is to improve application performance by properly
scheduling application tasks to cores. Additionally, these runtime systems
offer policies to cope with application phases that lack in parallelism to fill
all cores. However, these policies are usually static and favor either
performance or energy efficiency. In this paper, we have extended a task-based
runtime system with a lightweight monitoring and prediction infrastructure that
dynamically predicts the optimal number of cores required for each application
phase, thus improving both performance and energy efficiency. Through the
execution of several benchmarks in multi-/many-core systems, we show that our
prediction-based policies have competitive performance while improving energy
efficiency when compared to state of the art policies.Comment: Postprint submitted and published at Euro-Par2020: International
European Conference on Parallel and Distributed Computing (Springer)
(https://link.springer.com/chapter/10.1007%2F978-3-030-57675-2_31
A power-aware, self-adaptive macro data flow framework
The dataflow programming model has been extensively used as an effective solution to implement efficient parallel programming frameworks. However, the amount of resources allocated to the runtime support is usually fixed once by the programmer or the runtime, and kept static during the entire execution. While there are cases where such a static choice may be appropriate, other scenarios may require to dynamically change the parallelism degree during the application execution. In this paper we propose an algorithm for multicore shared memory platforms, that dynamically selects the optimal number of cores to be used as well as their clock frequency according to either the workload pressure or to explicit user requirements. We implement the algorithm for both structured and unstructured parallel applications and we validate our proposal over three real applications, showing that it is able to save a significant amount of power, while not impairing the performance and not requiring additional effort from the application programmer
Adaptive, efficient, parallel execution of parallel programs
Abstract Future multicore processors will be heterogeneous, be increasingly less reliable, and operate in dynamically changing operating conditions. Such environments will result in a constantly varying pool of hardware resources which can greatly complicate the task of efficiently exposing a program's parallelism onto these resources. Coupled with this uncertainty is the diverse set of efficiency metrics that users may desire. This paper proposes Varuna, a system that dynamically, continuously, rapidly and transparently adapts a program's parallelism to best match the instantaneous capabilities of the hardware resources while satisfying different efficiency metrics. Varuna is applicable to both multithreaded and task-based programs and can be seamlessly inserted between the program and the operating system without needing to change the source code of either. We demonstrate Varuna's effectiveness in diverse execution environments using unaltered C/C++ parallel programs from various benchmark suites. Regardless of the execution environment, Varuna always outperformed the state-of-the-art approaches for the efficiency metrics considered
Recommended from our members
Orchestrating thread scheduling and cache management to improve memory system throughput in throughput processors
textThroughput processors such as GPUs continue to provide higher peak arithmetic capability. Designing a high throughput memory system to keep the computational units busy is very challenging. Future throughput processors must continue to exploit data locality and utilize the on-chip and off-chip resources in the memory system more effectively to further improve the memory system throughput. This dissertation advocates orchestrating the thread scheduler with the cache management algorithms to alleviate GPU cache thrashing and pollution, avoid bandwidth saturation and maximize GPU memory system throughput. Based on this principle, this thesis work proposes three mechanisms to improve the cache efficiency and the memory throughput. This thesis work enhances the thread throttling mechanism with the Priority-based Cache Allocation mechanism (PCAL). By estimating the cache miss ratio with a variable number of cache-feeding threads and monitoring the usage of key memory system resources, PCAL determines the number of threads to share the cache and the minimum number of threads bypassing the cache that saturate memory system resources. This approach reduces the cache thrashing problem and effectively employs chip resources that would otherwise go unused by a pure thread throttling approach. We observe 67% improvement over the original as-is benchmarks and a 18% improvement over a better-tuned warp-throttling baseline. This work proposes the AgeLRU and Dynamic-AgeLRU mechanisms to address the inter-thread cache thrashing problem. AgeLRU prioritizes cache blocks based on the scheduling priority of their fetching warp at replacement. Dynamic-AgeLRU selects the AgeLRU algorithm and the LRU algorithm adaptively to avoid degrading the performance of non-thrashing applications. There are three variants of the AgeLRU algorithm: (1) replacement-only, (2) bypassing, and (3) bypassing with traffic optimization. Compared to the LRU algorithm, the above mentioned three variants of the AgeLRU algorithm enable increases in performance of 4%, 8% and 28% respectively across a set of cache-sensitive benchmarks. This thesis work develops the Reuse-Prediction-based cache Replacement scheme (RPR) for the GPU L1 data cache to address the intra-thread cache pollution problem. By combining the GPU thread scheduling priority together with the fetching Program Counter (PC) to generate a signature as the index of the prediction table, RPR identifies and prioritizes the near-reuse blocks and high-reuse blocks to maximize the cache efficiency. Compared to the AgeLRU algorithm, the experimental results show that the RPR algorithm results in a throughput improvement of 5% on average for regular applications, and a speedup of 3.2% on average across a set of cache-sensitive benchmarks. The techniques proposed in this dissertation are able to alleviate the cache thrashing, cache pollution and resource saturation problems effectively. We believe when these techniques are combined, they will synergistically further improve GPU cache efficiency and the overall memory system throughput.Computer Science
Adaptive parallelism mapping in dynamic environments using machine learning
Modern day hardware platforms are parallel and diverse, ranging from mobiles to
data centers. Mainstream parallel applications execute in the same system competing
for resources. This resource contention may lead to a drastic degradation in a program’s
performance. In addition, the execution environment composed of workloads
and hardware resources, is dynamic and unpredictable. Efficient matching of program
parallelism to machine parallelism under uncertainty is hard. The mapping policies
that determine the optimal allocation of work to threads should anticipate these variations.
This thesis proposes solutions to the mapping of parallel programs in dynamic environments.
It employs predictive modelling techniques to determine the best degree of
parallelism. Firstly, this thesis proposes a machine learning-based model to determine
the optimal thread number for a target program co-executing with varying workloads.
For this purpose, this offline trained model uses static code features and dynamic runtime
information as input.
Next, this thesis proposes a novel solution to monitor the proposed offline model
and adjust its decisions in response to the environment changes. It develops a second
predictive model for determining how the future environment should be, if the current
thread prediction was optimal. Depending on how close this prediction was to the
actual environment, the predicted thread numbers are adjusted.
Furthermore, considering the multitude of potential execution scenarios where no
single policy is best suited in all cases, this work proposes an approach based on the
idea of mixture of experts. It considers a number of offline experts or mapping policies,
each specialized for a given scenario, and learns online the best expert that is optimal
for the current execution. When evaluated on highly dynamic executions, these solutions
are proven to surpass default, state-of-art adaptive and analytic approaches