43,269 research outputs found
The Performance And Power Impact Of Using Multiple Dram Address Mapping Schemes In Multicore Processors
Lowest-level cache misses are satisfied by the main memory through a specific address mapping scheme that is hard-coded in the memory controller. A dynamic address mapping scheme technique is investigated to provide higher performance and lower power consumption, and a method to throttle memory to meet a specific power budget. Several experiments are conducted on single and multithreaded synthetic memory traces -to study extreme cases- and validate the usability of the proposed dynamic mapping scheme over the fixed one. Results show that applications’ performance varies according to the mapping scheme used, and a dynamic mapping scheme achieves up to 2x increase in peak bandwidth utilization and around 30% higher energy efficiency than a system using only a single fixed scheme Moreover, the technique can be used to limit memory accesses into a subset of the memory devices by controlling data allocation at a finer granularity, providing a method to throttle main memory by allowing unaccessed devices to be put into power-down mode, hence saving power to meet a certain power budget
An Efficient OpenMP Loop Scheduler for Irregular Applications on Large-Scale NUMA Machines
International audienceNowadays shared memory HPC platforms expose a large number of cores organized in a hierarchical way. Parallel application programmers strug- gle to express more and more fine-grain parallelism and to ensure locality on such NUMA platforms. Independent loops stand as a natural source of paral- lelism. Parallel environments like OpenMP provide ways of parallelizing them efficiently, but the achieved performance is closely related to the choice of pa- rameters like the granularity of work or the loop scheduler. Considering that both can depend on the target computer, the input data and the loop workload, the application programmer most of the time fails at designing both portable and ef- ficient implementations. We propose in this paper a new OpenMP loop scheduler, called adaptive, that dynamically adapts the granularity of work considering the underlying system state. Our scheduler is able to perform dynamic load balancing while taking memory affinity into account on NUMA architectures. Results show that adaptive outperforms state-of-the-art OpenMP loop schedulers on memory- bound irregular applications, while obtaining performance comparable to static on parallel loops with a regular workload
Towards an Adaptive Skeleton Framework for Performance Portability
The proliferation of widely available, but very different, parallel architectures
makes the ability to deliver good parallel performance
on a range of architectures, or performance portability, highly desirable.
Irregularly-parallel problems, where the number and size
of tasks is unpredictable, are particularly challenging and require
dynamic coordination.
The paper outlines a novel approach to delivering portable parallel
performance for irregularly parallel programs. The approach
combines declarative parallelism with JIT technology, dynamic
scheduling, and dynamic transformation.
We present the design of an adaptive skeleton library, with a task
graph implementation, JIT trace costing, and adaptive transformations.
We outline the architecture of the protoype adaptive skeleton
execution framework in Pycket, describing tasks, serialisation,
and the current scheduler.We report a preliminary evaluation of the
prototype framework using 4 micro-benchmarks and a small case
study on two NUMA servers (24 and 96 cores) and a small cluster
(17 hosts, 272 cores). Key results include Pycket delivering good
sequential performance e.g. almost as fast as C for some benchmarks;
good absolute speedups on all architectures (up to 120 on
128 cores for sumEuler); and that the adaptive transformations do
improve performance
Recommended from our members
Computer-aided programming for multiprocessing systems
As both the number of processors and the complexity of problems to be solved increase, programming multiprocessing systems becomes more difficult and error-prone. This report discusses parallel models of computation and tools for computer-aided programming (CAP). Program development tools are necessary since programmers are not able to develop complex parallel programs efficiently. In particular, a CAP tool, named Hypertool, is described here. It performs scheduling and handles the communication primitive insertion automatically so that many errors are eliminated. It also generates the performance estimates and other program quality measures to help programmers in improving their algorithms and programs. Experiments have shown that up to a 300% performance improvement can be achieved by computer-aided programming
- …