43,269 research outputs found

    The Performance And Power Impact Of Using Multiple Dram Address Mapping Schemes In Multicore Processors

    Get PDF
    Lowest-level cache misses are satisfied by the main memory through a specific address mapping scheme that is hard-coded in the memory controller. A dynamic address mapping scheme technique is investigated to provide higher performance and lower power consumption, and a method to throttle memory to meet a specific power budget. Several experiments are conducted on single and multithreaded synthetic memory traces -to study extreme cases- and validate the usability of the proposed dynamic mapping scheme over the fixed one. Results show that applications’ performance varies according to the mapping scheme used, and a dynamic mapping scheme achieves up to 2x increase in peak bandwidth utilization and around 30% higher energy efficiency than a system using only a single fixed scheme Moreover, the technique can be used to limit memory accesses into a subset of the memory devices by controlling data allocation at a finer granularity, providing a method to throttle main memory by allowing unaccessed devices to be put into power-down mode, hence saving power to meet a certain power budget

    An Efficient OpenMP Loop Scheduler for Irregular Applications on Large-Scale NUMA Machines

    Get PDF
    International audienceNowadays shared memory HPC platforms expose a large number of cores organized in a hierarchical way. Parallel application programmers strug- gle to express more and more fine-grain parallelism and to ensure locality on such NUMA platforms. Independent loops stand as a natural source of paral- lelism. Parallel environments like OpenMP provide ways of parallelizing them efficiently, but the achieved performance is closely related to the choice of pa- rameters like the granularity of work or the loop scheduler. Considering that both can depend on the target computer, the input data and the loop workload, the application programmer most of the time fails at designing both portable and ef- ficient implementations. We propose in this paper a new OpenMP loop scheduler, called adaptive, that dynamically adapts the granularity of work considering the underlying system state. Our scheduler is able to perform dynamic load balancing while taking memory affinity into account on NUMA architectures. Results show that adaptive outperforms state-of-the-art OpenMP loop schedulers on memory- bound irregular applications, while obtaining performance comparable to static on parallel loops with a regular workload

    Towards an Adaptive Skeleton Framework for Performance Portability

    Get PDF
    The proliferation of widely available, but very different, parallel architectures makes the ability to deliver good parallel performance on a range of architectures, or performance portability, highly desirable. Irregularly-parallel problems, where the number and size of tasks is unpredictable, are particularly challenging and require dynamic coordination. The paper outlines a novel approach to delivering portable parallel performance for irregularly parallel programs. The approach combines declarative parallelism with JIT technology, dynamic scheduling, and dynamic transformation. We present the design of an adaptive skeleton library, with a task graph implementation, JIT trace costing, and adaptive transformations. We outline the architecture of the protoype adaptive skeleton execution framework in Pycket, describing tasks, serialisation, and the current scheduler.We report a preliminary evaluation of the prototype framework using 4 micro-benchmarks and a small case study on two NUMA servers (24 and 96 cores) and a small cluster (17 hosts, 272 cores). Key results include Pycket delivering good sequential performance e.g. almost as fast as C for some benchmarks; good absolute speedups on all architectures (up to 120 on 128 cores for sumEuler); and that the adaptive transformations do improve performance
    • …
    corecore