Search CORE

43,269 research outputs found

The Performance And Power Impact Of Using Multiple Dram Address Mapping Schemes In Multicore Processors

Author: Jadaa Rami
Publication venue: 'Information Bulletin on Variable Stars (IBVS)'
Publication date: 01/01/2011
Field of study

Lowest-level cache misses are satisfied by the main memory through a specific address mapping scheme that is hard-coded in the memory controller. A dynamic address mapping scheme technique is investigated to provide higher performance and lower power consumption, and a method to throttle memory to meet a specific power budget. Several experiments are conducted on single and multithreaded synthetic memory traces -to study extreme cases- and validate the usability of the proposed dynamic mapping scheme over the fixed one. Results show that applications’ performance varies according to the mapping scheme used, and a dynamic mapping scheme achieves up to 2x increase in peak bandwidth utilization and around 30% higher energy efficiency than a system using only a single fixed scheme Moreover, the technique can be used to limit memory accesses into a subset of the memory devices by controlling data allocation at a finer granularity, providing a method to throttle main memory by allowing unaccessed devices to be put into power-down mode, hence saving power to meet a certain power budget

University of Central Florida (UCF): STARS (Showcase of Text, Archives, Research & Scholarship)

An Efficient OpenMP Loop Scheduler for Irregular Applications on Large-Scale NUMA Machines

Author: A. Mahéo
E. Ayguadé
F. Broquedis
F. Broquedis
L. Huang
M. Frigo
M. Ihmsen
S. Che
S. Subramaniam
S.L. Olivier
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

International audienceNowadays shared memory HPC platforms expose a large number of cores organized in a hierarchical way. Parallel application programmers strug- gle to express more and more fine-grain parallelism and to ensure locality on such NUMA platforms. Independent loops stand as a natural source of paral- lelism. Parallel environments like OpenMP provide ways of parallelizing them efficiently, but the achieved performance is closely related to the choice of pa- rameters like the granularity of work or the loop scheduler. Considering that both can depend on the target computer, the input data and the loop workload, the application programmer most of the time fails at designing both portable and ef- ficient implementations. We propose in this paper a new OpenMP loop scheduler, called adaptive, that dynamically adapts the granularity of work considering the underlying system state. Our scheduler is able to perform dynamic load balancing while taking memory affinity into account on NUMA architectures. Results show that adaptive outperforms state-of-the-art OpenMP loop schedulers on memory- bound irregular applications, while obtaining performance comparable to static on parallel loops with a regular workload

Crossref

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

Towards an Adaptive Skeleton Framework for Performance Portability

Author: Maier Patrick
Morton John Magnus
Trinder Phil
Publication venue: School of Computing Science, University of Glasgow
Publication date: 21/12/2015
Field of study

The proliferation of widely available, but very different, parallel architectures makes the ability to deliver good parallel performance on a range of architectures, or performance portability, highly desirable. Irregularly-parallel problems, where the number and size of tasks is unpredictable, are particularly challenging and require dynamic coordination. The paper outlines a novel approach to delivering portable parallel performance for irregularly parallel programs. The approach combines declarative parallelism with JIT technology, dynamic scheduling, and dynamic transformation. We present the design of an adaptive skeleton library, with a task graph implementation, JIT trace costing, and adaptive transformations. We outline the architecture of the protoype adaptive skeleton execution framework in Pycket, describing tasks, serialisation, and the current scheduler.We report a preliminary evaluation of the prototype framework using 4 micro-benchmarks and a small case study on two NUMA servers (24 and 96 cores) and a small cluster (17 hosts, 272 cores). Key results include Pycket delivering good sequential performance e.g. almost as fast as C for some benchmarks; good absolute speedups on all architectures (up to 120 on 128 cores for sumEuler); and that the adaptive transformations do improve performance

Enlighten

Recommended from our members

Computer-aided programming for multiprocessing systems

Author: Gajski Daniel D.
Wu Min-You
Publication venue: eScholarship, University of California
Publication date: 30/06/1988
Field of study

As both the number of processors and the complexity of problems to be solved increase, programming multiprocessing systems becomes more difficult and error-prone. This report discusses parallel models of computation and tools for computer-aided programming (CAP). Program development tools are necessary since programmers are not able to develop complex parallel programs efficiently. In particular, a CAP tool, named Hypertool, is described here. It performs scheduling and handles the communication primitive insertion automatically so that many errors are eliminated. It also generates the performance estimates and other program quality measures to help programmers in improving their algorithms and programs. Experiments have shown that up to a 300% performance improvement can be achieved by computer-aided programming

eScholarship - University of California