4,264 research outputs found
Evaluation of DVFS techniques on modern HPC processors and accelerators for energy-aware applications
Energy efficiency is becoming increasingly important for computing systems,
in particular for large scale HPC facilities. In this work we evaluate, from an
user perspective, the use of Dynamic Voltage and Frequency Scaling (DVFS)
techniques, assisted by the power and energy monitoring capabilities of modern
processors in order to tune applications for energy efficiency. We run selected
kernels and a full HPC application on two high-end processors widely used in
the HPC context, namely an NVIDIA K80 GPU and an Intel Haswell CPU. We evaluate
the available trade-offs between energy-to-solution and time-to-solution,
attempting a function-by-function frequency tuning. We finally estimate the
benefits obtainable running the full code on a HPC multi-GPU node, with respect
to default clock frequency governors. We instrument our code to accurately
monitor power consumption and execution time without the need of any additional
hardware, and we enable it to change CPUs and GPUs clock frequencies while
running. We analyze our results on the different architectures using a simple
energy-performance model, and derive a number of energy saving strategies which
can be easily adopted on recent high-end HPC systems for generic applications
Runtime-guided mitigation of manufacturing variability in power-constrained multi-socket NUMA nodes
This work has been supported by the Spanish Government (Severo Ochoa grants SEV2015-0493, SEV-2011-00067), by
the Spanish Ministry of Science and Innovation (contracts TIN2015-65316-P), by Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), by the RoMoL ERC Advanced Grant (GA 321253) and the European HiPEAC Network of Excellence. M. Moretó has been partially supported by the Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship number JCI-2012-15047. M. Casas is supported by the Secretary for Universities and Research of the Ministry of Economy and Knowledge of the Government of Catalonia and the Cofund
programme of the Marie Curie Actions of the 7th R&D Framework Programme of the European Union (Contract 2013 BP B 00243). This work was also partially performed
under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 (LLNL-CONF-689878).
Finally, the authors are grateful to the reviewers for their valuable comments, to the RoMoL team, to Xavier Teruel and Kallia Chronaki from the Programming Models group
of BSC and the Computation Department of LLNL for their technical support and useful feedback.Peer ReviewedPostprint (published version
Measuring concurrency in CCS
A research report submitted to the Faculty of Science,
University of the Witwatersrand, Johannesburg,
in partial fulfilment of the requirements for
the degree of Master of ScienceThis research report investigates the application of Charron-Bost's measure of
currency m to Milner's Calculus of Communicating Systems (CCS). The aim of this
is twofold: first to evaluate the measure m in terms of criteria gathered from the
literature: and second to determine the feasiblllty of measuring concurrency in CCS
and hence provide a new tool for understanding concurrency using CCS. The approach
taken is to identify the differences hetween the message-passing formalism in
which the measure m is defined, and CCS and to modify this formalism to-enable
the mapping of CCS agents to it. A software tool, the Concurrency Measurement
Tool, is developed to permit experimentation with chosen CCS agents. These experiments
show that the measure m, although intuitively appealing, is defined by an
algebraic expression that is ill-behaved. A new measure is defined and it is shown
that it matches the evaluation criteria better than m, although it is still not ideal.
This work demonstrates that it is feasible to measure concurrency in CCS and that
a methodology has been developed for evaluating concurrency measures.Andrew Chakane 201
OpenCL Actors - Adding Data Parallelism to Actor-based Programming with CAF
The actor model of computation has been designed for a seamless support of
concurrency and distribution. However, it remains unspecific about data
parallel program flows, while available processing power of modern many core
hardware such as graphics processing units (GPUs) or coprocessors increases the
relevance of data parallelism for general-purpose computation.
In this work, we introduce OpenCL-enabled actors to the C++ Actor Framework
(CAF). This offers a high level interface for accessing any OpenCL device
without leaving the actor paradigm. The new type of actor is integrated into
the runtime environment of CAF and gives rise to transparent message passing in
distributed systems on heterogeneous hardware. Following the actor logic in
CAF, OpenCL kernels can be composed while encapsulated in C++ actors, hence
operate in a multi-stage fashion on data resident at the GPU. Developers are
thus enabled to build complex data parallel programs from primitives without
leaving the actor paradigm, nor sacrificing performance. Our evaluations on
commodity GPUs, an Nvidia TESLA, and an Intel PHI reveal the expected linear
scaling behavior when offloading larger workloads. For sub-second duties, the
efficiency of offloading was found to largely differ between devices. Moreover,
our findings indicate a negligible overhead over programming with the native
OpenCL API.Comment: 28 page
A simple approach to distributed objects in prolog
We present the design of a distributed object system for Prolog, based on adding remote execution and distribution capabilities to a previously existing object system. Remote execution brings RPC into a Prolog system, and its semantics is easy to express in terms of well-known Prolog builtins. The final distributed object design features state mobility and user-transparent network behavior. We sketch an implementation which provides distributed garbage collection and some degree of tolerance to network failures. We provide a preliminary study of the overhead of the communication mechanism for some test cases
Harnessing the Power of Many: Extensible Toolkit for Scalable Ensemble Applications
Many scientific problems require multiple distinct computational tasks to be
executed in order to achieve a desired solution. We introduce the Ensemble
Toolkit (EnTK) to address the challenges of scale, diversity and reliability
they pose. We describe the design and implementation of EnTK, characterize its
performance and integrate it with two distinct exemplar use cases: seismic
inversion and adaptive analog ensembles. We perform nine experiments,
characterizing EnTK overheads, strong and weak scalability, and the performance
of two use case implementations, at scale and on production infrastructures. We
show how EnTK meets the following general requirements: (i) implementing
dedicated abstractions to support the description and execution of ensemble
applications; (ii) support for execution on heterogeneous computing
infrastructures; (iii) efficient scalability up to O(10^4) tasks; and (iv)
fault tolerance. We discuss novel computational capabilities that EnTK enables
and the scientific advantages arising thereof. We propose EnTK as an important
addition to the suite of tools in support of production scientific computing
Locality-aware parallel block-sparse matrix-matrix multiplication using the Chunks and Tasks programming model
We present a method for parallel block-sparse matrix-matrix multiplication on
distributed memory clusters. By using a quadtree matrix representation, data
locality is exploited without prior information about the matrix sparsity
pattern. A distributed quadtree matrix representation is straightforward to
implement due to our recent development of the Chunks and Tasks programming
model [Parallel Comput. 40, 328 (2014)]. The quadtree representation combined
with the Chunks and Tasks model leads to favorable weak and strong scaling of
the communication cost with the number of processes, as shown both
theoretically and in numerical experiments.
Matrices are represented by sparse quadtrees of chunk objects. The leaves in
the hierarchy are block-sparse submatrices. Sparsity is dynamically detected by
the matrix library and may occur at any level in the hierarchy and/or within
the submatrix leaves. In case graphics processing units (GPUs) are available,
both CPUs and GPUs are used for leaf-level multiplication work, thus making use
of the full computing capacity of each node.
The performance is evaluated for matrices with different sparsity structures,
including examples from electronic structure calculations. Compared to methods
that do not exploit data locality, our locality-aware approach reduces
communication significantly, achieving essentially constant communication per
node in weak scaling tests.Comment: 35 pages, 14 figure
- …