6,239 research outputs found
A review of High Performance Computing foundations for scientists
The increase of existing computational capabilities has made simulation
emerge as a third discipline of Science, lying midway between experimental and
purely theoretical branches [1, 2]. Simulation enables the evaluation of
quantities which otherwise would not be accessible, helps to improve
experiments and provides new insights on systems which are analysed [3-6].
Knowing the fundamentals of computation can be very useful for scientists, for
it can help them to improve the performance of their theoretical models and
simulations. This review includes some technical essentials that can be useful
to this end, and it is devised as a complement for researchers whose education
is focused on scientific issues and not on technological respects. In this
document we attempt to discuss the fundamentals of High Performance Computing
(HPC) [7] in a way which is easy to understand without much previous
background. We sketch the way standard computers and supercomputers work, as
well as discuss distributed computing and discuss essential aspects to take
into account when running scientific calculations in computers.Comment: 33 page
DReAM: An approach to estimate per-Task DRAM energy in multicore systems
Accurate per-task energy estimation in multicore systems would allow performing per-task energy-aware task scheduling and energy-aware billing in data centers, among other applications. Per-task energy estimation is challenged by the interaction between tasks in shared resources, which impacts tasksâ energy consumption in uncontrolled ways. Some accurate mechanisms have been devised recently to estimate per-task energy consumed on-chip in multicores, but there is a lack of such mechanisms for DRAM memories. This article makes the case for accurate per-task DRAM energy metering in multicores, which opens new paths to energy/performance optimizations. In particular, the contributions of this article are (i) an ideal per-task energy metering model for DRAM memories; (ii) DReAM, an accurate yet low cost implementation of the ideal model (less than 5% accuracy error when 16 tasks share memory); and (iii) a comparison with standard methods (even distribution and access-count based) proving that DReAM is much more accurate than these other methods.Peer ReviewedPostprint (author's final draft
Exploring performance and power properties of modern multicore chips via simple machine models
Modern multicore chips show complex behavior with respect to performance and
power. Starting with the Intel Sandy Bridge processor, it has become possible
to directly measure the power dissipation of a CPU chip and correlate this data
with the performance properties of the running code. Going beyond a simple
bottleneck analysis, we employ the recently published Execution-Cache-Memory
(ECM) model to describe the single- and multi-core performance of streaming
kernels. The model refines the well-known roofline model, since it can predict
the scaling and the saturation behavior of bandwidth-limited loop kernels on a
multicore chip. The saturation point is especially relevant for considerations
of energy consumption. From power dissipation measurements of benchmark
programs with vastly different requirements to the hardware, we derive a
simple, phenomenological power model for the Sandy Bridge processor. Together
with the ECM model, we are able to explain many peculiarities in the
performance and power behavior of multicore processors, and derive guidelines
for energy-efficient execution of parallel programs. Finally, we show that the
ECM and power models can be successfully used to describe the scaling and power
behavior of a lattice-Boltzmann flow solver code.Comment: 23 pages, 10 figures. Typos corrected, DOI adde
Parallel Construction of Wavelet Trees on Multicore Architectures
The wavelet tree has become a very useful data structure to efficiently
represent and query large volumes of data in many different domains, from
bioinformatics to geographic information systems. One problem with wavelet
trees is their construction time. In this paper, we introduce two algorithms
that reduce the time complexity of a wavelet tree's construction by taking
advantage of nowadays ubiquitous multicore machines.
Our first algorithm constructs all the levels of the wavelet in parallel in
time and bits of working space, where
is the size of the input sequence and is the size of the alphabet. Our
second algorithm constructs the wavelet tree in a domain-decomposition fashion,
using our first algorithm in each segment, reaching time and
bits of extra space, where is the
number of available cores. Both algorithms are practical and report good
speedup for large real datasets.Comment: This research has received funding from the European Union's Horizon
2020 research and innovation programme under the Marie Sk{\l}odowska-Curie
Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094
Understanding the thermal implications of multicore architectures
Multicore architectures are becoming the main design paradigm for current and future processors. The main reason is that multicore designs provide an effective way of overcoming instruction-level parallelism (ILP) limitations by exploiting thread-level parallelism (TLP). In addition, it is a power and complexity-effective way of taking advantage of the huge number of transistors that can be integrated on a chip. On the other hand, today's higher than ever power densities have made temperature one of the main limitations of microprocessor evolution. Thermal management in multicore architectures is a fairly new area. Some works have addressed dynamic thermal management in bi/quad-core architectures. This work provides insight and explores different alternatives for thermal management in multicore architectures with 16 cores. Schemes employing both energy reduction and activity migration are explored and improvements for thread migration schemes are proposed.Peer ReviewedPostprint (published version
A Survey of Techniques For Improving Energy Efficiency in Embedded Computing Systems
Recent technological advances have greatly improved the performance and
features of embedded systems. With the number of just mobile devices now
reaching nearly equal to the population of earth, embedded systems have truly
become ubiquitous. These trends, however, have also made the task of managing
their power consumption extremely challenging. In recent years, several
techniques have been proposed to address this issue. In this paper, we survey
the techniques for managing power consumption of embedded systems. We discuss
the need of power management and provide a classification of the techniques on
several important parameters to highlight their similarities and differences.
This paper is intended to help the researchers and application-developers in
gaining insights into the working of power management techniques and designing
even more efficient high-performance embedded systems of tomorrow
RPPM : Rapid Performance Prediction of Multithreaded workloads on multicore processors
Analytical performance modeling is a useful complement to detailed cycle-level simulation to quickly explore the design space in an early design stage. Mechanistic analytical modeling is particularly interesting as it provides deep insight and does not require expensive offline profiling as empirical modeling. Previous work in mechanistic analytical modeling, unfortunately, is limited to single-threaded applications running on single-core processors.
This work proposes RPPM, a mechanistic analytical performance model for multi-threaded applications on multicore hardware. RPPM collects microarchitecture-independent characteristics of a multi-threaded workload to predict performance on a previously unseen multicore architecture. The profile needs to be collected only once to predict a range of processor architectures. We evaluate RPPM's accuracy against simulation and report a performance prediction error of 11.2% on average (23% max). We demonstrate RPPM's usefulness for conducting design space exploration experiments as well as for analyzing parallel application performance
EPOBF: Energy Efficient Allocation of Virtual Machines in High Performance Computing Cloud
Cloud computing has become more popular in provision of computing resources
under virtual machine (VM) abstraction for high performance computing (HPC)
users to run their applications. A HPC cloud is such cloud computing
environment. One of challenges of energy efficient resource allocation for VMs
in HPC cloud is tradeoff between minimizing total energy consumption of
physical machines (PMs) and satisfying Quality of Service (e.g. performance).
On one hand, cloud providers want to maximize their profit by reducing the
power cost (e.g. using the smallest number of running PMs). On the other hand,
cloud customers (users) want highest performance for their applications. In
this paper, we focus on the scenario that scheduler does not know global
information about user jobs and user applications in the future. Users will
request shortterm resources at fixed start times and non interrupted durations.
We then propose a new allocation heuristic (named Energy-aware and Performance
per watt oriented Bestfit (EPOBF)) that uses metric of performance per watt to
choose which most energy-efficient PM for mapping each VM (e.g. maximum of MIPS
per Watt). Using information from Feitelson's Parallel Workload Archive to
model HPC jobs, we compare the proposed EPOBF to state of the art heuristics on
heterogeneous PMs (each PM has multicore CPU). Simulations show that the EPOBF
can reduce significant total energy consumption in comparison with state of the
art allocation heuristics.Comment: 10 pages, in Procedings of International Conference on Advanced
Computing and Applications, Journal of Science and Technology, Vietnamese
Academy of Science and Technology, ISSN 0866-708X, Vol. 51, No. 4B, 201
- âŠ