3,133 research outputs found
Development of an oceanographic application in HPC
High Performance Computing (HPC) is used for running advanced application programs
efficiently, reliably, and quickly.
In earlier decades, performance analysis of HPC applications was evaluated based on
speed, scalability of threads, memory hierarchy. Now, it is essential to consider the
energy or the power consumed by the system while executing an application.
In fact, the High Power Consumption (HPC) is one of biggest problems for the High
Performance Computing (HPC) community and one of the major obstacles for exascale
systems design.
The new generations of HPC systems intend to achieve exaflop performances and will
demand even more energy to processing and cooling. Nowadays, the growth of HPC
systems is limited by energy issues
Recently, many research centers have focused the attention on doing an automatic tuning
of HPC applications which require a wide study of HPC applications in terms of power
efficiency.
In this context, this paper aims to propose the study of an oceanographic application,
named OceanVar, that implements Domain Decomposition based 4D Variational model
(DD-4DVar), one of the most commonly used HPC applications, going to evaluate not
only the classic aspects of performance but also aspects related to power efficiency in
different case of studies.
These work were realized at Bsc (Barcelona Supercomputing Center), Spain within the
Mont-Blanc project, performing the test first on HCA server with Intel technology and then on a mini-cluster Thunder with ARM technology.
In this work of thesis it was initially explained the concept of assimilation date, the
context in which it is developed, and a brief description of the mathematical model
4DVAR.
After this problem’s close examination, it was performed a porting from Matlab
description of the problem of data-assimilation to its sequential version in C language.
Secondly, after identifying the most onerous computational kernels in order of time, it
has been developed a parallel version of the application with a parallel multiprocessor
programming style, using the MPI (Message Passing Interface) protocol.
The experiments results, in terms of performance, have shown that, in the case of
running on HCA server, an Intel architecture, values of efficiency of the two most
onerous functions obtained, growing the number of process, are approximately equal to
80%.
In the case of running on ARM architecture, specifically on Thunder mini-cluster,
instead, the trend obtained is labeled as "SuperLinear Speedup" and, in our case, it can
be explained by a more efficient use of resources (cache memory access) compared with
the sequential case.
In the second part of this paper was presented an analysis of the some issues of this
application that has impact in the energy efficiency.
After a brief discussion about the energy consumption characteristics of the Thunder
chip in technological landscape, through the use of a power consumption detector, the
Yokogawa Power Meter, values of energy consumption of mini-cluster Thunder were
evaluated in order to determine an overview on the power-to-solution of this application
to use as the basic standard for successive analysis with other parallel styles.
Finally, a comprehensive performance evaluation, targeted to estimate the goodness of
MPI parallelization, is conducted using a suitable performance tool named Paraver,
developed by BSC.
Paraver is such a performance analysis and visualisation tool which can be used to
analyse MPI, threaded or mixed mode programmes and represents the key to perform a parallel profiling and to optimise the code for High Performance Computing.
A set of graphical representation of these statistics make it easy for a developer to
identify performance problems. Some of the problems that can be easily identified are
load imbalanced decompositions, excessive communication overheads and poor average
floating operations per second achieved.
Paraver can also report statistics based on hardware counters, which are provided by the
underlying hardware.
This project aimed to use Paraver configuration files to allow certain metrics to be
analysed for this application.
To explain in some way the performance trend obtained in the case of analysis on the
mini-cluster Thunder, the tracks were extracted from various case of studies and the
results achieved is what expected, that is a drastic drop of cache misses by the case ppn
(process per node) = 1 to case ppn = 16.
This in some way explains a more efficient use of cluster resources with an increase of
the number of processes
QPACE 2 and Domain Decomposition on the Intel Xeon Phi
We give an overview of QPACE 2, which is a custom-designed supercomputer
based on Intel Xeon Phi processors, developed in a collaboration of Regensburg
University and Eurotech. We give some general recommendations for how to write
high-performance code for the Xeon Phi and then discuss our implementation of a
domain-decomposition-based solver and present a number of benchmarks.Comment: plenary talk at Lattice 2014, to appear in the conference proceedings
PoS(LATTICE2014), 15 pages, 9 figure
A Survey of Techniques For Improving Energy Efficiency in Embedded Computing Systems
Recent technological advances have greatly improved the performance and
features of embedded systems. With the number of just mobile devices now
reaching nearly equal to the population of earth, embedded systems have truly
become ubiquitous. These trends, however, have also made the task of managing
their power consumption extremely challenging. In recent years, several
techniques have been proposed to address this issue. In this paper, we survey
the techniques for managing power consumption of embedded systems. We discuss
the need of power management and provide a classification of the techniques on
several important parameters to highlight their similarities and differences.
This paper is intended to help the researchers and application-developers in
gaining insights into the working of power management techniques and designing
even more efficient high-performance embedded systems of tomorrow
From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation
Starting from a high-level problem description in terms of partial
differential equations using abstract tensor notation, the Chemora framework
discretizes, optimizes, and generates complete high performance codes for a
wide range of compute architectures. Chemora extends the capabilities of
Cactus, facilitating the usage of large-scale CPU/GPU systems in an efficient
manner for complex applications, without low-level code tuning. Chemora
achieves parallelism through MPI and multi-threading, combining OpenMP and
CUDA. Optimizations include high-level code transformations, efficient loop
traversal strategies, dynamically selected data and instruction cache usage
strategies, and JIT compilation of GPU code tailored to the problem
characteristics. The discretization is based on higher-order finite differences
on multi-block domains. Chemora's capabilities are demonstrated by simulations
of black hole collisions. This problem provides an acid test of the framework,
as the Einstein equations contain hundreds of variables and thousands of terms.Comment: 18 pages, 4 figures, accepted for publication in Scientific
Programmin
A GPU-accelerated Branch-and-Bound Algorithm for the Flow-Shop Scheduling Problem
Branch-and-Bound (B&B) algorithms are time intensive tree-based exploration
methods for solving to optimality combinatorial optimization problems. In this
paper, we investigate the use of GPU computing as a major complementary way to
speed up those methods. The focus is put on the bounding mechanism of B&B
algorithms, which is the most time consuming part of their exploration process.
We propose a parallel B&B algorithm based on a GPU-accelerated bounding model.
The proposed approach concentrate on optimizing data access management to
further improve the performance of the bounding mechanism which uses large and
intermediate data sets that do not completely fit in GPU memory. Extensive
experiments of the contribution have been carried out on well known FSP
benchmarks using an Nvidia Tesla C2050 GPU card. We compared the obtained
performances to a single and a multithreaded CPU-based execution. Accelerations
up to x100 are achieved for large problem instances
- …