1,643 research outputs found
Using Cognitive Computing for Learning Parallel Programming: An IBM Watson Solution
While modern parallel computing systems provide high performance resources,
utilizing them to the highest extent requires advanced programming expertise.
Programming for parallel computing systems is much more difficult than
programming for sequential systems. OpenMP is an extension of C++ programming
language that enables to express parallelism using compiler directives. While
OpenMP alleviates parallel programming by reducing the lines of code that the
programmer needs to write, deciding how and when to use these compiler
directives is up to the programmer. Novice programmers may make mistakes that
may lead to performance degradation or unexpected program behavior. Cognitive
computing has shown impressive results in various domains, such as health or
marketing. In this paper, we describe the use of IBM Watson cognitive system
for education of novice parallel programmers. Using the dialogue service of the
IBM Watson we have developed a solution that assists the programmer in avoiding
common OpenMP mistakes. To evaluate our approach we have conducted a survey
with a number of novice parallel programmers at the Linnaeus University, and
obtained encouraging results with respect to usefulness of our approach
Large-Eddy Simulations of Flow and Heat Transfer in Complex Three-Dimensional Multilouvered Fins
The paper describes the computational procedure and
results from large-eddy simulations in a complex three-dimensional
louver geometry. The three-dimensionality in the
louver geometry occurs along the height of the fin, where the
angled louver transitions to the flat landing and joins with the
tube surface. The transition region is characterized by a swept
leading edge and decreasing flow area between louvers.
Preliminary results show a high energy compact vortex jet
forming in this region. The jet forms in the vicinity of the louver
junction with the flat landing and is drawn under the louver in
the transition region. Its interaction with the surface of the
louver produces vorticity of the opposite sign, which aids in
augmenting heat transfer on the louver surface. The top surface
of the louver in the transition region experiences large velocities
in the vicinity of the surface and exhibits higher heat transfer
coefficients than the bottom surface.Air Conditioning and Refrigeration Project 9
Power efficient job scheduling by predicting the impact of processor manufacturing variability
Modern CPUs suffer from performance and power consumption variability due to the manufacturing process. As a result, systems that do not consider such variability caused by manufacturing issues lead to performance degradations and wasted power. In order to avoid such negative impact, users and system administrators must actively counteract any manufacturing variability.
In this work we show that parallel systems benefit from taking into account the consequences of manufacturing variability when making scheduling decisions at the job scheduler level. We also show that it is possible to predict the impact of this variability on specific applications by using variability-aware power prediction models. Based on these power models, we propose two job scheduling policies that consider the effects of manufacturing variability for each application and that ensure that power consumption stays under a system-wide power budget. We evaluate our policies under different power budgets and traffic scenarios, consisting of both single- and multi-node parallel applications, utilizing up to 4096 cores in total. We demonstrate that they decrease job turnaround time, compared to contemporary scheduling policies used on production clusters, up to 31% while saving up to 5.5% energy.Postprint (author's final draft
Hierarchical Parallelisation of Functional Renormalisation Group Calculations -- hp-fRG
The functional renormalisation group (fRG) has evolved into a versatile tool
in condensed matter theory for studying important aspects of correlated
electron systems. Practical applications of the method often involve a high
numerical effort, motivating the question in how far High Performance Computing
(HPC) can leverage the approach. In this work we report on a multi-level
parallelisation of the underlying computational machinery and show that this
can speed up the code by several orders of magnitude. This in turn can extend
the applicability of the method to otherwise inaccessible cases. We exploit
three levels of parallelisation: Distributed computing by means of Message
Passing (MPI), shared-memory computing using OpenMP, and vectorisation by means
of SIMD units (single-instruction-multiple-data). Results are provided for two
distinct High Performance Computing (HPC) platforms, namely the IBM-based
BlueGene/Q system JUQUEEN and an Intel Sandy-Bridge-based development cluster.
We discuss how certain issues and obstacles were overcome in the course of
adapting the code. Most importantly, we conclude that this vast improvement can
actually be accomplished by introducing only moderate changes to the code, such
that this strategy may serve as a guideline for other researcher to likewise
improve the efficiency of their codes
Domain knowledge specification for energy tuning
To overcome the challenges of energy consumption of HPC systems, the European Union Horizon 2020 READEX (Runtime Exploitation of Application Dynamism for Energy-efficient Exascale computing) project uses an online auto-tuning approach to improve energy efficiency of HPC applications. The READEX methodology pre-computes optimal system configurations at design-time, such as the CPU frequency, for instances of program regions and switches at runtime to the configuration given in the tuning model when the region is executed. READEX goes beyond previous approaches by exploiting dynamic changes of a region's characteristics by leveraging region and characteristic specific system configurations. While the tool suite supports an automatic approach, specifying domain knowledge such as the structure and characteristics of the application and application tuning parameters can significantly help to create a more refined tuning model. This paper presents the means available for an application expert to provide domain knowledge and presents tuning results for some benchmarks.Web of Science316art. no. E465
NBSymple, a double parallel, symplectic N-body code running on Graphic Processing Units
We present and discuss the characteristics and performances, both in term of
computational speed and precision, of a numerical code which numerically
integrates the equation of motions of N 'particles' interacting via Newtonian
gravitation and move in an external galactic smooth field. The force evaluation
on every particle is done by mean of direct summation of the contribution of
all the other system's particle, avoiding truncation error. The time
integration is done with second-order and sixth-order symplectic schemes. The
code, NBSymple, has been parallelized twice, by mean of the Computer Unified
Device Architecture to make the all-pair force evaluation as fast as possible
on high-performance Graphic Processing Units NVIDIA TESLA C 1060, while the
O(N) computations are distributed on various CPUs by mean of OpenMP Application
Program. The code works both in single precision floating point arithmetics or
in double precision. The use of single precision allows the use at best of the
GPU performances but, of course, limits the precision of simulation in some
critical situations. We find a good compromise in using a software
reconstruction of double precision for those variables that are most critical
for the overall precision of the code. The code is available on the web site
astrowww.phys.uniroma1.it/dolcetta/nbsymple.htmlComment: Paper composed by 29 pages, including 9 figures. Submitted to New
Astronomy
Evaluation of OpenMP Dependent Tasks with the KASTORS Benchmark Suite
International audienceThe recent introduction of task dependencies in the OpenMP specifi-cation provides new ways of synchronizing tasks. Application programmers can now describe the data a task will read as input and write as output, letting the runtime system resolve fine-grain dependencies between tasks to decide which task should execute next. Such an approach should scale better than the excessive global synchronization found in most OpenMP 3.0 applications. As promising as it looks however, any new feature needs proper evaluation to encourage applica-tion programmers to embrace it. This paper introduces the KASTORS benchmark suite designed to evaluate OpenMP tasks dependencies. We modified state-of-the-art OpenMP 3.0 benchmarks and data-flow parallel linear algebra kernels to make use of tasks dependencies. Learning from this experience, we propose extensions to the current OpenMP specification to improve the expressiveness of dependen-cies. We eventually evaluate both the GCC/libGOMP and the CLANG/libIOMP implementations of OpenMP 4.0 on our KASTORS suite, demonstrating the in-terest of task dependencies compared to taskwait-based approaches
FISH: A 3D parallel MHD code for astrophysical applications
FISH is a fast and simple ideal magneto-hydrodynamics code that scales to ~10
000 processes for a Cartesian computational domain of ~1000^3 cells. The
simplicity of FISH has been achieved by the rigorous application of the
operator splitting technique, while second order accuracy is maintained by the
symmetric ordering of the operators. Between directional sweeps, the
three-dimensional data is rotated in memory so that the sweep is always
performed in a cache-efficient way along the direction of contiguous memory.
Hence, the code only requires a one-dimensional description of the conservation
equations to be solved. This approach also enable an elegant novel
parallelisation of the code that is based on persistent communications with MPI
for cubic domain decomposition on machines with distributed memory. This scheme
is then combined with an additional OpenMP parallelisation of different sweeps
that can take advantage of clusters of shared memory. We document the detailed
implementation of a second order TVD advection scheme based on flux
reconstruction. The magnetic fields are evolved by a constrained transport
scheme. We show that the subtraction of a simple estimate of the hydrostatic
gradient from the total gradients can significantly reduce the dissipation of
the advection scheme in simulations of gravitationally bound hydrostatic
objects. Through its simplicity and efficiency, FISH is as well-suited for
hydrodynamics classes as for large-scale astrophysical simulations on
high-performance computer clusters. In preparation for the release of a public
version, we demonstrate the performance of FISH in a suite of astrophysically
orientated test cases.Comment: 27 pages, 11 figure
- …