361 research outputs found
An investigation of the performance portability of OpenCL
This paper reports on the development of an MPI/OpenCL implementation of LU, an application-level benchmark from the NAS Parallel Benchmark Suite. An account of the design decisions addressed during the development of this code is presented, demonstrating the importance of memory arrangement and work-item/work-group distribution strategies when applications are deployed on different device types. The resulting platform-agnostic, single source application is benchmarked on a number of different architectures, and is shown to be 1.3–1.5× slower than native FORTRAN 77 or CUDA implementations on a single node and 1.3–3.1× slower on multiple nodes. We also explore the potential performance gains of OpenCL’s device fissioning capability, demonstrating up to a 3× speed-up over our original OpenCL implementation
Optimization of atmospheric transport models on HPC platforms
The performance and scalability of atmospheric transport models on high performance computing environments is often far from optimal for multiple reasons including, for example, sequential input and output, synchronous communications, work unbalance, memory access latency or lack of task overlapping. We investigate how different software optimizations and porting to non general-purpose hardware architectures improve code scalability and execution times considering, as an example, the FALL3D volcanic ash transport model. To this purpose, we implement the FALL3D model equations in the WARIS framework, a software designed from scratch to solve in a parallel and efficient way different geoscience problems on a wide variety of architectures. In addition, we consider further improvements in WARIS such as hybrid MPI-OMP parallelization, spatial blocking, auto-tuning and thread affinity. Considering all these aspects together, the FALL3D execution times for a realistic test case running on general-purpose cluster architectures (Intel Sandy Bridge) decrease by a factor between 7 and 40 depending on the grid resolution. Finally, we port the application to Intel Xeon Phi (MIC) and NVIDIA GPUs (CUDA) accelerator-based architectures and compare performance, cost and power consumption on all the architectures. Implications on time-constrained operational model configurations are discussed.We thank M.S. Osores from the Argentinean National Scientific and Technical Research Council (CONICET) for providing hourly
column heights for the Cordón Caulle eruption simulation and the constructive
comments from two anonymous reviewers. This work was supported by NVIDIA through the UPC/BSC GPU Center of Excellence,
and the Spanish Ministry of Science and Technology through the TIN2012-34557 project. Finally, we dedicate this work to our colleague and co-author Nacho Navarro, who sadly passed away during the reviewing process.Peer ReviewedPostprint (author's final draft
Effective Large Scale Computing Software for Parallel Mesh Generation
Scientists commonly turn to supercomputers or Clusters of Workstations with hundreds (even thousands) of nodes to generate meshes for large-scale simulations. Parallel mesh generation software is then used to decompose the original mesh generation problem into smaller sub-problems that can be solved (meshed) in parallel. The size of the final mesh is limited by the amount of aggregate memory of the parallel machine. Also, requesting many compute nodes on a shared computing resource may result in a long waiting, far surpassing the time it takes to solve the problem.;These two problems (i.e., insufficient memory when computing on a small number of nodes, and long waiting times when using many nodes from a shared computing resource) can be addressed by using out-of-core algorithms. These are algorithms that keep most of the dataset out-of-core (i.e., outside of memory, on disk) and load only a portion in-core (i.e., into memory) at a time.;We explored two approaches to out-of-core computing. First, we presented a traditional approach, which is to modify the existing in-core algorithms to enable out-of-core computing. While we achieved good performance with this approach the task is complex and labor intensive. An alternative approach, we presented a runtime system designed to support out-of-core applications. It requires little modification of the existing in-core application code and still produces acceptable results. Evaluation of the runtime system showed little performance degradation while simplifying and shortening the development cycle of out-of-core applications. The overhead from using the runtime system for small problem sizes is between 12% and 41% while the overlap of computation, communication and disk I/O is above 50% and as high as 61% for large problems.;The main contribution of our work is the ability to utilize computing resources more effectively. The user has a choice of either solving larger problems, that otherwise would not be possible, or solving problems of the same size but using fewer computing nodes, thus reducing the waiting time on shared clusters and supercomputers. We demonstrated that the latter could potentially lead to substantially shorter wall-clock time
Porting of DSMC to multi-GPUs using OpenACC
The Direct Simulation Monte Carlo has become the method of choice for studying gas flows characterized by variable rarefaction and non-equilibrium effects, rising interest in industry for simulating flows in micro-, and nano-electromechanical systems.
However, rarefied gas dynamics represents an open research challenge from the computer science perspective, due to its computational expense compared to continuum computational fluid dynamics methods.
Fortunately, over the last decade, high-performance computing has seen an exponential growth of performance. Actually, with the breakthrough of General-Purpose GPU computing, heterogeneous systems have become widely used for scientific computing, especially in large-scale clusters and supercomputers.
Nonetheless, developing efficient, maintainable and portable applications for hybrid systems is, in general, a non-trivial task.
Among the possible approaches, directive-based programming models, such as OpenACC, are considered the most promising for porting scientific codes to hybrid CPU/GPU systems, both for their simplicity and portability.
This work is an attempt to port a simplified version of the fm dsmc code developed at FLOW Matters Consultancy B.V., a start-up company supporting this project, on a multi-GPU distributed hybrid system, such as Marconi100 hosted at CINECA, using OpenACC.
Finally, we perform a detailed performance analysis of our DSMC application on Volta (NVIDIA V100 GPU) architecture based computing platform as well as a comparison with previous results obtained with x64 86 (Intel Xeon CPU) and ppc64le (IBM Power9 CPU) architectures
Analysis of hybrid parallelization strategies: simulation of Anderson localization and Kalman Filter for LHCb triggers
This thesis presents two experiences of hybrid programming applied to condensed matter and high energy physics. The two projects differ in various aspects, but both of them aim to analyse the benefits of using accelerated hardware to speedup the calculations in current science-research scenarios. The first project enables massively parallelism in a simulation of the Anderson localisation phenomenon in a disordered quantum system. The code represents a Hamiltonian in momentum space, then it executes a diagonalization of the corresponding matrix using linear algebra libraries, and finally it analyses the energy-levels spacing statistics averaged over several realisations of the disorder. The implementation combines different parallelization approaches in an hybrid scheme. The averaging over the ensemble of disorder realisations exploits massively parallelism with a master-slave configuration based on both multi-threading and message passing interface (MPI). This framework is designed and implemented to easily interface similar application commonly adopted in scientific research, for example
in Monte Carlo simulations. The diagonalization uses multi-core and GPU hardware interfacing with MAGMA, PLASMA or MKL libraries. The access to the libraries is modular to guarantee portability, maintainability and the extension in a near future.
The second project is the development of a Kalman Filter, including the porting on GPU architectures and autovectorization for online LHCb triggers. The developed codes provide information about the viability and advantages for the application of GPU technologies in the first triggering step for Large Hadron Collider beauty experiment (LHCb). The optimisation introduced on both codes for CPU and GPU delivered a relevant speedup on the Kalman Filter. The two GPU versions in CUDA R and OpenCLTM have similar performances and are adequate to be considered in the upgrade and in the corresponding implementations of the Gaudi framework. In both projects we implement optimisation techniques in the CPU code. This report presents extensive
benchmark analyses of the correctness and of the performances for both projects
On Evaluating Commercial Cloud Services: A Systematic Review
Background: Cloud Computing is increasingly booming in industry with many
competing providers and services. Accordingly, evaluation of commercial Cloud
services is necessary. However, the existing evaluation studies are relatively
chaotic. There exists tremendous confusion and gap between practices and theory
about Cloud services evaluation. Aim: To facilitate relieving the
aforementioned chaos, this work aims to synthesize the existing evaluation
implementations to outline the state-of-the-practice and also identify research
opportunities in Cloud services evaluation. Method: Based on a conceptual
evaluation model comprising six steps, the Systematic Literature Review (SLR)
method was employed to collect relevant evidence to investigate the Cloud
services evaluation step by step. Results: This SLR identified 82 relevant
evaluation studies. The overall data collected from these studies essentially
represent the current practical landscape of implementing Cloud services
evaluation, and in turn can be reused to facilitate future evaluation work.
Conclusions: Evaluation of commercial Cloud services has become a world-wide
research topic. Some of the findings of this SLR identify several research gaps
in the area of Cloud services evaluation (e.g., the Elasticity and Security
evaluation of commercial Cloud services could be a long-term challenge), while
some other findings suggest the trend of applying commercial Cloud services
(e.g., compared with PaaS, IaaS seems more suitable for customers and is
particularly important in industry). This SLR study itself also confirms some
previous experiences and reveals new Evidence-Based Software Engineering (EBSE)
lessons
Reliable massively parallel symbolic computing : fault tolerance for a distributed Haskell
As the number of cores in manycore systems grows exponentially, the number of failures is
also predicted to grow exponentially. Hence massively parallel computations must be able to
tolerate faults. Moreover new approaches to language design and system architecture are needed
to address the resilience of massively parallel heterogeneous architectures.
Symbolic computation has underpinned key advances in Mathematics and Computer Science,
for example in number theory, cryptography, and coding theory. Computer algebra software
systems facilitate symbolic mathematics. Developing these at scale has its own distinctive
set of challenges, as symbolic algorithms tend to employ complex irregular data and control
structures. SymGridParII is a middleware for parallel symbolic computing on massively parallel
High Performance Computing platforms. A key element of SymGridParII is a domain specific
language (DSL) called Haskell Distributed Parallel Haskell (HdpH). It is explicitly designed for
scalable distributed-memory parallelism, and employs work stealing to load balance dynamically
generated irregular task sizes.
To investigate providing scalable fault tolerant symbolic computation we design, implement
and evaluate a reliable version of HdpH, HdpH-RS. Its reliable scheduler detects and handles
faults, using task replication as a key recovery strategy. The scheduler supports load balancing
with a fault tolerant work stealing protocol. The reliable scheduler is invoked with two fault
tolerance primitives for implicit and explicit work placement, and 10 fault tolerant parallel
skeletons that encapsulate common parallel programming patterns. The user is oblivious to
many failures, they are instead handled by the scheduler.
An operational semantics describes small-step reductions on states. A simple abstract machine
for scheduling transitions and task evaluation is presented. It defines the semantics of
supervised futures, and the transition rules for recovering tasks in the presence of failure. The
transition rules are demonstrated with a fault-free execution, and three executions that recover
from faults.
The fault tolerant work stealing has been abstracted in to a Promela model. The SPIN
model checker is used to exhaustively search the intersection of states in this automaton to
validate a key resiliency property of the protocol. It asserts that an initially empty supervised
future on the supervisor node will eventually be full in the presence of all possible combinations
of failures.
The performance of HdpH-RS is measured using five benchmarks. Supervised scheduling
achieves a speedup of 757 with explicit task placement and 340 with lazy work stealing when
executing Summatory Liouville up to 1400 cores of a HPC architecture. Moreover, supervision
overheads are consistently low scaling up to 1400 cores. Low recovery overheads are observed in
the presence of frequent failure when lazy on-demand work stealing is used. A Chaos Monkey
mechanism has been developed for stress testing resiliency with random failure combinations.
All unit tests pass in the presence of random failure, terminating with the expected results
Runtime support for load balancing of parallel adaptive and irregular applications
Applications critical to today\u27s engineering research often must make use of the increased memory and processing power of a parallel machine. While advances in architecture design are leading to more and more powerful parallel systems, the software tools needed to realize their full potential are in a much less advanced state. In particular, efficient, robust, and high-performance runtime support software is critical in the area of dynamic load balancing. While the load balancing of loosely synchronous codes, such as field solvers, has been studied extensively for the past 15 years, there exists a class of problems, known as asynchronous and highly adaptive , for which the dynamic load balancing problem remains open. as we discuss, characteristics of this class of problems render compile-time or static analysis of little benefit, and complicate the dynamic load balancing task immensely.;We make two contributions to this area of research. The first is the design and development of a runtime software toolkit, known as the Parallel Runtime Environment for Multi-computer Applications, or PREMA, which provides interprocessor communication, a global namespace, a framework for the implementation of customized scheduling policies, and several such policies which are prevalent in the load balancing literature. The PREMA system is designed to support coarse-grained domain decompositions with the goals of portability, flexibility, and maintainability in mind, so that developers will quickly feel comfortable incorporating it into existing codes and developing new codes which make use of its functionality. We demonstrate that the programming model and implementation are efficient and lead to the development of robust and high-performance applications.;Our second contribution is in the area of performance modeling. In order to make the most effective use of the PREMA runtime software, certain parameters governing its execution must be set off-line. Optimal values for these parameters may be determined through repeated executions of the target application; however, this is not always possible, particularly in large-scale environments and long-running applications. We present an analytic model that allows the user to quickly and inexpensively predict application performance and fine-tune applications built on the PREMA platform
Supercomputing Frontiers
This open access book constitutes the refereed proceedings of the 7th Asian Conference Supercomputing Conference, SCFA 2022, which took place in Singapore in March 2022. The 8 full papers presented in this book were carefully reviewed and selected from 21 submissions. They cover a range of topics including file systems, memory hierarchy, HPC cloud platform, container image configuration workflow, large-scale applications, and scheduling
- …