21 research outputs found
Recommended from our members
Center for Programming Models for Scalable Parallel Computing - Towards Enhancing OpenMP for Manycore and Heterogeneous Nodes
OpenMP was not well recognized at the beginning of the project, around year 2003, because of its limited use in DoE production applications and the inmature hardware support for an efficient implementation. Yet in the recent years, it has been graduately adopted both in HPC applications, mostly in the form of MPI+OpenMP hybrid code, and in mid-scale desktop applications for scientific and experimental studies. We have observed this trend and worked deligiently to improve our OpenMP compiler and runtimes, as well as to work with the OpenMP standard organization to make sure OpenMP are evolved in the direction close to DoE missions. In the Center for Programming Models for Scalable Parallel Computing project, the HPCTools team at the University of Houston (UH), directed by Dr. Barbara Chapman, has been working with project partners, external collaborators and hardware vendors to increase the scalability and applicability of OpenMP for multi-core (and future manycore) platforms and for distributed memory systems by exploring different programming models, language extensions, compiler optimizations, as well as runtime library support
X10 for high-performance scientific computing
High performance computing is a key technology that enables large-scale physical
simulation in modern science. While great advances have been made in methods and
algorithms for scientific computing, the most commonly used programming models
encourage a fragmented view of computation that maps poorly to the underlying
computer architecture.
Scientific applications typically manifest physical locality, which means that interactions
between entities or events that are nearby in space or time are stronger
than more distant interactions. Linear-scaling methods exploit physical locality by approximating
distant interactions, to reduce computational complexity so that cost is
proportional to system size. In these methods, the computation required for each
portion of the system is different depending on that portion’s contribution to the
overall result. To support productive development, application programmers need
programming models that cleanly map aspects of the physical system being simulated
to the underlying computer architecture while also supporting the irregular
workloads that arise from the fragmentation of a physical system.
X10 is a new programming language for high-performance computing that uses
the asynchronous partitioned global address space (APGAS) model, which combines
explicit representation of locality with asynchronous task parallelism. This thesis
argues that the X10 language is well suited to expressing the algorithmic properties
of locality and irregular parallelism that are common to many methods for physical
simulation.
The work reported in this thesis was part of a co-design effort involving researchers
at IBM and ANU in which two significant computational chemistry codes
were developed in X10, with an aim to improve the expressiveness and performance
of the language. The first is a Hartree–Fock electronic structure code, implemented
using the novel Resolution of the Coulomb Operator approach. The second evaluates
electrostatic interactions between point charges, using either the smooth particle
mesh Ewald method or the fast multipole method, with the latter used to simulate
ion interactions in a Fourier Transform Ion Cyclotron Resonance mass spectrometer.
We compare the performance of both X10 applications to state-of-the-art software
packages written in other languages.
This thesis presents improvements to the X10 language and runtime libraries for
managing and visualizing the data locality of parallel tasks, communication using
active messages, and efficient implementation of distributed arrays. We evaluate these improvements in the context of computational chemistry application examples.
This work demonstrates that X10 can achieve performance comparable to established
programming languages when running on a single core. More importantly,
X10 programs can achieve high parallel efficiency on a multithreaded architecture,
given a divide-and-conquer pattern parallel tasks and appropriate use of worker-local
data. For distributed memory architectures, X10 supports the use of active messages
to construct local, asynchronous communication patterns which outperform global,
synchronous patterns. Although point-to-point active messages may be implemented
efficiently, productive application development also requires collective communications;
more work is required to integrate both forms of communication in the X10
language. The exploitation of locality is the key insight in both linear-scaling methods and
the APGAS programming model; their combination represents an attractive opportunity
for future co-design efforts
Techniques, Tricks and Algorithms for Efficient GPU-Based Processing of Higher Order Hyperbolic PDEs
GPU computing is expected to play an integral part in all modern Exascale
supercomputers. It is also expected that higher order Godunov schemes will make
up about a significant fraction of the application mix on such supercomputers.
It is, therefore, very important to prepare the community of users of higher
order schemes for hyperbolic PDEs for this emerging opportunity. We focus on
three broad and high-impact areas where higher order Godunov schemes are used.
The first area is computational fluid dynamics (CFD). The second is
computational magnetohydrodynamics (MHD) which has an involution constraint
that has to be mimetically preserved. The third is computational
electrodynamics (CED) which has involution constraints and also extremely stiff
source terms. Together, these three diverse uses of higher order Godunov
methodology, cover many of the most important applications areas. In all three
cases, we show that the optimal use of algorithms, techniques and tricks, along
with the use of OpenACC, yields superlative speedups on GPUs! As a bonus, we
find a most remarkable and desirable result: some higher order schemes, with
their larger operations count per zone, show better speedup than lower order
schemes on GPUs. In other words, the GPU is an optimal stratagem for overcoming
the higher computational complexities of higher order schemes! Several avenues
for future improvement have also been identified. A scalability study is
presented for a real-world application using GPUs and comparable numbers of
high-end multicore CPUs. It is found that GPUs offer a substantial performance
benefit over comparable number of CPUs, especially when all the methods
designed in this paper are used.Comment: 73 pages, 17 figure