60 research outputs found
Introducing Molly: Distributed Memory Parallelization with LLVM
Programming for distributed memory machines has always been a tedious task,
but necessary because compilers have not been sufficiently able to optimize for
such machines themselves. Molly is an extension to the LLVM compiler toolchain
that is able to distribute and reorganize workload and data if the program is
organized in statically determined loop control-flows. These are represented as
polyhedral integer-point sets that allow program transformations applied on
them. Memory distribution and layout can be declared by the programmer as
needed and the necessary asynchronous MPI communication is generated
automatically. The primary motivation is to run Lattice QCD simulations on IBM
Blue Gene/Q supercomputers, but since the implementation is not yet completed,
this paper shows the capabilities on Conway's Game of Life
Data-centric Performance Measurement and Mapping for Highly Parallel Programming Models
Modern supercomputers have complex features: many hardware threads, deep memory hierarchies, and many co-processors/accelerators. Productively and effectively designing programs to utilize those hardware features is crucial in gaining the best performance. There are several highly parallel programming models in active development that allow programmers to write efficient code on those architectures. Performance profiling is a very important technique in the development to achieve the best performance.
In this dissertation, I proposed a new performance measurement and mapping technique that can associate performance data with program variables instead of code blocks. To validate the applicability of my data-centric profiling idea, I designed and implemented a profiler for PGAS and CUDA. For PGAS, I developed ChplBlamer, for both single-node and multi-node Chapel programs. My tool also provides new features such as data-centric inter-node load imbalance identification. For CUDA, I developed CUDABlamer for GPU-accelerated applications. CUDABlamer also attributes performance data to program variables, which is a feature that was not found in any previous CUDA profilers. Directed by the insights from the tools, I optimized several widely-studied benchmarks and significantly improved program performance by a factor of up to 4x for Chapel and 47x for CUDA kernels
Maximizing Communication Overlap with Dynamic Program Analysis
International audienceWe present a dynamic program analysis approach to optimize communication overlap in scientific applications. Our tool instruments the code to generate a trace of the application's memory and synchronization behavior. An offline analysis determines the program optimal points for maximal overlap when considering several programming constructs: nonblocking one-sided communication operations, non-blocking collectives and bespoke synchronization patterns and operations. Feedback about possible transformations is presented to the user and the tool can perform the directed transformations, which are supported by a lightweight runtime. The value of our approach comes from: 1) the ability to optimize across boundaries of software modules or libraries, while specializing for the intrinsics of the underlying communication runtime; and 2) providing upper bounds on the expected performance improvements after communication optimizations. We have reduced the time spent in communication by as much as 64% for several applications that were already aggressively optimized for overlap; this indicates that manual optimizations leave untapped performance. Although demonstrated mainly for the UPC programming language, the methodology can be easily adapted to any other communication and synchronization API
AUTOMATING DATA-LAYOUT DECISIONS IN DOMAIN-SPECIFIC LANGUAGES
A long-standing challenge in High-Performance Computing (HPC) is the simultaneous achievement of programmer productivity and hardware computational efficiency. The challenge has been exacerbated by the onset of multi- and many-core CPUs and accelerators. Only a few expert programmers have been able to hand-code domain-specific data transformations and vectorization schemes needed to extract the best possible performance on such architectures. In this research, we examined the possibility of automating these methods by developing a Domain-Specific Language (DSL) framework. Our DSL approach extends C++14 by embedding into it a high-level data-parallel array language, and by using a domain-specific compiler to compile to hybrid-parallel code. We also implemented an array index-space transformation algebra within this high-level array language to manipulate array data-layouts and data-distributions. The compiler introduces a novel method for SIMD auto-vectorization based on array data-layouts. Our new auto-vectorization technique is shown to outperform the default auto-vectorization strategy by up to 40% for stencil computations. The compiler also automates distributed data movement with overlapping of local compute with remote data movement using polyhedral integer set analysis. Along with these main innovations, we developed a new technique using C++ template metaprogramming for developing embedded DSLs using C++. We also proposed a domain-specific compiler intermediate representation that simplifies data flow analysis of abstract DSL constructs. We evaluated our framework by constructing a DSL for the HPC grand-challenge domain of lattice quantum chromodynamics. Our DSL yielded performance gains of up to twice the flop rate over existing production C code for selected kernels. This gain in performance was obtained while using less than one-tenth the lines of code. The performance of this DSL was also competitive with the best hand-optimized and hand-vectorized code, and is an order of magnitude better than existing production DSLs.Doctor of Philosoph
UPIR: Toward the Design of Unified Parallel Intermediate Representation for Parallel Programming Models
The complexity of heterogeneous computing architectures, as well as the
demand for productive and portable parallel application development, have
driven the evolution of parallel programming models to become more
comprehensive and complex than before. Enhancing the conventional compilation
technologies and software infrastructure to be parallelism-aware has become one
of the main goals of recent compiler development. In this paper, we propose the
design of unified parallel intermediate representation (UPIR) for multiple
parallel programming models and for enabling unified compiler transformation
for the models. UPIR specifies three commonly used parallelism patterns (SPMD,
data and task parallelism), data attributes and explicit data movement and
memory management, and synchronization operations used in parallel programming.
We demonstrate UPIR via a prototype implementation in the ROSE compiler for
unifying IR for both OpenMP and OpenACC and in both C/C++ and Fortran, for
unifying the transformation that lowers both OpenMP and OpenACC code to LLVM
runtime, and for exporting UPIR to LLVM MLIR dialect.Comment: Typos corrected. Format update
Static Local Concurrency Errors Detection in MPI-RMA Programs
International audienceCommunications are a critical part of HPC simulations, and one of the main focuses of application developers when scaling on supercomputers. While classical message passing (also called two-sided communications) is the dominant communication paradigm, one-sided communications are often praised to be efficient to overlap communications with computations, but challenging to program. Their usage is then generally abstracted through languages and memory abstractions to ease programming (e.g. PGAS). Therefore, little work has been done to help programmers use intermediate runtime layers, such as MPI-RMA, that is often reserved to expert programmers. Indeed, programming with MPI-RMA presents several challenges that require handling the asynchronous nature of one-sided communications to ensure the proper semantics of the program while ensuring its memory consistency. To help programmers detect memory errors such as race conditions as early as possible, this paper proposes a new static analysis of MPI-RMA codes that shows to the programmer the errors that can be detected at compile time. The detection is based on a novel local concurrency errors detection algorithm that tracks accesses through BFS searches on the Control Flow Graphs of a program. We show on several tests and an MPI-RMA variant of the GUPS benchmark that the static analysis allows to detect such errors on user codes. The error codes are integrated in the MPI Bugs Initiative opensource test suite
MDMP: Managed Data Message Passing
MDMP is a new parallel programming approach that aims to provide users with
an easy way to add parallelism to programs, optimise the message passing costs
of traditional scientific simulation algorithms, and enable existing MPI-based
parallel programs to be optimised and extended without requiring the whole code
to be re-written from scratch. MDMP utilises a directives based approach to
enable users to specify what communications should take place in the code, and
then implements those communications for the user in an optimal manner using
both the information provided by the user and data collected from instrumenting
the code and gathering information on the data to be communicated. This work
will present the basic concepts and functionality of MDMP and discuss the
performance that can be achieved using our prototype implementation of MDMP on
some model scientific simulation applications.Comment: Submitted to SC13, 10 pages, 5 figure
An abstract model for parallel execution of prolog
Logic programming has been used in a broad range of fields, from artifficial intelligence
applications to general purpose applications, with great success. Through its
declarative semantics, by making use of logical conjunctions and disjunctions, logic
programming languages present two types of implicit parallelism: and-parallelism and
or-parallelism.
This thesis focuses mainly in Prolog as a logic programming language, bringing out
an abstract model for parallel execution of Prolog programs, leveraging the Extended
Andorra Model (EAM) proposed by David H.D. Warren, which exploits the implicit
parallelism in the programming language. A meta-compiler implementation for an
intermediate language for the proposed model is also presented.
This work also presents a survey on the state of the art relating to implemented Prolog
compilers, either sequential or parallel, along with a walk-through of the current parallel
programming frameworks. The main used model for Prolog compiler implementation,
the Warren Abstract Machine (WAM) is also analyzed, as well as the WAM’s successor
for supporting parallelism, the EAM; Sumário:
Um Modelo Abstracto para
Execução Paralela de Prolog
A programação em lógica tem sido utilizada em diversas áreas, desde aplicações de
inteligência artificial até aplicações de uso genérico, com grande sucesso. Pela sua
semântica declarativa, fazendo uso de conjunções e disjunções lógicas, as linguagens de
programação em lógica possuem dois tipos de paralelismo implícito: ou-paralelismo e
e-paralelismo.
Esta tese foca-se em particular no Prolog como linguagem de programação em lógica,
apresentando um modelo abstracto para a execução paralela de programas em Prolog,
partindo do Extended Andorra Model (EAM) proposto por David H.D. Warren, que
tira partido do paralelismo implícito na linguagem. É apresentada uma implementação
de um meta-compilador para uma linguagem intermédia para o modelo proposto.
É feita uma revisão sobre o estado da arte em termos de implementações sequenciais
e paralelas de compiladores de Prolog, em conjunto com uma visita pelas linguagens
para implementação de sistemas paralelos. É feita uma análise ao modelo principal
para implementação de compiladores de Prolog, a Warren Abstract Machine (WAM) e
da sua evolução para suportar paralelismo, a EAM
Generalizing Hierarchical Parallelism
Since the days of OpenMP 1.0 computer hardware has become more complex,
typically by specializing compute units for coarse- and fine-grained
parallelism in incrementally deeper hierarchies of parallelism. Newer versions
of OpenMP reacted by introducing new mechanisms for querying or controlling its
individual levels, each time adding another concept such as places, teams, and
progress groups. In this paper we propose going back to the roots of OpenMP in
the form of nested parallelism for a simpler model and more flexible handling
of arbitrary deep hardware hierarchies.Comment: IWOMP'23 preprin
- …