13,628 research outputs found
Distributed memory compiler methods for irregular problems: Data copy reuse and runtime partitioning
Outlined here are two methods which we believe will play an important role in any distributed memory compiler able to handle sparse and unstructured problems. We describe how to link runtime partitioners to distributed memory compilers. In our scheme, programmers can implicitly specify how data and loop iterations are to be distributed between processors. This insulates users from having to deal explicitly with potentially complex algorithms that carry out work and data partitioning. We also describe a viable mechanism for tracking and reusing copies of off-processor data. In many programs, several loops access the same off-processor memory locations. As long as it can be verified that the values assigned to off-processor memory locations remain unmodified, we show that we can effectively reuse stored off-processor data. We present experimental data from a 3-D unstructured Euler solver run on iPSC/860 to demonstrate the usefulness of our methods
FluidFFT: common API (C++ and Python) for Fast Fourier Transform HPC libraries
The Python package fluidfft provides a common Python API for performing Fast
Fourier Transforms (FFT) in sequential, in parallel and on GPU with different
FFT libraries (FFTW, P3DFFT, PFFT, cuFFT). fluidfft is a comprehensive FFT
framework which allows Python users to easily and efficiently perform FFT and
the associated tasks, such as as computing linear operators and energy spectra.
We describe the architecture of the package composed of C++ and Cython FFT
classes, Python "operator" classes and Pythran functions. The package supplies
utilities to easily test itself and benchmark the different FFT solutions for a
particular case and on a particular machine. We present a performance scaling
analysis on three different computing clusters and a microbenchmark showing
that fluidfft is an interesting solution to write efficient Python applications
using FFT
Vienna FORTRAN: A FORTRAN language extension for distributed memory multiprocessors
Exploiting the performance potential of distributed memory machines requires a careful distribution of data across the processors. Vienna FORTRAN is a language extension of FORTRAN which provides the user with a wide range of facilities for such mapping of data structures. However, programs in Vienna FORTRAN are written using global data references. Thus, the user has the advantage of a shared memory programming paradigm while explicitly controlling the placement of data. The basic features of Vienna FORTRAN are presented along with a set of examples illustrating the use of these features
An assessment of the connection machine
The CM-2 is an example of a connection machine. The strengths and problems of this implementation are considered as well as important issues in the architecture and programming environment of connection machines in general. These are contrasted to the same issues in Multiple Instruction/Multiple Data (MIMD) microprocessors and multicomputers
AD in Fortran, Part 1: Design
We propose extensions to Fortran which integrate forward and reverse
Automatic Differentiation (AD) directly into the programming model.
Irrespective of implementation technology, embedding AD constructs directly
into the language extends the reach and convenience of AD while allowing
abstraction of concepts of interest to scientific-computing practice, such as
root finding, optimization, and finding equilibria of continuous games.
Multiple different subprograms for these tasks can share common interfaces,
regardless of whether and how they use AD internally. A programmer can maximize
a function F by calling a library maximizer, XSTAR=ARGMAX(F,X0), which
internally constructs derivatives of F by AD, without having to learn how to
use any particular AD tool. We illustrate the utility of these extensions by
example: programs become much more concise and closer to traditional
mathematical notation. A companion paper describes how these extensions can be
implemented by a program that generates input to existing Fortran-based AD
tools
Domain-Specific Acceleration and Auto-Parallelization of Legacy Scientific Code in FORTRAN 77 using Source-to-Source Compilation
Massively parallel accelerators such as GPGPUs, manycores and FPGAs represent
a powerful and affordable tool for scientists who look to speed up simulations
of complex systems. However, porting code to such devices requires a detailed
understanding of heterogeneous programming tools and effective strategies for
parallelization. In this paper we present a source to source compilation
approach with whole-program analysis to automatically transform single-threaded
FORTRAN 77 legacy code into OpenCL-accelerated programs with parallelized
kernels.
The main contributions of our work are: (1) whole-source refactoring to allow
any subroutine in the code to be offloaded to an accelerator. (2) Minimization
of the data transfer between the host and the accelerator by eliminating
redundant transfers. (3) Pragmatic auto-parallelization of the code to be
offloaded to the accelerator by identification of parallelizable maps and
reductions.
We have validated the code transformation performance of the compiler on the
NIST FORTRAN 78 test suite and several real-world codes: the Large Eddy
Simulator for Urban Flows, a high-resolution turbulent flow model; the shallow
water component of the ocean model Gmodel; the Linear Baroclinic Model, an
atmospheric climate model and Flexpart-WRF, a particle dispersion simulator.
The automatic parallelization component has been tested on as 2-D Shallow
Water model (2DSW) and on the Large Eddy Simulator for Urban Flows (UFLES) and
produces a complete OpenCL-enabled code base. The fully OpenCL-accelerated
versions of the 2DSW and the UFLES are resp. 9x and 20x faster on GPU than the
original code on CPU, in both cases this is the same performance as manually
ported code.Comment: 12 pages, 5 figures, submitted to "Computers and Fluids" as full
paper from ParCFD conference entr
Array languages and the N-body problem
This paper is a description of the contributions to the SICSA multicore challenge on many body
planetary simulation made by a compiler group at the University of Glasgow. Our group is part of
the Computer Vision and Graphics research group and we have for some years been developing array
compilers because we think these are a good tool both for expressing graphics algorithms and for
exploiting the parallelism that computer vision applications require.
We shall describe experiments using two languages on two different platforms and we shall compare
the performance of these with reference C implementations running on the same platforms. Finally
we shall draw conclusions both about the viability of the array language approach as compared to
other approaches used in the challenge and also about the strengths and weaknesses of the two, very
different, processor architectures we used
Learning from the Success of MPI
The Message Passing Interface (MPI) has been extremely successful as a
portable way to program high-performance parallel computers. This success has
occurred in spite of the view of many that message passing is difficult and
that other approaches, including automatic parallelization and directive-based
parallelism, are easier to use. This paper argues that MPI has succeeded
because it addresses all of the important issues in providing a parallel
programming model.Comment: 12 pages, 1 figur
The PISCES 2 parallel programming environment
PISCES 2 is a programming environment for scientific and engineering computations on MIMD parallel computers. It is currently implemented on a flexible FLEX/32 at NASA Langley, a 20 processor machine with both shared and local memories. The environment provides an extended Fortran for applications programming, a configuration environment for setting up a run on the parallel machine, and a run-time environment for monitoring and controlling program execution. This paper describes the overall design of the system and its implementation on the FLEX/32. Emphasis is placed on several novel aspects of the design: the use of a carefully defined virtual machine, programmer control of the mapping of virtual machine to actual hardware, forces for medium-granularity parallelism, and windows for parallel distribution of data. Some preliminary measurements of storage use are included
- …