72,958 research outputs found
Aspects of practical implementations of PRAM algorithms
The PRAM is a shared memory model of parallel computation which abstracts away from inessential engineering details. It provides a very simple architecture independent model and provides a good programming environment. Theoreticians of the computer science community have proved that it is possible to emulate the theoretical PRAM model using current technology. Solutions have been found for effectively interconnecting processing elements, for routing data on these networks and for distributing the data among memory modules without hotspots. This thesis reviews this emulation and the possibilities it provides for large scale general purpose parallel computation. The emulation employs a bridging model which acts as an interface between the actual hardware and the PRAM model. We review the evidence that such a scheme can achieve scalable parallel performance and portable parallel software and that PRAM algorithms can be optimally implemented on such practical models. In the course of this review we presented the following new results:
1. Concerning parallel approximation algorithms, we describe an NC algorithm for findings an approximation to a minimum weight perfect matching in a complete weighted graph. The algorithm is conceptually very simple and it is also the first NC-approximation algorithm for the task with a sub-linear performance ratio.
2. Concerning graph embedding, we describe dense edge-disjoint embeddings of the complete binary tree with n leaves in the following n-node communication networks: the hypercube, the dc Bruijn and shuffle-exchange networks and the 2-dimcnsional mesh. In the embeddings the maximum distance from a leaf to the root of the tree is asymptotically optimally short. The embeddings facilitate efficient implementation of many PRAM algorithms on networks employing these graphs as interconnection networks.
3. Concerning bulk synchronous algorithmic, we describe scalable transportable algorithms for the following three commonly required types of computation; balanced tree computations. Fast Fourier Transforms and matrix multiplications
Fast hyperbolic Radon transform represented as convolutions in log-polar coordinates
The hyperbolic Radon transform is a commonly used tool in seismic processing,
for instance in seismic velocity analysis, data interpolation and for multiple
removal. A direct implementation by summation of traces with different moveouts
is computationally expensive for large data sets. In this paper we present a
new method for fast computation of the hyperbolic Radon transforms. It is based
on using a log-polar sampling with which the main computational parts reduce to
computing convolutions. This allows for fast implementations by means of FFT.
In addition to the FFT operations, interpolation procedures are required for
switching between coordinates in the time-offset; Radon; and log-polar domains.
Graphical Processor Units (GPUs) are suitable to use as a computational
platform for this purpose, due to the hardware supported interpolation routines
as well as optimized routines for FFT. Performance tests show large speed-ups
of the proposed algorithm. Hence, it is suitable to use in iterative methods,
and we provide examples for data interpolation and multiple removal using this
approach.Comment: 21 pages, 10 figures, 2 table
Accelerated Modeling of Near and Far-Field Diffraction for Coronagraphic Optical Systems
Accurately predicting the performance of coronagraphs and tolerancing optical
surfaces for high-contrast imaging requires a detailed accounting of
diffraction effects. Unlike simple Fraunhofer diffraction modeling, near and
far-field diffraction effects, such as the Talbot effect, are captured by
plane-to-plane propagation using Fresnel and angular spectrum propagation. This
approach requires a sequence of computationally intensive Fourier transforms
and quadratic phase functions, which limit the design and aberration
sensitivity parameter space which can be explored at high-fidelity in the
course of coronagraph design. This study presents the results of optimizing the
multi-surface propagation module of the open source Physical Optics Propagation
in PYthon (POPPY) package. This optimization was performed by implementing and
benchmarking Fourier transforms and array operations on graphics processing
units, as well as optimizing multithreaded numerical calculations using the
NumExpr python library where appropriate, to speed the end-to-end simulation of
observatory and coronagraph optical systems. Using realistic systems, this
study demonstrates a greater than five-fold decrease in wall-clock runtime over
POPPY's previous implementation and describes opportunities for further
improvements in diffraction modeling performance.Comment: Presented at SPIE ASTI 2018, Austin Texas. 11 pages, 6 figure
L-PICOLA: A parallel code for fast dark matter simulation
Robust measurements based on current large-scale structure surveys require
precise knowledge of statistical and systematic errors. This can be obtained
from large numbers of realistic mock galaxy catalogues that mimic the observed
distribution of galaxies within the survey volume. To this end we present a
fast, distributed-memory, planar-parallel code, L-PICOLA, which can be used to
generate and evolve a set of initial conditions into a dark matter field much
faster than a full non-linear N-Body simulation. Additionally, L-PICOLA has the
ability to include primordial non-Gaussianity in the simulation and simulate
the past lightcone at run-time, with optional replication of the simulation
volume. Through comparisons to fully non-linear N-Body simulations we find that
our code can reproduce the power spectrum and reduced bispectrum of dark
matter to within 2% and 5% respectively on all scales of interest to
measurements of Baryon Acoustic Oscillations and Redshift Space Distortions,
but 3 orders of magnitude faster. The accuracy, speed and scalability of this
code, alongside the additional features we have implemented, make it extremely
useful for both current and next generation large-scale structure surveys.
L-PICOLA is publicly available at https://cullanhowlett.github.io/l-picolaComment: 22 Pages, 20 Figures. Accepted for publication in Astronomy and
Computin
Ordered fast fourier transforms on a massively parallel hypercube multiprocessor
Design alternatives for ordered Fast Fourier Transformation (FFT) algorithms were examined on massively parallel hypercube multiprocessors such as the Connection Machine. Particular emphasis is placed on reducing communication which is known to dominate the overall computing time. To this end, the order and computational phases of the FFT were combined, and the sequence to processor maps that reduce communication were used. The class of ordered transforms is expanded to include any FFT in which the order of the transform is the same as that of the input sequence. Two such orderings are examined, namely, standard-order and A-order which can be implemented with equal ease on the Connection Machine where orderings are determined by geometries and priorities. If the sequence has N = 2 exp r elements and the hypercube has P = 2 exp d processors, then a standard-order FFT can be implemented with d + r/2 + 1 parallel transmissions. An A-order sequence can be transformed with 2d - r/2 parallel transmissions which is r - d + 1 fewer than the standard order. A parallel method for computing the trigonometric coefficients is presented that does not use trigonometric functions or interprocessor communication. A performance of 0.9 GFLOPS was obtained for an A-order transform on the Connection Machine
- …