We introduce a computation model for developing and analyzing
parallel algorithms on distributed memory machines. The model allows
the design of algorithms using a single address space and does not
assume any particular interconnection topology. We capture performance
by incorporating a cost measure for interprocessor communication
induced by remote memory accesses. The cost measure includes
parameters reflecting memory latency, communication bandwidth, and
spatial locality. Our model allows the initial placement of the input
data and pipelined prefetching.
We use our model to develop parallel algorithms for various data
rearrangement problems, load balancing, sorting, FFT, and matrix
multiplication. We show that most of these algorithms achieve optimal
or near optimal communication complexity while simultaneously
guaranteeing an optimal speed-up in computational complexity.
(Also cross-referenced as UMIACS-TR-94-5.