Algorithms and programming tools for image processing on the MPP, part 2 by Reeves, Anthony P.
A1 gori thms and Programing Tools for 
Image Processing on the MPP: #2 
Report for the Period 
March 1986 t o  Augus t  1985 
Anthony P. Reeves 
School o f  E l e c t r i c a l  Engineering 
Cornel 1 Univers i ty  
I thaca,  New York 14853 
Work Supported by NASA Grant NAG 5-403 
https://ntrs.nasa.gov/search.jsp?R=19870008007 2020-03-20T12:59:50+00:00Z
Algorithms and Programming Tools for 
Image Processing on the MPl? #2 
Report for the Period 
March 1986 to August 1985 
Anthony P. Reeves 
School of Electrical Engineering 
Cornell University 
Ithaca, New York 14853 
Work Supported by NASA Grant NAG 5-403 
Summczry 
The work reported here was conducted by Maria Gutierrez and Marc 
Willebeek-LeMair, who are graduate students at Cornell University, and myself. The 
work for this period iof the grant falls into two main categories algorithms for the 
MPP and performance analysis of data manipulations for the MPP and related archi- 
tectures. Maria has developed a number of novel algorithms for image warping and 
pyramid image filtering. Marc has investigated techniques for the parallel processing 
of a large number of independent irregular shaped regions on the MPP. In addition 
some new utilities for dealing with very long vectors and for sorting have been 
developed. Documentation pages for the new algorithms which are available for dis- 
tribution are given in Appendix A. Not all algorithms have been made to work on 
the MPP. The effort in the final period of the grant will concentrate on the MPP im- 
plementation. 
The performance of the MPP for a number of basic data manipulations has been 
determined. From these results it is possible to predict the efficiency of the MPP for a 
number of algorithms and applications Some of these results have been published 
11,Z 31 and these papers are included as Appendices C, D and E The Parallel Pascal 
development system, which is a portable programming environment for the MPP, has 
been improved and better documentation including a tutorial has been written. This 
environment, allows programs for the MPP to be developed on any conventional mm- 
puter system; it consists of a set of system programs and a library of general purpose 
Parallel Pascal functions The new tutorial is included as Appendix EL 
2 
During this report period Idria, Man: and mysel have visited the NA 4 God- 
dard Space Flight Center. Algorithms have been tested on the MPP and a presenta- 
tion on the development system was made to the MPP users group We have distri- 
buted the uND< version of the Parallel Pascal System to number of new sites. 
Some of the highlights of the rmlts  of this research are listed below. 
Image Processing Algorithms 
Algorithms for image warping are described in Appendix A. The two main 
functions are nearest neighbor, " W A R P ,  and bilinear interpolation BLWARP. Both 
of these functions are guided by the same heuristic technique which is very &cient 
for small arbitrary warps but can also deal with large image distortions 
Building on the pyramid processing primitives, which were mentioned in the 
previous report, Laplacian and Gaussian pyramid image filters have been implemented 
by the functions LAFUICIAN and GAUSSIAN respectively as outlined in Appendix 
A. These algorithms are used to decompose an image into a number of bandpass 
filtered subimages A number of interesting &cient image analysis and image filter- 
ing algorithms have been based on this pyramid of subimages. 
Local Region Processing 
A new approach to the parallel processing of independent regions in parallel on 
the MPP is being investigated. For each region in an image a tree is created which 
spans the region and can be used to compute features of the region. Special provisions 
have been made for generating trees for non-simply connected regions Techniques 
for parallel region merging have been developed. After merging two regions, a new 
tree is generated which covers the region in a minimum amount of time Using these 
tree procedures, an image segmentation algorithm, based on the split and merge para- 
digm, has been implemented. An initial paper on these techniques is in preparation. 
General Utilities 
A general purpose sorting algorithm has been implemented; this is described in 
the SORT documentation page in Appendix A. Based on the bitonic sorting technique 
this program can sort the rows, the columns or treat the whole PE matrix as a long 
3 
vector. Any basic data type can be sorted. 
There are new utilities for local mean, local median and local maximum filters; 
see LMEAN, LMEDIAN and LMAXIMUM in Appendix A. Also, a general purpose 
binary matching algorithm (COMPN) has been developed. 
Performance analysis 
An analysis of different data permutations and manipulations on the MPP is 
presented in [l] which is also included in Appendix C This analysis expresses the 
co6t of data manipulations in terms of elemental arithmetic operations; Boolean, in- 
teger and floating point data types are considered. Results are computed for arrays of 
size 128 x 128,256 x 256, and 512 x 512. An extended version of this paper, which 
includes a general description of the MPP, is given in [21 which is also included as A p  
pendix C 
There has been much recent interest in the implementation of parallel pyramid 
data processors Such a processor could be maded with the MPP by augmenting the 
processor array with a pyramid structure of additional processing elements. A py- 
ramid processor based on the MPP is considered in 131 which is also included as Ap- 
pendix D. The results from an analysis of the proposed system indicate that, in gen- 
eral, there would be little advantage in having the additional pyramid hardware for 
implementing many of the pyramid algorithms 
References 
1. 
2 
3 
A. P. Reeves and C H. Moura, "Data Manipulations on the Massively Parallel 
Processor," Proceedings o f  the Nineteenth Hawaii lnterncrtwd Conference on 
System Sciences, p p  222-229 (January, 19862 
A. P. Reeves, 'The Massively Parallel Processor: A Highly Parallel scientific 
Computer," p p  239-252 in Data Analysis in Astrommy 11, ed. V. Di Gesu, Ple- 
num Press (1986). 
A. P. Reeves, ''Pyramid Algorithms on Processor Arrays," p p  195-213 in Pyrm-  
kid Systems for Computer Visison, ed. V.  Cantoni and S Levialdi, Academic 
Press (1986). 
BLWARP ( 2 ) 
Appendix A 
PPSPDS Users Manual BLWARP ( 2 ) 
NAME 
blwarp - Bilinear interpolation warping 
{$library blwsrp.pl} 
SYNOPSIS 
function blwarp( mx:pli; rp:pli;cp:pli; rf:plr; cfip1r):pli; extern rl, rh, cl, ch; 
plr = array [rl..rh,cl..ch] of btype; 
pli = array [rl..rh,cl..ch] of itype; 
plb = array [rl..rh,cl..ch] of boolean; 
Where btype is any type and itype is an integer or subrange base type 
rl = the smallest row number of the input matrix 
rh = the largest row number of the input matrix 
cl = the smallest column number of the input matrix 
ch = the largest column number of the input matrix 
Blwarp is a two dimensional bilinear interpolation warping function. It first finds the four ver- 
tices of the squares that contain each of the points that we want to interpolate. The function 
nnwarp is used to find these vertices and the points are given by the matrices (rp+rf) and 
(cp+cf). Rp and cp define the mapping to be implemented. 
The bilinear interpolation is performed in the following way: 
TYPES 
EXTERN CONSTANTS 
DESCRIPTION 
- 
cf 
P1 -+ * P 2  
I I  
I I  
rf I I 
* P  - 
P3 * * P 4  
for all points P in the matrix. 
AUTHOR 
Maria C. Gutierrez 
rotation(2),~warp(2) 
SEE ALSO 
A. P. Reeves PPL 1 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
m 
I 
I 
I 
COMPN(2)  PPSPDS Users Manual 
~ ~- 
COMPN(2)  
N M  
compn - near neigbor comparison function 
{library mx.pl } 
SYNOPSIS 
compn (m:mtype; w:wtype ):mtype; extern sise; 
mtype = parallel array [lol..hil, lo2..hi2] of boolean; 
wtype = array [O..size, O..size] of 0 . 2 ;  
Compn compares the local neighborhood of each element of the boolean input matrix m with the 
window w. If a match occures then the result element is true, otherwise it is false. A zero in w 
matches with false in m, a one in w matches with true in m and a two in w is a don’t care. 
A. P. Reeves 
TYPES 
DESCRIPTION 
AUTHOR 
A. P. Reeves PPL 1 
I 
I 
I 
I 
1 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
GAUSSIAN ( 2 ) PPSPDS Users Manual GAUSSIAN ( 2 ) 
NAME 
gaussian - gaussian pyramid: low pass filter 
{(library Pyramid.pl } 
SYNOPSIS 
function gaussian( image:gtype; weight:real):rtype; 
extern nrows ncols; 
TYPES 
gtype = parallel array [O..nrows, O..ncols] of (real or integer); 
rtype = parallel array [O..nrows, O..ncols] of real; 
itype = parallel array (O..nrows, O..ncols] of integer; 
btype = parallel array [O..nrows, O..ncols] of boolean; 
nrows = The largest row number of the image matrix 
ncols = The largest column number of the image matrix 
idl,id2,upl,dnl,up2,dn2: itype 
p yr msk: b type; 
Id1 and id2 are two global index identifying matrices as created by twodid (see mat(2)). These 
must be initialized first. The other matrices specify transformations for managing pyramid data; 
see pyramid(2) for more information. 
GUU88iUn is a low pass filter function for pyramid structure images stored in a t w o  dimensional 
matrix. The successive levels of the pyramid structure are stored in successive rows of the image 
matrix with the highest level image (1 pixel image) being located at position [O,O] in the input 
matrix. Each level is a low pass filtered copy of its predecessor and i t  is obtained by convolving 
the predecessor with a Gaussian weighting kernel. This kernel is generated from the weight 
parameter; if a gaussian kernel is desired the weight should be equal to 0.4. 
The pyramid operations require several constant matrices; these are declared globally for 
efficiency. The global variables are generated with the functions twodid (see mat(2)), pynnskg 
and pyrgcn (see pyramid(2)). 
Maria C. Gutierrez 
pyramid( 2), xshift( 2), xconv( 2), mat( 2), gather( 2) 
EXTERN CONSTANTS 
VARS 
DESCRIPTION 
AUTHOR 
SEE ALSO 
A. P. Reeves PPL 1 
I PPSPDS Users Manual 
I 
1 
I 
I 
I 
I 
I 
1 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
" W A R P ( 2 )  NNWARP(2)  
NAME 
nnwarp - Near neighbor warping 
{$library nnwarp.pl } 
SYNOPSIS 
function nnwarp(mx:pa; r:pi; c:pi; mask:pb):pa;extern 101, hil, 102, hi2; 
pa = array [lol..hil,lo2..hi2] of btype; 
pi = array [lol..hil,lo2..hi2] of itype; 
pb = array [lol..hil,lo2..hi2] of boolean; 
Where btype is any type and itype is an integer or subrange base type. 
lo1 = the smallest row number of the input matrix 
102 = the largest row number of the input matrix 
hi1 = the smallest column number of the input matrix 
hi2 = the largest column number of the input matrix 
idl,id2: pi; 
Id1 and id2 are two global index identifying matrices as created by twodid (see mat(2)). These 
must be initialized before using nnwarp. 
Nnwarp is a two dimensional near neighbor warping function. The transformation implemented 
by nnwarp is as follows: 
TYPES 
EXTERN CONSTANTS 
VARS 
DESCRIPTION 
nnwarp[i,j] := mx[r[i,j],c[i,j]]; 
The r and c matrices contain the row and column indices, respectively, of where each element in 
nnwarp is to be obtained from in mx. i.e. r and c define the mapping to be implemented. 
Maria C. Gutierrez 
mat(2) 
AUTHOR 
SEE ALSO 
A. P. Reeves PPL 1 
I 
I 
I 
I 
I 
I 
I 
I 
I 
B 
I 
I 
I 
I 
I 
I 
I 
I 
I 
, LWLACIAN(2)  PPSPDS Users Manual LAF'LACLAN ( 2 ) 
NAME 
laplacian - laplacian pyramid: band pass filter 
{$library Pyramid.pl  } 
SYNOPSIS 
function laplacian(imsge:gtype; weight:resl):rtype; 
extern n r o w s  ncols; 
TYPES 
gtype = parallel array [O..nrows, O..ncols] of (real or integer); 
rtype = parallel array [O..nrows, O..ncols] of real; 
itype = parallel array [O..nrows, O..ncols] of integer; 
btype = parallel array [O..nrows, O..ncols] of boolean; 
nrows = The largest row number of the image matrix 
ncols = The largest column number of the image matrix 
idl,id2,upl,dnl,up2,dn2: itype 
pyrmsk: btype; 
Id1 and id2 are two global index identifying matrices as created by twodid (see mat(2)). These 
must be initialized first. The other matrices specify transformations for managing pyramid data; 
see pyramid(2) for more information. 
Laplacian is a band pass filter function for pyramid structure images stored in a two dimensional 
matrix. The successive levels of the pyramid structure are stored in successive rows of the image 
matrix with the highest level image (1 pixel image) being located at position [O,O] in the input 
matrix. 
In the Laplacian pyramid, each level is a band pass filtered copy of its predecessor and it is 
obtained by the difference of two levels of the Gaussian pyramid. See gaussian(2) for more infor- 
mation about the Gaussian pyramid. 
The pyramid operations require several constant matrices; these are declared globally for 
efficiency. The global variables are generated with the functions twodid (see mat(2)), pynnskp 
and pytpen (see pyramid(2)). 
Maria C. Gutierrez 
Pyramid(2), pyramid(2), xshift(2), xconv(2), mat(2), gather(2) 
EXTERN CONSTANTS 
VAFtS 
DESCRIPTION 
AUTHOR 
SEE ALSO 
A. P. Reeves PPL 1 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
' LMAxIMuM(2)  PPSPDS Users Manual 
NAME 
SYNOPSIS 
lmaximum - local maximum filter function 
{$library 1mean.pl } 
func t ion  lmaximum(m:mtype; w :integer):mtype; extern shifttype; 
TYPES 
mtype = parallel array[il..ih,jl..jh] of btype; 
shifttype = shift, lshift, lshiftg, crshift, or crshiftg 
Where btype is integer or real 
Lmazimum computes the local maximum of the elements in the square (w*w) window. If the win- 
dow goes beyond the input data then all values outside the data border are assumed zero. The 
maximum value is returned at the location of the central element of the window. For the case 
that  the window size is even, the central element is located above and to the left of the center of 
the window. 
DESCRIPTION 
The extern parameter shift can be selected according to the type of shift operation desired. If, 
for example, the local mean function is to be used on large arrays (exceeding the 128*128 array 
size of the MPP) then the lshift or lshiftg options can be used. In addition, there is a crshift and 
a crshiftg operation available. For more information on these functions consult the User's 
Manual. 
AUTHOR 
Marc Willebeek-LeMair 
SEE ALSO 
lmean(2), lmedian(2) 
A. P. Reeves PDS 1 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
PPSPDS Users Manual 
NAME 
lmean - local mean filter function 
{$library 1mean.pl } 
SYNOPSIS 
function lmean(m: mtype): mtype; extern w, il, ih, jl, jh, shifttype; 
TYPES 
mtype = array[il..ih,jl..jh] of btype; 
shifttype = shift, lshift, lshiftg, crshift, crshiftg 
Where btype is integer or real 
Lmean computes the local mean of the elements in the square (w*w) window. The mean value is 
returned at the location of the central element of the window. For the case that  the window size 
is even, the central element is located above and to the left of the center of the window. Along 
the border of the input array where the window contains elements outside the array boundary (of 
unknown value) the mean is set to zero. Therefore, the output array contains a border of zeroes 
half a window-length wide. 
DESCRIPTION 
The extern parameter shift can be selected according to the type of shift operation desired. If, 
for example, the local mean function is to be used on large arrays (exceeding the 128428 array 
size of the MPP) then the lshift or lshiftg options can be used. In addition, there is a crshift and 
a crshiftg operation available. For more information on these functions consult the User's 
Manual. 
The window size (w) is specified as an extern parameter since it is used in the variable declara- 
tion of lmean as a dimension parameter. 
AUTHOR 
M. Willebeek-LeMair 
SEE ALSO 
lmaximum(2), lmedian(2) 
A. P. Reeves PDS 1 
.I 
I 
I 
I 
I 
1 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
m 
, L&4EDIAN(1W) PPSPDS Users Manual LMEDIAN ( 1W ) 
NAME 
SYNOPSIS 
lmedian - local median filter function 
{$library Imean.pl } 
function lmedian(m: mtype; w: integer): mtype; extern il, ih, jl, jh, shifttype; 
mtype = array(il..ih,jl..jh] of btype; 
shifttype = shift, lshift, lshiftg, crshift, or crshiftg 
Where btype is integer or real 
Lmedion computes the local median of the elements in the square window of area wa. If the win- 
dow goes beyond the input data then all values outside the data border are assumed zero. The 
median value is returned at the location of the central element of the window. For the case that 
the window size is even, the central element is located above and to the left of the center of the 
window. 
M. Willebeek-LeMair 
TYPES 
DESCRIPTION 
AUTHOR 
SEE ALSO 
lmaximum( I), lmean( 1) 
A. P. Reeves PDS 1 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
SORT ( 2 ) PPS-PDS Users Manual 
NAME 
sort,rowsort,colsort - General two dimensional sort procedure 
{library sort.pl } 
SYNOPSIS 
SORT ( 2 ) 
procedure sort(var x:pd; var y:pd;);extern pb, n; 
procedure rowsort(var x:pd; var y:pd;);extern pb, n; 
procedure colsort(var x:pd; var y:pd;);extern pb, n; 
pd = array [lol..hil,lo2..hi2] of btype; 
pb = array [lol..hil,lo2..hi2] of boolean; 
Where btype is the base type of the data array to be sorted and n is the dimension of one side of 
the matrix. i.e., (hi1 - lo1 - 1). (n must be a power of 2) 
TYPES 
VARS 
idl,id2: pi; 
Id1 and id2 are two global index identifying matrices as created by twodid (see mat(2)). These 
must be initialized before using sort. 
Sort is a general purpose two dimensional sorting procedure based on the bitonic sorting algo- 
rithm. It is designed for a parallel computer architecture which involves a square mesh con- 
nected processor array. The matrix to be sorted is the parameter x and the result is returned in 
y. X will be modified by this procedure, if this is not desirable for the given application then 
simply do not specify x to  be a var parameter. 
Rowsort is similar to sort except that each row of the matrix is individually sorted. Colsort sorts 
the columns of the matrix. 
DESCRIPTION 
AUTHOR 
A. P. Reeves 
mat( 2) 
SEE ALSO 
A. P. Reeves PPL 1 
I 
I 
I 
1 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
VSHlFT ( 2 ) PPS-PDS Users Manual VSHIFT ( 2 ) 
NAME 
vshift - matrix spiral shift function 
{library vshif't.pl } 
SYNOPSIS 
function vshift(s:atype; s:integer):atype; extern n, mtype; 
TYPES 
atype = array[il..ih,jl..jh] of btype; 
mtype = array[il..ih,jl..jh] of boolean; 
Where btype is any base type 
n = the number of columns of m (jh-jl) 
Vashift performs a spiral shift on a matrix by a specified number of elements (s). That  is, the last 
element in the first row of the matrix spirals around to the first element of the second row, and 
so forth. The last element of the last row is discarded. Zeroes are inserted at the first element 
of the first row. The following is an example of a 5 6  array (m), spiral shifted by 3 elements (9) 
using the vshift function. 
DESCRIPTION 
1 2 3 4 5  0 0 0 1 2  4 5 6 7 8  
6 7 8 9 1 0  3 4 5 6 7  9 10 11 12 13 
11 12 13 14 15 8 9 10 11 12 14 15 16 17 18 
16 17 18 19 20 13 14 15 16 17 19 20 21 22 23 
21 22 23 24 25 18 19 20 21 22 24 25 0 0 0 
(a) Original matrix m. (b) Vshift(m;s) using s = 3. (c) Vshift(m;s) using s = -3. 
This function is useful in instances where a two dimensional array is being treated as a large vec- 
tor. 
AUTHOR 
M. Willebeek-LeMair 
SEE ALSO 
vrotate(2) 
A. P. Reeves PDS 1 
I 
I 
I 
I 
I 
I 
I 
I 
I 
1 
I 
I 
1 
I 
I 
I 
I 
I 
I 
W O T A T E  ( 2 ) PPSPDS Users Manual 
NAME 
vrotate - vector spiral rotate function 
{library vshift.pl } 
SYNOPSIS 
function vrotate(s:atype; e:integer):atype; extern n, mtype; 
TYPES 
atype = array[il..ih,jl..jh] of btype; 
mtype = array[il..ih,jl..jh] of boolean; 
Where btype is any base type 
n = the number of columns of m (jh-jl) 
DESCRIPTION 
Vrotatc performs a spiral rotate on a matr.2 by a specified number 0- elements , , That is, the 
last element in the first row of the matrix spirals around to the first element of the second row, 
and so forth. The last element of the last row is inserted at the location of the first element of 
the first row. The following is an example of a 5 4  array (m), spiral rotated by 3 elements (s) 
using the vrotate function. 
1 2 3 4 5  23 34 25 1 2 3 4 5 6 7  
6 7 8 9 1 0  4 5 6 7 8  9 10 11 12 13 
11 12 13 14 15 8 9 10 11 12 14 15 16 17 18 
16 17 18 19 20 13 14 15 16 17 19 20 21 22 23 
21 22 23 24 25 18 19 20 21 22 24 25 1 2 3 
(a) Original matrix m. (b) Vrotate(m;s) using s = 3. (c) Vrotate(m;s) using s = -3. 
This function is useful in instances where a two dimensional array is being treated as a large vec- 
tor. 
AUTHOR 
M. Willebeek-LeMair 
SEE ALSO 
vshift(2) 
A. P. Reeves PDS 1 
-I 
I 
I 
I 
I 
I 
I 
I 
I 
1 
I 
I 
1 
I 
I 
I 
I 
1 
I 
Appendix B 
PARALLEL PASCAL DEVELOPMENT SYSTEM: TUTORIAL 
Anthony P. Reeves 
School of Electrical Engineering 
Cornel1 University 
Ithaca, New York 14853 
Abstract 
This document demonstrates various features of the Parallel Pascal Development system which 
consists of a Parallel Pascal Translator and a library of support functions. A basic knowledge of the 
Parallel Pascal programming language is assumed. An example compilation is presented and the con- 
tents of the generated files is discussed. Advanced debugging procedures are considered. 
Compiling a Program 
An example program which is stored in a file called square.pp is shown below: 
The first step is to create a source Parallel Pascal file having a .pp extension with any text editor. 
program square(input, output); 
{$library mat } 
const 
dim1 = 2; 
dim2 = 2; 
ar = array [l..diml, l..dim2] of integer; 
a:ar; 
type 
var 
procedure writemx( mx:ar; fmt:integer); extern 1, dim:!; 
begin 
read( a); 
a := a * a; 
writemx(a, 4); 
end. 
To compile this program for a conventional computer simply type 
If the computer system has a Pascal interpreter in addition to a compiler for fast program development 
this can be selected with a -i flag; Le., 
pp square.pp -i 
Assuming that  the first command is typed, selecting the local Pascal compiler, then the following mes- 
sages will be output to the terminal 
*** Pascal Library Processor & *** 
*** Parallel Pascal Translator *** 
Syntax Analysis Complete, 
1% 
I 
I 
I 
I 
I 
i 
I 
I 
1 
I 
I 
I 
I 
I 
1 
I 
1 
I 
TUTORIAL Parallel Pascal TUTORIAL 
No Errors detected. 
*** Pascal Compiler *** 
This indicates that  the compilation was successful. The following files are created: 
pplist 
square. p 
square 
When using the VMS operating system the file aquarc.p may be named square.pas. 
a listing of the Parallel Pascal Program 
a Pascal version of the Parallel Pascal Program 
a binary file ready for execution 
The listing file pplist has the following contents: 
1 program square(input, output); 
2 { library mat } 
3 const 
4 dim1 = 2; 
5 dim2 = 2; 
6 type 
7 
8 var 
9 a m ;  
10 procedure writenu( mx:ar; fmt:integer); 
'ar = array [l..diml, l..dim2] of integer; 
11 (*S- 
*)(*%+*)begin 
12 read(a); 
13 a : = a * a ;  
14 writemx(a, 4); 
15 end. 
Syntax Analysis Complete, No errors detected. 
In this program a single library mat is referenced which contains the procedure writemz. In the listing 
file the extern statement in line 10 is replaced by the library preprocessor with the body of the procedure 
writemx at line 11. The presence of an inserted body is indicated by the sequence (*%. . .*)(*&+*); 
Since this is usually of no interest to the programmer the body itself is not listed. In this way all line 
numbers in the listing file correspond exactly with the line numbers of the source program. 
Running a Program 
The exact procedure for running a program varies from system to system. An example dialogue 
with a UNM operating system is shown below for the program square. Input from the user is shown in 
italics. 
% 8qUare 
1 2 3 4  
1 4  
9 16 
% square 
1 
9 8  
5 
A. P. Reeves PP 2 
-1. 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
1 
1 
TUTORIAL Parallel Pasca l  TUTORIAL 
1 81 
72 25 
% square < infile > ofile 
% 
These examples demonstrate the free format used for data input. The last command is an example 
of how, with UNIX, data may read from a file infile rather than the terminal and the result written to a 
file ofile. 
Program Debugging 
The usual cycle of events when errors are detected is to examine the listing file and then re-edit the 
source program. This procedure is slightly more difficult if an error is found in the body of a library 
function since these bodies are not listed. In general, errors in library functions are rare and are usually 
caused by an incorrect procedure declaration. 
Consider the program square in which such an error has been made; i.e., 
program square(input, output); 
{$library mat } 
const 
dim1 = 2; 
dim2 = 2; 
ar = array [l..diml, l..dim2] of integer; 
a:ar; 
type 
V a r  
procedure wr i t em(  mx:ar; fmt:real); extern 1, dim2; 
begin 
read( a); 
a := a * a; 
writemx(a, 4); 
end. 
On typing the command 
the result is: 
*** Pascal Library Processor & *** 
*** Parallel Pascal Translator *** 
Syntax Analysis complete, 
1 errors detected. 
*** No Compilation *** 
Only the file pplist is created; its contents are: 
1 program square(input, output); 
2 {library mat } 
3 const 
A. P. Reeves PP 3 
1. 
1 
D 
I 
1 
I 
D 
I 
I 
I 
I 
D 
I 
D 
I 
I 
I 
i 
1 
TUTORIAL Parallel Pascal 
4 diml = 2; 
5 d i m 2 = 2 ;  
6 type 
7 
8 var 
9 a:ar; 
10 procedure writemx( mx:ar; fmt:real); 
ll 
ar = array [l..diml, l..dim2] of integer; 
11 (*$I- 
**** -116 
*)( * $i+ *)begin 
12 read(a); 
13 a := a a; 
14 writemx(a, 4); 
15 end. 
* 
Syntax Analysis Complete, 1 errors detected. 
TUTORIAL 
116: error in type of standard procedure parameter 
The error has occurred on line 11 of the listing, i.e. within the body of the library procedure, but it is not 
clear what the cause is. In order to make library function bodies appear in the listing, specify the -6 flag 
for the compiler; Le., 
pp square.pp -s 
The messages to the terminal will be the same as before; however, p p l i d  will now contain the library pro- 
cedure bodies as shown below: 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
program square(input, output); 
{ library mat } 
const 
diml = 2; 
dim2 = 2; 
ar = array [l..diml, l..dim2] of integer; 
a:=; 
type 
V a r  
procedure wr i t em(  mx:ar; fmt:real); 
(procedure writemx(mx;fmt:integer); extern 101, hil;} 
V M  
i:integer; 
for i := 1 to dim2 do 
begin 
begin 
write(m[i,] :fmt); 
-116 **** 
18 writeln; 
19 end 
A. P. Reevee PP 4 
I- 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
1 
I 
TUTORIAL Parallel Pascal 
20 end; 
21 begin 
22 read(a); 
23 a : = a * a ;  
24 writemx(a, 4); 
25 end. 
Syntax Analysis Complete, 1 errors detected. 
TUTORIAL 
116: error in type of standard procedure parameter 
The reason for the error is now clear. The format parameter jmt must be declared to  be of type integer 
rather than real as shown in the comment on line 11. When using the -8 option, line numbers in the list- 
ing file which follow the first library function no longer correspond to  the line numbers in the source file. 
Using the Library Preprocessor 
Standard Pascal has no library facility; all subprograms Le., procedures and functions, must be 
present in the source program. A library preprocessor was developed to allow the use of libraries without 
violating the rules of standard Pascal. The header line of a library subprogram is specified in the source 
program with an extern directive. The library preprocessor replaces the extern directive with the 
appropriate subprogram body. The type information for the library subprogram is extracted from the 
declaration statement in the source program. Therefore, library subprograms can be -written to work 
with any user specified array type. 
If a library subprogram is to be used for more than one array type in the same block, then a sub- 
program declaration statement for each unique argument type is necessary. Each unique version of the 
subprogram is identified by a user specified extension to the subprogram name in both declaration and 
usage. 
For example, consider the ceiling function as defined below: 
function ceiling(x:xtype) : rtype; 
begin 
where x < 0.0 do 
otherwise 
ceiling := trunc(x) 
where x-trunc(x) = 0.0 do 
otherwise 
ceiling := trunc(x) 
ceiling := trunc(x)+l; 
end; 
The following program fragment illustrates how more than one version of this function could be specified 
for the library preprocessor. 
. . .  
tYPe 
ar = array [1..10] of real; 
ai = array [1..10] of integer; 
br = array [1..8, 1..8] of real; 
bi = array [1..8, 1..8] of integer; 
{$library math } 
A. P. Reeves PP 
-I* 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I1 
I 
I 
TUTORIAL Parallel Pascal TUTORIAL 
function ceiling.a(x:ar) :ai; extern; 
function ceiling.b(x:br) :bi; extern; 
V8U 
ax:ar; ay:ai; bx:br; by:bi; 
begin 
. . .  
ay := ceiling.a(ax); 
by := ceiling.b(bx); 
. . .  
A library file consists of the bodies of a set of library subprograms separated by their names pre- 
ceded by a #symbol. The following is the contents of a library file, such as math.pl, which contains the 
two library subprograms used in this tutorial. The usage of system library subprograms and their loca- 
tion is given in section 2 of the Parallel Pascal Development System Manual. In addition, by convention, 
a descriptive comment line for the format of the subprogram header is included with the body of each 
library subprogram. For further information on the structure of the library files see extern(1) and for the 
usage of these subprograms see mat(2) and math(2). 
#ceiling 
{ 
function ceiling(x:real(array)) : integer(array); extern; 
1 
begin 
where x < 0.0 do 
otherwise 
ceiling := trunc(x) 
where x-trunc(x) = 0.0 do 
otherwise 
ceiling := trunc(x) 
ceiling := trunc(x)+l; - 
end; 
#write= 
{ 
procedure writemx( mx:matrix;fmt:integer); extern 101, hil; 
1 
Yar 
i:integer; 
for i := $3 to 84 do 
write(mx[i,]:fmt); 
writeln; 
begin 
begin 
end 
end; 
#end 
A. P. Reeves PP 6 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
Paral le l  Pascal TUTORIAL 
For the Intrepid Exp lo re r  
If errors are reported after the translator these are generated by the local Pascal Compiler or 
linker. In this caSe the translator usually reports No error8 detected then error messages appear on the 
terminal. The cause of these errors may either be an incorrect translation by the translator or by a limi- 
tation in the local Pascal compiler. 
The translator generates a standard Pascal program and the intrepid explorer may wish to examine 
this to find the cause for more obscure errors. But beware, this program is typically six times longer than 
the original source program with many obscure variable names. 
For example, the file s q u a w p  contains the following: 
program square(input, output); 
const 
dim1 = 2; 
dim2 = 2; 
ar =array[1..2,1..2] of integer; 
plltyp0 = ar; 
plltypl = array [1..2] of integer; 
type 
var 
a:ar; 
pllibool : boolean; 
pllvar0: ar; 
pllidxl: integer; 
pllidx0: integer; 
procedure writemx ( mx:ar; fmtinteger); 
var 
i:integer; 
pllvar0: p!ltypl; 
pllidx0: integer; 
for i:=1 to dim2 do 
begin 
begin 
for pllidxO:=l to 2 do 
writeln; 
write(mx[i,pllidxO] :fmt); 
end 
end; 
for pllidxO:=l to 2 do 
begin 
for pllidxl:=l to 2 do 
read( a[pllidxO,pllidxl]); 
for pllidxO:=l to 2 do 
for pllidxl:=l to 2 do 
a[pllidx~,pllidxl]:=a[pllidxO,pllidxl]*a[pllidxO,pllidxl] ; 
writemx(a,4); 
end. 
A. P. Reevea PP 7 
*I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
.I 
I 
I 
~~~ ~ _ _  
TUTORIAL Paral le l  Pascal TUTORIAL 
If still a glutton for more punishment, there is a debug compiler option {§c+ } which inserts the 
source program text as comment statements into the translated pascal file. This makes the pascal file 
easier to follow. For example, for full commented output add the debug option as shown below: 
{&+ } 
program square(input, output); 
{$library mat } 
const 
{ comment option } 
diml = 2; 
dim2 = 2; 
ar = array [l..diml, l..dim2] of integer; 
a:ar; 
type 
V a r  
procedure writemx( mx:ar; fmtinteger); extern 
begin 
read( a); 
a := a * a; 
writemx(a, 4); 
end. 
1, dim2; 
The file 8quare.p generated by compilation will now contain additional comment statements as follows: 
program square(input, output); 
const 
diml = 2; 
dim2 = 2; 
ar =array[1..2,1..2] of integer; 
plltyp0 = ar; 
plltypl = array [1..2] of integer; 
type 
VBT 
VBT 
a:ar; 
pllibool : boolean; 
pllvaro: ar; 
pllidxl: integer; 
pllidxo: integer; 
procedure writemx ( mx:ar; fmt:integer); 
i:integer ; 
pllvar0: plltypl; 
pllidx0: integer; 
for i:=l to dim2 do 
begin 
begin 
A. P. Reeves 
(*****> TRANSLATE A PROCEDURE <***** 
< < < ORIGINAL > > > 
< < < TRANSLATED > > > *) 
write(mx[i,] :fmt); 
PP 8 
~ 
I 
.I’ * 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
TUTORIAL Parallel Pascal TUTORIAL 
for pllidxO:=l to 2 do 
write(mx[i,pllidxO] :fmt); 
1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  
writeln; 
end 
end; 
begin 
(*****> TRANSLATE A PROCEDURE <***** 
<<< ORIGINAL >>> 
<<< TRANSLATED >>> *) 
read( a); 
for pllidxO:=l to 2 do 
for pllidxl:=l to 2 do 
read( a[pllidxO,pllidxl]); 
1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  
(***** > TRANSLATE AN ASSIGNMENT < ***** 
< < < ORIGINAL > > > 
< < < TRANSLATED > > > 
a:=a*a; 
*) 
for pllidxO:=l to 2 do 
for pllidxl:=l to 2 do 
a[pllidxO,pllidxl] :=a[pllidxO,pllidxl] *a[pllidxO,pllidxl] ;
1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  
write=( a,4); 
end. 
(*****> TRANSLATE A PROCEDURE <***** 
<<< ORIGINAL >>> 
<<< TRANSLATED >>> *) 
writemx(a,4); 
1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  
A. P. Reeves PP 9 
Appendix C 
DATA MAMputAnoNS ON 
T m M U s w L  Y PABALLEt PROCESSOR 
Anthony P. Reeves and C r i s t b  It Fxancfort de Sellor Mourr 
1- New York 14853 
school of Electrical Engineering 
cornell univariry 
Abstrau 
The dedgn of an eSective promaor in- . networkisama~probleminthedaignof 
tivcly expensive for such sg3tew an &ordabb network must be deigned which is capable of 
implementing the rpsrcmt tasks without a siflcant lcm in proamor uriliution. In this paper the 
m u h  network of the W v e l y  purllei Rocearrr LMPP) b anal- for a number of imparrrnt 
data snanipukths, The AMPP is a SIMD computer with 16384 proaring elemcnm connected in a 
128 x 118 mesh. The MPP is an inmesting system to study since it represents a minimal archi- 
t e c t ~  in both praesirrg element complexity and interconnection topology. The transfer mi4 
which is the number of elemenul pnrann opmtions required to execute a data mtnipuhtim, is 
introduced to m a w e  the performance of the AWP for different data rmnipuktiam and to esti- 
mate the effectivenu of hrdwM enhancemenu Thir ryp of analysis can be uad to compra 
the performance of different interconnection schema and may be applied to other highly prallel 
highly pdkl campltar - Adrru infcrc-schcmahathe - is pmhibi- 
L , I "  
In any highly parallel pnreaar ckaign a ma* considention ia the 
pmamr interamneetion shema Pmcsrmr must ommunicam 
with a sped which does me impede data w s  howcva, a 
general pioccs~e intcrconneccian network is usually prohibitiMly 
expensive when a lage number af praas~n M involved. A 
major design task is to daign a r#pincd nemork which is rda 
quare for the anticipated  task^ for the system. 
The ;MamveIy Parallel Roasm consists of 16384 b i t d l  Pro- 
cessing Elements (PES) connected in 128 x 128 ma& [ll That b 
each PE i s  connected to i u  4 adjacunt neighbon in a planrr ma- 
The wo d i m e ~ ~ i 0 ~ 1  grid is one of the simpbt inwranmdon 
topologies to implemeas siace the P e t  themselvu M e out in a 
planar grid fashion and all interconnections are between ad-t 
components Furthermore, thic topology is ideal for m o  dimcn- 
s i 0 ~ 1  filtering 0pcmti0~ which are common to low level image 
procerdng such as snail window convolution. 
The PFs are bitrrial, Le. the data paths are all one bit wide. Thir 
organization offen the maximum Rexibility, at the expnm of the 
highur degree of parallelirm. with the minimum number of con- 
tml lina For exanp1e. aa an alternative to the MPP conside 
2048 &bit wide P I 3  (on the MPP one chip COD- 8 l-bit PE'd 
'@e &bit version would have a la rich aet of insPuccionr re+ 
picrcdto pdefilled IYytaopentiona while the bit- proccron 
can procea any data formrt The advantag3 gainad with the bbit 
Proceedlngs o f  the Nineteenth Hawaii International 
Conference on System Sciences (HICSS-19) 1986 
syxtun is that full pma!morutilipriDnis achiwed with urayrof 
2048 elements while arrays of 16384 elements are rrquirsd for 
full utilization of the MPP. The MPF' PE is well matched to loat 
level image pmcemiq which often involve very l q e  data 
amy3 of shortintegen which may be from 1 to 16bia 
h tbirpaper the es&Venem ofthe MPP uchitccnw for wiav 
~ ~ d a t a n u o i p l l . t i o n r i s c o n d c h n d  The MPP offen a 
&pie brric model for analysis since it involm just mah inter- 
.ad b i t e  Prh ' totbirschemetospsd 
up SDm dpulationa M -d. The minim.l architectlam 
of the MPP is of pniculu in- to sndy. * any uchimc- 
trvc rwdi&srdanr to hproM pUf- arould 8mOn 
complex PE or a mare denw intaconnaxion wtcgy. Thr M p p  
pgrmmad in a high level laaguap alld I4uallol Pucrl I21 
The algorithms in this paper will be described in PMllCl 
notation. 
Ll TIIE MPP PROCESSNG ELEMPPT 
The iMpp pmaring element is shown in Fig. 1. -411 data paths are 
one bit wide and there am 8 ms on atingle a m  chip with the 
w ~ o r y o a ~ ~ o r y c h i p a  k e p t f r t h e s h i f t  
register. the Qdgn is eumially I minim.l architecum of thb 
t3pa TYM single bit full adder b uad for uithmaic opera- 
and the Boolean -* which implamma all  16 pmible two 
inpt logial functiws, is wed for all other opexatiorm The NN 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
IC 
I 
select unit is the interface to the interpnraso~i network and is 
wed to select a value from one of the four adjacent PES in the 
mcah. 
' h e  s register is wed for VO. A bit plana is slid into the S rrgi+. 
tm indepndent of the PE proccaing apration and it is then 
loadd into the local memory by cycle stealing one cycle- The G 
register is used in masked opentionz When masking is enabled 
only PFs in which the G register is set prform MY opratiOIU the 
remainder are idlc The marked operation is a very similar control 
feature in SMD deoigns. Not shown in Fig. 1. b an OR bur output 
from the PE All the outpuu are connected (OM) together 90 
that the control unit can determine if any biu are set in a bitphne 
in a single instruction. On the imp the local m c m ~ r y  h a  1024 
words (bits) and is implemented with bipolar Chips which have a 
35 os access timc 
The main novel feature of the MPP PE arcbitan~e b the 
reconfigurable shift register. It may be configund unda prognm 
control to have a length from 2 to 30 birr Impved @- 
is achieved by keeping operands ckcukting in the shift e 
which greatly reduces the number of loal memory acce.mm and 
insuuctions. It speeds up integer multiplhtian by a futcu of two 
and 3 h  hat M important e g c t  on lbtbppoint perf- 
1.2 PERFOLlMANcE EVALUATION 
In & to d p  the 
pmcesing qed of the PE and the speed of the il- 
of the in=- net- 
work for diSemit manipstinns it n - to cimmewciza the 
network. On the MPP both of these are data dependent we ham 
coacidcred three mpraentative ringledit Bod.QI data, bbit 
trt0g.r data and =-bit na-t G.ol) data For e u h  of them 
data types we have atbated a rypicrd time for an elexaenal 
o p n t i o ~  Them gcimrtar am of a reaumabla osda for thir 
the innruction cycle time for aman- -and operation on 
the MPP ir 100 ns An elemental boolean opention may \a con.- 
sidered to aka 100 nr: hmwver.it may ba argued that an opn- 
tion should involve two and have all vuiabla in 
memory in Which -tbra ~ ~ ~ ~ c r i a n r )  would 
require Mons. For our analpis a two instruction (200 MI model 
was urd to repnrnt Boolean instruction timea For the red and 
integer data a convenient n u m k  midway between the times for 
addition and multiplication was urd It should be remembmd 
that elemental operations also include many other functions such 
as -endental functions since these a n  be computed in tima 
comparable to a multiplication on a bit-serial uchirccrura BY 
adding a large amcunt of additional hrdwua to ach PE it b PC+ 
./sible to increase the sped of multiplication by 10 tima or molc 
m i n i m a l P E a r r b i ~ b u t m 8 n o t v u y ~ k  Forenmpk 
" [31 
For each of the data manipulations considered. times for the three 
different data typs will be computed. The performance of the 
MPP for each manipulation will be indicated by the ntio of the 
data transfer time to an elemental PE operation on the same data 
type; this will be called the transfer rocio. One way to look at 
this mi0 is the number of eiemental data operations which must 
be performed tetwnn data masfen for the data transfen not to 
be the dominant cast for the algorithm On the MPP data may be 
shifted between adjacent PES in OIU instruction time (100 nr) 
concurrently with a PE pmccrring iprmrcrion. 
For many applicaths the physical dhensions of the e l  
hardware are tmalla than tb4 dimensionsof the army to be pm- 
cesmi. In this am the dam array is Dmcea#d as a set of blocks An 
exfcnsiam of the data manipulation algorithns to deal with this 
SituationisdLnUrsd 
The program and algorithm exampla given in this papr usc the 
Parallel Pasal ~ t a t i ~ ~  This notation involves the following five 
expresioPr involving whole amp are prmittcd; for mixed 
operations between a E.lat and an array the scahr is repli- 
cated to form a c o n f a b l a  axray. 
the where - do - othorrna ' control statement b availabk 
This statement is a parallel version of the if - then - elst 
statemens the control elpMion must evaluate to a Bookan 
army. All m y  Migrunena within the controlled state- 
ments must k conformable with the control army and are 
marked by i t  
the functioru any and min are the army reduction functions 
or and minimum respectively. 
array W may be elided for subarray selectha For exam- 
pie, coprider a matrix 4 Qil spcidu the i'th row, .Lj 
rpcw the j'th column and (dJ specihs the whole maair. 
basic parallel data manipllathl opaatioru including 
machine primitive manipulatio~, are available as built in 
functions shift. mace. expand and trmupose. 
2. SHIFT AND ROTATE OPEELATIONS 
TJia only prmutation function which is dkectly implemented by 
the AMPP is the near neighbor rotate (or shift). The direction of the 
rotation may be in any of the foursvdinal dirationr In -1 
PasA the main permuutim functions are multklement mate 
and shift functionsother pnnutatiala Ifc built on - primi- 
ti-fa 
The rotate function takm as arguments the amy to be shifted and 
a displacement for each of the anap dimenrianr For examplo 
consider a- dimrruinn.l vrty aspecW by 
. *  
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
1 
I 
I 
I 
I: 
I 
Boolean inteser rea1 Boolean i n m r  ral 
1 03 1.6 6.4 1.0 0.32 0.16 
64 6 5  51 210 33 10 52 
bibjl * 4 rfii). 
More formally, the data array involved in the prmuution are 
for  ir 1 to nmr* do 
apin 
but the situation is much better for real dam. 
3. MATPIXMAPPING 
The hrsr step is to compute matrica rr and rc which specify the 
distance data muse be moved rather than the atnolute loation of 
the data. 
In thh alporitam. data is only explicitly moved in the upward and 
left directions data which must be moved in the right or down- 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
oprat ian~in&f.  
B o o h  intemr ml 
Total Car 
original 29000 75000 230000 
optmimd 1MK)O 64ooo UOOOO 
Withrhift 16OOO SO000 17oo00 
m+ter 
ward directions is taken care of by the toroidal end conaecti0nS 
Specihed by the rmae primitive operation. 
This algorithm involves 0 ( n 3  operatiom for an n x n ma- 
Each iteration requires each PE to compute two 7-bit compnriSnS 
(which are shown in the where sta-t>. thwe take 1.4 W. - 
total c01f for 16384 iteratiom is shown in Table 2 It is of no 
surprise rhrt the wm are 10 high for arbimry mappings on the 
very restricted inraprocaot network of the MPP. HOWBV% 
table 2 is useful in that it provides an upper bound for 
dau mappings 
Transfer Ratio 
B o o h i n t e q a  red 
r m  15000 5800 
90000 13Ooo sux) 
8#xx) loo00 4200 
A h a 1  problem with thir algorithm ir that it hat a complexity Of 
slightly worse than O(n'X For exampk a 256 x 256 matrix 
requira 16 timm the computation required for a 128 I 128 nuuix. 
Alprithm 
0fob.l 
ROW 
Rowwith 
4 DATA BROAXAST 
oper8tionnnin~. Trrpdcr Ratio 
Boolean in- real boleur in- real 
0.4 3 . 2 1 3  2 0.640.32 
14 100 410 68 21 10 
14 16 21 68 32 os2 
The second type is a row or column brodM in wbich asiagle 
row (or column) is bmdcM toau man(0rcolumna) in the d t  
column may be reprscntrd in patallel h l  by 
forrmlly. t h ~ m  &mthS for rhe k'th M OT 
fori:-1tonrowdo (rowdiruibution t 
bid * dkJ 
for j + 1 to ncol do { column distribution 
GI* &k 
h fact the function expond is included in the k l l e l  pual 
,. ' without a loop: r-g. 
,language so that this impomnt tmdm.rflatioa may ba rpcciaed 
b :- expand(dkJ1k ( row distribution 1 
For global brodast the bntstep is to transfer the value to be d b  
tributed from the matrix to the control unit. The MPP bar a g low 
a hardware mechmha which can detect if any bit in bitpkne 
is o c t  This mechanism can also be used toextract the b u r n  or 
minimum value in a matrix in a similar amaut of timu To 
upaft a value from a matrix 6rst set a mask which is true only 
at the F% of interut (14 instructiod Then d thir mask with 
each bit p b  of the data matrix and detect if any Valu is set 
with the global of mahanum . Grinspucuanr . w h n i s t h e  
number of bia in the dataL This value is then distributed to 
ms with the inaauccion spurn (which may b considered L1 a 
globll brmdarr m e c u  tbir requirw n i,mtsuc- 
A simple algorithm to do row (or column) distribution is to take 
the row (or column) and distribute it .QOP t!m matrix in 128 
srcp Far example, the following algorithm wil l  distribute the 
k'th row of a Boolean matrix o. 
mask r idr - k 
braandmask 
fori - 1 to nrow do 
b * b or rotadb, 1, Ox 
For integer and real data the above algorithm can be repeated for 
each dam bit p l a r ~  algorithm hat complexity O(ma 1 where 
A is the number of bia in each data element and m is the length 
of the row. On the MPP a faster solution ir pamible involving the 
PEshift rrginar All bits of the row to be diauibuted are &n 
shifted in a d w n t  rowa of a single bitphne then, as the data 
pls by a PE. th PE loads the data into iushift  re-. The 
complexity of tbir dgoritbm is O(n+mL The fa0 of -emit 
diatribution algoritbnu are given in Tabb 2.
Tabb 3 : Car for the Ddbution Algorithml 
0 1 2 3 4 S 6 7  inputvectar 
0 4 1 S 2 6 3 7  shuEkdvecmr 
A 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
i I I+’ 
I 
I 
I 
/ 
s. SHUFFLES AND PYLLAMIDS 
In this section shu& pennuutbns are considered. In addition to 
their usunl applications to algmithmr such a the Fast Fourier 
Transform (FFT) they are a h  the basic prmutation uxd to d- 
pulate ppnmid data SU-  he shufEe pnnutatkm in one 
dimension is illuspated below the vector is rplit into two equal 
halva and then interleaved w in shu-g a pack of carb. 
The t w u e r u i o n a l  shufae of a matrix may be considered = a 
wmbination of two orrhogonal OnbdimCnrioMi &U&S hnr all 
the rows are rhufaed and then the columnr The algorithms we 
have consided CM also be divided into two half shume open- 
tiont for example, a half shufae may map only elements 0.1.2 
and 3 of the above input vector. 
Detailed algorithms will only be given far a half s h u a  opera- 
a full shume reqUirar two half sh& and a ~wo-~~IB&M~ 
shuae nquirea four half sh- The invarr &u& permucatia 
is a h  of in- The shua algorithm M a h  perform inverso 
shufAa for a similar eoa 
Operations on pyramd data suuctura are of interat to some 
image pmceaing apppliariopr A pynmid may be Eonddrrad to be 
a connected aet of mulrirrsOluriOn mamcu At the apex is one ele- 
ment, at the next level there a four elements in a 2 x 2 matrix 
uch comecud to the apex. At the next level there is a 4 I 4 
matrix which may be considered ala disjoint bt d four 2 x 2 ab- 
mauicu with each subnupix being connected to om of the ele- 
meats in the matrix at the next higher level In geaed.  the k’th 
level is a ZC x P matrix with a c h  element being d t d  with 
one element in the level above and four in the lewl tclow. On 
the Mpp. the top6 lcvalr of apJrrmid maybe stored in 1 single 
are m d  to perform shift8 in dl or a a1ecte.d atbet of lewb 
rimulmmously. The other f-ol opnrioo is to move data 
up or down the leveia of the pynmd The -0~1 half 
mutations for th- pymnid opm- 
Two half shume algorithms are conridcrrd; the hnr ir the drop 
algorithm. 
ma- (the PJmmid hu 164 x 64 -1 Mmkedahift Opaati0n.l 
in- shume and t w ~ l u l o n a l  shume are the raspctive pa- 
when mark do 
b * q  
md; 
In this algorithm the input matrix a is slid over the PFs and when 
the comcf elements are aligned with their denhati- they are 
dmpped into them. The spcific half shufae performed is deter- 
mined by the setup paramctctr For uample, if d is set to 1 and 
md isset to-1 then an inverse half shume will be implemented. 
The second algorithm is called the take algorithm. 
mask * faw 1 take algorithm 1 
f o r i r 1 t o ~ 0 ~ d i v Z - l d o  
mask~NowdiV2-115.true; 
asin 
where mask do 
b * e  
b :- S h i f t  (b. -1 , 0); 
mask * shift (b, 1.0); 
end; 
b [OJ r a [Os 
For this algorithm the reault matrix b is slid over the PE‘r When 
it is at the correa location to maive data it rakas from the PE 
before ap ing  on to its dnal datination. 
The coat of them algorithms is shown in table 4. These algorithm 
have a similar performance when the PE M h i t e c n u c  is simple but 
thc toke algorithm is better when the shift regher is used. This is 
kuux the matrix being mondand mndifiad isthesmc far this 
algorithm (b) and it can be CfBEiently operated on when Rored in 
theshiftrcgisrcr. 
&MATRECWAILpINo 
In many image pmerring appriationr a mrrptrg or rubber shat 
dhtmion of amatrixb needed. Tho warp exhibiu a locllitp prc- 
patp which ir ch.ractdsic of laany impomnt dul mmipuk- 
tiom Th.t is tha ad- tetwwa neigh- matrix elemrmu 
is lOmBly &mined by the tl.nrfosm. -om laany eIrmmtr 
willb movd by similar amounts t, the disuaca to be maMd 
matricaa Gr and rc as denad in action 3) will contab Inany of 
the nmaorsimiktvalutd elements 
We have developd a heurbtic algorithm which attempm to only 
maLC m o w  which are needed by the pvricular datamanipula- 
tion; thh is in contrast to the simple mapping algorithm of -on 
3 which pmacb through all n2 poarible data displacemenu 
Tho algorithm finc slides (rotates) a as many ~ocations up aad left 
aa poaible such that future brck?ncking will not be neccmuy. If 
any element of a k correctly @tk& over b (Le. (w - 0)  and 
? 
, 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
Al@thm 
Takeor h o p  
DropwithSR 
Takewith= 
shift re- 
Operatum cos in ws. Transfer Ratio 
Boolean integer real Boolean in- real 
32 220 830 160 4 s  21 
32 170 630 160 35 16 
32 110 420 160 23 11 
Gc - 0)) then big updatsd. 0th- ar. wbich it a copy of the 
current version of a, is slid in the upwards direction until all out- 
standing elemenu of b. for which the current rc - Q are s- 
The algorithm then shifu a far a posible up and left again and 
repeats the above procedure until all elemenu of the result mask 
ye f a 4  Le. b b complete. 
The following variables are urd in the algorithm : 
~~ ~~~~ ~ ~ _ _  
else 
begin (satisfy each element for the given column) 
when re  = 0 do 
m r r  
otherwire 
mt r Now; 
rit * min(m. 1.2k 
mnsktrrm-rir; 
1 the next seven statemenu implement t 
( the statement (UT - rotate (a. rit, 0) I 
( but a h  take advanrage of the previous shifts 
if a 0 0  then 
btgh 
M * q  
lasfra*Q 
end; 
am :- m u d  cur, rit - kmir. 0 ); 
M*** 
end: 
agin 
where m k t r  do (update b for the current location of a) 
b r o c r ;  
-*Now; 
re *&  
mask * f.lsy 
snd: 
ma; 
Variable functioas 
elements need to be mapped, the mapping may k acbiemd in a 
very small number of i t e m h  It is m t  to rcpnrcnt the oprt 
m k  :the result mask, trim d u r a  indiate elmmenu o f b  
which have not pat received the correct elernant of& 
i a :  mw and column dic9nsar for the upleft mova 
of this algorithm by any single number since io performance is SD 
dependent on the speci6ed dau mapping. For a more detailed 
analysis of this algorithm and i o  applications see [41 
7. LARGE ARL4YS 
Frequently the data to be procaacd by a parrllcl pmceuor will be 
in the format of axnyr which exceed the 6xed range of parrllcl- 
ism of the hardware. Theref- it b ne- to have special 
algorithms that will deal with large arrays by baking them 
down into block, mmrgabb by the hudanzr, without laaing 
track of the r e l a t i d i p  between merant b l a h  
One scheme, which is frequently used on the MPP, is to parcitia~ 
the large vray into block, which are conveniently stored in a 
four dimensional array. The a g e  of the Bxsr dimension of this 
array spec- the number of blocks in uch row of the large 
matrix and the range o f t h c d d i m c n d n n  spec- the number 
of block, in each column. G i m  .conceptual large matrix 
I. 
, 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
Element i,j of the large matrix is mapped into the array a u 
rpcifid by 
For ezample, a 512 x 256 matrix a u l d  ba stored in eight blocks as 
lo : array [14112&12a1-128l of rral: 
This data structure allow blockr to be manipulated indepn- 
dently. However, it still the pori- rela-ip Of 
those blocks in the original large matrix. 
To simplify the mrnipllation of large arrap on the LWP, w0 
Parallel Pascal libnry funcdorrr irocau and Lhift have ban 
develops nluefunctiontmkammyugumwltandmdi, 
placement argumea- like tha prhitiw matrix rootr and ahift 
functionr, howsva. in thia am the m y  vgumrnt ia a f w  
dimcadnlul urap which is UMtd like 8 colmpaully large 
mauix. 
h h y  prognms caa b convertad to opmm a0 blocked mtha 
than coDv?a- ma- by rimply rapluing a - at 
mat4 and shift with lrotata and khitt -y. This is m4 
for the data !MllipUl.th algorithm pr=ntK% hoarrva. exap 
for near neighbor typa algorithm& 8 dif[crant rmtcgy ia UUdy  
mon dedsnt n m  tnasfar m t h  for far Merant ai& laqe  
arrays are shown in Tabla 5. It should b noted that not much 
pmcerring can be dona with S12 I 512 real ma- on the 
current 'MPP since a singla matrix nquirea half of tha loal 
memory. 
Single clement shifts hrva slightly VOCI v8lw f a  larga unyr 
due to the small overhaad inmltnd in novingdauat tha edgad 
bladrs to other bl& Bmdcan opratiOar have bccgl trader 
ratiosfor luge amy since once a block ha been set up it can be 
swred in several ruult blocltr TIM BuiUa algorithm for ach 
block of a iarge matrix b -tially the mma aa for a -11 
matrix (simple muked swapping of ma& hrlvcr doa -rI, the 
uansfer ratio is not signidcrntly churged far large amys On the 
other hand, for porrmiddata maum the data~torrgc beyond 
level seven baomm much dmpier rad more diciea~ The 
transfer ratio in this- is greatly improvedfor luge 
The ccst of the simple mapping alprithm hu a complexi~ of 
Oh'>  therefore the aandsr ratia are four tinea higher for the 
2S6 I 256 mamix and 16 tima higher for as12 x 5 1 2 ~ p i L  In 
m m p r a c r i a l c u s a t h e ~  algorithmwillbafarruprimto 
thc d p b  algorithm:-,fOr 8 PuU Mdom Mm thsn 
/ia no kna~n efilcicnt aolutina 
For the heuristic mapping algorithm a good straw for lKge 
array to sari through the result blocks and perf- o p r a w  
on only the input blocks that contribute to the current d t  
block proasad This algorithm L shown below. 
Perm2 is the heuristic algorithm prarentcd prrvioruly with the 
modibation that tha initial m& value ia as an vgumcnt 
That is, only elements sleeted by the nuak are permuted. An 
additional -up is achievad by thh since the hcurirric worka 
much be- when only a s u b t  of elements are to b pernut& 
. . a .  .- 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
Dam 1MMipUlatim 
Data shift 
a) 1 element 
b) worn cax 
Bmadcan 
a) Global 
e) Row with 
shift re-r 
S h a e  
(I-dimensiona1) 
psramid 
(up or down) 
d Y s i z e  
128 x 128 256 x 256 512 I 512 
Boolean intcqer real Boolean in- real Boolean in- m l  
1.0 0.32 0.16 2.0 OS 0.24 2.0 05 0.24 
33 10.2 5.2 33 10 5.2 33 10 5.2 
I , 
2 0.64 13 0.88 0.28 0.14 0.59 0.20 0.09 
68 3.2 0.52 3s 1.68 0.30 183 0.92 0.19 
640 90 42 640 90 42 640 90 42 
320 45 21 58 4.3 U 16 1.2 0.98 
have a much more & k t  implemeatath on the Mpp. 
On the DAP row and column dircribution is implemented directly 
by special hardware b u a  For the MPP we can see fromTable 3 
data operations and pamibly v u y  little advantage for integer 
operations. A secood architecture modikatioa that lw bemi 
the pyramid layer structure in hard- FnrmTabla 5 we can 
that no advancage would be gained from thi3 hardware for real 
mantly pmpoad by scycral m o r a  b to directly imphmmt 
rc that the MPP rlrerdy implanalu pynmid ~ t i o l l a o n  large 
ppramidr Vary Cscientlx theIdtm& is little advlnpgr 
to be gainedfrom theadditionotspecialhardumm. 
I 
I 
I 
I 
1. 
2 
1 
4. 
5. 
6. 
,I 
I 
I 
I 
I 
I 
1 
I 
I 
I 
I 
1 
I 
I 
I 
I 
I 
I 
I 
I 
Appendix D 
THE MASSIVELY PARALLEL PROCESSOR 
A HIGHLY .PARALLEL SCIENTIFIC COMPUTER 
Anthony P. Reeves 
School of Electrical Engineering 
Cornell University 
Ithaca, New York 14853 
INTRODUCI'ION 
The Massively Patallel Processor W P )  [1,21 is a highly parallel scientific computer 
which was originally intended for image processing and analysis applications but it is also 
suitable for a large range of other scientific applications. Currently the highest degree of 
parallelism is achieved with the SIMD type of parallel computer architecture. With this 
scheme a single program sequence unit broadcasts a sequence of instructions to a large 
number of slave Processing Elements @E's). All PES perform the same function at the 
same time but on Merent  data elements; in this way a whole data structure such as a 
matrix can be manipulated with a single instruction. The alternative highly parallel 
organization, the MlMD type, is to have an instruction unit with every PE This scheme is 
much more flexible but also much more complex and expensive. 
Computers based on the SIMD scheme are usually very effective for applications 
which have a good match to the constraints of the architecture. Furthermore, they are 
usually also extensible in that it is possible to increase the performance for larger data 
structures by simply increasing the number of PES 
In order to utilize the features of a SIMD system, as with all computer designs, it is 
important for the programmer to have some knowledge of the underlying architecture; 
for example, it is important to know that some matrix operations have the same cost as a 
scalar operation. For these systems a special programming environment is usually used 
and, in general, serial programs designed for conventional serial computers must be refor- 
mulated for highly parallel architectures. The MPP is programmed in a high level 
language called Parallel Pascal [31 Therefore, the main advantage of SIMD systems is a 
much more cost effective method for doing scientific computing than conventional com- 
puter or supercomputer systems The major disadvantage is that the user must become 
familiar with a new kind of programming environment. 
A major consideration in the design of a highly parallel mmputer architecture is the 
procesror interconnection network. Processors must communicate with a speed which 
does not impede data processing; however, a general processor interconnection network is 
usually prohibitively expensive when a large number of pmessors are involved. A 
major design task is to design a restricted network which is adequate for the anticipated 
tasks for the system. The mesh interconnection scheme, used on the MPP, is simple to 
- 
I 
I 
I 
1 
I 
1 
I 
I 
I 
1 
I 
I 
I 
I 
1 
I 
I 
I 
I 
implement and is well suited to a large number of image processing algorithms. 
Tbe first section of this chapter describes the architecture of the MPP. Then the con- 
venient data strutters which can be manipulated by the MPP are outlined and the high 
level programming environment is discussed The performance of the processor intercon- 
nection network for important data manipulations is considered in detail since this indi- 
cates which algorithms can be efkiently implemented on the MPP. Finally, some 
current applications of the MPP are outlined. 
The impact of Techndogy 
The implementation of highly parallel processon is ma& possible with VLSI tech- 
nology. The MPP was designed with the technology available in 1977. The custom p r e  
cessor chip has 8OOO transistors on an area of 235 x 131 m i l s  and contains 8 bit-serial 
PES. I t  is mounted in a 52 pin fiat pack and requires 200 mW at 10 MHZ; 5 micrometer 
design rules were used. The SIMD mesh architecture can directly take advantage of the 
ongoing major advances in VLSI technology. A number of more advanced chips have 
been developed since the MPP design. For example, the GAPP chip [41 developed with 
the technology of 1982 has 72 bit-serial PE's, each having 128 bits of local memory. This 
chip requires 500 mW at 10 MHz ITT [51 is predicting that by 1987 they will be able to 
make 16 16-bit PES (with four spare) on a single chip with each PE having lk  words of 
local memory. This chip would use 1.25 micrometer design rules and would involve 
600,OOO transistors on an area of 450 x 600 mils For the more distant future, advantage 
can be taken of wafer scale integration as soon as it becomes economically available. 
Techniques for dealing with the fault tolerance needed with such a technology have 
already been considered [63. 
THEMPPARcHxTEcruRE 
The Massively Parallel Processor consists of 16384 bit+erial Processing Elements 
@E's) connected in 128 x 128 mesh [l]. That is each PE is connected to its 4 adjacent 
neighbors in a planar matrix The two dimensional grid is one of the simplest intercon- 
nection topologies to implement, since the PE's themselves are set out in a planar grid 
fashion and all interconnections are between adjacent components. Furthermore, this 
topology is ideal for two dimensional filtering operations which are common to low level 
image processing such as small window convolution. 
The PES are bit-serial, Le. the data paths are all one bit wide. This organization 
offers the maximum flexibility, at the expense of the highest degree of parallelism, with 
the minimum number of control lines. For example, as an alternative to the MPP con- 
sider 2048 %bit wide PES (on the MPP one chip contains 8 1-bit PE'd The &bit version 
would have a less rich set of instructions restricted to predefined byte operations while 
the bit-serial processors can process any data format The advantage gained with the 8- 
bit system is that full processor utilization is achieved with arrays of 2048 elements 
while arrays of 16384 elements are required for full utilization of the MPP. The MPP PE 
is well matched to low level image processing tasks which often involve very large data 
arrays of short integers which may be from 1 to 16 bits. 
The effectiveness of the MPP architecture for various interprocessor data manipula- 
tions is considered. The MPP offers a simple basic model for analysis since it involves just 
mesh interconnections and bit-serial PES. The minimal architecture of the MPP is of par- 
ticular interest to study, since any architecture modifications to improve performance 
would result in a more complex PE or a more dense interconnection strategy. 
T h e  MPP Processing Element 
The MPP processing element is shown in Fig. 1. All data paths are one bit wide and 
there are 8 PES on a single CMOS chip with the local memory on external memory chips. 
Except for the shift register, the design is essentially a minimal architecture of this type 
The single bit full adder is used for arithmetic operations and the Boolean processor, 
which implements all 16 possible two input logical functions, is used for all other 
I 
I 
I 
I 
I 
I 
I 
I 
1 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
V 
Progro m 
... 
... 
.. . 
-.. 
/ /  
I + 1 i 
Figure 1. The MPP Processing Element 
operations The NN select unit is the interface to the interprocessor network and is used 
to select a value from one of the four adjacent PES in the mesh. 
The S register is used for VO. A bitplane is slid into the S registen independent of 
the PE processing operation and it is then loaded into the local memory by cycle stealing 
one cycle. The G register is used in masked operations When masking is enabled only 
PES in which the G register is set perform any operations; the remainder are idle. The 
masked operation is a very common control feature in SIMD designs. Not shown in Fig. 
1. is an OR bus output from the PE All these outputs are connected (ORed) together so 
that the control unit can determine if any bits are set in a bitplane in a single instruction. 
On the MPP the local memory has 1024 words (bits) and is implemented with bipolar 
chips which have a 35 ns access time. 
The main novel feature of the MPP PE architecture is the reconfigurable shift regis- 
ter. It may be configured under program control to have a length from 2 to 30 bits 
Improved performance is achieved by keeping operands circulating in the shift register 
which greatly reduces the number of local memory accesses and instructions It speeds 
up integer multiplication by a factor of two and also has an important effect on fioating- 
point performance. 
I 
I 
I 
I 
I 
I 
I 
I 
I 
s 
I 
1 
1 
I 
I 
I 
I 
I 
I 
Host Computer 
(VAX 11/780) 
Array Edge Connections 
The interprocessor connections at the edge of the processor array may either be con- 
nected to zero or to the opposite edge of the array. With the latter option rotation per- 
mutations can be easily implemented. This is particularly useful for processing arrays 
which are larger than the dimensions of the PE array. A third option is to connect the 
opposite horizontal edges displaced by one bit position. With this option the array is con- 
nected in a spiral by the horizontal connections and can be treated like a one-dimensional 
vector of 16384 elements. 
Data Bus Staging Memory 
B 32 Mb 4 
The MPP Control Unit 
A number of processors are used to control the MPP processor Array; their organiza- 
tion is shown in Fig. 2 The concept is to always provide the array with data and 
instructions on every clock cycle. The host computer is a VAX 11/780; this is the most 
convenient level for the user to interact since it provides a conventional environment 
with direct connection to terminals and other standard peripherals. The user usually 
controls the MPP by developing a complete subroutine which is down loaded from the 
VAX to the main control unit (MCU) where it is executed. The MCU is a high speed 
S b i t  minicomputer which has direct accesp to the microprogrammed array control unit 
(ACU). It communicates to the ACU by means of macro instructions of the form "add 
array A to array B". The ACU contains runtime microcode to implement such operations 
without missing any clock cycles A first in-fint out (FIFO) buffer is used tb connect the 
MCU to the ACU so that the next macro operation generation in the MCU can be over- 
lapped with the execution in the ACU. A separate VO control unit (IOCU) is used to 
control input and output operations to the processor array. It controls the swapping of 
bitplanes between the processor array and the staging memory independent of the array 
processing activity. Processing is only halted for one cycle in order to load OT store a bit- 
plans 
- Ill - 
Wain Control I/O Control I 
Unit (WCU) 3 Unit (IOCU) + in 
VO and the MPP Staging Memory 
The staging memory is a large data store which is used as a data interface between 
peripheral devices and the processor arrax it provides two main functions First, it per- 
forms &cient data format conversion between the data element stream which is most 
commonly used for storing array data to the bitplane format used by the MPP. Second, 
Ill 
ir 
Figure 2. MPP System Organization 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
it provides space to store large (image) data structures which are too large for the prow+ 
sor array local memory. 
The VO bandwidth of the processor array is 128 bits every 100 ns; Le., 160 
Mbytes/second When fully codigured with 64Mb of memory the staging memory can 
sustain the MPP VO rate Currently the MPP is configured with 32 Mb and can sustain 
half the optimal VO rate. In addition to data reformating, hardware in the staging 
memory permits the access of 128 x 128 blocks of data from arbitrary locations in a large 
(image) data structure. This feature is particularly useful for the spooling scheme which 
is outlined in the following section. 
MPP DATA STRUCKIRES 
The 128 x 128 array is the equivalent to the natural word s h  on a conventional 
computer since elementary MPP instructions manipulate 128 x 128 bitplanes Stacks of 
bitplanes representing integers and 32-bit real numbers are also basic instructions in the 
MCU which are supported by runtime subroutines in the microprogrammed control uni t  
The fundamental data structures for the MPP are shown in Fig. 3.a. The long vec- 
tor format is supported by the hardware spiral edge connections However, for long data 
shifts the vertical interprocessor C O M ~ C ~ ~ O I U  can also be used. The 128 x 128 matrix is 
the most natural data structure for the MPP. Higher dimensional data structures may 
also be implemented. If the last two dimensions of the data structure are 128 x 128 then 
higher dimensions are simply processed serially. If the last two dimensions are less than 
128 x 128 then it may be possible to pack more than two dimensions into a single bit- 
plane. For example, a 16 x 8 x 8 x 4 x 4 data structure can be efficiently packed into a 
128 x 128 array. Convenient data manipulation routines for this data structure can be 
ef€iciently developed at the Parallel Pascal level. 
Frequently, the data to be processed by a parallel processor will be in the format of 
arrays which exceed the fixed range of parallelism of the hardware. Therefore, it is 
necessary to have special algorithms that will deal with large arrays by breaking them 
down into blocks manageable by the hardware, without loosing track of the relationship 
between different blocks 
There are two main schemes for storing large arrays on processor arrays the bZocked 
scheme and the crinkled scheme; these are illustrated in Fig. 3.b. Consider that a M x M 
array is distributed on an N x N prooessor array where K = M / N is an integer. In the 
blocked scheme a large array is considered as a set of K x K blocks of size N x N each of 
which is distributed on the processor. Therefore, elements which ar allocated to a single 
PE are some multiple of N apart on the large array. In the crinkled scheme each PE con- 
tains a K x K matrix of adjacent elements of the large array. Therefore, each parallel 
processor array contains a sampled version of the large array. For conventional array 
operations which involve large array shift and rotate operations both blocked and crin- 
kled schemes can be implemented with only a very small amount of overhead. The crin- 
kled scheme is slightly more efficient when shift distances are very small and the blocked 
scheme has a slight advantage when the shift distance is of the order of N. 
The third type of data structure which can be manipulated on the MPP is the huge 
array which is much too large to fit into the 2 Mb MPP local storaga This scheme, the 
spooled organization, involves the staging memory and is illustrated in Fig 3.c In the 
spooled scheme the data is stored in the staging memory and is processed one block at a 
time by the processor array. The VO operations to the staging memory are overlapped 
with data processing so that if the computation applied to each block is large enough then 
the cost of spooling will be negligible. However, if only a few operations are applied to 
each block the the I/O time will dominate. For near neighbor operations one possibility is 
to perform a sequence of operations on each block without regard to other blocks The 
boundary elements of the result array will not be valid. This is circumvented by read- 
ing overlapping blocks and only writing the valid portions of the result blocks back to 
memory. 
. 
n-dimensional array  
-16384- 
Large Matrix 
Block 
1 
I 256 
El a 
-256- 
Spooled Matrix 
A matrix is 
processed as a 
sequence of , 
over lapped blocks 
-128- 
Crinkled 
I mu*-- 
'i" * - .  c 0 . . *  0 
-256- 
Staging 
Memory 
32 Mb 
I I 
I Processor 
Array 
128 x 128 
(2 Mb) 
n 
Figure 3. Data Structures for the MPP 
THE MPP PROGRAMMING ENVR0"T 
There are three fundamental classes of operations on array data which are fre- 
quently implemented as primitives on array computers but which are not available in 
conventional programming languages, these are: data reduction, data permutation and 
data broadcast These operations have been included as primitives in the high level 
language for the MPP called Parallel Pascal. Mechanisms for the selection of subarrays 
and for selective operations on a subset of elements are also important language features 
High level programming languages for mesh connected SIMD computers usually 
have operations similar to matrix algebra primitives since entire arrays are manipulated 
with each machine instruction. A description of Parallel Pascal features is given else- 
where in this proceedings [71 A brief synopsis of important features of this language fol- 
lows. 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
m 
Parallel 
designed for 
Pascal is an extended version of the Pascal programming language 
the convenient and efficient programming of parallel computers. 
which is 
It is the 
first-high level programming language to be implemented on the MPP. Parallel Pascal 
was designed with the MPP as the initial target architecture; however, it is also suitable 
for a large range of other parallel processors. A more detailed discussion of the language 
design is given in Dl 
In Parallel Pascal all conventional expressions are extended to array data types. In a 
parallel expression all operations must have conformable array arguments. A scalar is 
considered to be conformable to any type compatible array and is conceptually converted 
to a conformable array with all elements having the scalar value. 
In many highly parallel Computers including the MPP there are at least two 
different primary memory systems one in the host and one in the processor array. Paral- 
lel Pascal provides the reserved word pwauel to allow programmers to specify the 
memory in which an array should reside. 
Reduction Functions 
Array reduction operations are achieved with a set of standard functions in Parallel 
Pascal. The bit argument of a reduction function specifies the array to be reduced and 
the following arguments specify which dimensions are to be reduced. The numeric 
reduction functions maximum, minimum, sum and product and the Boolean reduction 
functions any and all are implemented. 
Permutation Functions 
One of the most important features of a parallel programming language is the facil- 
ity to specify parallel array data permutations. In Parallel Pascal three such operations 
are available as primitive standard functions shift, rotate and transpose. 
The shift and rotate primitives are found in many parallel hardware architectures 
and also, in many algorithms. The-Wt function shi f ts data by the amount specified for 
each dimension and shifts zeros (null elements) in at the edges of the array. Elements 
shifted out of the array are discarded. The rotate function is similar to the shift function 
except that data shifted out of the array is inserted at the opposite edge so that no data is 
lost. The ht argument to the shift and rotate functions is the array to be shifted; then 
there is an ordered set of parameters, each one specifies the amount of shift in its 
corresponding dimension. 
While transpose is not a simple function to hplement with many parallel architec- 
tures, a significant number of matrix algorithms involve this function; therefore, it has 
been ma& available as a primitive function in Parallel Pascal. The first parameter to 
transpose is the array to be transposed and the following two parameters specify which 
dimensions are to be interchanged. If only one dimension is specified then the array is 
flipped about that dimension. 
Distribution Functions 
The distribution of scalars to 'brays is done implicitly in parallel expressions To 
distribute an array to a larger number of dimensions the expand standard function is 
available. This function increases the rank of an array by one by repeating the contents 
of the array along a new dimension. The fvst parameter of expand specifies the array to 
be expanded, the second parameter specifies the number of the new dimension and the 
last parameter specifies the range of the new dimension. 
This function is used to maintain a higher degree of parallelism in a parallel state- 
ment which may result in a clearer expression of the operation and a more direct parallel 
implementation. In a conventional serial environment such a function would simply 
waste space. For example, to distribute a Nelement vector A over all rows of a N x N 
matrix, the expression is "expand(k1,llN)"; as an alternative, to distribute the vector 
over the columns, the second argument to expand should be changed to 2 
I 
1 
I 
I 
I 
I 
I 
I 
I 
I 
I 
1 
I 
I 
I 
I 
I 
I 
I 
I 
S u b h a y  Selection 
selection of a portion of an array by selecting either a a l e  index value or dl index 
values for each dimension is frequently used in many parallel algorithms; sg, to select 
the ith row of a matrix which is a vector. In Parallel Pascal all index values can be 
specified by eliding the index value for that dimension. 
conditioaal Execntion 
An important feature of any parallel programming language is the ability to have 
an operation operate on a subset of the elements of an array. In Parallel Pascal a where - 
do - orherwise programming construct is available which is similar to the conventional if 
- then - else statement except that the control expression results in a Boolean array rather 
than a Boolean scalar. All parallel statements enclosed by the where statement must 
have results which are the same size as the controlling array. Only result elements 
which correspond to true elements in the controlling array will be modified. Unlike the 
if statement, both clauses of the where statement are always executed. 
System Support 
In addition to the MPP Parallel Pascal compiler there is a Parallel Pascal translator 
and a library preprocessor to aide high level program development. The translator 
translates a Parallel Pascal program into a standard Pascal form. In this way, conven- 
tional serial computers can be used to develop and test Parallel Pascal programs if they 
have a standard Pascal compiler. 
Standard Pascal has no library facility; all subprograms Le., procedum and func- 
tions, must be present in the source program. A library preprocessor was developed to 
allow the use of libraries without violating the rules of standard Pascal. 
MPP Compiler Restrictions 
The Parallel Pascal compiler for the MPP cunently has several restrictions The 
most important of these is that the range of the last two dimensions of a parallel array 
are constrained to be 128, Le., to exactly fit the parallel array size of the MPP. It is p i -  
ble that language support could have been provided to mask the hardware details of the 
MPP array size from the programmer; however, this would be very diflticult to do and 
efEcient code generation for arbitrary sized arrays could not be guaranteed. Matrices 
which are smaller than 128 x 128 can usually be fit into a 128 x 128 array by the pro- 
grammer. Frequently, arrays which are larger than 128 x 128 are required and these are 
usually fit into arrays which have a conceptual size which is a multiple of 128'x 128. 
For example, a large matrix of dimensions (m * 128) x b * 128) is specified by a four 
dimensional array which has the dimensions m x n x 128 x 128. There are two funda- 
mental methods for packing the large matrix data into this four dimensional array (see 
Fig. 3b), this packing may be directly achieved by the staging memory in both cases 
Host programs for the.MPP can be run either on the main control unit (MCU) or on 
the VA)(; in the latter case the MCU simply relays commands from the VAX to the PE 
array. The advantages of running on the VAX is a good programming environment, 
floating point arithmetic support and large memory (or virtual memory). The advantage 
of running on the MCU is more direct control of the MPP array. 
Compiler directives are used to specify if the generated code should run on the MCU 
or the VAX. With the current implementation of the code generator, only complete pr* 
cedures can be assigned to the MCU and only programs on the MCU can manipulate 
parallel arrays Therefore, the p r o g r a ~ e r  must isolate sections of code which deal with 
the PE array in procedures which are directed to the MCU. 
1 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
MPP PERFORMANCE EVALUATION 
The peak arithmetic performance of the MPP is in the order of 400 million floating 
point operations per second &PLOPS) for 32 bit data and 3OOO million operations per 
second (MOPS) for &bit integer data. In order to sustain this performance the data 
matrices to be processed must be as large as the processor array or larger and the amount 
of time transferring data between processors should be relatively small compared to the 
time spent on arithmetic computations. For image processing the former constraint is 
rarely a problem; however, the latter constraint requires careful study. 
In order to analyze the effectiveness of the interconnection network for different 
manipulations it is necessary to characterize the processing speed of the PE and the speed 
of the interconnection network On the MPP both of these are data dependent; we have 
considered three representative cases singlebit Bodean data, %bit integer data and 32-bit 
floating-point (red) data. For each of these data types we have estimated a typical time 
for an elemental operation. These estimates are of a reasonable order for this minimal PE 
architecture but are not very precise. For example, the instruction cycle time for a 
memory access and operation on the MPP is 100 ns. An elemental boolean operation may 
be considered to take 100 ns; however, it may be argued that an operation should involve 
two operands and have all variables in memory in which case three memory accesses 
(instructions) would require 3CXlns For our analysis a two instruction 0 0  ns) model 
was used to represent Boolean instruction times For the real and integer data a con- 
venient number midway between the times for addition and multiplication was used; 
this was 5p.s for an integer operation and 40 p. for a real operation. It should be 
remembered that elemental operations also include many other functions such as tran- 
scendental functions since these can be computed in times comparable to a multiplication 
on a bit-serial architecture. By adding a large amount of additional hardware to each PE 
it is possible to increase the speed of multiplication by 10 times or more 181 
For each of the data manipulations considered, times for the three different data 
types was computed. The performance of the MPP for each manipulation is indicated by 
the ratio of the data transfer time to an elemental PE operation on the same data type; 
this will be called the tr& fer ruria One way to look at this ratio is the number of ele- 
mental data operations which must be performed between data transfers for the data 
transfers not to be the dominant cost for the algorithm. On the MPP data may be shifted 
between adjacent PES in one instruction time (100 ns) concurrently with a PE processing 
i n s t r U C t i O n .  
Shift and Rotate Operations 
The only permutation function which is directly implemented by the MPP is the 
near neighbor rotate (or shift). The direction of the rotation may be in any of the four 
cardinal directions. The rotation utilizes the toroidal end around edge connections of the 
mesh. The shift function is similar except that the mesh is not toroidally connected and 
zeroes are shifted into elements at the edge of the array; therefore, the shift function is 
not a permutation function in the strict sense. The concept of the rotate and shift func- 
tions extends to n dimensions; on the MPP the last two dimensions of the array 
correspond to the parallel hardware dimensions and are executed in parallel, higher 
dimension operations are implemented in serial. The cost of the rotate function is depen- 
dent on the distance rotated. It also depends on the size of the data elements to be per- 
muted. 
The transfer ratios for the shift operation are given in Table 1. Ratios are given for 
shift distances of 1 and 64 elements; 64 is the largest shift which will normally be 
required in a single dimension on a 128 x 128 matrix since a shift of 65 can be obtained 
with a rotate of -63 and a mask operation. The worst case figures for a two dimensional 
shift is 64 in each direction; le., twice the figures given in Table 1. 
For single element shifts the interconnection network is more than adequate for all 
data types. For maximum distance shifts the ratio of 33 for Boolean data could cause 
problems 'for some algorithms but the situation is much better for real data. 
I' 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
1 
I 
I 
I 
1 
64 
Table 1: The Cost for Shift and Rotate Operations 
Boolean integer real Boolean integer real 
a2 1.6 64 1.0 0.32 0.16 
6-5 51 210 33 10 5.2 
I distance I I 1 I 
Important Data Manipulations 
A simple algorithm to perform any arbitrary data mapping on the MPP is as fol- 
lows. Start with the address of where the data is to come from in each PE For each PE 
compute the distance that the data must be moved to reach that PE Using the spiral 
interconnections, rotate the data 16384 times. After each rotation compare the distance 
in each PE with the distance moved and if they match then store the data for that PE 
The transfer ratios for this algorithm are 82000, 10000 and 4200 for Boolean integer and 
real data types respectively. Obviously, this is much too slow for mosf practical applica- 
tions. Fortunately, for most applications only a small number of regular data mappings 
are required; efficient algorithms can be developed for most of these mappings. 
The transfer ratio for a number of important data manipulations is shown in Table 
2 The figures for large arrays correspond to the blocked data scheme For large arrays 
the transfer ratio is normalized by the cost of an operation on the whole array. 
This technique may be applied to almost any parallel architecture. It is expected 
that the results obtained for the MPP would be very similar to those obtained for other 
like architectures such as the Distributed Array Processor (DAP)  191 or NCR's GAPP pro- 
cessor chip which contains 72 PES with local memory 141. The results given in Table 2 
could be used by programmers to predict the performance of algorithms on the MPP. A 
more detailed analysis of data mappings on the MPP is given in [IO1 
For the MPP, the results indicate that, although arbitrary data mappings may be 
very costly, some important data manipulations can be done very efficiently. The shift 
register, which has a 2 times speedup factor for multi-bit arithmetic also has a significant 
effect on the implementation of several of the multi-bit data manipulations studied. 
Epecially interesting is the improvement of over 10 times for real data distribution. The 
shuffle cannot be implemented fast enough for eflicient implementation; however, 
other data mapping strategies for the FFT, such as butterfly permutations, are well 
known which have a much more ef€icient implementation on the MPP. 
On the DAP row and column distribution is implemented directly by special 
hardware buses. For the MPP we can see from Table 2 that no advantage would be 
gained from this hardware for real data operations and possibly very little advantage for 
integer operations 
The MPP can effectively implement algorithms on pyramid data structures Hor- 
izontal operations are done with near neighbor operations; Le. single element shifts. Vert- 
ical operations require data mappings similar to the shu& permutation; the times for 
these operations are given in Table 2 These times for numeric data am quite reasonable 
for many pyramid algorithms. A detailed analysis of pyramid operations on the MPP is 
Sorting has been proposed as one technique for doing arbitrary data permutations on 
the MPP; the cost of bitonic sorting on the MPP is given in Table 2 This ast is very 
high although not as high as the arbitrary data mapping algorithm. 
The last row in Table 2 shows the transfer ratio for swapping a matrix with the 
staging memory. In this case the transfer ratio indicates the number of operations which 
given in 1111 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
:2-dimensional) 
Transpose 840 110 44 840 110 44 840 110 44 
Table 2 Transfer Ratios for Different Data Manipulations and Array Sizes 
P 
Pyramid Up 
[sum reduce) 
Data Manipulation 
190 43 21 190 43 21 190 43 21 
330 45 21 110 15 785 28 4 23 
330 44 19 
19000 1100 280 
256 20 10 
110 15 64 28 39 1.7 
12000 870 230 14000 790 210 
256 20 10 256 20 10 
must be performed on each swapped matrix for there to be no signikant overhead due to 
swapping. Since the transfer ratio does not change with array size, this suggests that 
spooling with large matrices for near neighbor operations would be more ef€icient than 
spooling with 128 x 128 blocks 
APPLICATIONS 
The MPP was originally designed for processing multispectral satellite imagery and 
synthetic apertute radar imagery. Both of these applications have now been demon- 
strated on the MPP. Other uses of the MPP are now being explored. There are now 36 
active projects on the MPR these can grouped as follows: physics (lo), earth sciences (5) 
signal and image processing (71, and computer science (14). A full list of these projects is 
given in Appendix A. 
CONCLUSION 
The Massively Parallel Processor is a highly effective computer for a large range of 
scientific applications. It is representative of the clasp of highly parallel SIMD mesh con- 
nected computers Novel features of the MPP design are the staging memory and the PE 
shift register. The MPP has demonstrated its capability to implement the image process- 
ing algorithms for which it was originally designed; however, the current system lacks 
the very high speed peripheral devices needed to optimize its performance. It is also being 
used for a much broader range of scientiiic applications. The main limitation with the 
MPP when used for other applications is the limited amount of local memory (1024 
bits/PE). This should not be a problem with future systems especially since the recent 
advances in memory technology. This problem has been offset on the MPP by judicious 
use of the staging memory. 
The problem frequently cited for mesh c o ~ e ~ t e d  SIMD architectures is the 
inefficiency of the mesh interconnection scheme when used for other than near neighbor 
tasks. However, this is not a problem for many practical applications on the MPP; for 
example, FFI' and pyramid operations can be effectively implemented especially for very 
large data structures. 
I' 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
n 
I 
I 
I 
I 
I 
I 
The MPP is more cod effective for suitable applications than supercomputers of a 
similar agc Future SIMD mesh connected computers may be anticipated which will take 
advantage of recent VISI technology and will be much more powerful than the MPP. 
These systems can be expected to be much more cod effective than more conventional 
supercomputers for suitable applications such as low level image processing. These archi- 
tectures can also be effectively implemented at smaller scales: for example, as attached 
processoff to microprocessor systems. 
The cost of using a highly parallel computer is the change of programming style and 
the need to reformulate existing programs However, programming in an appropriate 
high level language is often not conceptually more dmcult than programming a conven- 
tional computer. In fact, in some respects, it is simpler since arrays are manipulated 
without the multiple do loops required in conventional serial programming languages. 
REFERENCES 
1. 
2 
3 
4 
5. 
60 
7. 
a 
9. 
100 
11. 
K. E Batcher, "Design of a Massively Parallel Processor," IEEE Transactions on 
Computers C-29(9) pp. 836-840 (September 1981). 
J. L Potter, The Massively Parallel Processor, MIT Pres (1985). 
A. P. Reeves, 'Tarallel Pascal: An Extended Pascal for Parallel Computexs," Jd 
of ParaUel and Distributed Computing 1 pp. 64-80 (19842 
NCR Corporation, Geometric Arithmetic ParaUel Processor, NCR, Dayton, Ohio 
(1984). 
S G. Morton, E Abreau, and E Tse, "IIT CAP-Toward a Personal Supercomputer," 
IEEE Micro, pp. 3749 (December 1985). 
A. P. Reeves, "Fault Tolerance in Highly Parallel Mesh Connected Processors," in 
Computing Struchrres for Image Processing, ed. M. J. R Duff, Academic Pres 
(1983). 
A. P. Reeves, 'languages for Parallel Processors," International Workshsop on Data 
AnCrLaysis in Astromniy, , Erice, 1taIfipri.l 1986). 
A. P. Reeves, "The Anatomy of VLSI Binary Array Processors," in Languages and 
Architectures for Image Processing, ed. M. J. R Duff and S Levialdi, Academic 
Press (1981). 
R W. Gostick, "Software and Algorithms for the Distributed-hray Processor," ICL 
A. P. Reeves and C H. Moura, 'Data Manipulations on the Massively Parallel Pro- 
cessor," Proceedings o f  the Nineteenth Hmaii International Con f m e  on System 
A. P. Reeves, "Pyramid Algorithms on Processor Arrays," Proceedings of the lVAT0 
Advanced Research Workshop on Pyramidal Systems for Image Processing and 
Computer Vision,, Maratea, Italy(May 19861 
Techni~d Jountal, pp. 116-135 (May 1979). 
Sciences, pp. 222-229 (January, 1986). 
APPENDIX A. Current Research Using the MPP 
Dr. John A. Barnden 
Indiana University 
Diagramtic Information-Processing in Neural 
Arrays 
Dr. Richard S. Bucy 
Univ. of Southern California dulation 
Fixed Point Optimal Nonlinear Phase Demo- 
Dr. Gregory R Carmichael 
University of Iowa 
Tropospheric Trace Gas Modeling on the MPP 
I 
I 
I 
I 
1 
1 
1 
I 
I 
1 
1 
I 
I 
I 
1 
I 
I 
I 
I 
Dr. Tara Prasad Das 
State University of New York at Al- 
b Y  
Investigations on Electronic Structures and As- 
sociated Hyperfine Systems Using the MPP 
Dr. Edward w. Davis 
North Carolina State University 
Graphic Applications of the MPP 
Dr. Howard B. Demuth 
University of Idaho 
Sorting and Signal Processing Algorithms A 
Comparison of Parallel Architectures 
Dr. James A. Earl 
University of Maryland 
Numerical Calculations of Charges Particle 
Transport 
Mr. Eugene W. Greenstadt 
TRW 
Space Plasma Graphics Animation 
Dr. Chester E Grosch 
NASA-Langley Research Center 
Adapting a Navier-Stokes Code to the MPP 
Dr. Robert J. Gurney 
Goddard Space Flight Center 
A Physically-Based Numerical Hillslope Hy- 
drological Model with Remote Sensing Cali- 
bration 
Dr. Martin Hagan 
University of Tulsa 
Sorting and Signal Processing Algorithms A 
Comparison of Parallel Architectures 
Dr. Harold M. Hastings 
Hofstra University 
Applications of Stochastic and Reaction - 
Diffusion Cellular Automata 
Dr. Sara Ridgway Heap 
W a r d  Space Flight Center 
Automatic Detection and Classification of 
Galaxies on "DeepSky" Pictures 
Dr. Nathan Ida 
The University of Akron 
Solution of Complex; Linear Systems of Equa- 
tions 
Dr. Robert V. Kenyon 
MIT Man Vehicle Laboratory 
Application of Parallel Computers to Biomedi- 
cal Image Analysis 
Dr. Daniel A. Klinglesmith, 111 
Goddard Space Flight Center 
Comet Haley Large-Scale Image Analysis 
Dr. Daniel A. Klinglesmith, 111 
W a r d  Space Flight Center * 
FORTH, an Interactive Language for Control- 
ling the MPP 
Dr.chinSLiIl 
Southwest Research Institute 
Simulation of Beam Plasma Interactions Util- 
izing the MPP 
Dr. Stephen A. Mango 
Naval Research Laboratory 
Synthetic Aperture Radar Processor System 
Improvements 
Dr. Michael A. McAnulty 
University of Alabama 
Algorithmic Commonalities in The Parallel 
Environment 
Dr. A.P. Mulhaupt 
University of New Mexico 
Kalman Filtering and Boolean Delay Equa- 
tions on an MPP 
Dr. John T. O'Donnell 
Indiana University 
Mr.Martinotga 
USDA-Statistical Reporting Service 
6RS) 
Dr. Hx. Ramapriyan 
W a r d  Space Flight Center 
Dr. John Reif 
Harvard University 
Computer Science 
Dr. L.R. Owen Storey 
Stanford University 
Dr. James P. Strong 
W a r d  Space Flight Center 
., 
Dr. Francis Sullivan 
National Bureau of Standards 
Dr. Peter Suranyi 
University of Cincinnati 
Dr. James C Tilton 
Goddard Space Flight Center 
Dr. William Tobocman 
Case Western Reserve University 
Mr. Lloyd A. Treinish 
Goddard Space Flight Center 
Dr. Scott Von h v e n  
KMS Fusion, Inc. 
Dr. Elden C Whipple, Jr. 
ucsD/cAss/c-a)1 
Dr. Richard L White 
Space Telescope Science Institute 
Dr.LoLYiIl 
Goddard Space Flight Center 
Simulating an Applicative Programming 
Storage Architecture Using the NASA MPP 
A Comparison of the MPP with Other Super- 
computers for Landsat Data Processing 
Development of Automatic Techniques for 
Detection of Geological Fracture Patterns 
Parallel Solution of Very Large Sparse Linear 
Systems 
Particle Simulation of Plasmas on the MPP 
Development of Improved Techniques for 
Generating Topographic Maps from Spacecraft 
Imagery 
Phase Separation by king Spin Simulations 
A Study of Phase Transitions in Lattice Field 
Theories on the MPP 
Use of Spatial Information for Accurate Infor- 
mation Extraction 
Wave Scattering by Arbitrarily Shaped Tar- 
gets Direct and Inverse 
Animated Computer Graphics Models of Space 
and Earth Sciences Data Generated via the 
MPP 
Free-Electron Laser Simulations on the MPP 
A Magnetospheric Interactive Model Incor- 
porating Current Sheets (MIMICS) 
The Dynamics of Collisionless Stellar Systems 
Reconstruction of Coded-Aperature X-ray Im- 
age 
Appendix E I Proceedings of the Nato Advanced Research Workshop on Pyramidal Systems f o r  Image 
Processing and Computer Vision Maratea, I t a l y ,  May 1986 
PYRAMID fiG0-S ON PROCESSOR ARRAYS 
Anthony P. Remes 
School of Electrical Engineering 
Cornell University 
Ithaca, New York 14853 
A class of adaptive grid size algorithms, called pyramid algorithms, have received much 
attention in recent years for computer vision applications Special hardware architectures 
which are optimal for grid resolution transformations have been proposed to implement these 
algorithms. In this chapter analysis techniques are described which measure the efFectiveness 
of different architectures. A cornpatison is made between the more conventional planar or 
fiat architecture and the optimal pyramid architecture. It is shown that in many uses *he 
additional hardware of the pyramid scheme offers little improvement in performance. 
INTRODUCTION 
The general concept of pyramid algorithms is to decompose a large matrix into a 
number of lower resolution matrices €or more convenient processing. The most typical 
pyramid structure c~nsista of a square base matrix with dimensions N x N (where N is a 
power of 2) and logz N reduced matrices (layers) each layer having dimensions half that of 
the layer below until a single element apex (layer 0) is reached. A number of algorithms for 
image processing have been based on this data structwe. Cther applications such as multigrid 
techniques for PDE's can also be considered in the pyramid framework 
In order to achieve the very high data processing rates needed €or image processing 
applications a aumber of highly parallel computer architectures have been desised and 
implemented; several of such systems are now comniercially available. A very effective 
parallel computer architecture for conventional image processing applications is the mesh 
connected procesor array. For processing pyramid data structures a number of researchers 
have proposed hardware modificatiom to the two d imens id  processor array. These 
modifications directly match the pyramid data structure by having interconnected layers of 
processor mays attached to the base proceswr array. In this paper techniques are developed 
to investigate the effidency with which pyramid algorithms can be implemented on processor 
arrays and the effectiveness of enhanced hardware schemes is considered. 
A major Consideration in the design of a highly parallel computer architecture is the 
procesror interconnection network Processors must communicate with a speed which does 
not impede data processing; however, a general processor interconnection network is usually 
prohibitively expensive when a large aumber of processors are involved. A major design 
task is to design a restricted network which is adequate for the anticipated tasks for the sys- 
tem. The mesh interconnection scheme is simple to implement and is well suited to a large 
number of image processing algorithms. The first section of this paper describes the Mas- 
sively Parallel Processor (MPP) 111 and the performance of the processor interconnection net- 
work is considered in detaiL 
N 
N/ logzN 
The second section of this paper deals with the implementation of pyramid algorithms 
on the MPP and also considers the effect of adding a pyramid hardware structure to the 
MPP. A two dimensional processor array will be called a frcrt processor and a processor array 
with additional hardware for pyramid data structures will be called a pyramid processor. 
The general techniques used to analyze these architectures may be used to determine the 
effectiveness of other highly parallel processor systems for pyramid applications. 
The hardware enhancement of a pyramid of processing arrays approaches - 4 times the 
3 
number of procgsing elements @Fs) required for the base array, as the number of layers 
increases. Therefore, the maximum possible increase in arithmetic capability is However, 
the major advantage claimed for the pyramid processor is the speedup in interprocessor data 
routing. In order to compute a global function over an N x N processor array N processing 
steps are required However, on a pyramid processor global idormation may be extractad at 
the apex of the pyramid after log2 N steps The potential gain for the pyramid system is 
therefore N/"; this value is shown in Table 1. for different values of N. The MPP has a 
128 x128 base array therefore the potential speedup is in the order of 18. 
The pyramid computer has an improved performance for two types of operations (a) 
multiresolution algorithms which involve frequent changes between matrix resolutions and 
(b) global feature extraction operations. However, the additional pyramid hardware must be 
justiiied by the multiresolution operations alone since there are more cost effective hardware 
enhancements for global feature extraction 121. 
3' 
4 a  16 32 64 128 256 512 1024 
2 233 4 64  10.66 1829 32 56.8 1024 
Table 1: Data routing advantage of the Pyramid Architecture 
The cost of the additional hardware for the pyramid processor has been considered by 
4 
3 some to be simply - times the cost of the &tt processor since this is the increase in the 
number of PES, The cost will, in general, be much higher than this since each PE now 
requires more interconnections (an increase from 4 to 9 in our example) and interprocessor 
C O M ~ C ~ ~ O X S  are no longer local which could be particularly expensive for multiprocessor chip 
designs. 
FLAT PROCESSOR ARRAYS 
The Massively Parallel Processor consists of 16384 bit-serial Processing Elements (PE's) 
connected in 128 x 128 mesh 111 That is each PE is connected to its 4 adjacent neighbors in a 
planar matrix. The two dimensional grid is one of the simplest interconnection topologies to 
implement, since the PES themselves are set out in a planar grid fashion and all interconnec- 
tions are between adjacent components Furthermore, this topology is ideal for two dimen- 
sional atering operations which are common to low level image processing such as small 
window convolution. 
The PES are bit-serial, Le. the data paths are all one bit wide. This organization offers 
the maximum flexibility, at the expense of the highest degree of parallelism, with the 
minimum number of control lines. For example, as an alternative to the MPP consider 2048 
&bit wide PES (on the MPP one chip contains 8 1-bit PES). The %bit version would have a 
less rich set of instructions restricted to predefined byte operations while the bit-serial proces- 
sors can process any data format The advantage gained with the %bit system is that full 
processor utilization is achieved with arrays of 2048 elements while arrays of 16384 ele- 
ments are required for full utilization of the MPP. The MPP PE is well matched low level 
image processing tasks which often involve very large data arrays of short integers which 
may be from 1 to 16 bits. 
The effectiveness of the MPP architecture for various interprocessor data manipulations 
is considered. The M p p  offers a simple basic model for analysis since it involves just mesh 
interconnections and bit-serial PEk The minimal architecture of the MPP is of particular 
interest to study, since any architecture modifications to improve performance would result 
in a more complex PE or a more dense interconnection strategy. The MPP is programmed in 
a high level language called Parallel Pascal b]. 
The MPP Processing Element 
The MPP processing element is shown in Fig. 1. All data paths are one bit wide and 
there are 8 PES on a single CMOS chip with the local memory on external memory chips. 
Except for the shift register, the design is essentially a minimal architecture of this type. 
The single bit full adder is used for arithmetic operations and the Boolean processor, which 
implements all 16 possible two input logical functions, is used for all other operations. The 
NN select unit is the interface to the interprocessor network and is used to select a value 
from one of the four adjacent PES in the mesh. 
The S register is used for VO. A bit plane is slid into the S registers independent of the 
PE processing operation and it is then loaded into the local memory by cycle stealing one 
cycle. The G register is used in masked operations. When masking is enabled only PE's in 
which the G register is set perform any operatioxxq the remainder are idle. The masked opera- 
tion is a very common control feature in S A D  designs. Not shown in Fig. 1. is an OR bus 
I 128- 
Prog ra m 
1 
Control I 
Unit 
I &  
I/ 
To NN 
PES 
W C - A
- Boolean N-bit shift register 
processor . 
d 7 + 
Figure 1. The MPP Processing Element 
28 
output from the PE All these outputs are connected (ORed) together so that the control unit 
can determine if any bits are set in a bitplane in a single instruction. On the MPP the local 
memory has 1024 words (bits) and is implemented with bipolar chips which have a 35 ns 
access time. 
The main novel feature of the MPP PE architecture is the reconfigurable shift register. 
It may be configured under program control to have a length from 2 to 30 bits Improved 
performance is achieved by keeping operands circulating in the shift register which greatly 
reduces the number of local memory accesses and instructions. It speeds up integer multipli- 
cation by a factor of two and also has an important effect on floating-point performance. 
Flat Array Performance Evaluation 
In order to a n a l p  the effectiveness of the interconnection network for different mani- 
pulations it is necessary to characterize the processing speed of the PE and the speed of the 
interconnection network. On the MPP both of these are data dependent; we have considered 
three representative cases singlebit Bodean data, &bit integer data and 32-bit floating-point 
(real) data. For each of these data types we have estimated a typicat time for an elemental 
operation. These estimates are of a rearanable order for this minimal PE architecture but arc 
not very precise. For example, the instruction cycle time for a memory access and operation 
on the MPP is 100 ns. An elemental boolean operation may be considered to take 100 ns; 
however, it may be argued that an operation should involve two operands and have all vari- 
ables in memory in which case three memory accesses (instructions) would require 300x1~ 
For our analysis a two instruction (200 ns) model was used to represent Boolean instruction 
times. For the real and integer data a convenient number midway between the times for 
addition and multiplication was used; this was 5 ps for an integer operation and 40 1~s. for a 
real operation. It should be remembered that elemental operations also include many other 
functions such as transcendental fundions since these can be computed in times comparable 
to a multiplication on a bit-serial architecture. By adding a large amount of additional 
hardware to each PE it is v i b l e  to increase the speed of multiplication by 10 times or more 
[41. 
For each of the data manipulations considered, times for the three different data types 
was computed. The performance of the MPP for each manipulation is indicated by the ratio 
of the data transfer time to an elemental PE operation on the same data type: this will be 
called the transfer ratio. One way to look at this ratio is the number of elemental data 
operations which must be performed between data transfers for the data transfers not to be 
the dominant cost for the algorithm. On the MPP data may be shifted between adjacent PES 
in one instruction time (100 IS.) concurrently with a PE processing instruction. 
I 
I 
I 
1 
1 
I 
1 
I 
I 
I 
I 
I 
I 
I 
I 
1 
I 
I 
1 
shift 
distance 
For many applications the physical dimensions of the parallel hardware are smaller 
than the dimensions of the array to be processed. In this case the data array is processed as a 
set of blocks An extension of the data manipulation algorithms to deal with this situation is 
discussed. 
Operation cost in ps. Transfer Ratio 
Shift and Rotate Operations 
The only permutation function which is directly implemented by the MPP is the near 
neighbor rotate (or shift). The direction of the rotation may be in any of the four cardinal 
directions The rotation utilizes the toroidal end around edge connections of the mesh. The 
shift function is similar except that the mesh is not toroidally connected and zeroes are 
shifted into elements at the edge of the array; therefore, the shift function is not a permuta- 
tion function in the strict sense. The concept of the rotate and shift functions extends to n 
dimensions; on the MPP the last two dimensions of the array correspond to the parallel 
hardware dimensions and are executed in parallel, higher dimension operations are imple- 
mented in serial. The cost of the rotate function is dependent on the distance rotated. It also 
depends on the s h  of the data elements to be permuted. 
The transfer ratios for the shift operation are given in Table 2 Ratia are given for 
shift distances of 1 and 64 elements; 64 is the largest shift which will normally be required 
in a single dimension on a 128 x 128 matrix since a shift of 65 can be obtained with a rotate 
of -63 and a mask operation. The worst case figures for a two dimensional shift is 64 in each 
directiorc Le, twice the figures given in Table 2 
For single element shifts the interconnection network is more than adequate for all data 
tw For maximum distance shifts the ratio of 33 for Boolean data could cause problems for 
some algorithms but the situation is much better for real data. 
1 
64 
Large Array Processing 
Frequently the data to be processed by a p a l l e l  processor will be in the format of 
arrays which exceed the k e d  range of parallelism of the hardware. Therefore, it is 
Boolean integer real Boolelill integer real 
a2 1.6 64 1.0 032 ai6 
6.5 51 210 33 10 5.2 
Table 2 The Cost for Shift and Rotate Operations 
I 
I 
I 
1 
I 
I 
I 
I 
I 
I 
I 
1 
I 
I 
I 
I 
I 
I 
I 
necessary to have special algorithms that will deal 
into blocks manageable by the hardware, without 
different blocks 
with large arrays by breaking them down 
loosing track of the relationships between 
There are two main schemes for storing large arrays on processor arrays the b&xked 
scheme and the crinkled scheme. Consider that a M x M array is distributed on an N x N 
procesror array where K - M / N is an integer. In the blocked scheme a large array is con- 
sidered as a set of K x K blocks of size N x N each of which is distributed on the processor. 
Therefore, elements which ar allocated to a single PE are some multiple of N apart on the 
large array. In the crinkled scheme each PE contains a K x K matrix of adjacent elements of 
the large array. Therefore, each parallel processor array contains a sampled version of the 
large array. For conventional array operations which involve large array shift and rotate 
operations both blocked and crinkled schemes can be implemented with only a very small 
amount of overhead. The crinkled scheme is slightly more efticient when shift distances are 
very small and the blocked scheme has a slight advantage when the shift distance is of the 
order of N. 
The Performance of the MPP 
The transfer ratio for a number of important data manipulations is shown in Table 3. 
The figures for large arrays comespond to the blocked data scheme. For large arrays the 
transfer ratio is normalized by the cost of an operation on the whole array; Le, K x R array 
operations 
This technique may be applied to almost any parallel architecture. It is expected that 
the results obtained for the MPP would be very similar to those obtained for other like 
architectures such as the Distributed Array Processor DAP) [51 or NCR's GAPP processor 
chip which contains 72 PES with local memory [41 The results given in Table 3 could be 
used by programmers to predict the performance of algorithms on the MPP. 
For the MPP, the results indicate that, although arbitrary data mappings may be very 
custly, some important data manipulations can be done very efticiently. The shift register, 
which has a 2 times speedup factor for multi-bit arithmetic also has a significant effect on 
the implementation of several of the multi-bit data manipulations studied. Especially 
interesting is the improvement of over 10 times for real data distribution. The shuffle cannot 
be implemented fast enough for dcient  FFI' implementation; however, other data mapping 
strategies for the FFI: such as butterfly permutations, are well known which have a much 
more scient implementation on the MPP. A more detailed analysis of data mappings on the 
MPP is given in 171 
On the DAP row and column distribution is implemented directly by special hardware 
buses. For the MPP we can see from Table 3 that no advantage would be gained from this 
~ 
i 
I 
I 
I 
I 
I 
I 
I 
I 
I 
i 
I 
I 
I 
I 
I 
I 
I 
I 
Data Manipulation 
Data Shift 
a) 1 element 
b) worst case 
Broadcast 
c) Row (or column) 
ShuBle 
(2-dimensional) 
Transpose 
a> Global 
hardware for real data operations and possibly very little advantage for integer operations. 
128 x 128 
Boolean integer real 
1.0 0.32 0.16 
33 10.2 5.2 
2 0.64 a32 
68 3 2  0.52 
640 90 42 
840 105 44 
PYRAMID PROCESSING 
256 x 256 
Boolean integer real 
A Pyramid Architecture 
A hypothetical architecture of a pyramid machine based on the MPP is illustrated in 
Fig. 2. The PE is similar to the MPP PE with the addition of five additional interprocessor 
connections. Four of these are connected to the four children at the next lower level and the 
Mth is connected to the parent PE in the layer above. 
512 x 512 
Boolean integer real 
Pyramid Primitive Operations 
We consider processing operations on pyramid data structures to be one of three types 
elemental, in which no communication between PES is involved, horizantal, in which adja- 
cent elements in the same pyramid level are combined, and vertical, in which elements at 
Merent adjacent levels are combined. Furthermore, we consider that thae operations may 
be applied to the total pyramid or to a subset of levels. Other relevant pyramid operations 
include feature extraction and data broadcast. 
Pyramid Operations on a Flat Array 
An efEcient scheme for mapping a pyramid data structure to a N x N processor array is 
as follows. All levels of the pyramid which have dimensions less than N x N are mapped 
into a single plane. The level with size N x N is mapped into a second plane and levels 
lower than this are mapped in such a way that each PE is allocated to a subpyramid with 
Table 3 Transfer Ratios for Different Data Manipulations and Array Sizes 
35 
640 90 42 640 90 42 
840 105 44 1840 105 44 1 
..-.. 
1 
... .. . 
I 
: 
I 
c1 c2 c3 c4 
PR 
To NN 
/ PES 
N 
S m 
L A 
~ 
Full - Boolean - N-bit shift register adder - 
T 
processor 
t 
Local memory 
Figure 2. The Pyramid Processing Element 
adjacent elements. This is achieved by storing each level with the crinkled storage format. 
Therefore, for a k-level pyramid the top n - 1 levels h = logl2 N) are stored in one plane, 
and the remaining k - n levels are stored in 2a = -(2 *-$-I) planes. h t a  transitions k-n-1 
i I 0  3 
between the top n - 1 levels is achieved with perfect shum permutations. The perfect 
shuffle for an 8 x 8 matrix is shown in Fig. 3. Each quadrant of the matrix is distributed 
over the whole matrix A 4 x 4 pyramid structure embedded in an 8 x 8 matrix is shown in 
Fig. 4 With this organization, a transition to a lower level can be achieved by a shuf€ie and 
a transition to a higher level can be achieved with an inverse shuflie for all levels simultane- 
ously. This scheme make good use of memory and is optimal for horizontal operations but is 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
(a) 
I33 137134138 
(b) 
Figure 3. The Two Dimensional Perfect ShufEe. 
The shuffled version of matrix (a) is shown in (b). 
relatively slow for vertical operations. 
A Comparative Analysis 
To evaluate the cost of macro operations a mrmczlized cost is used which is similar in 
concept to the transfer ratia It is obtained by dividing the actual cost by the cost of an ele- 
mental operation on the data structure being processed by the macro operation. 
Elemental operations 
The actual time capt in term of the number of processor array arithmetic operations is 
shown in Fig. 5 for an elemental operation applied to the whole pyramid data structure and 
for an architecture with a 128 x 128 base size. The cost for a fiat array is shown by the solid 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
Figure 4. A 4 x 4 base pyramid embedded in an 8 x 8 mesh 
30 
25 
c 
20 
0 
I O  
5 
Flat architecture 
-0- Pyramid architecture 
- 0 
0 
I 
I '" I 
1 
I 
I 
I 
I 
I 
I 
I ! I ; /E' 
I /  ." 
/ /  -.-.-.----.-.-.--O I I I 
I 2 4 8 16 32 64 128256 512 1024 
0 
Pyramid Base Width 
Figure 5. The Actual Cost for Elemental Pyramid Operations 
line. While the base of the pyramid to be processed is less than 128 x 128 then the whole 
pyramid is embedded in a single array and only one operation is required. For a 128 x 128 
base pyramid 2 operations are required and a blocked processing strategy is used for larger 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
pyramids. The cost for the pyramid architecture is shown by the broken line In this case 
the 128 x 128 pyramid is exactly matched to the hardware and only requires one operation. 
Larger pyramids must be blocked using the pyramid base leveL 
The normalized cost of a elemental operation on the whole pyramid data structure is 
shown in Fig. 6. Since we normalize by the cost of an operation on the whole pyramid, in 
this case the m l t  is a constant 1 for all pyramid sizes For comparison purposes all normali- 
zations are done with respect to an operation on a flat array. The broken line in Fig. 6 shows 
the normalized cost for an elemental operation for a true pyramid architecture with a 128 x 
128 base. When the base of the pyramid to be processed is less than 128 then the cost is the 
same for both architectures When the base is 128 x 128 then a 50% reduction in normalized 
cost for the pyramid architecture is observed, For pyramids with bases larger than 128 x 
128 both architectures must process the data in blocks and the advantage of the pyramid 
architecture rapidly diminishes. It is important to note that the normalized cost for this 
graph and for all following graphs is plotted on a logarithmic scale 
- - _ _  
Large Pyramid Processing 
When the pyramid structure ro be p r o c d  has a base which is larger than the proces- 
sor array then a blocking scheme must be used for the lower large levels of the pyramid. 
An effective scheme is to use the crinkled storage strategy for each large leveL In this way 
each PE contains a complete sub pyramid; therefore, level changes for the large levels are 
done locally within PES and no interprocessor communications are necessary. The crinkled 
- Fiat architecture --- Pyramid architecture 
c l o t  cn 
0 
0 
W 
I I 
I 
I 
I 
I 
I 
0.1 I I I I I I 1 I I I 
I 2 4 8 16 32 64 128256 512 1024 
Pyramid Base Width 
Figure 6. The Normalized Cost for Elemental Pyramid Operations 
storage scheme is also very efficient for horizontal operations. I 
Convolve 
5 x 5  
Symmetric 
Separable 
Horizontal Operations 
Adjacent element information is readily available for both the architectures considered. 
Therefore, the normalized cost for macro operations involving horizontal near neighbor infor- 
mation will be similar to Fig. 6 except that the graph will be translated upwards by the 
time for the macro operation to be implemented in terms of elemental operations. The cost 
of some important horizontal operations is shown in Fig. 7. For reference, the cost of an add 
operation is first shown; it is less than one since an add requires less time than a multiply for 
integer and real data types. A 3 x 3 mean operation is achieved with four near neighbor 
additions, while a 3 x 3 convolution requires 9 multiplications and 8 near neighbor additions. 
A symmetric 5 x 5 convolution, such as that used by Burt for Pyramid building [81 requires 
6 multiplications and 8 near neighbor additions. An example of an expensive macro opera- 
tion is the 10 by 10 convolution which requires 100 multiplications and 99 near neighbor 
I 
I 
I 
1 
I 
I 
, additions. 
I 
Convoivc 
io x IO 
0 
1000 
I O 0  
c 
v) 
0 
0 
0 
N IO 
6 
.-  
E 
z 
I 
0.1 
A Boolean 
0 Real 
- - 0 Integer 
c 
Add 
- 
0 a 
Mean 
3 x 3  
---- 
A- 
0 0 
0 0 
Convolve 
3 x3 
-- - - 
3 a  
0 A 
3 4  
---1--- 
Figure 7. The Normalized Cost of some Horizontal Operations 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
E 
1 
I 
I 
I 
I 
1 
Vertical Pyramid Operations 
It is in the vertical operations that the pyramid architecture is expected to excel in com- 
parison to a flat architecture. An important operation in many pyramid algorithms is to 
build a pyramid from an image located at the lowest level of the pyramid. We consider a 
very simple building operation in which each higher level is constructed from the mean of 
the four sons of the adjacent lower level. 
Pyramid Building 
The cost of pyramid building is shown for a flat processor in Fig. 8 and for a pyramid 
processor in Fig. 9. The building cost is normalized by an elemental operation applied to the 
whole pyramid data structure. We see that for the worst case base width of 128 the flat 
processor is several times slower that the pyramid however this advantage rapidly diminishes 
if large pyramids are processed. The Boolean case is particularly bad for the flat array 
IOOC 
I oc 
c 
v) 
0 
0 
U 
Q) 
N 10 
0 
E 
0 z 
.-  
L 
I 
0. I 
A Boolean 
0 Integer 
0 Real 
.---- 
I 0 
I 
1 I I I I I I 1 I 
I 2 4 8 16 32 64 128 256 512 IO24 
Pyramid Base Width 
Figure 8. The Normalized Cost for Simple Pyramid Building on a Flat Processor Array 
i 
I 
I 
I 
I 
I 
1 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
1 
IOOC 
IO0 
c 
v) 
0 
0 
U a 
N 10 
0 
€ 
0 z 
.-  
L 
I 
0. I 
A Boolean I 
0 Integer I 
o Real I 
I 
I 
I 
.-------- --I----- 
I ‘0 
I \O 
I 2 4 8 16 32 64 128 256 512 1024 
Pyramid Base Width 
Figure 9. The Normalized cost for Simple Pyramid Building on a Pyramid Processor Array 
because the overhead to do the pyramid data manipulations cannot be amortized over the 
number of bits in the data as it can for the numeric data formats. The flat array although 
slower than the pyramid may not be significantly slower in a pyramid building algorithm. 
For example consider the overlapped algorithm of Burt [8]. At each level a 5 x 5 convolution 
must be computed before proceeding to the next. The cost of computing these convolutions is 
comparable to the data building cost, 
Another way to look at the vertical operation cost is to consider a single level reduction; 
this is shown for both &it and pyramid processors in Fig. 10. In this case the cost is normal- 
ized by an elemental operation on the level at which the reduced data is generated. The 
architectures have the same performance when the base is smaller than the level being prw 
cessed. The worst case for the flat processor OCCUIS for the level which is half the base size 
since this requires the most interprocessor data transfers If many operations fall on this 
level transition then one technique is to consider the processor array to have a size of 64 x 64. 
The arithmetic processing speed is reduced by four (since only a quarter of the processors are 
I 
I 
I 
I 
I 
I 
1 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
‘ 0 4 4  - Fiat architecture I 
--- Pyramid architecture I 
1000 
100 
c 
in 
0 
0 
U 
IO .-  
0 
0 z 
L. E d  
I 
I 
0. I 
A Boolean I 
I I I I I I I I I 1 
I 2 4 8 16 32 64 128 256 512 1024 
Width of Level 
Figure 10. Tie Normalized Cost for Single Layer Reduction 
u s e d G t  the data transition is managed at least ten times faster. 
Data Reductia 
In Fig. 11 and Fig. 12 the cost for computing the global sum of the elements in the base 
of the pyramid is g ivn  In this case the cost is normalized by an elemental operation over 
the whole pyramid data structure. The advantage of the pyramid architecture is not as great 
as in the pyramid building example because the flat array is no longer forced to emulate the 
pyramid in the highest levels since only the final sum is required, 
CONCLUSIONS 
A general technique for determining the cost of pyramid algorithms has been presented. 
This technique may be used on any fiat array architecture to evaluate its performance for 
pyramid algorithms and to evaluate the benefit of a pyramid processor hardware enhance- 
ment. Once the graphs similar to those given in Figs. 6-12 have been obtained then the mix 
i 
I 
1 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
1000 
100 
c 
v) 
8 
U 
10 -  
0 
E 
0' 
7 
I 
0. I 
A Booiean I 
0 Integer I 
I 
I 
I 
I 
0 Real 
.--------------- 
.- 
.- 
I I I I I I I I I I 
I 2 4 8 16 32 64 128 256 512 1024 
Pyramid Base Width 
Figure 11. The Normalized Cost for Sum Reduction on a Flat Processor Array 
of vertical to other pyramid operations for the desired application must also be determined 
for an accurate evaluation. 
1. 
2 
3. 
4 
From the results given in this paper the following observations can be made. 
Pyramid algorithms can be effectively implemented on a flat array in many cases; each 
application should be examined individually in detail. 
If most of the operations are performed in the base of the pyramid then the pyramid 
hardware will offer little advantage 
The fiat array is substantially worse for Boolean data types than for numeric data types 
Therefore, a preponderance of Boolean data structures favors the pyramid approach 
However, if numeric data types dominate then the case for pyramid hardware is much 
weaker. 
The p e i d  hardware can only be justifred by the need for a large number of pyramid 
level changes in a algorithm. It m o t  be justised by the global data reduction func- 
tion alone since other techniques can be used on the flat array. 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
5. 
6. 
IOOC 
I oc 
c 
in s 
z 
0 
10 
.-  
6 z 
I 
0. I 
I 
I 
I 
A Boolean 
o Integer 
0 Real 
a 
I 
I I I I I I I I I 
I 2 4 8 16 32 64 128 256 512 1024 
Pyramid Base Width 
Figure 12. The Normalized Cost for Sum Reduction on a Pyramid Processor Array 
The ratio of the hardware base size and the pyramid base size will have a major effect 
on the performance. For numeric data processing one way to inmase the effectiveness 
of the flat array is to use a smaller number of multi-bit procesos therefore the 
hardware base size is reduced without loss of processing power. 
When the pyramid to be processed is much larger than the hardware processor array 
then there is very little advantage in having the pyramid hardware. This is the case 
when the MPP is used to process pyramids with a base size of 512 x 512 
In summary while the analysis is complex and very application dependent there are a 
number of situations which may be identified for which justikation for the pyramid archi- 
tecture on be determined. The pyramid hardware is hard to justify if either (a) large 
pyramid processing techniques are used, (b) the operations to be performed are numeric, or 
(c> vertical operations do not constitute a large part of the pyramid algorithms to be imple- 
mented. The strongest case for the pyramid hardware exists when Boolean data dominates 
., . 
I 
I 
I 
I 
I 
I 
I 
I 
1 
i 
I 
I 
I 
I 
I 
I 
I 
I 
the processing and the 
pyramid data structure. 
REFERENCES 
number of PES exactly matches the number of elements in the 
1. 
2 
3 
4 
5. 
6 
7. 
a 
K. E Batcher, “Design of a Massively Parallel Processor,” IEEE Transactions on Com- 
puters C-29(9) p p  836-840 (September 1981). 
A. P. Reeves, “On Ef€icient Global Information Extraction Methods For Parallel Proces- 
sors,” Computer Graphics and Image Processing 14 pp. 159-169 (1980). 
A. P. Reeves, ‘Tarallel Pascal: An Extended Pascal for Parallel Computers,” Jd o f  
Paral&l and Distrihted Computing 1 pp. 64-80 (1984). 
A. P. Reeves, “The Anatomy of VLSI Binary Array Processors,” in Lcarguages mzd 
Architectures for Image Processing, ed. M. J. B. DUE and S Levialdi, Academic Press 
(1981). 
R W. Gostick, “Software and Algorithms for the Distributed-Array Processor,” ICL 
Technical J d  pp. 116-135 (May 1979). 
NCR Corporation, Geometric Arithmetic Parallel Processor, NCR, Dayton, Ohio (1984). 
A. P. Reeves and C H. Moura, “Data Manipulations on the Massively Parallel Proces- 
sor,’’ Proceedings o f  the Nineteenth Hawaii Internationul Conference on System Sci- 
e~ces, pp. 222-229 (January, 1986). 
P. J. Burt, ‘Fast Filter Transforms for Image Processing,” Computer Graphics and 
Image Processing 16 pp. 20-51 (1981). 
