Deriving algorithms on reconfigurable networks based on function decomposition  by Chen, Gen-Huey et al.
Theoretical Computer Science 120 (1993) 215-227 
Elsevier 
215 
Deriving algorithms on 
reconfigurable networks 
based on function decomposition 
Gen-Huey Chen and Biing-Feng Wang 
Department of Computer Science and Information Engineering, National Taiwan University, Taipei, 
Taiwan 
Hungwen Li 
IBM Research Division, Almaden Research Center, 650 Harry Road, San Jose, CA 95120-6099, USA 
Communicated by M.S. Paterson 
Received November 1990 
Revised June 1992 
Abstract 
Chen, G.-H., B.-F. Wang and H. Li, Deriving algorithms on reconfigurable networks based on 
function decomposition, Theoretical Computer Science 120 (1993) 215-227. 
In this paper, a new approach, which is based on function decomposition, is proposed for deriving 
algorithms on processor arrays with reconfigurable bus systems. The effectiveness of this approach is 
shown through some important applications. They include computing the logical exclusive-OR of 
n bits, summing n bits, summing n m-bit binary integers, and multiplying two n-bit binary integers. 
All these applications are solved in O(1) time. 
1. Introduction 
Many networks of processors have been supplemented with buses to decrease their 
diameters in order to enhance the system performance [l, 3, 11, 151. Buses give 
networks of processors greater communication capabilities and allow broadcasting 
and long-distance communication to be completed in negligible time. Recently, some 
parallel computers have been further equipped with reconfigurable bus systems to 
solve problems more efficiently [8,9,12, 17,211. A reconfgurable bus system is a bus 
Correspondence to: Gen-Huey Chen, Department of Computer Science and Information Engineering, 
National Taiwan University, Taipei, Taiwan. Fax: (886)-(2)-3628167. 
0304-3975/93/$06.00 0 1993-Elsevier Science Publishers B.V. All rights reserved 
216 G.-H. Chen et al. 
system whose configurations can be dynamically changed. Bus automata [17], poly- 
morphic-torus networks [9], reconfigurable meshes [121 and mesh-connected array 
processors with bypass capability [S] are four examples. 
A bus automaton can be viewed as a cellular automaton with a locally switchable 
global communication network. By adjusting the local switches properly, straight, 
zig-zag and staircase sub-buses can be formed. Efficient algorithms on bus automata 
have been proposed for many applications such as pattern recognition [16], language 
parsing [13] and string comparison [4]. A polymorphic-torus network consists of 
a rectangular array of processors that are connected to a grid-shaped physical bus 
network. There is a wrapped-around connection on each row and each column in the 
polymorphic-torus network. The configurations of the physical bus network are 
dynamically changeable by adjusting a programmable internal network within each 
processor. Efficient embeddings of trees, rings, meshes, pyramids and hypercubes can 
be realized by properly establishing the programmable local switches [9]. Like the 
polymorphic-torus network, a reconfigurable mesh also consists of a rectangular array 
of processors that are connected to a grid-shaped reconfigurable bus system. Within 
each processor, four locally controllable bus switches are built to adjust the configura- 
tions of the bus system. Some graph, image and geometry problems [12] have been 
efficiently solved on reconfigurable meshes. 
Conceptually, the polymorphic-torus network and the reconfigurable mesh are 
functionally equivalent and belong to the family of two-dimensional processor arrays 
with reconfigurable bus systems (abbreviated to 2-d PARBSs). 
In this paper, a new approach, which is based on function decomposition, isproposed 
for deriving algorithms on PARBSs. Using this approach, many important problems can 
be solved efficiently. They include computing the logical exclusive-OR of n bits, summing 
n bits, summing n m-bit binary integers and multiplying two n-bit binary integers. 
2. Processor arrays with reconfigurahle bus systems 
A 2-d N1 x N2 PARBS consists of an N1 x Nz array of processors, which are 
connected to a grid-shaped reconfigurable bus system. Each processor is identified by 
a unique index (i, j), Od i< N1 - 1, 0 <j< N, - 1. The processor with index (i, j) is 
denoted by Pi,j. Within each processor, four ports, denoted by U, D, L and R (stand- 
ing for up, down, left and right, respectively), are provided; ports L, R are built in the 
i-direction, and ports U, D are built in the j-direction. Through these ports, processors 
are connected to the reconfigurable bus system. 
In Fig. 1, a 6 x 4 PARBS is shown. (For ease of description, all PARBSs shown 
in this paper have the i-direction representing the horizontal direction and the 
j-direction representing the vertical direction.) The configurations of the reconfigur- 
able bus system are dynamically changeable by adjusting the local connections among 
ports within each processor. For example, by connecting port L to port R within each 
processor, horizontally straight buses are established to connect the processors of the 
Algorithms on reconjigurable networks 217 
same row together (see Fig. 2a). These horizontally straight buses are further split into 
sub-buses, if some processors disconnect heir local connections (see Fig. 2b). Besides, 
other configurations of the reconfigurable bus system are allowed, as long as they can 
be formed by properly adjusting the local connections among ports within each 
processor. When no local connection among ports is set within each processor, 
a square PARBS is functionally equivalent o a mesh-connected computer. 
When a bus configuration is established, processors that are attached to the same 
bus can communicate with one another by broadcasting on the common bus. Like 
u 
L 
+ 
R 
D 
1 2 3 4 5 
i 
Fig. 1. A 2-d 6 x 4 PARBS. 
Fig. 2. Two configurations of the reconfigurable bus system shown in Fig. 1. 
218 G.-H. Chen et al. 
[2,9, 171, we assume that each broadcasting takes one time unit. When more than one 
processor attempts to broadcast values on the same bus simultaneously, a collision 
occurs, and the final value received is undefined. 
In this paper, we use {gi}, {g2}, . . . . {gt> to represent he local connection within 
a processor, where each gi, 1 <i< t, denotes a group of ports that are connected 
together within the processor. For example, in Fig. 2a, the local connection within 
each processor can be represented by {L, R}. Further, if the local connection within 
a processor is represented by (U,L}, {D,R}, it means that two connections exist 
within the processor; one connects ports U and L together, and the other connects 
ports D and R together. 
Although only the 2-d PARBS is introduced in this section, extension to higher 
dimensions is rather straightforward. 
3. Deriving algorithms based on function decomposition 
We use Ui,j, Di,j, Li,j and Ri,j to denote the four ports U, D, L and R of processor 
Pi, j, respectively. 
Definition 3.1. Let f be a function that maps a set of integers to another set of integers, 
and let S be a subset of the domain ofJ A function-mapping conjiguration of a 2-d w x h 
PARBS with respect to f and S, which is denoted by FMC(f,S, w x h), is a 
bus configuration established on a 2-d w x h PARBS that has the following four 
properties: 
(1) Sc_{O,l,..., h-l), 
(2) j-(S) c {O,l, *.*, h-l}, wheref(S)={f(j)ljeS}, 
(3) for each jES, ports L,,j and R,- l,f(j) are connected to the same bus, and 
(4) no two ports Rw-1.j and R,-i,j, are connected to the same bus, where 
O<j<h-l,O<j’<h-1 and j#j’. 
As two illustrative examples, an FMC(f( j)= 2j+ 1, (0, 1,2}, 3 x 6) and an 
FMC(f( j)=j div 2, (0, 1,2,3,4,5}, 3 x 6) are depicted in Fig. 3. 
Note that an FMC(J; S, w x h) is also an FMC(f, S’, w x h) for any set S’ E S. Also, 
by definition, if processor Po,j, jES, broadcasts a value on the bus to which port L,,j is 
connected, only one processor (Pw _ r,f(j,) on column w - 1 can receive the value from 
its port R (R,_ r,f(j)). Thus, an FMC(f; S, w x h) is actually an efficient embedding of 
function f with domain restricted to S into a w x h PARBS. For example, we can 
determine any one value f(j), jES, in O(1) time as follows. We first let processor 
Pe,j broadcast a signal on the bus to which port L,,j is connected. Then, f(j) is 
determined as v if processor P,,_ l,v on column w - 1 received the signal from port 
R,- 1.0. 
We say that two function-mapping configurations FMC(&, So, w. x h) 
and FMC(fi,Si, w1 x h) are composable if fo(So) z Sr. For example, the two 
Algorithms on reconfigurable networks 219 
(4 04 
Fig. 3. Two function-mapping configurations. 
function-mapping configurations of Fig. 3 are composable. Similarly, r function- 
mapping configurations, denoted by FMC( fk, Sk, wk x h), k = 0, 1, . . . , Y - 1, are com- 
posableiffkfk_l...fO(SO)CSk+l for k=O,l,..., r - 2. (Throughout this paper we use 
fkfk-l~~~fo&) to denotef,(f,-,(...(fo(So))...)).) 
Let ck denote an FMC(fk,&, wk x h), k=O, 1, . . . . Y- 1, and suppose co,cl, . . . . c,_~ 
are composable. The composition of co, cl, . . . . c,_ 1 is a bus configuration on a 2-d 
w x h PARBS, where w = w. + w1 + ... + w,_~. The 2-d w x h PARBS is composed of 
r 2-d PARBSs: SUB-PARBSo, SUB-PARBS1,. . . , SUB-PARBS,_ 1, where each 
SUB-PARBSk, 06 kdr- 1, is of dimension wk x h and contains processors ranging 
from column w. + w1 + “‘+wk_r to column (wo+wi+ “’ + wk)- 1. The bus config- 
uration of each SUB-PARBS,, OdkGr- 1, is established as ck. 
As an illustrative example, Fig. 4 shows the composition of the two function- 
mapping configurations of Fig. 3. It is not difficult to see that the composition is 
indeed an FMC(f(j)=(2j+ l)div2, {0,1,2},6 x 6). In fact, for any two composable 
function-mapping configurations, we have the following lemma. 
Lemma 3.2. Let co,cl denote an FMC(fo,So,woxh) and an FMC(fl,S1,wl x h), 
respectively, and suppose they are composable. The composition of co and cl is an 
FMC(flfo,So,wxh), where w=wo+wl. 
Lemma 3.2 is clear, and the proof is omitted. 
Let us consider Fig. 4 again. According to Lemma 3.2, we know that if processor 
Po,j, j~(0, 1,2}, broadcasts a signal on the bus to which port Lo,j is connected, only 
one processor (PS,(*j+l)div2) on column 5 can receive the signal from port 
R (Rs,(zj+ I)div2). However, processor Pz, 2j+ I is now no longer the unique one on 
column 2 that can receive the signal from port R, although the bus configuration 
220 G.-H. Chen et al. 
Fig. 4. The composition of the two function-mapping configurations depicted in Fig. 3. 
established on the sub-PARBS that contains processors ranging from column 0 to 
column 2 is an FMC(f(j)=2j+ 1, {0,1,2}, 3 x 6). For example, both processors 
P2,0 and Pz,l can receive the signal from port R2,e and port R2,r, respectively, if 
processor I’,,, broadcasts a signal on the bus to which port L,,e is connected. Note 
that this situation occurs because the function f( j) =j div 2 is not one-to-one. 
By applying Lemma 3.2 repeatedly, we have the following lemma. 
LemII’Ki 3.3. Let ck denote an FMC(f,, Sk, wk x h), k=O, 1, . . . . I- 1, and suppose 
CO,Cl, . . ..cr-1 are composable. The composition of c~,c~,...,c,-~ is an 
FMC(f,_,~~~flfO,SO,wxh), where w=w~+w~+~~~+w~_~. 
Suppose a given problem can be expressed as a function F : X+ Y, where X denotes 
the set of all possible input instances and Y= (0, 1,2,. . . , h - l} denotes the set of their 
corresponding answers. According to the above discussion, we can derive an algo- 
rithm on a 2-d PARBS for the problem, if for each XEX, 
(1) there exist an integer j, and r functions fx,o,fx, 1, . . . ,fx,,_ 1 such that 
F(x)=f,,,-,...f,,,,f,,o(j,), where O<j,<h-1, 
(2) there exist r composable function-mapping configurations c,, o, c,, 1, . . . , c,,,_ 1, 
where c,,k iS an FMC(f,,k,Sx,k,W,,kxh), O<kbr-1, and jxES,,O, and 
(3) there is a way to construct each c,,k, 0~ k <r- 1. 
The derived algorithm runs on a 2-d w x h PARBS (w = w,, o + w,, 1 + ... + w,,,_ 1), 
where the composition of c,, o, c,, 1, . . . , c,, r _ 1 is established. 
The proposed approach first decomposes the function F into r composable func- 
tionsf,,,,f,,,,...,f,,,-,, and then constructs their corresponding function-mapping 
configurations. The effectiveness of the proposed approach is shown by some import- 
ant applications in the following section. 
Algorithms on reconjigurable networks 221 
4. Applications 
4.1. Computing the logical exclusive-OR of n bits 
This problem can be expressed as a function F: X+ Y, where X= {(b,, bI, 
. . . . b,_,)Ibk=O,l for k=O ,..., n-l}, Y= (0, l} and F(bO,bI, ...,b,_l)= 
b0 @ bI @ ..s @ b,_ 1. Here, @ denotes the logical exclusive-OR operation. 
Define two functions X0&,, XOR, as follows: XORo = {(O,O), (1, l)} and 
XOR, =((O, l),(l,O)} (i.e., f or each be{O, l}, XOR,,(b)=b and XOR,(b)=6). Then, 
F(x)=XORb._, .+.XORb,XORb,(O) for each x=(bo,bI, . . . . ~,_,)EX (i.e., j,=O, r=n 
andf,,,=XO& for k=O, . . . . n - 1). A function-mapping configuration c,, &, which is 
an FMC(XO&+, (0, l}, 2 x 3), can be established on a 2-d 2 x 3 PARBS as follows 
(see Fig. 5). 
Step 0: Initially, bk is stored in processor PO,O. 
Step 1: The value bk is broadcast to all processors. 
Processor PO, 0 sends a copy of bk to processor P1, 0. Then, all the processors 
establish the local connection {U, D} to form straight buses along the j-direction. And 
then, processors P,,, and P,,0 broadcast bk on the established buses. 
After step 1, each processor owns a copy of bk. 
Step 2. Establish the bus configuration as an FMC(XORbk, (0, l}, 2 x 3). 
Each processor Po,j, 0 < j<2, establishes the local connection {L, R} if j< 1 and 
b~=O,{L,U}ifj=Oandb~=1,{L,R},{U,D)ifj=1andb~=1and{D,R}ifj=2and 
bk = 1. And each processor PI ,I, 0 < j < 2, establishes the local connection (L, R} ifj < 1 
andb~=O,{U,R}ifj=Oandb,=1,(L,D},{U,R}ifj=1andb,=1and{L,D}ifj=2 
and bk= 1. 
Since XORbx ~~~XOR~,XOR~,((O,1})~{0,1} for k=O,l,..., n-2, c~,~,c~,~ ,..., 
C X,tl-- 1 are composable and their composition is an FMC(XO&_, ..’ 
XORbl X0&,,,, (0, l}, 2n x 3). For example, the composition of c,,~, c,, 1, c,, 2, c,,~ 
for x=(bo,bI,bZ,b3)=(l,0,1,1) is depicted in Fig. 6. The composition of 
Cx,O,Cx,l,*..,C,,.-1 can be established in O(1) time on a 2-d 2n x 3 PARBS if bk, 
0 <k d n - 1, is initially stored in processor PZk, o. Thus, the problem of computing the 
logical exclusive-OR of n bits can be solved in O(1) time on a 2-d 2n x 3 PARBS. 
Note that since the functions XORo,XOR, are one-to-one, all the values 
bo @ bl @ ‘.. @ biy i=O, 1, ... , n - 2 are also computed simultaneously. 
(4 (b) 
Fig.5 AnFMC(XORb,,{0,1},2~3),0<k<n-l.(a) bk=O.(b) bt=l. 
222 G.-H. Chen et al. 
4.2. Summing n bits 
This problem can be expressed as a function F : X-t Y, where X = {(b,, bI, . . . , 
b,_l)Ibk=O,l for k=O ,..., n-l}, Y={O,l,..., U} and F(bo,bI ,..., bn_l)= 
b,+b,+...+b,-,. 
Define two integer functions INCRo,INCR1 as follows: for each integer j, 
ZNCRo( j)=j and INCR,(j)=j+ 1. Then, F(x)=ZNCRb”_;..INCRb,INCRb,(O) for 
each x=(bo,bI, . . . , b,_,)EX, (i.e., j,=O, r=n andfx,k=ZNCRb, for k=O, 1, . . . . n- 1). 
Clearly, a function-mapping configuration &+, which is an FMC(INCRb,, 
(071, . . . . k}, 1 x (n+ 1)) can be established in O(1) time on a 2-d 1 x (n+ 1) PARBS if 
bk, 0 <k d n - 1, is initially stored in processor PO, 0. Figure 7 shows an example of 
n=5. 
Since INCRbx . ..INCRb.INCRb,((O}) E {O,l, . . . . k+ l} for k=O, 1, . . . . n-2, 
Cx,O,Cx,l,..., x,n-1 C are composable. The composition of c,.,~, c,, 1,. . . , c,,._ 1 can be 
established in O(1) time on a 2-d n x (n+ 1) PARBS if bk, 0~ k< n- 1, is initially 
stored in processor Pk, o. Thus, the problem of summing n bits can be solved in O(1) 
time on a 2-d 
i 
t 
n x (n+ 1) PARBS. Note that the derived algorithm can be 
Fig. 6. The composition of an FMC(XOR1, (0, l}, 2 x 3), an FMC(XOR,,, {0,1},2 x 3), an FMC(XORl, 
{0,1},2 x 3), and an FMC(XOR,, {0, l}, 2 x 3). 
i 
t 
i 
t 
Fig. 7. An FMC(INCRb,, {O,l, . . . . k}, 1 x6), O<k<4. (a) b,=O. (b) b,=l. 
Algorithms on reconjgurable networks 223 
adapted for the problem of computing the prefix sums of n bits, since the functions 
INCR,, INCRl are one-to-one. 
4.3. Summing n m-bit binary integers and multiplying two n-bit binary integers 
The multiplication of two n-bit binary integers A=a,_ 1 u~_~.. a0 and 
B=b,_Ib,_z... b. can be computed by summing n 2n-bit integers MO, M 1, . . . , M, _ 1, 
where M,=(A * bk) * 2k, 0~ k <n - 1. Therefore, we only show how to solve the 
problem of summing n m-bit integers. 
Let Ao=ao,m-lao,m-2...a0,0, A,=a l,m-lal,,-2...al,0,..., An-l= 
a,-l,,-la,-,,,-2...a,-l,0 be n m-bit integers and S=S~-~S,,,-~ . ..so be their 
sum. While computing S, the carry to bit position m is ignored if it is generated. 
Mathematically, S=(Ao+Al+~~~+A,_l)mod2”. Define u_~=O and ~,=(a~,~+ 
a,,,+ ...+a,_l,,)+(u,_1div2)for t=O, l,..., m - 1, where u,_ 1 div 2 is the carry to bit 
position t. Then, s, = u, mod 2, for t = 0, 1, . . . , m- 1. Since the carry to bit position t is 
less than n, we have u, < 2n - 1. The problem of summing n m-bit integers can be solved 
in O(1) time if all u,‘s can be determined in O(1) time. 
The function-mapping configuration for computing u, can be easily established; it is 
merely the combination of the function-mapping configurations for division by 
2 (shown in Fig. 3b) and for summation of n bits (shown in Fig. 7), respectively. The 
detail is as follows. 
We express the problem of computing u, as a function F :X+ Y, where X= 
(~~0,m-l~0,m-2~~~~0,0,~l,m-l~~,m-2~..~~,0,...,~n-l,m-l~.-l,m-2...~n-l,0)l~k,~=~,~ 
for k=O, . . . . n-l and 1=0 ,..., m-l>, Y={O,l,..., 2n-1) and F(ao,,_luo,m_2... 
~0,0,al,m-lal,m-2...~1,0,...ran-l,m-l~n-~,m-2...an-~,0 )=u,. We also define n+l 
functions sfo, sfi , . . . , sfn as follows: for each j~(0, 1, . . . ,2n - l}, sfk( j) = k + ( jdiv 2), 
k=O,A...,n. Then,F(x)=~fao.l+a,,,+...+a._,,,...~fao.,+a, l+...+a,_, ,~fa~,~+~,.~+...+~,_,,~(O) 
for each x=(ao,,_Iao,,_2...ao,o,al,,_lal,,_2.’..uI,o ,... ‘,a,_,,,_lu,_l,,_2... 
a,_,,,)EX (i.e., j,=O, r=t+ 1 andf,,,=sf,,,,+,,,,+...+.“_,,, for k=O, 1, . . . . t). A func- 
tion-mapping configuration c,&, which is an FMC(S~~~,~+~,,,+...+~“_, li, (0, 1, . .. . 
2n- 1},2n x 2n), can be established on a 2-d 2n x 2n PARBS as follows. (See Fig. 8, 
where an example with n=3, ao,k= 1, al,k=O and u2,k= 1 iS shown.) 
Step @ Initially, each ai,k iS stored in processor Pi+“,o, O< i< n- 1. 
Step 1: The value ai,k is broadcast to each processor Pi+n,j, 0 <i < n- 1, 
Odj<2n-1. 
Each processor Pi+n,j, 0 < id n - 1, 0 <j< 2n - 1, establishes the local connection 
{U,D} to form straight buses along the j-direction. Then, each processor Pi+,,09 
0 < i < n - 1, broadcasts ai,k on the established bus to which it is connected. 
After step 1, each processor Pi+,,j, O<i<n-l,O<j<2n-1, owns a copy Of ai,k. 
Step 2: Establish the bus configuration as an FMC(S~~~,~+~,,~+..,+~._,,~, 
(091, .. . , 2n-l), 2nx2n). 
Each processor Pi,j, 0~ i< n - 1, 0 <j< 2n - 1, establishes the local connection 
{L, D}, {U, R} if i < j, {U, L, R} if i =j and {L, R} if i >j. And, each processor 
224 G.-H. Chen et al. 
Fig. 8. AnFMC(sf,+,+,,{O,l,..., 5),6x6). 
Pi+n,j3 0 < i < n - 1,O < j < 2n - 1, establishes the local connection (L, R} if ai,k = 0 and 
{L,U},(D,R} if ai,k=l. 
Note that c,,k is really the composition of n+ 1 composable function-mapping 
configurations; the first one is an FMC(f( j)=jdiv 2, (0, 1, . . . ,2n - l}, n x 2n), and the 
Ith one is an FMC(INCR,,_,,,,{O,l,..., n+1-3},lx2n) for Z=2,3 ,..., n+l. There- 
fore, c&k is an FMC(Sfao.t+a,~x+...+~“_,,~, (0, 1,...,2n-1},2nx2n). 
Clearly, c,,o, c,, 1, . . ., c,,~, 0 6 t d m - 1, are composable, and their composition can 
be established in 0( 1) time on a 2-d 2(t + 1)n x 2n PARBS if each Q~,~, 0 < i < n - 1, 
O,< k < t, is initially stored in processor P@k+ Ijn+i,O. Thus, the value u, can be 
computed in O(1) time on a 2-d 2(t + 1)n x 2n PARBS. Using this result, the problem 
of summing n m-bit integers can be solved in O(1) time on a 2-d 2mn x 2mn PARBS if 
each ai,k, O<i<n-1, O<k<m-1, is initially stored in processor P~*k+l)n+i,O and 
proper data routing is performed. 
Based on the above discussion, we know that the problem of multiplying two n-bit 
binary integers can be solved in O(1) time on a 2-d 4n2 x 4n2 PARBS (m= 2n in this 
case), if proper data routing is performed to create the necessary n 2n-bit binary 
integers. We omit the routing procedure here. It is not difficult for an interested reader 
to work out the details. 
5. Reduce the size of PARBSs 
It is possible for a given problem to derive more than one algorithm by function 
decomposition. Different algorithms may require PARBSs of different sizes. For the 
purpose of cost-effectiveness, we wish to minimize the size of the PARBS (i.e., the 
number of processors used). So, in this section, we suggest wo approaches that are 
useful in reducing the size of the PARBS. 
Algorithms on reconjigurable networks 225 
The first approach is to decompose F, which expresses the input problem, into 
composable functionsf,,k, more carefully. It is clear that the decomposition of F is not 
unique, and different decompositions of F may result in PARBSs of different sizes. If 
we choosef,,k’s carefully, it is very easy to obtain a better PARBS. For example, let us 
consider again the problem of computing the logical exclusive-OR of n bits. If 
fx,k = XORb, is chosen as in the previous section, the resulting PARBS is of size 2n x 3. 
If, instead, we let (assume n is even) fx,k = { (0, l), (1,2)) if k is even and bk =O, 
fx,k={(Q2),(l,l)} ifk isevenandbk=l,fx,k={(1,0),(2,1)}ifkisoddandb,=0,and 
fx,k={(l, 1),(2,0)} ifk is odd and bk = 1, then the resulting PARBS is of size n x 3, since 
each of FMC(f,, ziy (0, l}, 1 x 3) and FMC(fx,zi+ 1, { 1,2), 1 x 3) can be established in 
O(1) time on a 2-d 1 x 3 PARBS. 
The second approach is to transform the input problem into another equivalent 
problem. The solution of the original problem can be obtained by solving the newly 
obtained problem. Besides, more importantly, solving the new problem requires fewer 
processors than solving the original problem. For example, let us consider again the 
problem of summing n bits, which we have solved in the previous section in O(1) time 
on a 2-d n x (n + 1) PARBS. In the following, we show that this problem can be solved 
with fewer processors. 
Let po,pl, . . . . pt_l be mutually prime positive integers. According to the Chinese 
remainder theorem [6], a positive integer i smaller than the product pop1 ...pt_ 1 can 
be uniquely determined if the values i mod po, i mod pl, . . . , mod p, _ 1 are known. Thus, 
choosingt=2,po=~n’12~andp~=~n’~2~+1,thesumofnbitsbo,b~,...,b,_~canbe 
determined if the values (bo+bl+~~~+b,_l)modpo and (bo+bl+~..+b,_,)modp, 
are computed. These two values can be easily computed in 0( 1) time on 2-d PARBSs 
of size 2n x (p. + 1) and 2n x (pl + l), respectively. Since the same PARBS can be used 
for modp, and then for modp,, the problem of summing n bits can be solved in O(1) 
time on a 2-d 2n x (r n”’ 1+ 2) PARBS. Indeed, by properly choosing 
GPO,Pl, .*., pt_ 1, it is possible to compute the sum of n bits in O(1) time on a 2-d 
PARBS of size O(n x ne) for any fixed E > 0 (the interested readers may consult any 
book, e.g. L-141, on the theory of numbers for useful properties of integers). 
One more example is the problem of summing n m-bit integers Ai = Ui, ,, _ 1 ai, m _ 2.. . 
&,03 i=o, 1, . ..) n- 1, which we have solved in the previous section in O(1) time on 
a 2-d 2mn x 2mn PARBS. Let I=rlog,(n+ l)] and Xj=Xj,~-1Xj,~_~...Xj,o be the sum 
of a,_ i,jrk-2,j, ...9 Clo,j forj=O, 1, . . . . m-l. We have Ao+Al+...+A,_l=Xo20+ 
Xl 2l+ ... +X,_, 2”-‘. Also, let yi denote an integer whose binary representation 
is obtained by packing Xi,Xi+r,Xi+zr, . . . . Xi+Ltm-i)//,*1 into an (m+E)-bit integer 
word (i.e., ~~X~2i+X~+~2i~‘+X~+~~2’+2’+~~~+X~+~~~_~~~~~~~2i~~~m~1~‘~~*’)for i= 
O,l, . ..) I-1.Then,wehaveA0+A1+~~~+A,_,=Y0+Y1+~~~+Y,_,,whichimp1ies 
that the problem of summing n m-bit integers can be transformed into the problem of 
summing 1 (m + &bit integers. The transformation requires the computation of all 
Xj’s, which can be performed in O(1) time on a 2-d PARBS of size O(n x mn”), 
according to the discussion in the previous paragraph. Then, YTs can be obtained by 
proper data routing (note that each x is distributed over 1 +m processors, each 
226 G.-H. Chen et al. 
holding one bit). The routing procedure is rather lengthy, and is omitted in this paper. 
The interested readers are encouraged to work out the details. By applying the result 
of the previous section, the sum of Y,, Y,, . . . , & _ 1 can be computed in 0( 1) time on 
a 2-d 21(m + I) x 2&m + 1) PARBS. Thus, the multiplication of two n-bit integers can be 
performed in O(1) time on a 2-d PARBS of size O(n log rr x n1 +&). 
6. Discussion and conclusion 
In this paper, we have proposed a new approach, which is based on function 
decomposition, for deriving algorithms on PARBSs. Efficient algorithms for some 
important problems were also derived as examples. One more example is the problem 
of adding two n-bit binary integers, which can be found in [22]. 
Although many algorithm design strategies have been suggested for single-pro- 
cessor computers (e.g., divide-and-conquer, dynamic programming and branch-and- 
bound) and parallel random access machines (e.g., recursive doubling and divide- 
and-conquer), no algorithm design strategy was found earlier for networks of pro- 
cessors. Essentially, the function decomposition approach is an algorithm design 
strategy for the PARBS. Its central idea is like “divide-and-conquer”. To solve 
a complex problem on the PARBS, we first decompose it into many subproblems (the 
divide step). Then, each of these subproblems is solved on the PARBS (the conquer 
step). Finally, the solutions of these subproblems on PARBSs are combined to form 
the solution to the original problem (the merge step). 
In the previous section, we have successfully used the Chinese remainder theorem to 
reduce the size of PARBSs used. In fact, this idea can be used elsewhere. For example, 
it is possible to reduce the number of processors used in Ben-Asher’s orting algorithm 
and in the algorithm for summing y1 m-bit integers [2]. 
Finally, before ending this paper, the implementation aspect of the PARBS is 
discussed. As far as the authors know, there are three VLSI implementations that 
demonstrate the feasibility and benefits of the 2-d PARBS: the YUPPIE (Yorktown 
ultra-parallel polymorphic image engine) chip [lo], the GCN (gated-connection 
network) chip [19,20] and the chip developed by Gray and Kean [7]. Certainly, there 
is a gap between the theoretical PARBS model and their physical implementations. 
For example, the O(1) time claim about the broadcast delay in the PARBS is not true 
for the above three chips. However, the broadcast delay is very small. For example, 
only 16 machine cycles are required to broadcast on a one million-processor 
YUPPIE. The GCN has further shortened the delay by adopting precharged circuits. 
It can be expected that the gap will be much narrowed down in the future with the 
fast progress of the hardware technology. Especially, it has been shown in [18] 
that the O(1) time claim may become reasonable if the reconfigurable bus system 
is implemented using optical fibers [S] as the underlying global bus system and 
electrically controlled coupler switches (ECS) [S] for connecting or disconnecting two 
fibers. 
Algorithms on reconjgurable networks 227 
Acknowledgment 
The authors wish to express their gratitude to the anonymous referees for their 
valuable comments, which have improved this paper a lot. 
References 
[l] A. Aggarwal, Optimal bounds for finding maximum on array of processors with k global buses, IEEE 
Trans. Comput. C-35 (1) (1986) 62-64. 
[2] Y. Ben-Asher, D. Peleg, R. Ramaswami and A. Schuster, The power of reconfiguration, J. Parallel 
Distributed Comput. 13 (2) (1991) 139-153. 
[3] S.H. Bokhari, Finding maximum on an array processor with a global buses, IEEE Trans. Comput. 
C-33 (2) (1984) 133-139. 
[4] D.M. Champion and J. Rothstein, Immediate parallel solution of the longest common subsequence 
problem, in: Proc. 1987 Internat. ConJ on Parallel Processing (1987) 70-77. 
[S] D.G. Feitelson, Optical Computing (MIT Press, Cambridge, MA, 1988). 
[6] R.L. Graham, D.E. Knuth and 0. Patashnik, Concrete Mathematics (Addison-Wesley, New York, 
1989). 
[7] J.P. Gray and T.A. Kean, Configurable hardware: a new paradigm for computation, in: Proc. 10th 
Caltech. Co@ on VLSI (1989) 279-295. 
[S] D. Kim and K. Hwang, Mesh-connected array processors with bypass capability for signal/image 
processing, in: Proc. Hawaii Conf on System Science, 1988. 
[9] H. Li and M. Maresca, Polymorphic-torus network, IEEE Trans. Comput. C-38 (9) (1989) 1345-1351. 
[lo] M. Maresca and H. Li, Connection autonomy in SIMD computers: a VLSI implementation, 
J. Parallel and Distributed Comput. 7 (2) (1989) 302-320. 
[ll] P. Mckinley, Multicast routing in spanning bus hypercubes, in: Proc. 1988 Internat. Conf on Parallel 
Processing Vol. 2 (1988) 204-211. 
[12] R. Miller, V.K. Prasanna Kumar, D. Reisis and Q.F. Stout, Data movement operations and 
applications on reconfigurable VLSI arrays, in: Proc. Internat. Conf on Parallel Processing, Vol. 1 
(1988) 205-208. 
[13] J.M. Moshell and J. Rothstein, Bus automata and immediate languages, Inform. and Control 40 (1) 
(1979) 88-121. 
[14] I. Niven and H.S. Zuckerman, An Introduction to the Theory of Numbers (Wiley, New York, 3rd edn., 
1972). 
[lS] V.K. Prasanna Kumar and C.S. Raghavendra, Array processor with multiple broadcasting, J. Parallel 
and Distributed Compur. 4 (2) (1987) 173-190. 
[16] J. Rothstein, Toward pattern-recognizing visual prostheses, in: Proc. IFAC Symp., Control Aspects of 
Prosthetics and Orthotics, Columbus, 1982 (Pergamon, Oxford, 1982) 87-89. 
[17] J. Rothstein, Bus automata, brains, and mental models, IEEE Trans. Systems, Man Cybernet. 
SMC-18,4 (1988) 522-531. 
[18] A. Schuster and Y. Ben-Asher, Algorithms and optic implementation for reconfigurable networks, in: 
Proc. 5th Jerusalem ConJ on Information Technology, 1990. 
[19] D.B. Shu, L.W. Chow and J.G. Nash, A content-addressable, bit-serial associative processor, in: Proc. 
IEEE Workshop on VLSI Signal Processing, Montery, CA, 1988. 
[20] D.B. Shu and J.G. Nash, The gated interconnection network for dynamic programming, in: 
S.K. Tewsburg et al. ed., Concurrent Computations (Plenum, New York, 1988). 
[21] B.F. Wang and G.H. Chen, Constant time algorithms for the transitive closure problem and some 
related graph problems on processor arrays with reconfigurable bus systems, IEEE Transactions on 
Parallel and Distributed Systems, 1, 500-507. 
[22] B.F. Wang, G.H. Chen and H. Li, Configurational computation: a new computation method on 
processor arrays with reconfigurable bus systems, in: Proc. Internat. Co& on Parallel Processing, 
Vol. 3 (1991) 42-49. 
