A fast parallel algorithm for special linear systems of equations using
processor arrays with reconfigurable bus systems by Wankar, Rajeev et al.
A Fast Parallel Algorithm for Special Linear Systems of










*School of Computer Science, DAVV Indore, India
Abstract
A parallel algorithm using Processor Arrays with Reconfigurable Bus Systems has been designed to solve dense
Symmetric Positive Definite (SPD) systems of equations Ax = b. The key content of this report is the parallelisa-
tion of the algorithm by Delosme & Ipson [8]. In order to design a parallel algorithm for PARBS, many proce-
dures involved in [8] are handled in a slightly different way. The parallel time and processor’s complexity of each
step of the algorithm is calculated. The parallel time complexity is O(n) using number of Processing
Elements.
Introduction: There are many direct and iterative methods for solving system of equations
with SPD matrix [13,14,25,28,30]. One of the numerically stable approaches
to solve a system of equations is LU decomposition[13]. Besides factorization this method
involves a forward elimination and a back substitution process which cannot be parallelised
easily. In order to construct a parallel algorithm for this problem, we need to investigate a
method which does not involve these steps. One such method was proposed by Delosme and
Ipson. Other than these two parallel steps the algorithm presented in [8] includes many steps
which can be exploited for parallelism to reduce the complexity of the algorithm. Such work
has been carried out and a parallel algorithm based on the PRAM model of computation has
been published in [38]. Implementing this method on processor arrays with a reconfigurable
bus system requires a special technique with modifications to the existing algorithm. In Section
3 this implementation has been discussed. The report is designed in the following way: In Sec
tion 1 we give some important results from [8], in Section 2 we present a brief description of
the architecture used for designing the problem. In Section 3 we give a PARBS solution of the
problem and discuss the time complexity of each of the steps.
1. This work is supported by the German Academic Exchange Services (DAAD) under the “Sandwich
Model” fellowship with the author who is permanently working in the Department of Computer Sci-
ence, North Maharashtra University, Jalgaon (MS), India.
2n 2n× 5n×
Ax b= An n×A Fast Parallel Algorithm for Special Linear Systems of Equations using Processor Arrays with Reconfigurable Bus Systems 1
Section 1: Algorithm based on hyperbolic Cholesky factorization.
In this section a brief discussion on the method of Delosme and Ipsen [8], based on hyperbolic
rotations, for the solution of linear systems of equations
(1.1)
with symmetric positive definite coefficient matrix A, is given. Cholesky factor of A is first
determined by essentially pre-multiplying A with appropriate hyperbolic rotations, whose
product is called Q. Next simple matrix vector multiplication involving Q to the right hand side
of (1.1) provides a novel way of solving the system. Avoiding forward elimination and back
substitution is a very desirable feature for a parallel implementation.
1.1 The hyperbolic Cholesky factorization.
The computation of the Cholesky decomposition,
,  upper triangular (1.2)
of real symmetric positive definite(spd) matrix A = aij , by means of hyperbolic rotations
is called hyperbolic Cholesky factorization. Its derivation is based on a particular decompo-
sition of the matrix A as given in [8]. Since A is spd, ak,k is strictly positive and A(k) can be
written as the difference of the outer product.
(1.3)
where vk and wk are row vectors with elements
(1.4)
(1.5)
That is, vk consists of non zero rows of A(k) scaled by the square root of the diagonal element,




A UT U= Un n×
n n×



























=2 A Fast Parallel Algorithm for Special Linear Systems of Equations using Processor Arrays with Reconfigurable Bus Systems
one has A = VTV-WTW. (1.7)
Lemma 1.1: Let R and S be upper triangular matrices such that RTR-STS is positive def-
inite, and let , .
 = where and
(1.8)
then is pseudoorthogonal, is upper triangular, and is strictly upper triangular (upper tri-
angular with zero diagonal).
Remarks:
1. Since the matrix constitute disjoint rotations, they commute and can be applied in any




2. The diagonal elements of have same sign as the corresponding diagonal elements of R;
thus if R has a positive diagonal,  also has a positive diagonal.
Theorem 1.1 (The Hyperbolic Cholesky Algorithm):
Let A be a spd matrix and V and W be upper triangular matrices as defined in (1.4) and
(1.5), so that A = VTV-WTW. Set
(1.12)
and apply the sequence of operations
n n×
μk k, rk k,
1–





Qˆ Q˜ n( )…Q˜ 1( )=
q˜ ij
k( )
1 i j= k or i j= n k+≠ ≠
1 μk
2–
–( ) 1– 2⁄ i j k or i j= n k+= = =
1 μk
2






qˆk k, qˆn k n k+,+ 1 μk
2
–( ) 1 2⁄– 1 k n≤ ≤,= =
qˆk n k+, qˆn k k,+ 1 μk
2
–( ) 1 2⁄– μk– 1 k n≤ ≤,= =








=A Fast Parallel Algorithm for Special Linear Systems of Equations using Processor Arrays with Reconfigurable Bus Systems 3
(1.13)
Where is obtained from R(l) and S(l) in the same way as is constructed from R and S in
lemma 1.2, and where P is the circular permutation matrix with P1,n = 1 and Pi,i-1 = 1,
, then R(n) = U, the Cholesky factor of A, and S(n) = 0.
Remarks:
The transformation performed by the hyperbolic Cholesky algorithm is denoted by
(1.14)
The matrix Q is pseudoorthogonal, since it is the product of pseudoorthogonal matrices.
Application of hyperbolic rotations to the solution of linear systems:
The hyperbolic Cholesky algorithm determines simultaneously the Cholesky factor of an
spd matrix and a set of n(n-1)/2 parameters , which defines the
hyperbolic rotations that make up the matrix Q. For the solution of the spd system ,
the above algorithm could be used to find followed by forward elimination to solve system
and by back substitution to solve . Instead the above algorithm can also be
used to find parameter followed by the application of the hyperbolic rotation to the b in a
particular way to get the solution vector x.
The algorithm for the solution of can be summed up as follows: Let be the upper
part of and its strictly upper triangular part ( = D2 + ). Using the hyperbolic
Cholesky algorithm as specified in the theorem, express the matrix Q as a product of hyper-




and D = diag , so
R l 1+( )
S l 1+( )
I 0
0 P




l 0 … n 1–, ,=
Qˆ l( ) Qˆ
n n×
2 i n≤ ≤
Q I 0
0 P
Qˆ n 1–( )… I 0
0 P





l( ) 1 l n 1–≤ ≤ 1 k n≤ ≤, ,
Ax b=
U


















, ,( )4 A Fast Parallel Algorithm for Special Linear Systems of Equations using Processor Arrays with Reconfigurable Bus Systems
apply the same operation to  to obtain
(1.17)
and then apply these operations essentially in reverse order to get form
(1.18)
where  is an arbitrary real number, which can be equal to 0 or 1 for convenience.
and QT is given by
(1.19)
Section 2: The Computational Model.
Recently the Processor Array with Reconfigurable Bus Systems (PARBS) has drawn lot of
attention in the scientific community for its high performance computing with general purpose
processors. Various models of PARBS appear in the literature[5]. Common models are Bus
automaton, Polymorphic torus network[16], Reconfigurable Meshes[29,19], Bypass capabili-
ties etc.
An N1 x N2 x ...x Nr PARBS consist of an array of N1 x N2 x ...x Nr processors that is con-
nected to a r-dimensional grid shaped reconfigurable bus system. The processing elements are
connected to a bus through a fixed number of I/O ports. The ability to change the configuration
of the bus system by adjusting local or global switches makes this architecture interesting to
obtain various computational configurations like row, zig-zag, staircase & diagonal at run time.
A two dimensional processor array with a reconfigurable bus system of size N2, consisting of
identical processors, connected to a NxN rectangular mesh system, is called reconfigurable
mesh. In figure 1, we see processing elements connected to a grid of buses and a PE with its








x I I,( )ΔQT αU
T– b
1 α–( )L T– b
=
α
QT Qˆ 0( ) I 0
0 P 1–
Qˆ 1( ) I 0
0 P 1–
…Qˆ n 1–( ) I 0
0 P 1–
=A Fast Parallel Algorithm for Special Linear Systems of Equations using Processor Arrays with Reconfigurable Bus Systems 5
The basic computational unit of the reconfigurable mesh is the Processing Element (PE) which
consists of switches, small storage and an ALU. A PE is capable of performing following oper-
ations in one unit of time
1. Setting up a connection pattern.
2. Read from a or write onto a sub bus or memory
3. Performing logical and arithmetic operations
4. Disconnecting itself from the bus.
Higher dimensional PE’s can be formed in the same way. Figure 2 (b) shows a PE in 3-dimen-











Figure 1. Reconfigurable Mesh of size 4x4
x11 x12 x13 x14
x21 x22 x23 x24
x31 x32 x33 x34
x41 x42 x43 x44










Figure 2(b). A PE in 3-D.6 A Fast Parallel Algorithm for Special Linear Systems of Equations using Processor Arrays with Reconfigurable Bus Systems
We assume that constant time is needed to broadcast values through the established buses irre-
spective of their distance from the first processor. Although it is a theoretical assumption and
somewhat unrealistic on the current architecture, with the advancement in the fibre optics com-
munication technology, this architecture is expected to gain wide popularity.
Section 3:
The parallel algorithm for spd systems works in stages, we first present a pseudocode for the
algorithm and then present the PARBS implementation.
Begin
[1] Read A,b
[2] for all k, in parallel do
          for all j, in parallel do
                 begin
               end;
[3]
[4] for all k, in parallel do
          for all j, in parallel do
                begin
               end;
[5] for all i, in parallel do
          for all j, in parallel do
                begin
1 k n≤ ≤
1 j n≤ ≤
V k j,
Ak j, Ak k,⁄ if j k≥
0 otherwise
=





1 k n≤ ≤
1 j n≤ ≤
Ri j, V i j,=
Ri n+ j, W i j,=
1 i 2n≤ ≤
1 j 2n≤ ≤A Fast Parallel Algorithm for Special Linear Systems of Equations using Processor Arrays with Reconfigurable Bus Systems 7




          [6.1] E = interchange( )
          [6.3] if ( ) then
                   begin
                         [6.3.1] F = qmult(Q,R)
                         [6.3.2] R = interchange(F)
                         [6.3.3] Q = qconstruct(R)
                         [6.3.4]  = qmult(Q,E)
        end;
            p = p + 1;
end;
[7] in parallel do
           begin
                  [7.1] Y = vmult(E, (bb = dmult( ,b)))
                  [7.2] Z = emult(H= (umult((I,I), )),ET)
           end;
[8] x = vmult( )
End.
The important results used in the design process are presented next.
Lemma 3.1. A semi group computation of a1+a2+...+an where ‘+’ denotes an associative oper-
ator can be performed in O(log n) time on a linear PARBS of n processors. Initially each proc-
essor owns a value. The computed value is stored in the leftmost processor. [29]
Lemma 3.2. On a 2-D, mesh, n data items d1, d2, ..., dn can be moved from the first col-
umn to the first row in O(1) time. [5]
Qi j,
tot 1 if i j else 0==







Z α1 α–⎝ ⎠⎛ ⎞ Y,
n n×8 A Fast Parallel Algorithm for Special Linear Systems of Equations using Processor Arrays with Reconfigurable Bus Systems
Theorem 1. The transpose of a given square matrix Anxn can be obtained in constant time on
PARBS. Initially P(i,j,1) owns the values ai,j, after the computa-
tion is over P(i,j,1) contains the transpose of A.
Proof:
Step1. Establish straight bus in k-direction. Each processor connects port F to B. After this step
n2 straight buses are established, each one for n2 PARBSi,j, .
Step 2. Processor P(i,j,1) broadcast values A[i,j] on the buses such that A[i,j] is broadcasted on
P(i,j,j), . (figure 3(a)).
Step 3. Break-up the established sub buses (i.e. disconnect port F to B).
Step 4. Establish sub buses  in j-direction, i.e. connect port L to R.
Step 5. Broadcast values of A from P(i,j,k) to P(i,i,k), , . (figure
3(b)).
Step 6. Disconnect port L to R and establish new sub buses in i-direction and transmit data
items of A from processor P(i,i,k) to processor P(k,i,k), , (figure
3(c)).
Step 7. Disconnect established sub buses and establish new sub buses in k-direction and broad-
cast values back to P(i,j,1), (figure 3(d)).
n n× n× 1 i n≤ ≤ 1 j n≤ ≤,
1 i n≤ ≤ 1 j n≤ ≤,
1 i n≤ ≤ 1 j n≤ ≤,
1 i n≤ ≤ 1 j n≤ ≤, 1 k n≤ ≤
1 i n≤ ≤ 1 k n≤ ≤,A Fast Parallel Algorithm for Special Linear Systems of Equations using Processor Arrays with Reconfigurable Bus Systems 9
It is not difficult to see that the transpose of a matrix Anxn can be obtained in O(1) time with
 processors.
3.2 PARBS implementation of SPD solver.
Now we present an algorithm for SPD system of equations on PARBS.
Algorithm:
Phase I: We assume the existence of two working spaces where most of the operations are per-
formed.We assume that the values of A, b and Qtot are initially stored at the first layer of a













Figure 3(a): After steps 1-3
Figure 3(b): After steps 4-5
k=1 k=2 k=3 k=4














a11 a21 a31 a41
a12 a22 a32 a42
a13 a23 a33 a43
a14 a24 a34 a44
a11 a21 a31 a41
a12 a22 a32 a42
a13 a23 a33 a43
a14 a24 a34 a44
k=1 k=2 k=3 k=4
Figure 3(c): After step 6

















2n 2n× 5n× 1 i n≤ ≤ 1 j n≤ ≤,10 A Fast Parallel Algorithm for Special Linear Systems of Equations using Processor Arrays with Reconfigurable Bus Systems
b is stored at processors P[i+n,n+1,1], . We assume that the identity matrix is stored
at the processors P[i,j+n,1], , as initial value for D.
Step 1: Establish sub buses in j-direction i.e. connect port L to R, broadcast values of A[i,i] to
P[i,j,1] on the established buses for the first rows and A[i+n,i] on the remaining
rows.
Step 2: For the first rows on the established sub buses, divide values of A[i,j], D[i,j] at
P[i,j,1] by the value received in step 1, i.e. by A[i,i] if the index , otherwise con-
vert A[i,j] = 0, .
Step 3: For the remaining rows on the established sub buses, perform the same operation as
in step 2, if the index (j+n) > i. After completion of this step we get initial values of
Rnxn at the position P[i,j,1], and a diagonal matrix D at P[i,j+n,1],
.
Phase II: The processes involved in this phase are repeated n times.
In step 6.1 of the algorithm we multiply a matrix with another matrix ,
where P is a circular permutation matrix. The process of the multiplication is a
sequence of row interchange operations. The 2nth row of the second matrix becomes n+1st row,
n+1st becomes n+2nd and so on. 2n-1st row becomes 2nth row. P-RAM based algorithm for the
procedure computation is given as a procedure 3.1.
This operation can be performed on a PARBS in constant time. We establish a sub
bus in k-direction and send values of Qtot to the processors P[i,j,1] of working space I in con-
stant time.
Step 1: Establish sub buses in k-direction and send the row vectors of Qtot from the n+1st row
1 i n≤ ≤




1 i n≤ ≤ 1 j n≤ ≤,
n
1 i 2n≤ ≤ 1 j n≤ ≤,
1 i n≤ ≤ 1 j n≤ ≤,
2n 2n× 2n 2n× I 0
0 P
n n×
for i:= 1 to n in parallel do
     for j:= 1 to c1 in parallel do
E[i,j] := mat[i,j];
for j:= 1 to c1 do in parallel do
E[n+1,j] := mat[2*n,j];
for i:= (n+1) to r1-1 in parallel do
     for j:= 1 to c1 do in parallel do
 E[i+1,j]:= mat[i,j];
Where c1, r1 denotes column, row index of the first matrix.
Procedure 3.1  Row interchange procedure
2n 2n× n×A Fast Parallel Algorithm for Special Linear Systems of Equations using Processor Arrays with Reconfigurable Bus Systems 11
to the 2nth row to processor P[i+n,j,k], , .
Step 2: Establish sub buses in i-direction. Processors P[i+n,j,i], , broadcast values of
Qtot[i+n,j] on the established sub buses. After the process is over Qtot[i+n,j] is stored
on processors P[i+n,j,i].
Step 3: Disconnect established connections and establish new sub buses in k-direction and
broadcast values of these row vectors from P[i+n,j,k], , to
P[i+n,j,1] and values at P[n+1,j,n] to P[n+1,j,1].
Step 4: At the end of the step 3, required result is available at processors P[i,j,1] of working
space I. Establish sub buses in k-direction and broadcast values back to the first layer.
In step 6.3.1 of the algorithm, the matrix Q which is initially an identity matrix and stored at
the first layer, is multiplied by R, obtained in the step 4. We establish sub buses in k-direction
and broadcast the values of Q and R to the working space I. Initially we assume that the values
are stored at P[i,j,1] of this working space. Due to the special structure of matrix Q, we design
a procedure which takes constant time to multiply two matrices on a
PARBS. We first present P-RAM based algorithm for the procedure, followed by a PARBS
mapping of it. Here r1 denotes row index of the first matrix and c2 denote column index of the
second matrix as given in procedure 3.2.
Sub phase I:
Step 1: Establish the straight sub bus in k-direction. Each processor connects port F to B.
Step 2: Processor P[i,j,1] broadcast values Q[i,j] on the connected sub buses. After this step,
processor P[i,j,k] owns the value of Q[i,j], (Figure 4(a)).
Step 3: Break-up the established sub buses (each processor disconnect port F to B).
Step 4: Establish straight buses in j-direction.
Step 5: Processors P[i,k,k] broadcast the values Q[i,k] on the bus which is connected to
. After step 5 processor P[i,j,k], owns the value of Q[i,k],
(Figure 4(b)).
1 i n≤ ≤ 1 j n≤ ≤, 1 k n≤ ≤
1 i n≤ ≤
2 i n≤ ≤ 1 k n 1–≤ ≤
2n 2n× 2n 2n× 2n×
for i:= 1 to (r1 div 2) in parallel do
     for j:= 1 to c2 in parallel do
        begin
 F[i,j]:= Q[i,i] * R[i,j] + Q[i,i+n] * R[i+n,j];
 F[i+n,j] := Q[i+n,i] * R[i,j] + Q[i+n,i+n]*R[i+n,j];
end;
Procedure 3.2 Multiplication by Q
1 i n≤ ≤ 1 k n≤ ≤, 1 j n≤ ≤12 A Fast Parallel Algorithm for Special Linear Systems of Equations using Processor Arrays with Reconfigurable Bus Systems
Sub phase II.
Step 6: Establish sub buses in k-direction. Processors P[i,j,1], of working space I, broadcast
values of R[i,j] on the established sub buses.
Step 7: Establish straight buses in i-direction. (Figure 4(c)).
Step 8: Processors P[k,j,k] broadcast values R[k,j] on the sub buses. After this step processor
P[i,j,k] owns the value of R[k,j], (Figure 4(d)).
Step 9: Break-up established sub buses.
Multiplication Phase: Each processor P[i,j,k] computes the value of Q[i,k] * R[k,j], which
takes constant time.
Summing Phase: Establish a straight bus in k-direction and broadcast values, obtained after
the multiplication phase, from P[i,j,i+n] to P[i,j,1] and P[i+n,j,i+n] to P[i+n,j,i], ,
. Broadcasting values in this way takes constant time. Perform the summation at the
respective processors and send values back to the second layer by establishing sub buses in k-
direction.
1 i n≤ ≤
1 j n≤ ≤
q11 0 q13 0
q11 0 q13 0
q11 0 q13 0
q11 0 q13 0
q31 0 q33 0
q31 0 q33 0
q31 0 q33 0
q31 0 q33 0
0 q22 0 q24
0 q22 0 q24
0 q22 0 q24
0 q22 0 q24
0 q42 0 q44
0 q42 0 q44
0 q42 0 q44
0 q42 0 q44
k
Figure 4(a): After step 1-3 of sub phase I.
i=1 i=2 i=3 i=4j
0 0 0 0
q13 q13 q13 q13
0 0 0 0
q11 q11 q11 q11
q24 q24 q24 q24
0 0 0 0
q22 q22 q22 q22
0 0 0 0
0 0 0 0
q33 q33 q33 q33
0 0 0 0
q31 q31 q31 q31
q44 q44 q44 q44
0 0 0 0
q42 q42 q42 q42
0 0 0 0
k
Figure 4(b): After step 4-5 of sub phase I.



















Figure 4(c): After step 6-7 of sub phase II.
i=1 i=2 i=3 i=4jA Fast Parallel Algorithm for Special Linear Systems of Equations using Processor Arrays with Reconfigurable Bus Systems 13
The next step 6.3.2 of the algorithm is again an interchange operation which has been
described in step 6.1. After the process is over we obtain the new value of R and store it at the
first layer in constant time.
In step 6.3.3 we construct Q, we first present P-RAM algorithm (procedure 3.3) for the process
and then PARBS implementation of it.
Step 1: Establish a sub bus in k-direction, processors P[i,j,1], , and broad-
cast values of R as the initial value for Q to working space I.
Step 2: Establish sub buses in i-direction (Connect port U and D). Processors P[i,i,1] broadcast
values of Q[i,i],  on the established sub buses.
Step 3: Divide the values of Q by the value of Q[i,i], received in the last phase i.e. =
Q[n+k,k]/Q[k,k], .
Step 4: Processors P[i+n,i,1], broadcast the values of on the established sub buses.
Step 5: Disconnect port U to D.


















Figure 4(d): After step 8 of sub phase II
i=1 i=2 i=3 i=4
j
  for i:= 1 to 2*n in parallel do
     for j:= 1 to 2*n in parallel do
Q[i,j] := 0.0;
  for k:= 1 to n in parallel do
     begin
qq:= mat1[n+k,k] / mat1[k,k];
Q[k,k] := 1/sqrt(1-sqr(qq));
Q[n+k,n+k] := Q[k,k];
 Q[k,n+k] := -(qq/sqrt(1-sqr(qq)));
Q[n+k,k] := Q[k,n+k];
 end;
Procedure 3.3   Construction of Q
1 i 2n≤ ≤ 1 j n≤ ≤
1 i n≤ ≤
ρk
1 k n≤ ≤
1 i n≤ ≤ ρk
1 i n≤ ≤14 A Fast Parallel Algorithm for Special Linear Systems of Equations using Processor Arrays with Reconfigurable Bus Systems
of  on the established sub buses.
Step 7: Processors P[k,k,1] and P[n+k,n+k,1],  perform the operation
and Processors P[k,n+k,1] and P[n+k,k,1], ,
perform .
Step 8: Disconnect the established sub buses. After the process is over, we get the value of
in constant time.
The step 6.3.4 is a multiplication of matrix Q with the matrix E obtained in the step 6.1. As we
described in step 6.3.1, the procedure qmult multiplies two matrices in constant time. The
resultant matrix Qtot is stored at first layer.
Once the values are received after step 6 of pseudocode, we can use them to avoid the forward
elimination and back substitution process. The two processes 7.1 and 7.2, given below are exe-
cuted parallely. We broadcast values of b to the working space I, and D to working space I and
II both, by establishing the sub buses in k-direction. The two steps 7.1 and 7.2 are described
separately. The step 7.1 consists of two steps dmult( ,b) and vmult(E, result of procedure
dmult). We first describe dmult.
Phase III.
Step 1: The broadcasted diagonal matrix D at the processor P[i,j+n,1], ,
of working space I and b at P[i+n,n+1,1], . We establish sub buses
in j-direction so there are sub buses which are connected to the ports L and R of the
processors.
Step 2: Broadcast the values of b from the processors P[i+n,n+1,1] to processors P[i+n,i+n,1],
 on the established sub buses.
Step 3: Disconnect the established sub buses and connect new sub buses in i-direction. Proces-
sors P[i+n,i+n,1],  send values of b to processors P[i,i+n,1], .
Step 4: Perform the multiplication operation of the received values on the respective proces-
sors.
Step 5: On the established sub buses send the result back from processors P[i,i+n,1] to
P[i+n,i+n,1],
Step 6: Disconnect ports U and D and establish sub buses in j-direction i.e. connect port L to
R, broadcast values from P[i,i+n,1] to P[i,1,1] and from P[i+n,i+n,1] to P[i+n,1,1],
.
The second step of 7.1 is a procedure vmult(Q * column vector resulted from the last step).
ρk
1 k n≤ ≤ Q k k,[ ] =
Q n k n k+,+[ ] 1 1 ρk–( )⁄= 1 k n≤ ≤





n n× 1 i n≤ ≤
1 j n≤ ≤ 1 i n≤ ≤
n
1 i n≤ ≤
1 i n≤ ≤ 1 i n≤ ≤
1 i n≤ ≤
1 i n≤ ≤A Fast Parallel Algorithm for Special Linear Systems of Equations using Processor Arrays with Reconfigurable Bus Systems 15
This is a simple matrix vector multiplication which can be done in O(log n) time on a
PARBS using a standard procedure described in [10]. The resultant
matrix Y is stored at processors P[i,1,1],  of working space I.
The step 7.2 consists of two multiplication steps. First the , obtained in the form of D from
the first layer, is multiplied by matrix (I, I), made up of identity matrix I, fol-
lowed by a procedure emult. It can be observed from the property of such matrix that the
resultant matrix is a matrix with it’s rows 1 to n are added to n+1 to 2n. Here we describe a pro-
cedure which takes constant time for such row additions. This operation can be done in only
one step.
Step 1: The values of D are available in P[i,i+n,1], of working space II. We establish
a sun bus in j-direction by connecting port L to R, broadcast the value of D from proc-
essor P[i,i+n,1] to processors P[i,i,1], .
We have already proved that the transpose of a matrix can be obtained in constant time
using PARBS, the second procedure emult is a simple vector matrix multiplication
and it can be executed on PARBS of size  in O(log n) time [10].
In step 8 of the algorithm the resultant matrix of the last step, Z, is multiplied by Y by taking
appropriate value of as 0 or 1. This is a vector matrix multiplication which can be executed
on PARBS of size  in O(log n) time [10].
It can be observed that all steps in the loop of step 6 takes constant time and repeated n times,
the algorithm presented above can obtain the solution of spd system in O(n) time using
 processors, where  processors are required for working space.
Concluding Remarks: For the solution of linear systems of equations with symmet-
ric positive definite matrix , an algorithm based on hyperbolic Cholesky factorization on
processor arrays with reconfigurable systems is presented and its parallel time and processors
complexity is calculated. The processors complexity is on the upper bound and with the clever
use, can be further reduced. Since the matrix multiplication, due to the summation of n values
in an inner vector product, is bounded by O(log n) parallel time, fast and efficient matrix
manipulation on a linear array with a reconfigurable pipeline bus system using optical buses
has been proposed recently [19]. One of the fast algorithm for solving linear system of equa-
tions with this latest pipeline architecture, takes parallel time with
processors, where [19]. The algorithm presented in section 3 has
parallel time complexity of O(n). It can be considered as a fast algorithm from the practical
point of view but not the efficient parallel algorithm as it does not belongs to Nick’s
class[12,35]. The work can be further extended to find the efficient parallel algorithm for the
solution of spd systems of equations using a pipelined bus architecture.
2n 2n× 2n× 2n 1×
1 i 2n≤ ≤
Δ
n 2n× n n×
1 i n≤ ≤






2n 2n× 5n× 2n 2n× 4n×
Ax b=
An n×
O Nlog( )1 δ+
O N4 87--⎝ ⎠⎛ ⎞
Nlog( )δ
⁄⎝ ⎠⎛ ⎞ 0 δ 1≤ ≤16 A Fast Parallel Algorithm for Special Linear Systems of Equations using Processor Arrays with Reconfigurable Bus Systems
References:
1. Aggarwal A., “Optimal Bounds for Finding Maximum on Array of Processors with k Glo-
bal Buses”, IEEE Transaction on Computers, pp. 62-64, Vol. c-35, No. 1, January
1986.
2. Aho A., J. Hopcroft and J. Ullman, The Design and Analysis of Computer Algorithms,
Addison-Wesley, 1974.
3. Alaghband Geeta, “Parallel sparse matrix solution and performance”, Parallel Comput-
ing, pp.1407-1430, 21(1995).
4. Bauer B. E., Practical Parallel Programming, Academic Press Inc., 1992.
5. Biing-Feng Wang and Gen-Huey Chen, “Constant Time Algorithms for the Transitive
Closure and Some Related Graph Problems on Processor Arrays with Reconfigurable Bus
Systems”, IEEE Transaction on Parallel and Distributed Systems, pp. 500-507, Vol.
1, No. 4, October 1990.
6. Biing-Feng Wang, Gen-Huey Chen and Ferng-Ching Lin, “Constant Time Sorting on a
Processor Array with Reconfigurable Bus System”, Information Processing Letters, pp.
187-192, 34(1990).
7. Bokhari S. H., “Finding Maximum on an Array Processor with a Global Bus”, IEEE
Transaction on  Computers, pp. 133-139, Vol. c-33, No. 2, February 1984.
8. Delosme J.M., Ilse C.F. Ipsen, “Parallel Solution of Symmetric Positive Definite Systems
With Hyperbolic Rotations”, Linear Algebra and its Applications, Special Volume on
Parallel Computing, North Holland, NY, pp. 75-111, Vol. 77, May 1986.
9. Duff I.S., A.M. Erisman and J.K. Ried, Direct Methods for Sparse Matrices, Clarendon
Press Oxford, 1986.
10. G-H Chen, B-F Wang and C.J. Lu, “On the parallel computation of the algebraic path
problem”, IEEE Transaction on Parallel and Distributed Systems, pp. 251-256, Vol. 3,
No. 2, March 92.
11. G-H Chen, S. Olariu et al., “Constant Time Tree Algorithms on Reconfigurable Meshes
on Size n x n”, Journal of Parallel and Distributed Computing, pp. 137-150, 26
(1995).
12. Gibbons A. And Wojciech Rytter, Efficient Parallel Algorithms, Cambridge University
Press, Cambridge, May 1988.
13. Golub G.H. and Charles and F. Van Loan, Matrix Computations, John Hopkins Press,
Baltimore, MA, 1983.
14. Golub G.H. and James Ortega, Scientific Computing: An Introduction with ParallelA Fast Parallel Algorithm for Special Linear Systems of Equations using Processor Arrays with Reconfigurable Bus Systems 17
Computing, Academic Press Inc., 1993.
15. Horowitz E. And S.Sahni, Fundamentals of Computer Algorithms, Addison-Wesley,
Computer Science Press, NY, 1978.
16. Hungwen Li and Massimo Maresca, “Polymorphic Torus Network”, IEEE Transaction
on  Computers, pp. 1345-1351, Vol. 38, No. 9, September 1989.
17. Hwang Kai and F.A.Briggs, Computer Architectures and Parallel Processing,
McGraw-Hill International Edition, 1985.
18. Jain Abhay, N.S. Chaudhari, “Efficient parallel recognition of context-free languages”,
Parallel Computing, pp. 1303-1321, 20(1994).
19. Ju-Wook J. and V.K. Prasanna, “An Optimal Sorting Algorithm on Reconfigurable Mesh”,
Journal of Parallel and Distributed Computing, pp. 31-41, 25 (1995).
20. K. Li, Y. Pan and S.Q. Zheng, “Fast and Efficient Parallel Matrix Manipulation on a Lin-
ear Array with a Reconfigurable Pipelined Bus System”, High Performance Computing
Systems and Applications, J.Schaeffer and R.Unrau eds. , Kluwer Academic Press,
1998.
21. Koji Nakano and Koichi Wada, “Integer Summing Algorithms on Reconfigurable
Meshes”, IEEE First International Conference on Algorithms and Architectures for
Parallel Processing, Brisbane, Australia, pp. 19-21, April 1995.
22. Koji Nakano and Stephan Olariu, “An Optimal Algorithm for the Angle Restricted all
Nearest Neighbour Problem on Reconfigurable Mesh with Applications”, Proc. 10th
International Parallel Processing Symposium, Honolulu, Hawaii, April 15-19, 1996.
23. Koji Nakano, “Optimal Initializing Algorithms for a Reconfigurable Mesh”, Journal of
Parallel and Distributed Computing, pp. 218-223, 24 (1995).
24. Koji Nakano, “Prefix sums algorithms on Reconfigurable Meshes”, Parallel Processing
Letters, pp. 23-35, Vol.5 No.1 (1995).
25. Lancaster P.,M. Tismenetsky, The Theory of Matrices, Academic Press Inc., (1985).
26. Lester B. P., The Art of Parallel Programming, Prentice Hall, Englewood Cliffs, N.J.
(1993).
27. Linda Kaufman, “The Generalized Householder Transformation and Sparse Matrices”,
Linear Algebra and its applications, pp. 221-234, 90 (1987).
28. Louis A. Hageman and David M. Young, Applied Iterative Methods, Academic Press
Inc., NY, 1981.
29. Miller R., V.K. Prasanna Kumar, Dionisios Reisis and Quentin F. Stout, “Meshes with
Reconfigurable Buses”, Proc. 15th MIT Conference on Advance Research in VLSI,
pp. 163-178, March 1988.18 A Fast Parallel Algorithm for Special Linear Systems of Equations using Processor Arrays with Reconfigurable Bus Systems
30. Stewart G.W., Introduction to Matrix Computations, Academic Press Inc., SanDiego
California 1973.
31. Susanne E.H., “VLSI Algorithms for the Connected Component Problems”, SIAM Jour-
nal  Computing, pp. 355-365, Vol. 12, No. 2, May 1983.
32. Tatsuya Hoyashi, Koji Nakano and Stepahn Olariu, “Efficient List Ranking on Reconfig-
urable Meshes with Applications”, Hayashi Technical Report, May 1996.
33. Ten-Hwang Lai, M.J.Shwng, “Triangulation on Reconfigurable Meshes: A Natural
Decomposition Approach”, Journal of Parallel and Distributed Computing, 38-51, 30
(1995).
34. V.Bokka, H.Gurla, S. Olaru and J.L. Schwing, “Constant Time Convexity Problems on
Reconfigurable Meshes”, Journal of Parallel and Distributed Computing, pp.86-99,
27 (1995).
35. Vipin Kumar et at., Introduction to Parallel Computing, Design and Analysis of Algo-
rithms, The Benjamin Cummings Publishing Company Inc. , California, 1994.
36. Voevodin V.X., Mathematical Foundations of Parallel Computing, World Scientific
Publishing Co., 1986.
37. Wankar R. and N.S.Chaudhari, “Parallel Cholesky Factorization Algorithm”, National
Conference on current trends in Information Technology, pp. 23-24, June 1995, Bho-
pal.
38. Wankar R., “Parallel Algorithms for Solving Symmetric Positive Definite System (SPD)
of Equations”, International Journal of Management and Systems, pp. 311-324, Vol.
11, No. 3, Sept-Dec 95.
39. Wei Shu, “Parallel implementation of a sparse simplex algorithm on MIMD distributed
memory computers”, Journal of Parallel and Distributed computing. pp. 25-40,
31(1995).
40. Yakowitz S., F. Szidarovszky, An Introduction to Numerical Computations, Mac-
Millan Publishing Co., 1989.
41. Yen-Cheng Chen and Wen-Tsuen Chen, “Constant Time Sorting on Reconfigurable
Meshes”, IEEE Transaction on  Computers, pp. 749-751, Vol. 43, No. 6, June 1994.
42. Yen-Cheng Chen et al., “Designing Efficient Parallel Algorithms on Mesh-Connected
Computers with Multiple Broadcasting”, IEEE Transaction on Parallel and Distrib-
uted Systems, pp. 241-245, Vol. 1, No. 2, April 1990.
43. Yosi Ben-Asher, Dan Gordon and Assaf Schuster, “Efficient Self-Simulation Algorithms
for Reconfigurable Arrays”, Journal of Parallel and Distributed Computing, pp. 01-
22, 30 (1995)A Fast Parallel Algorithm for Special Linear Systems of Equations using Processor Arrays with Reconfigurable Bus Systems 19
44. Yue-Li Wang, H.C.Chen and Chen-Yu Lee, “An O(log n) parallel algorithm for construct-
ing a spanning tree on permutation graphs”, Information Processing Letters, pp. 83-87,
56(1995).20 A Fast Parallel Algorithm for Special Linear Systems of Equations using Processor Arrays with Reconfigurable Bus Systems
