Architectures for block Toeplitz systems by Bouras, Ilias et al.
SIGNAL 
PROCESSING 
ELSEVIER Signal Processing 51 (1996) 167-190 
Architectures for block Toeplitz systems 
Ilias Bourasa, George-Othon Glentisb, Nicholas Kalouptsidis”,* 
“Institute of Microelectronics. DEMOKRITOS, Athens 153 IO, Greece 
bDepartment of Electrical Engineering, Universiiy of Twente, EL-BSC, P.O. Box 217, 7500 AE Enschede, The Netherlands 
‘Department of Informatics, University of Athens, Panepistimiopolis, TYPA Buildings, 157 71 Athens, Greece 
Received 8 April 1994; revised 22 June 1995 and 14 December 1995 
Abstract 
In this paper efficient VLSI architectures of highly concurrent algorithms for the solution of block linear systems with 
Toeplitz or near-to-Toeplitz entries are presented. The main features of the proposed scheme are the use of scalar only 
operations, multiplications/divisions and additions, and the local communication which enables the development of 
wavefront array architecture. Both the mean squared error and the total squared error formulations are described and 
a variety of implementations are given. 
Zusammenfassung 
In dieser Arbeit werden effiziente VLSI Architekturen fur hochgradig parallele Algorithmen zur Losung von 
blocklinearen Systemen vorgestellt, die eine Toplitzstruktur oder Fast-Toplitz-struktur aufweisen. Die Hauptmerkmale 
des vorgeschlagenen Schemas ind die Verwendung ausschlieglich skalarer Operationen, und zwar sowohl hinsichtlich 
der Multiplikationen/Divisionen als such der Additionen, sowie eine lokale Kommunikationsstruktur, die die Entwick- 
lung einer Wellenfrontfeld-Architektur errnoglicht. Es wird sowohl ein Ansatz fur den mittleren quadratischen Fehler als 
such ein Ansatz fur den gesamten quadratischen Fehler formuliert, und eine Vielzahl von Implementierungen wird 
angegeben. 
R&sum6 
Dans cet article sont present&es des architectures VLSI efficaces d’algorithmes hautement concurrents pour 
la resolution de systemes lineaires par blocs avec des entrees Toepliz ou presque Toepliz. Les caracttristiques 
principales des methodes propostes sont l’utilisation doperations scalaires (multiplications/divisions et additions) 
uniquement, ainsi que la communication locale, qui permet le developpement dune architecture en reseau de front 
d’onde. Les formulations des erreurs quadratiques moyenne et totale sont d&rites, et plusieurs implementations 
distinctes sont don&es. 
Kqvwords: Block Toeplitz matrices; Multichannel Schur algorithms; Parallel processing; VLSI implementation 
*Corresponding author. Fax: 301 72 28 981 
0165-1684/96/$15.00 0 1996 Elsevier Science B.V. All rights reserved 
PZISO165-1684(96)00041-2 
168 I. Bouras et al. / Signal Processing 51 (1996) 167-190 
1. Introduction 
In this paper efficient VLSI architectures for 
block Toeplitz and Toeplitz like linear systems are 
developed. Block Toeplitz solvers are encountered 
in a wide range of applications. Typical examples 
include multichannel spectral estimation, multi- 
variable system identification and realization, 
multichannel one dimensional (1-D) and two di- 
mensional (2-D) Wiener filtering and smoothing, 
nonlinear filtering using truncated Volterra series 
[S, 10, 13, 15, 18, 191. 
High performance massively parallel architec- 
tures are required in most real time digital signal 
processing applications. Advanced algorithms used 
for adaptive antennas and adaptive beam-forming 
have a data throughput of hundreds of millions of 
operations per second. The requirements of real 
time image processing are even more demanding, 
since several billions of operations per second are 
needed. VLSI array processors provide a low cost 
solution to high speed, heavily loaded real time 
applications [111. 
Real time operation of block Toeplitz solvers is 
an extremely heavy task. Since block Toeplitz sys- 
tems result from multichannel and multidimen- 
sional applications the computational oad is ex- 
pected to be tremendous even for moderate cases. 
Computation savings can be obtained by applying 
the efficient Levinson-Wiggins-Robinson (LWR) 
type algorithms that take into account the special 
structure of block Toeplitz systems [8, 17, 22, 231. 
The computational complexity of these techniques 
is O(k3p2), where k is the number of channels 
(or the block order of the block system) and p is 
the number of taps associated with each input 
channel (or the dimensions of each entry Toeplitz 
matrix). 
A major drawback of the LWR type algorithms 
is the need of inner products computations, 
thus prohibiting full parallelism. A remedy to 
this bottleneck is the engagement of Schur type 
algorithms, which bypass the need for inner 
product estimation, thus being suitable for parallel 
implementation on a general purpose parallel 
machine or on dedicated VLSI hardware, using 
a systolic or a wavefront array architecture [l-4,6, 
7, 9, 121. 
In this paper several VLSI architectures are pro- 
posed for the Schur type block Toeplitz solvers 
recently developed in [l, 3,4,9]. Different through- 
put rates can be achieved depending on the avail- 
able hardware. A planar array processor is first 
designed that computes the desired solution in 
O(kp) time units utilizing O(k2p) processing ele- 
ments. A reduced processor architecture is then 
derived that computes the output in O(k2p) time 
units, requiring however O(kp) processing elements 
only. Finally, a superfast highly pipelineable three 
dimensional structure is developed which in a full 
pipeline mode has a constant throughput rate. 
These algorithms offer significant advantages 
over their predecessors, ince inner products are 
now bypassed, and they are free of matrix opera- 
tions. On the contrary, the parallel structures pre- 
sented in [6, 7, 211 require additional parts that 
implement matrix inversion and multiplication, as 
well as inner products, of order k. Moreover, they 
suffer from communication, since matrices of order 
k must be transmitted. 
The proposed implementation of the highly con- 
current Schur type algorithms involves the follow- 
ing steps. (i) Derivation of dependence graph 
showing the flow of computations. (ii) Employ- 
ment of the canonical mapping methodology and 
the default schedule, to derive a signal flow graph. 
(iii) Algorithm reorganization to ensure local flow 
of data and event-driven local control. (iv) Mini- 
mization of the number of the over-all processing 
elements as well as computational and communica- 
tion tasks. (v) Programming in Occam and algo- 
rithm validation. 
The paper is organized as follows. First, the 
block Toeplitz and near-to-Toeplitz solvers of 
[l-4] are briefly discussed in Section 3. New sys- 
tolic and wavefront architectures are proposed in 
Section 4, and some hardware implementation 
issues are presented to give directions to hardware 
designers. The estimation of the correlation lags 
used for the initialization of the algorithms is also 
discussed. An alternative, more efficient from the 
hardware point of view, architecture is described in 
Section 5 and a description of the main processor 
unit is given. Finally, in Section 6 a highly pipelined 
3-D wavefront array, suitable for adaptive process- 
ing, is proposed. 
I. Bouras et al. 1 Signal Processing 51 (1996) 167-190 169 
2. Problem formulation 
Let us consider a multichannel FIR (finite im- 
pulse response) filter of k input channels and 
a single output. The extension to the multi-output 
case is straightforward. Let k be the number of filter 
input channels; xi(n), i = 1, . . . , k, be the corres- 
ponding input. Then, the filter output y(n) is esti- 
mated by the discrete time equation 
Y(n) = - i f Ci(l)Xi(fl - I + 1)~ (1) 
i=l I=1 
where Ci(l), 1 = 1, . . . ,pi, are the filter coefficients 
assigned to the ith channel. The number of filter 
coefficients pi, i = 1, . . . , k, is in general different for 
each input channel, i.e., pi # pj, i, j = 1, . . . , k. The 
multichannel filter is completely characterized by 
the parameter set (Pk,cpJ where 
pk = bI,P2v ... ,Pkl 
is a multi-index that specifies the lengths of the 
input registers Cl-43 and cPl is a block vector that 
carries the filter coefficients 
CPt = cc;: cfz’ . . CE]‘, 
where C; = [CL(l) Ci(2) . . . ci(pi)], i = 1,2, ,.. , k. 
Superscript T means transpose. Eq. (1) can be com- 
pactly written as a linear regression 
y(n) = - $(fi)e,, . (2) 
The regressor vector x,,(n) consists of blocks, each 
one being formed by successive samples and of 
varying dimension 
x,,(n) = [x;:(n) x;:(n) ... gml’ 3 
where x:(n) = [xi(n) Xi(n - 1) . . . xi(n - pi + l)], 
i=lT 2-3 ..’ > k. 
Let 
e;,(n) = z(n) - y(n) = z(n) +x+%& 
be the instantaneous output error between the de- 
sired response signal z(n) and the filter’s output 
y(n). The error signal &(n) is forced to be ortho- 
gonal to the regressor vector x,,(n). This can be 
expressed as 
<x,,(n), e;,(n) > = 0, (3) 
where (. ) is a suitable orthogonalization operator. 
The choice of (‘) gives rise to several algorithms 
for the determination of the optimal filter cPx. Two 
particular cases are considered in this paper. 
(a) The mean square error (MSE). Let a(.) de- 
note the expectation operator. Then, Eq. (3) takes 
the form 
&(x,,(n)e~,(n)) = 0. (4) 
The mean squared error filter is then estimated 
by the solution of the linear system of equations 
R&Q = - dPI 9 
where 
(5) 
R,, = &(x&)x;f;,(n)), d,, = &&)z(n)) (6) 
are the autocorrelation matrix and the crosscor- 
relation vector, respectively. 
R,, is a symmetric positive definite block matrix 
of block order k with Toeplitz entry matrices. This 
means that R,, consists of kZ Toeplitz submatrices 
Rij of dimensions pi x pj, i, j = 1, . . . , k. Thus, 
[&I Rt2 ... &k) 
& R 
Rm = [PijIjz > ::: ,i = . 
22 ... R 
. . f” , (7) . . , . . 
LR:, R2’k ... 
where each entry Rij is a Toeplitz matrix 
Rij = &(xi,(n)xg(n)), i = 1, . . . , k, j = 1, . . . , k. 
(8) 
In a similar way dpk is a block vector that contains 
k subvectors of dimensions pi x 1, each. 
dij = B(xh,(n)zj(n)), i = 1, . . . , k, j = 1. . . . ,q. (9) 
(b) The total squared error (TSE). Operator ( . ) 
now takes the form If=‘=, (.). The orthogonality 
condition (3) is expressed by 
“iO (x,,(+$,(n)) = 0. (IO) 
The corresponding least squared filter is deter- 
mined by the linear system 
R,,(W,,(W = - d,,(N) 7 (11) 
170 I. Bouras et al. / Signal Processing 51 (1996) 167-190 
where 
UN) = 5 %(n)x:,(n)? d,,(N) = 5 x,,(n)+) 
?I=0 n=O 
(12) 
are the sampled autocorrelation matrix and the 
crosscorrelation vector, respectively. 
Block matrix R,,(N) consists of k2 near-to-Toep- 
litz entry matrices, R,j(N), of dimensions pi x pj, 
given by 
Rij(N) = : ~j,(n)~t(n), i, j = 1, . . . , k. 
n=O 
Similarly, block vector d,_(N) consists of k sub- 
vectors 
di(N) = 5 $(n)dn)(n), i = 1, . . . ,k. 
n=O 
3. Algorithm description 
Linear systems (5) and (11) can efficiently be 
solved by the algorithms recently proposed in 
[l-4], using a channel decomposition technique. 
The main advantage of this novel approach is the 
need of scalar operations only. This is an attractive 
feature when VLSI implementation is under con- 
sideration. 
Eq. (5) determines the optimal FIR filter in the 
MSE sense. The pertinent filter is obtained as the 
solution of a block linear system of equations with 
Toeplitz matrices as entries. While a standard 
linear system solver can be utilized, the structure of 
matrix R,,k enables the development of efficient al- 
gorithms for the computation of the optimal filter 
cPk. The derivation of such fast algorithms is based 
on the nesting properties of matrix RpI that permit 
the order recursive estimation of the optimal filter, 
starting from c1 and going up to the final filter cPx. 
A novel algorithmic family that recursively solves 
the block Toeplitz linear system of Eq. (5) has been 
derived in [l-4]. Several fast order recursive algo- 
rithms have been presented, possessing the follow- 
ing desired properties: 
- Multichannel FIR filters with different input 
orders, pi # pj can be easily handled. 
- The new algorithms offer a significant advantage 
over their well-known predecessors [6-8, 12, 17, 
22, 231 since they manage to get free of matrix 
operations altogether. 
- Very highly concurrent structures can be extrac- 
ted using a Schur type format, involving scalar 
operations only. 
A set of k single channel forward and backward 
predictors is utilized by both the MSE and the TSE 
algorithm. Single channel predictors are assigned 
to each input channel. The forward v-channel pre- 
dictors aih, v = 1, . . . , k, are determined by the or- 
thogonality condition 
<xi*(n), e:(n) > = 0, 
where 
e:(n) = x,(n) + x~~(n)u~, 
is the forward prediction error and xik(n) is an 
alternative regressor vector defined by 
x’k(n) = 1 Cxpitn - l)li= 1, .,, ,v Cx:x,,(~)li=v+l, . . ..k 1 * (13) 
The v first vector components of x:,(n) contain the 
elements of the channels delayed by one unit. 
Clearly, xifi(n) =x,,(n), gk(n) = x,,(n - 1). 
The backward single channel predictors are ob- 
tained from the orthogonality condition 
(XL@), e;iP + l’(n)) = 0) 
where 
&p+ l+) = 
Pk xp+ 1 (n - PJI + 1) + x;:‘Tw;: l, 
p = 0,l , . . ..k- 1. 
In this paper we will concentrate on the Schur 
type fast order recursive algorithms developed in 
[l-4] for the solution of linear systems (5) and (11). 
For presentation reasons filters with equal number 
of delay elements for all input channels are con- 
sidered, i.e., pi = pj = p, i, j = 1, . . . , k. The MSE 
multichannel Schur algorithm is summarized in 
Table 1. 
The solution vector c is recursively computed by 
the algorithm, using Eq. (3) of Table 1. Two sets of 
auxiliary vectors are utilized, a”, v = 1,2, . . . , k, and 
bP, p = 1,2, . . . , k, corresponding to single channel 
forward and backward predictors, respectively. The 
I. Bouras et al. 1 Signal Processing 51 (1996) 167-190 171 
Table 1 
The highly parallel algorithm for block Toeplitz systems (MSE 
case) 
FORj=OTOp-1 
FORi=OTOk-1 
kh,+i+, = - e~~~,'+l(j)/e~',i=,',"')(j) 
INPARALLELDO 1 
FORn=jTOp-lINPARALLELDO2 
ech ,,+i+I(n) = eZk+dn) + e~~++iliSdk~~+i+l 
ENDPAR 
ENDPARl 
1 =left_rotate[l 2 . . . k],+l 
FORV=~TO~INPARALLFJLDO~ 
1( = I(v) 
INPAFIALLELDO4 
ki;+i+ 1 = - e,fi’;f’( j + I,)/e$$( j + I*) 
kb’ nl+i+l = - em,+i J(p’v)( j + Il)/ai:+i+ 1 
ENDPAR 
FORn=jTOp-lINPAFUILLELDO6 
e&‘$“+ l(n) = e$;f)(n) + z~e,$!!~!I(n)k!;+,+ I 
e,$$l(n) = z~e,f,l”;~)(n) + e$zf:I(n)k,$+i+, 
ENDPAR 
INPARALLELDO? 
ENDPAR 
ENDPAR 
END FOR i 
END FOR j 
(1) 
(2) 
(3) 
(4) 
(5) 
(6) 
(7) 
(8) 
(9) 
(10) 
algorithm computes the solution in p steps. Each 
step consists of k phases. During each phase the 
solution c is augmented by a new element. Thus, 
after the completion of a single step, c has been 
increased by k elements. S’ and TN that appear in 
Tables 1 and 2 are suitable permutation matrices 
[l-4]. 
A notable feature of the algorithm is the cyclic 
way in which the forward predictors, a’, and the 
backward predictors, b”, are updated. At the first 
phase (i = 0) of a new step, the forward predictors 
are coupled together with the backward predictors 
in a particular way which is [d,bk], [a2,b’], 
[a3, b2], [ak, bk- ‘1. At the second phase (i = l), the 
predictors are coupled together as [a’,bk-‘1, 
[a2,bk], [a3,b’], [ak,bke2]. Finally, at the last 
phase (i = k - l), the pairs are [d,bl], [a2,b2], 
[u3, b3], [uk, bk]. The same coupling holds for the 
forward and backward error variables, eftvvh) and 
ebu‘sh). The cyclic nature of the parameters coupling 
is nicely described by introducing the index se- 
quence [l 2 . . . k] and the rotation operator 
I = left_rotate[l 2 . . . k]i+l, 
which rotates [l 2 . . . k] (i + 1) times to the left. 
In this way, the backward predictors b” associated 
with the forward predictors (I” at phase i are found 
if we set p = I(v), where I(v) means the vth element 
of the index sequence I. The operator z,” that ap- 
pears in Eqs. (6) and (7) of Table 1 denotes a shift 
with respect o n. It is activated by the superscript q, 
q = 1 if v = k, otherwise q = 0. Variables I1 and 
l2 denote a delay with respect o n, and are defined 
as ll = 1 if p G i, otherwise l1 = 0, and l2 = 1 if 
12 < i, otherwise l2 = 0. The algorithm is initialized 
as 
ech(n) = dh(n), 
ef(Y’h)(n) = rv,h(n), 
where 
d,(n) = a[Xh(l - n)z(l)l? 
rv, hb) = 8 cxvtl - nbh@)l 
are cross-correlation and autocorrelation lags. 
The overall complexity of the proposed algo- 
rithm is O(2k2p2 + kp2) MADS (multiplications 
and divisions). On the contrary, Cholesky’s de- 
composition method requires 0(k3p3/6) MADS. 
The Wiggins-Levinson-Robinson (WLR) algo- 
rithm requires roughly the same amount of opera- 
tions; however extra linear system solvers of order 
k are required. 
In a parallel processing environment, the pro- 
posed algorithm enables full parallelism using 
172 I. Bouras et al. / Signal Processing 51 (1996) 167-190 
Table 2 
The highly parallel algorithm for block near to Toeplitz systems (TSE case) 
FORj=OTOp-1 
FORi=OTOk-1 
INPARALLELDO 1 
k&+i+l(N) = - e~~+l~(jIN)/e~~~(,‘+i!.i”‘(jIN) 
k&+i+l(N)= -~~~~(jjN)/e~~f!,“‘“(j(N- 1) 
Q,+dN) = - s~~+?(jlN)la,,+dN) 
ENDPARl 
INPARALLFJLDO~ 
FORn=jTOp-lINPARALLELDO3 
e~+i+l(nlN)=e,~+i(nlN)+e~~~‘=i’;:’(nIN)k,,+i+l(N) 
eif:/)(n - 11 N - 1) = e$G’:‘=i”(n - 1) N) + ct::/)(n - 11 N)EL,+i(N) 
ENDPAR 
FORn=jTOp-lINPARALLELDO4 
C,+i+l(nlN) = di,+dnlN) + C~~i’*h’(nIN)k~~+i+l(N) 
ENDPAR 
bk!i(N - 1) = bL:ii(N) + W,,+i(N)&+i(N) 
S~~i+Iw_,+i+,(N) = (““d’“)) + yii(; - ‘)) k:k+i+,(N) 
amk+i+l(N) = a,,+hV) - d$?h+~ INKk+~+I(N) 
ENDPAR 
I = left_rotate[l 2 . . . k],+l 
FORv=1TOkINPAFULJdLDO3 
p = I(v) 
INPARALLELDO4 
kL;+i+l(N)= -e~~‘;t’(j+I,IN)/(zPN)e~~f’(j+/z1N) 
k;;+i+,(N) = - ef(rsY) mk+i (I + hIN)ld:+i+I(N) 
ENDPAR 
FORn=jTOp-lINPARALLELDO6 
e;c,“;:! l(n 1 N) = eL>F’ (nIN)+(z~z.)‘e~l”;f:~(nIN)k~:+i.,(N) 
e%r;f!l(nlN) = (z,z,)qe~~;f’(nlN) + e~~f!l(nlN)k~~+i+l(N) 
ENDPAR 
(1) 
(2) 
(3) 
(4) 
(5) 
(6) 
(7) 
(8) 
(9) 
(10) 
(11) 
(12) 
(13) 
(14) 
I. Bouras et al. J Signal Processing 51 (1996) 167-190 173 
Table 2 (continued) 
INPARALLELDO7 
(15) 
ENDPAR 
ENDPAR 
END FOR i 
END FOR j 
scalar operations only. The multichannel imple- 
mentation of the parallel structure presented in [12, 
211 requires additional parts that implement 
matrix inversion and multiplication, as well as in- 
ner products, of order k. Moreover, it suffers from 
communication, since matrices of order k must be 
transmitted. 
The total least squared error (TSE) FIR filter is 
determined by Eq. (11). Matrices that compose the 
block matrix R,_(N) are no longer Toeplitz. It does 
however have a Toeplitz like structure which en- 
ables the development of highly concurrent algo- 
rithms. The TSE Schur algorithm is summarized in 
Table 2. It has the same structure as the MSE 
counterpart of Table 1. Some extra variables are 
now introduced to compensate for the close-to- 
Toeplitz structure of the data matrix. The rotation 
operator I is used again in order to couple suitably 
the various variables. Operator 2% denotes a 
shift with respect to N. The algorithm is in- 
itialized as 
ech(n) = dh(n), 
e”“h’(n) = r,&(n), 
where 
d,(n) = 2 xh(l - n)z(b 
I=0 
N 
r,,h(n) = 2 x,(l - n)xh(l). 
I=0 
(14) 
4. Systolidwavefront architectures 
In this section several systolic/wavefront imple- 
mentations of the highly parallel Schur type algo- 
rithms of Tables 1 and 2 are derived. The estimation 
task is naturally decomposed into separate sub- 
modules. First, autocorrelation and crosscorrelation 
variables are estimated, as are required for the initia- 
lization of both algorithms. In the MSE case, auto- 
correlation and crosscorrelation lags are assumed to 
be known in advance. Otherwise, they are esti- 
mated on the basis of the available data record [ 151 
d,(n) = ; i xh(l - n)$), 
I-O 
l”,h(n) = $ i xd - nbh(l). 
1-O 
In the TSE case, initialization variables are esti- 
mated according to Eq. (14). Both cases can be 
handled by a similar array processor. 
4.1. Systolic estimation of correlation lags 
A systolic/wavefront array that estimates auto- 
correlation and crosscorrelation lags for the initia- 
lization phase is first developed. The variables 
r,,h(n) and d,(n) of Eq. (14) are fed into the 
algorithm. These are computed by the systolic ar- 
ray of Fig. l(a). The arcs of the graph illustrate the 
114 I. Bouras et al. 1 Signal Processing 51 (1996) 167-190 
(a) 
(b) 
q(1) . . . q(N - 1) q(N) 
Q(1) . . . r*(N - 1) Xl(N) 
z,(l) . . . q(iv - 1) q(N) 
xt(1) . . . xr(N - 1) 21(N) 
x2(1) 
xs(N- - 1) 
22(N) 
tr(l) 
c2(N-- 1) 
x2(N) 
x,(N’ - 1) 
xl(N) 
Fig. 1. (a) A 2-D systolic/wavefront array for the computation of the initial values (3-channel case with filters of order 3). Boxes denote 
delay elements and circles denote Processor Elements that perform multiplication and division. (b) Functional block for each node of 
the array. MUL, REG, MUX, DEC and ADD denote a multiplier, a register, a multiplexer, a decoder and an adder, respectively. 
I. Bouras et al. / Signal Processing 51 (1996) 167-190 175 
computational flow among the various processor 
elements. Circles denote processor elements execut- 
ing multiplication and addition and boxes denote 
delay elements. The functional operation of the 
processor element (PE) used in this array is illus- 
trated in Fig. 1 (b). In the sequel abbreviation PE is 
utilized to denote a processor element. 
Each array utilizes kp PEs, where k is the number 
of channels and p is the order of the filter. The 
computation scheme is terminated after N + k time 
units, N being the number of available data. The 
time unit consists of the time needed to perform 
a multiplication and an addition. The absolute 
value of this time unit is dependent on the specific 
circuits that will be used. When all necessary com- 
putations have been finished, initial values for ef, e* 
and ec are located at the nodes of the array. At 
every time step each processor (j, 1), j = 1, . . . , k, 
1 = 0, . . , p, is fed with inputs xl(n), xl(n), 
n = N, . . . ,I, from the left and from the top, respec- 
tively. The result of the multiplication between 
the two inputs is added to the resulting output 
of the previous computation step. At the end of 
computation (N + k time units) the outputs 
rl,j(l) = 
e*(lgj)(/) = e f",j)(l) are located at nodes 
(j, I). We can activate the data path created by the 
multiplexer and the decoders (denoted with 
a dashed line in Fig. l(b)) in order to obtain the 
results serially from the rows of the array of 
Fig. l(a) and pass them to controller for further 
manipulation. The overall architecture is depicted 
in Fig. 2. 
A column of dividers is used inside the controller 
in order to divide the outputs with the number of 
available data (in the MSE case). The array of Fig. 
1 needs k(p + 1) PEs and k(k - 1)/2 registers as 
delay units. Each PE is composed of one multiplier, 
one adder, three registers, two decoders and one 
multiplexer. The overall time of the array function 
is 
Tall = Tcomp + Tres , 
where Tcomp is the computation time and Tres 
is the time needed to obtain the results. 
Tcomp = Nkt, + (k - l)t,, where t, is the cycle 
time of a PE. t, = t,,,,,,, + tadd, where tmul, is the 
delay of a multiplier and tadd is the delay of an 
adder. Tres = kt,,,, where t,,, is the time that a re- 
sult variable is obtained from every PE. 
Completion of the initial values computations 
requires either k + 1 such arrays (k for the compu- 
tation of e*(n) and e”(n) and 1 for the computation 
%1(l) zz(N - 1) 
w(N) - 
rx(N - 1) 
r,(N) - - 
zi(ll Z,(N - 1) +,(W _3 > 
r,(l) ‘,(N - 1) cj(N) -, ARRAY.1 ’ CONTROLLER 
=j(l) s,(N - 1) rj(N) ___) 3 
ZJl) +,(N - 1) r,(N) + 
2, e’(e’) 
Fig. 2. The overall architecture for the solution of block Toeplitz or Toeplitz like systems (3-channel case with filters of order 3). 
ARRAY.1 is a 2-D systolic array for the estimation of the correlation lags used for the initialization of the algorithms. CONTROLLER 
is an interface for the rearrangement of the initial values. ARRAY.2 is a 2-D wavefront array which implements the proposed algorithms. 
176 I. Bouras et al. / Signal Processing 51 (1996) 167-190 
of et(n)) or a sequential implementation of the same 
array k + 1 times. 
A suitable overall system architecture is depicted 
in Fig. 2. The computed initial values of e’, eb and 
ec are obtained serially from every row of the 
ARRAY.1 of Fig. 2. Then they pass through a con- 
troller composed of a column of (p + 1) dividers, 
a column of p + 1 FIFOs of length k and glue logic, 
that places them in an appropriate order. The re- 
sults are stored in a RAM of size 2(k + 1) x kp 
words, in the order presented in Fig. 7, and can 
directly be the input of input of ARRAY.2 of Fig. 2. 
ARRAY.2 is depicted in detail in Fig. 7 and analyti- 
cally presented in Section 4.2. After the computa- 
tion of the filter parameters the results are serially 
obtained from the columns of the above-mentioned 
array and are stored in a multiport RAM of size 
2(k + 1) x kp words. 
4.2. Main algorithm. A two dimensional approach 
The main array processor architecture that im- 
plements the proposed Schur algorithms is de- 
veloped. Since the algorithms dependence graph 
(DG) representation is complicated we will only 
give the DG for some equations of the first algo- 
rithm (MSE case). In a similar way all the other 
equations of the algorithms are handled. Consider 
Eqs. (l), (2) and (3) of Table 1. Eqs. (l), (2) are 
represented by the DG of Fig. 3. The DG of Fig. 4 
represents Eq. (3). In order to simplify the exposi- 
tion the 3-channel case with filter order 3 is depic- 
ted. Nodes of the arrays denoted by boxes perform 
division, while nodes denoted by circles perform 
multiplication and addition. The last index of vari- 
ables ef, eb, b, c and the index of reflection coeffic- 
ient k’ are for representation purposes and denote 
the phase of the algorithm. As is obvious from 
Tables 1 and 2 the algorithms are composed of two 
parts, one for the computation of the various errors 
and another for the computation of the various 
predictors and the resulting filter. The variables of 
the first part are decreased by one element in every 
phase of the algorithm while the variables of the 
second part are augmented by one element in the 
same phase. As we see from Fig. 3, the node 
responsible for the computation of the reflection 
coefficients is not necessary for subsequent compu- 
tations. These nodes are located in such array posi- 
tions that if a projection along any axis of the array 
is made, nodes performing division are used. Using 
a rearrangement of the input variables a triangular 
array results with the nodes performing division 
located at the first column. In conjunction with the 
array of Fig. 4 we finally have the DG graph, 
presented in Fig. 5 for all the equations ((l), (2), (3)) 
of Table 1. A projection along the vertical axis in 
conjunction with the consideration of the reflection 
coefficient being transferred rather than broad- 
casted, in order to ensure locality, results in the DG 
of Fig. 6(a). Using the canonical methodology [l 1, 
161, we finally have the wavefront array of Fig. 6(b). 
The depicted array consists of appropriate PEs 
provided with handshaking ports. The functional 
block of such a PE is illustrated in Fig. 8(b). 
In a similar way using the DGs for the remaining 
equations we obtain the main array processor 
architecture. 
The final configuration for the 3-channel case 
with filters of order 3 is depicted in a two dimen- 
sional (2-D) graph in Fig. 7. The array discussed 
above is located in the last row of the array with its 
PEs denoted by an asterisk. O((k + 1)kp) processor 
cells are required to cope with the k channel FIR 
filtering. The overall processing time is O(kp) time 
units. 
Each node of the graph is represented by indices 
(v,n) where v = 1, . . . ,(k + 1) and v = 0, . . . , 
((kp) - 1). Two types of processors are utilized. The 
first is a multiply and add unit and is denoted by 
a circle. The second, denoted by a box, apart 
from multiplication and addition, performs a divis- 
ion as well. Their functional operation is illustrated 
in Fig. 8. The arcs of the graph illustrate the com- 
putational flow among the various processor ele- 
ments. 
The function of the array can be divided in three 
phases. First, the PEs are serially loaded with the 
initial values, as illustrated in Fig. 7. These values 
in the MSE case are assumed to be known in 
advance. Otherwise they are computed in the array 
presented in Section 4.1 and are fed in the array as 
depicted in Fig. 2. As illustrated in Fig. 8 we have 
created a data path inside the PEs (indicated with 
dashed lines) using multiplexers and decoders, 
I. Bouras et al. / Signal Processing 51 (1996) 167-190 117 
e*lJ(O)(O) e”l,‘(l)(O) W(2)(0) e’yO)(O) eblJ(l)(0) W(2)(0) eb1,3(0)(O) eb1j3(l)(0) e*‘J(2)(0) 
I I 
1 I 
I I 
I 1 
eb2)3(1)(7) eb2*s(2)(7) 
ec3(1)(7) eC3(2) 
IAl ’ 
L-z@ 
Fig. 3. The Dependence Graph for Eqs. (1) and (2) of Table 1 (3-channel case with filters of order 3). The circles denote nodes that 
perform multiplication and addition, while boxes denote nodes that perform division. 
through which the data input occurs. These 
variables have been properly rearranged in order 
to ensure local communication as well as to 
minimize the use of division cells. Indeed, the 
proposed architecture involves k + 1 PEs that ex- 
ecute division. 
The PEs located at nodes of the form (v,O), 
v = 1, . . . , k, compute variables k*” and kbp, while 
178 I. Bouras et al. / Signal Processing 51 (1996) 167-190 
Fig. 4. The Dependence Graph for Eq. (3) of Table 1 (3-channel case with filters of order 3). The circles denote nodes that perform 
multiplication and addition. 
the first processor of the last row (k + 1,0) (PE 
denoted by a box with an asterisk) computes kc. 
The resulting outputs are serially transmitted to the 
neighboring nodes (v, 1). All other processors (v, n) 
receive the reflection coefficients, denoted by k, 
from the (v,n - 1) processor and pass them to 
(v,n + 1). Once the values of reflection coefficients 
kSy and kbp reach processor (v, n), they are used for 
the computation of ebwTh(n) and ef”*h(n) (or 
b[( 1 n - kpl) and a;( 1 n - kp I)), since the new values 
of b and II are created at the places of eb and e/ that 
are not necessary for subsequent computations. 
PEs (k + 1,n) (PEs denoted by a circle with an 
asterisk) compute ec (or c). The PEs (v,O) do not 
compute the variables ef and e’. When computa- 
tions are completed, the new values of ebp (or b’) 
are passed to the (v - 1, n) processor in order to 
meet the appropriate e” (or a”) for the next phase. 
Concerning the new values of efv (or a”) and ec (or 
c), they are passed to (v, n - 1) and (k + 1,n - 1) 
processors, respectively. The circulation of es 
(or a) and ec (or c) is necessary to ensure that the 
I. Bouras et al. f Signal Processing 51 (1996) 167-190 119 
ec2(lP) ec3(l)(0) ee’GW e"(2)(0) eC3(2)(0) 
eb3 
Fig. 5. The Dependence Graph for Eqs. (l), (2) and (3) of Table 1, resulting from the rearrangement of input data (3-channel case with 
filters of order 3). The circles denote nodes that perform multiplication and addition, while boxes denote nodes that perform division. 
computation of the reflection coefficients (kc, kb, kf) 
is always performed by (v,O) processors and at the 
same time create an empty place at the (v, kp - 1) 
processors. In a similar fashion variables ec flow 
from processors (k + 1,n) to processors 
(k + 1, n - l), leaving an empty place at (k + 1, 
kp - 1) PE. 
In each phase of the algorithm there are empty 
places where the new values of a,b, c are created 
from the values of kf, kb and kc respectively. At the 
end, the resulting values of a,b and c are placed at 
corresponding places where initialization variables 
eI eb ec were stored. > 3 
The new values of 4” are always created at the 
(v, kp - 1) PE, v = 1, . . . , k. The new values of c are 
created at the (k + 1, kp - 1) PE. The new values of 
b”appearatPE(v,kp-l-m),m=O, . . ..kp.Al- 
bit control signal is used to indicate the PE where 
the new values of b’ are created. Initially all PEs 
stay in a ‘false’ mode except the ones with indices 
(v, kp - 1). Thus, the first values of b” are generated 
at these nodes. Once this has been completed, the 
180 I. Bouras et al. / Signal Processing 51 (1996) 167-190 
e*3*3(2)(8) b:(l)(S) b%(l)@) b;(l)(8) b:(2)(8) b;(2)(8) G(2)(8) b:(W) b:(W) 
I I I I I 1 I I # 
I I I I I 1 I I I 
1 I I I I I I 0 I 
I I I I I I I I I 
eb3t3(0)(2) eb3s1(l)(2) ebq2(l)(2) ebq3(l)(2) eba1(2)(2) e”“*(2)(2) eb3p3(2)(2) b:(l)(2) b:(l)(2) 
eb’~*(o)(l) eba,3(0)(l) ebzl’(l)(l) eb2s2(l)(l) eb2.3(l)(1) eb2p1(2)(l) eb2z2(2)(1) ebz3(2)(l) b:(l)(l) 
ebl,* (O)(O) ebl,*(0)(O) eb1j3(0)(O) W(l)(O) ebl**(l)(0) f~“‘~(l)(O) eblll (2)(O) eb1t’(2)(0) eb1j3(2)(0) 
(a) ClQ)P) . c*(1)(9) . c3(1)W Cl cw) c*(W) c3Gw Cl(W) c*wJ) Q(3)(9) 
7 7 
(b) cl(r)(s) c*(1)(9) c3OP) ~lcw c2tw) c3w4 ClwJ) c*WP) c3cw) 
Fig. 6. (a) The Dependence Graph of Eqs. (l), (2) and (3) of Table 1 resulting from the projection along the vertical axis of the DG of 
Fig. 5 (3-channel case with filters of order 3). (b) The wavefront array that implements Eqs. (l), (2) and (3) of Table 1 (3-channel case with 
filters of order 3). Circles denote PEs that perform multiplication and addition, while boxes denote PEs that perform division. 
PEs (v, kp - 1) are deactivated, passing the control 
signal to their neighboring PEs (v, kp - 2) making 
them active and so on. When the PEs (v, 0) become 
active and the last values of b” are generated, an- 
other control signal is activated, which is sent to- 
gether with the last values of reflection coefficients 
and signals the end of the computation. As is obvi- 
ous from the above description both the control 
and data flow is local leading to a wavefront array. 
In the final phase of the array function the resulting 
values of u, b and c are placed at corresponding 
places where initialization variables ef, eb, ec were 
loaded as described earlier. We can now employ the 
data path used in the initial phase, to obtain the 
results serially from the columns of the array and 
store them in a multiport RAM of size 2(k + 1) x kp 
words. 
As we can conclude from Fig. 7 the array needs 
(k + 1)kp PEs. Each of k’(p - 1) PEs (denoted by 
a circle in Fig. 7) is composed of two multipliers, 
two adders, five handshaking modules, four regis- 
ters, two decoders, three multiplexers and a control 
I. Bouras et al. / Signal Processing 51 (1996) 167-190 181 
e” (b) e” (b) e’ (b) eb (b) e’ (b) eb (b) eb (b) e” (b) e” (b) 
1 
eb2J (0) 
’ eJl,‘(O) 
eQ3 (0) eba,l (1) ebaBa(l) eba13(1) ebaol (2) ebasa (2) ebq3 (2) eb7’ (3) 
ef’qo) eJ’,’ (1) eJ1oa (1) eJ1s3 (1) eJ1zl (2) eJIBa (2) eJ1t3(2) eJ’,’ (3) 
eb3-3 (0) 
2 eJas3(o) 
ebal (1) ebaa (1) eba3(1) eb3,’ (2) eb3J(2) ebq3 (2) eb31 (3) ebQa (3) 
eJa,‘(l) eJasa(l) eJaB3(1) eJa*r (2) eJaaa (2) eJap3(2) eJa,’ (3) eJala (3) 
ebltl(0) 
3 eJ3,1 (1) 
eblna (0) eb113 (0) e”‘,‘(l) eblta(l) eb1j3(1) eblsl (2) eblja (2) eb113(2) 
eJ3ja (1) eJ3,3(l) eJ3j1 (2) eJ3za (2) eJ3t3 (2) eJ3,’ (3) eJ3,a(3) eJ3t3(3) 
4 
eblrL (0) eblga(0) eb113(0) e’l,’ (1) eblaa (1) eb1n3 (1) eblB1 (2) ebl,a (2) eb1r3(2) 
eC1(0) eta(O) eC3(0) eel (1) eta(1) ee3(1) eC1(2) eta(2) ec3(2) 
Fig. 7. A 2-D wavefront array for the 3-channel case with filters of order 3. The circles denote PEs that perform multiplication and 
addition while boxes denote PEs that apart from multiplication and addition perform division as well. At the end of the computation the 
results are located at the PEs of the array. 
unit of about ten gates of glue logic. Their 
functional block is depicted in Fig. 8(a). The k 
PEs (denoted by a box in Fig. 7) have the same 
amount of hardware with the above-mentioned 
PEs plus two dividers in order to perform the 
computation of the reflection coefficients. The 
k(p - 1) PEs (denoted by a circle with an asterisk 
in Fig. 7) are composed of one multiplier, one 
adder, five handshaking modules, three registers, 
two decoders, two multiplexers and a control unit 
of about ten gates of glue logic, each. Their func- 
tional block is depicted in Fig. 8(b). One PE (de- 
noted by a box with an asterisk in Fig. 7) is com- 
posed of the hardware mentioned above and one 
divider. 
The overall time of the array function is 
Tall = Tinit + Tcomp + Tres, 
182 I. Bouras et al. / Signal Processing 51 (1996) 167-190 
I- 
ef (4 
kj, kb 
ADO 
T 
MUL 
Fig. 8. Functional blocks for the nodes of the array of Fig. 7. (a) PE denoted by circle (MSE and TSE cases). 
where Tinit is the time needed for the initialization 
of the array, Tcomp is the computation time and 
Tres is the time needed to obtain the results. Note 
that Tinit = 2(k + l)tload, where rIoad is the time in 
which every variable e is loaded in a PE. 
Tcomp = Trcomp + Tkdel , 
Trcomp = kpt, where kp are the needed cycles of 
a PE for the overall computation and t, is the cycle 
time of a PE. t, = 2tkload + tmult 4 tadd where t&,ad is 
the time for a reflection coefficient to be stored in 
a register in the PE. Tkdel = k(p - l)tkdc, where 
tr& is the time that a reflection coefficient needs 
to pass from a PE to its neighbor. Tkdel is the 
time that a reflection coefficient needs to arrive 
at the last PEs of the array (v,k(p - 1)). 
Tres = 2(k + l)t,,, where t,,, is the time in which 
every variable of the results II (orb, or c) is obtained 
from a PE. 
OCCAM description of the basic processor units 
is provided in Fig. 9. Readers unfamiliar with 
OCCAM language may look at the detailed 
description of the code of Fig. 9(a) given in 
Appendix A. 
A similar array processor can be utilized for the 
implementation of the total least squared Schur 
algorithm tabulated in Table 2. The only difference 
is the functional operation of the PEs with indices 
(k + 1,n) (n = 0, . . . , kp - 1). The functional block 
of such a PE is illustrated in Fig. 10). The PE 
(k + 1,0) computes the values of kc, k”, kb and 
I. Bouras et al. / Signal Processing 51 (1996) 167-190 183 
Fig. 8. (b) PE denoted by circle with an asterisk (MSE case). MUL, ADD, REG, DEC and MUX denote a multiplier, an adder, a register, 
mode 1 
1 MU. 1 ‘:-_-I--------------________i 1 
kc 
kc 
’ e* (b) 
a decoder and a multiplexer, respectively. 
passes them to its neighbor (k + l,l). Each PE 
(k + 1,n) receives the values of reflection coeffi- 
cients from PE (k + 1, n - 1) and forwards them to 
the next PE (k + 1,n + 1). When these values reach 
a PE they are used for the computation of the e’(N) 
(or c(N)), E(N) (or w(N)) and eb(N - 1) (or 
b(N - 1)). When the above computations have 
been completed, PEs (k + 1, n) forward the com- 
puted values of e’(N) (or c(N)) and E(N) ( or w(N)) 
to PEs (k + 1,n - 1) while they receive the com- 
puted values of e’(N) (or c(N)) and E(N) (or w(N)) 
from PEs (k + 1, n + 1). The new values of c(N) 
and w(N) are generated at PE (k + 1, kp - 1). 
eb(N - 1) (or b(N - 1)) are forwarded to PEs (k, n) 
while eb(N - 1) (or b(N - 1)) are received from 
PEs (1, n). All the other processors of the array have 
the same functionality as the ones discussed in the 
MSE case. The data and control flow remain the 
same as well. Variables ef (or a) and e* (or b) have 
been replaced by ef(N) (or a(N)) and eb(N) (or 
b(N)), respectively. 
The amount of hardware is augmented since the 
PEs of the last row of the array (denoted with an 
asterisk in Fig. 7) have the following extra circuits 
each: one multiplier, one adder, three handshaking 
modules and two registers. The size of the off chip 
RAM is now (2k + 3) x kp words. 
The overall time of the array function is again 
Tall = Tinit + Tcomp + Tres, 
where Tinit is the time needed for the initialization 
of the array, Tcomp is the computation time and 
Tres is the time needed to obtain the results of the 
computation from the array. Tinit = 3(k + l)tload, 
where tload is the time in which every variable e is 
loaded in a PE. 
Tcomp = Trcomp + Tkdel. 
Trcomp = kpt, where kp are the needed cycles of 
a PE for the overall computation and C, is the cycle 
time of a PE. t, = 3tkload + 2~~~ + tadd where 
tkload is the time for a reflection coefficient to be 
stored in a register in the PE. Tkdel = k(p - l)tkdel 
where lkdel is the time that a reflection coefficient 
needs to pass from a PE to its neighbor. Tkdel is the 
time that a reflection coefficient needs to arrive 
at the last PEs of the array (v, k(p - 1)). 
Tres = 3(k + l)t,,, where t,,, is the time in which 
every variable of the results II (or 6, or c, or IV) is 
obtained from a PE. 
184 I. Bouras ef al. J Signal Processing 51 (1996) 167-190 
WHILE NOT stop 
SEQ 
chan.1 ? kf ; kb ; stop 
chan.6 ! kf ; kb ; stop 
IF 
NOT model 
PAR 
SEQ 
ef:= tempef + (tempeb*kf) 
chan.2 ! ef ; model 
eb := tempeb + (tempef*kb) 
model 
PAR 
SEQ 
ef:= tempef + (tempeb*kf) 
chan.2 ! ef ; model 
eb := kb 
model := FALSE 
PAR 
chan.3 ! eb 
chan.5 ? tempef ; mode 
chan.4 ? tempeb 
a) 
WHILE NOT stop 
SEQ 
chan.1 ? kc ; stop 
chan.6 ! kc ; stop 
ec := ec + (eb*kc) 
chan.3 ? eb 
chan.5 ! eb 
chan.2 ! ec ; mode 
chan.5 ? ec ; mode 
b) 
chaa.3 
chan.2 chan.5 
* 
:ti:- 
chan.1 chan.6 
chall.4 
chan.3 
chan.2 chan.5 
+= 
chan.1 chan.6 
chin.4 
WHILE NOT stop 
SE8 
PAR 
kf := 
kb := 
-$myf[Frb) 
em e 
IF 
model 
stop := TRUE 
TRUE 
SKIP 
chan.1 ! kf ; kb ; stop 
IF 
NOT model 
PAR 
af := af + (tempefLkf) 
eb := tempeb + (tempef+kb) 
model 
eb := kb 
PAR 
chan.2 ? tempef ; model 
chan.3 ! eb 
chan.4 ? tempeb c) 
Fig. 9. OCCAM description for some of the PEs of the array of Fig. 7. The boundary processors n = (kp - 1) have some 
modifications since they do not compute the a or c but assign to them the new values of reflection coefficients. (a) PE denoted by 
a circle (MSE and TSE cases). (b) PE denoted by a circle with an asterisk (MSE case). (c) PE denoted by a box (MSE and TSE 
case). 
1. Bouras et al. / Signal Processing 51 (I 996) 167-I 90 185 
eb(N - 1) (b(N - 1)) 
-Dl 
Q 
mu.u 
ec 
kc,iib, 
1 
ADD ADD ADD 
T T 1’ 
MUL MUL MUL 
-IT-l 
i 
kw z 
Fig. 10. Functional block for the PE of the array depicted in Fig. 7, which is denoted by a circle with an asterisk (TSE case). MUL, ADD 
and REG denote a multidier, an adder and a register, respectively 
4.3. Implementation issues 
The wavefront arrays discussed so far are the two 
dimensional, thus being suitable for VLSI imple- 
mentation. Although the number of PEs is extreme- 
ly large the increasingly diminished dimensions of 
the devices in current technologies along with new 
technologies as wafer scale integration (WSI) or 
multiple chip modules (MCM) allow for a possible 
implementation of the algorithm in a single chip, in 
the near future. 
Dedicated PEs must be designed in order to 
maximize performance. Simulations have shown 
that both algorithms are well behaved using fixed- 
point arithmetic with word-length greater than 
16-bits. Thus fixed-point arithmetic will be used for 
hardware implementation of the arrays. All PEs 
have a very simple structure. They all consist of 
a multiplier, an adder, a couple of registers and 
some glue logic for the control. Since wave-front 
arrays are employed, the ports must be imple- 
mented by handshaking circuits. Notice that the 
PEs assigned to (v,O) include extra circuitry to 
perform division. 
Multiple data paths can be utilized to achieve full 
parallelism. Even in the case of PEs assigned to 
(v,O), more than one divider must be used, when we 
have to compute more than one reflection coeffi- 
cient. However, the large number of PEs makes it 
prohibitive within the existing technology. Many 
choices could be made for the implementation of 
the multiplier and the adder, such as bit-serial or 
full parallel architectures, having trade-offs be- 
tween area and speed. Since the area is the most 
restrictive factor in our case, ‘slow’ but ‘small’ cir- 
cuits must be used in order to implement the 
186 I. Bouras et al. / Signal Processing 51 (1996) 167-190 
algorithm in a single chip. The scheduling of the 
operations is not so vital for the speed of the algo- 
rithms except he priority given to the computation 
of the b(N - 1) (or eb(N - 1)) in the TSE case. 
5. An architecture with reduced number of 
processors 
In this section a more realistic approach towards 
hardware implementation is described. Grouping 
of the PEs of Fig. 7 along the ‘channel’ axis is 
performed. We map every group of k PEs (k is the 
number of channels) to one PE of the new archi- 
tecture. The resultant architecture is illustrated in 
Fig. 11 (3-channel case with filters of order 3). 
Each PE in the new scheme has to handle the 
operations performed by k PEs of Fig. 7. The input 
data are serially loaded in the array in a similar 
eb (b) 
Fig. 11. A 2-D array for the 3-channel case with channel ength 
3, resulting from the folding of the array of Fig. 7. Input and 
output of data are performed in a similar fashion to the one 
described for the array of Fig. 7. 
fashion as in the array of Fig. 7. Every PE of Fig. 11 
has to be loaded with all the input data of k PEs of 
Fig. 7, increasing the time needed for the initializa- 
tion of the array by a factor of k. The same increase 
in time is also there at the end of the computation, 
when we have to output serially the results located 
at the PEs of the array. 
Following the array processor of Fig. 7, multiply 
and add processors are represented by a circle, 
while units equipped with division circuitry are 
denoted by a box. This 2-D array is of primary 
interest when VLSI implementation is considered, 
due to the reduced number of PEs that are utilized. 
Indeed only (k + 1)~ processors are employed. This 
significant hardware reduction is traded off by an 
increase of computation time. The computational 
load of each PE is increased by a factor of k, 
resulting in an overall time of O(k2p) time units. 
Implementation of the algorithm in a massively 
parallel machine composed of standard DSP pro- 
cessors is more effectively carried out by the above 
architecture. Indeed parallel machines have large 
communication cost because their processors are 
intended for general use. The architecture allows 
grouping of the variables and thus reduction of the 
communication overhead. The efficient architec- 
ture depicted in Fig. 6, for the implementation of 
the multichannel Schur algorithm, was pro- 
grammed on Parsytec’s GC512 highly parallel ma- 
chine, consisting of 512 transputers. A particular 
problem encountered was the excessive time 
needed for input and output of data, since only one 
processor was available for the communication 
between the machine and the host computer. 
A suitable dedicated processor for wavefront im- 
plementation is shown in block diagram form in 
Fig. 12. A double data path is utilized to increase 
processor speed, to simplify control and to take 
advantage of the specific structure of the pertinent 
algorithms. Also using multiplexers and decoders 
we create a data path in order to load the initial 
values to the array and to obtain the results from it. 
The circuit that mostly affects the PE speed is the 
multiplier. This is also the case in most DSP ap- 
plications. At present a wide range of fast multipli- 
cation algorithms are available to the designer [14, 
20-J. A parallel multiplier based on the modified 
Booth algorithm seems to be appropriate in our 
I. Bow-as et al. J Signal Processing 51 (1996) 167-190 187 
4 
Fig. 12. The block diagram of an ASIC processor suitable for the 2-D array of Fig. Il. 
case. It assures a high speed performance and 
a compact layout as well. In our case there is an 
extra advantage since in each phase of the algo- 
rithm a value of variable k is multiplied with a num- 
ber of variable e values. Thus, the delay due to the 
modified Booth algorithm encoding, is counted 
only once, for a number of multiplications equal to 
the number of channels. It should be pointed out 
that good performance could be achieved even with 
the use of single-level-metal CMOS technology 
[ 14, 201. This enables the utilization of the remain- 
ing metal levels for the interconnections of the 
whole array. The processor also includes two fast 
adders (e.g., CLA adders), two FIFOs, each one 
equal to at least the number of channels, four ports 
suitable for handshaking, some registers and a simple 
control unit. The FIFOs may need some extra space 
to cope with any communication bottleneck. 
The two FIFOs are initially loaded with initial 
values of efv,h and e bP3h by activating the appropri- 
ate data path. At each phase the new values of kf, kb 
arrive at the processor. These in turn pass to its 
neighbor and simultaneously are stored in the two 
registers. After the computation of the new values 
ef (or a) and eb (or b), the first value of es is passed 
to the neighboring processor while the other ones 
return to the FIFO. After the end of the phase the 
value of er that arrives at the processor is placed in 
the last position of the FIFO. All values of eb (or b) 
pass to the neighboring processor as indicated in 
Fig. 11. Similar operations are performed by the 
processors that compute ec (or c). 
188 I. Bowos et al. / Signal Processing SI (1996) 167-l 90 
e’ l(O) 
ew(O 
ea 3(O) 
e-(o) ey1 
cd?) a;(P) 4(P) 4(P) 
b:@) b:(p) b:(p) 
Fig. 13. A 3-D wavefront array for the 3-channel case, suitable for both the MSE and TSE case. PEs denoted by circles perform 
multiplication and addition. PEs denoted by boxes apart from multiplication and addition, perform division as well. The indices m and 
1 used to describe the input data are m = kp and I= kp - 1. 
I. Bourns et al. 1 Signal Processing 51 (1996) 167-190 189 
In the total squared error case of Table 2, the 
structure of the algorithm is essentially the same, 
i.e., the dependence graph does not change apart 
from some of the nodes which perform some extra 
multiply-add operations. More precisely, the pro- 
cessors computing the vector c have now to per- 
form three operations at a time. One design alter- 
native is to increase the data paths. A possible 
remedy is to allocate the third operation to one of 
the existing data paths slightly increasing the over- 
all computation time of the array. In this case 
a careful scheduling of the operations must be ac- 
complished giving priority to the computation of 
b(N - 1) (or eb(N - 1)). In both cases a third FIFO 
must be used to store the additional variables. 
6. A highly pipelined 3-D array 
An alternative architecture for the implementa- 
tion of the proposed algorithms is depicted in Fig. 
13. It is a three dimensional array consisting of 
(k -t- l)k2p2 PEs, needed to handle the k channel 
setup. As with the previous arrays, the circles rep- 
resent multiply-add processors while boxes per- 
form division as well. The PEs are identified by 
a triplex (v,h, n) (v = 1, . . . , k + 1, h = 1, . . . , kp, 
n = 1, . . . , kp). The PEs (k + l,h, n) compute the 
solution vector c (and vectors w and b(N - 1) in the 
TSE case), while the rest compute the forward and 
backward predictors u and b. The functional opera- 
tions of each PE are similar to that utilized by the 
2-D array counterparts. The input and output data 
are illustrated in Fig. 13. To clarify the computa- 
tional flow, input and output data are indicated in 
the figure. The main advantage of this array is that 
it is susceptible to pipeline with latency kp time 
units. This attractive feature compensates for 
the excessive number of PEs. This particular 
implementation is suitable for block adaptive pro- 
cessing [21,24]. Indeed, it is more efficient han the 
triangular array processors developed in [21] for 
the adaptive Schur algorithm, since matrix opera- 
tions are replaced now by scalar counterparts. 
7. Conclusions 
Efficient VLSI architectures of highly concurrent 
algorithms for the order recursive solution of block 
Toeplitz like systems have been presented. The pro- 
posed structures can be implemented either on 3-D 
or on 2-D arrays of VLSI processors, compromising 
between execution time and number of processors. 
Appendix A 
All processes in Fig. 9(a) with the same indenta- 
tion belong to the same process. For explanation 
reasons we give a number at the right of each 
process. All the subprocesses with the same number 
at their left belong to the same process. The sym- 
bols kf, kb, eb, ef denote the variables kf, kb, eb (or 
b), d (or a), respectively, the symbols tempeb, tem- 
pef denote temporal variables for the eb (or b), ef (or 
a), the stop and model are logical variables and 
chan.1, chan.2, chan.3, etc., are the six channels of 
the processor element. 
WHILE NOT stop 
SEQU) 
(1.a)chan.l ? kf; kb; stop 
(l.b)chan.6 ! kf; kb ; stop 
IF 
(l.c)NOT model 
PAR(2) 
CWSEQP) 
(3.a)ef := tempef + (tempeb*kf) 
(3.b)chan.2 ! ef; model 
(2.b)eb := tempeb + (tempef*kb) 
(l.c)model 
PAR(4) 
WaWQ(3 
(5.a)ef := tempef + (tempeb*kf) 
(5.b)chan.2! ef; model 
(4.b)eb := kb 
(4.c)model:= FALSE 
(l.d)PAR(6) 
(6.a)chan.3 ! eb 
(6.b)chan.5? tempef; model 
(l.e)chanA? tempeb 
While the logical variable stop has the false 
Do sequentially (1) 
(1.a) The variables kf, kb, stop take serially their 
new values from the channel chan.1. 
(1.b) The variables kf, kb, stop output their values 
serially to the channel chan.6. 
190 I. Bouras et al. / Signal Processing 51 (1996) 167-190 
(lx) Zf the logic variable model has the value false 
do in parallel (2) 
(2.a) Do sequentially (3) 
(3.a) Add the result of the multiplication of tempeb 
and kf to the tempef and assign the result to the 
variable ef. 
(3.b) The variables ef and model output their values 
to the channel chan.2. 
(2.b) Add the result of the multiplication of tempef 
and kb to the tempeb and assign the result to the 
variable eb. 
(1.~) If the logic variable model has the value false 
do in parallel (4) 
(4.a) Do sequentially (5) 
(5.a) Add the result of the multiplication of tempeb 
and kf to the tempef and assign the result to the 
variable ef. 
(5.b) The variables ef and model output their values 
to the channel chan.2. 
(4.b) Assign the value of the variable kb to the vari- 
able eb. 
(4.~) Assign the value false to logic variable model. 
(1.d) Do in parallel (6) 
(6.a) Output the value of variable eb to the channel 
chan.3. 
(6.b) The uariables tempefand model take their new 
values from the channel chan.5. 
(1.e) The variable tempeb takes its new value from 
the channel chan.4. 
References 
Cl1 
PI 
c31 
r41 
c51 
G. Glentis and N. Kalouptsidis, “Efficient order recursive 
algorithms for multichannel least squares filtering”, IEEE 
Trans. Signal Process., June 1992, pp. 135441374. 
G. Glentis and N. Kalouptsidis, “Efficient algorithms for 
the solution of block linear systems with Toeplitz entries”, 
Linear Algebra Applications, January 1993. 
G. Glentis and N. Kalouptsidis, “Solution of block linear 
systems with Toeplitz entries using a channel decomposi- 
tion technique”, Signal Processing, Vol. 37, No. 1, May 
1994, pp. 15-60. 
G.-O. Glentis and N. Kalouptsidis, “Efficient multichannel 
FIR filtering using a step versatile order recursive algo- 
rithm”, Signal Processing, Vol. 37, No. 3, June 1994, pp. 
437-462. 
A.K. Jain, Fundamentals of Digital Image Processing, Pren- 
tice-Hall, Englewood Cliffs, NJ, 1989. 
PI 
c71 
PI 
c91 
WI 
Clll 
WI 
Cl31 
Cl41 
Cl51 
II161 
Cl71 
Cl81 
Cl91 
WI 
c211 
PI 
~231 
c241 
I.C. Jou, Y.H. Hu and W.S. Feng, “Novel implementation 
of pipelined Toeplitz system solver”, Proc. IEEE, Vol. 74, 
1986, pp. 1463-1464. 
T. Kailath, Signal Processing in the VLSI Era, Prentice- 
Hall, Englewood Cliffs, NJ, 1985. 
N. Kalouptsidis, G. Carayannis, D. Manolakis and E. 
Koukoutsis, “Efficient recursive in order LS FIR filtering 
and prediction”, IEEE Trans. Acoust. Speech Signal Pro- 
cess., Vol. 33, October 1985, pp. 1175-1187. 
N. Kalouptsidis and S. Theodoridis, “Parallel implementa- 
tion of efficient LS algorithms for filtering and prediction”, 
IEEE Trans. Acoust. Speech Signal Processing, Vol. 35, 
November 1987, pp. 1565-1569. 
N. Kalouptsidis and S. Theodoridis, eds., Adaptive System 
Zdentijication and Signal Processing, Prentice-Hall, Engle- 
wood Cliffs, NJ, 1993. 
S.Y. Kung, VLSI Array Processors, Prentice-Hall, Engle- 
wood Cliffs, NJ, 1988. 
S.Y. Kung and Y.H. Hu, “A highly concurrent algorithm 
and pipelined architecture for solving Toeplitz systems”, 
IEEE Trans. Acoust. Speech Signal Process., Vol. 31, Feb- 
ruary 1983, pp. 66-76. 
L. Ljung and T. Siiderstrlim, Theory and Practice ofRecur- 
sioe Identification, MIT Press, Cambridge, MA, 1982. 
G.K. Ma and F.J. Taylor, “Multiplier policies for digital 
signal processing”, IEEE ASSP Mug., January 1990, pp. 
6-20. 
S.L. Marple, Digital Spectral Analysis with Applications, 
Prentice-Hall, Englewood Cliffs, NJ, 1987. 
D.I. Moldovan and J. Fortes, “Partitioning and mapping 
algorithms onto fixed size arrays”, lEEE Trans. Comput., 
Vol. 35, No. 1, January 1986, pp. 1-12. 
M. Morf, B. Dickinson, T. Kailath and A. Viera, “Efficient 
solution of covariance equations for linear prediction”, 
IEEE Trans. Acoust. Speech Signal Process., Vol. 25, 1977, 
pp. 4299433. 
J.G. Proakis, Digital Communications, McGraw-Hill, New 
York, 1983. 
E. Robinson, Multichannel Time Series Analysis with 
Digital Computer Programs, Holden-Day, San Francisco, 
CA, 1967. 
R. Sharma et al., A 6.75-m 16 x 16-bit multiplier in single- 
level-metal CMOS technology”, IEEE J. Solid-State Cir- 
cuits, Vol. 24, No. 4, August 1989, pp. 922-927. 
P. Strobach, “Recursive triangular array ladder algo- 
rithms”, IEEE Trans. Signal Process., Vol. 39, January 
1991, pp. 122-136. 
S. Treiter, “Principles of digital multichannel filtering”, 
Geophysics, Vol. 35, No. 5, 1970, pp. 785-811. 
R. Wiggins and E.A. Robinson, ‘Recursive solution to 
multichannel filtering problem”, J. Geophys. Res., Vol. 70, 
1965, pp. 1885-1891. 
Xiao-Hu Yu and Zhen-Ya He, “Efficient block implemen- 
tation of exact sequential least squares problems”, IEEE 
Trans. Acoust. Speech Signal Process., Vol. 36, 1988, pp. 
392-399. 
