Transputer implementation of block regularized filtering by Kadlec, Jiří
Kybernetika
Jiří Kadlec
Transputer implementation of block regularized filtering
Kybernetika, Vol. 32 (1996), No. 3, 235--250
Persistent URL: http://dml.cz/dmlcz/125518
Terms of use:
© Institute of Information Theory and Automation AS CR, 1996
Institute of Mathematics of the Academy of Sciences of the Czech Republic provides access to digitized
documents strictly for personal use. Each copy of any part of this document must contain these
Terms of use.
This paper has been digitized, optimized for electronic delivery and stamped with
digital signature within the project DML-CZ: The Czech Digital Mathematics Library
http://project.dml.cz
K Y B E R N E T I K A — V O L U M E 32 ( 1 9 9 6 ) , N U M B E R 3, P A G E S 2 3 5 - 2 5 0 
TRANSPUTER IMPLEMENTATION 
OF BLOCK REGULARIZED FILTERING 
J I Ř Í K A D L E C 
A novel approach to parallel transputer implementation of a regularized exponential 
parameter tracking is described. The proposed block regularized exponential algorithm 
allows to define and automatically adjust mean value of an alternative covariance of the 
estimated parameters in the boundaries of blocks of processed data. The alternative mean 
tracks in blocks the current parameter estimates. That is why the influence of the regular-
ization on the parameter estimates is reduced, and the algorithm remains compatible with 
the completely pipelined parallel transputer implementation. 
1. I N T R O D U C T I O N 
A brief introduction t o the existing parallel technology for implementation of iden­
tification and control algorithms is presented in this section. The objective is to 
provide an overview of the transputer technology. 
H a r d w a r e u s e d in s ignal p r o c e s s i n g a n d contro l app l icat ions 
A wide variety of hardware has been employed for embedded real-time systems, rang­
ing from conventional microprocessors to digital signal processors (DSPs). T h e main 
advantage of DSPs over general purpose microprocessors is t h a t they contain dedi­
cated circuitry which provides high resolution and high speed arithmetic operations. 
Furthermore, these devices contain A/D and D/A converters on-chip and require 
little external circuitry. T h e increasing trend in this area is to provide low-cost 
application specific integrated circuits (ASICs) which, although restrict flexibility, 
increase the computat ional rates and simplify the support hardware requirements. 
One of the most commonly used digital processing devices is the TI TMS 320. 
Despite the success of embedded digital controllers in industrial control applica­
tions, an increasing number of parallel processors, such as the Inmos transputer, are 
appearing on the market. Increased computational speed is of course the primary 
benefit of parallel processing. This allows faster systems to be controlled and gives 
the designer the choice of added complexity in the algorithms. Easy expansion, with­
in a uniform hardware and software base is another feature of concurrent systems, 
236 J. KADLEC 
since it is possible to add more processors as required which has important impli-
cations for reduced development and maintenance costs. Parallel processing also 
offers a closer relationship between the inherent parallelism expressed at the design 
stage, in the form of block diagram descriptions of the system, and the hardware 
implementation. 
Paral le l arch i t ec tures 
The most commonly accepted classification of digital computers was introduced by 
Flynn in 1966 [2], based on the multiplicity of instruction and data streams. This 
results in four categories: 
1. Single instruction stream - Single data stream (SISD) 
2. Single instruction stream - Multiple da ta stream (SIMD) 
3. Multiple instruction stream - Single data stream (MISD) 
4. Multiple instruction stream - Multiple data stream (MIMD) 
where an instruction stream is a sequence of instructions executed by the machine 
and a da ta s tream is a sequence of data including input, partial or temporary results, 
called for by the machine [8]. The most useful parallel processing architectures can 
be classified as SIMD, for example a systolic array, or as MIMD, for example an 
array of t ransputers . 
Systo l i c arrays . The systolic array was first proposed by II. T. Kung and C. E. Leis-
erson in 1980 [19]. This innovative array was originally proposed to exploit VLSI 
technology as it consisted of an array of simple, mostly identical, processing elements 
arranged in a regular, repeated structure. Furthermore, each processing element is 
connected only to its nearest neighbors and communication is achieved by pump-
ing data through the array on each cycle of a clock. As a result data is passed 
through the array in a specific direction and the term 'systolic' is used to describe 
this rhythmical movement which is analogous to the pumping action of the heart. 
During the last ten years, the concept of a systolic array has evolved to provide 
a graphical representation of the computational and da ta flows for a concurrent 
description of an algorithm. Indeed, considerable effort has been devoted to the 
design of systolic systems ranging from convolution to Kalman filtering [20]. The 
general consensus emerging from this work seems to be tha t the structures should be 
referred to as systolic architectures as their actual hardware implementation could 
be either a dedicated integrated circuit (SIMD) or an array of transputers (MIMD). 
Furthermore, the advances in systolic arrays have provided a major stimulus to the 
development of other array processors such as the wavefront array. This latter type, 
which was proposed by S. Y. Kung [20], operates in a similar fashion as the systolic 
array except tha t the communication is asynchronous, as the global clock pulse is 
replaced by a local handshaking type procedure. 
Transputer Implementation of Block Regularized Filtering 237 
Transputers. The Inmos transputer, which was introduced in the mid 1980's, 
is a 32-bit microprocessor specifically designed for implementing parallel systems. 
Essentially, the device is a microprocessor with on-chip RAM and with the novel 
feature of four bi-directional serial links which allow point-to-point communication 
between one transputer and another. This allows arrays of transputers to be con-
nected together to form MIMD systems. 
There are various generations of the transputer commercially available, such as 
the T800. It contains an on-chip 64 bit floating point unit and the hardware floating 
point unit sustains 2.5 million floating point operations per second (FLOPS). In 
addition, the T800 transputers employ four links with speeds 20 Mbits/sec. 
The next generation of the transputer, the T9000, provides further improvements 
over the T800 processor as it contains a pipelined superscalar micro-architecture 
and 16 Kbytes of on-chip instruction and data cache. Furthermore, this processor 
is capable of a peak performance of 25 million FLOPS running at a clock speed of 
50 MHz. In addition, the four communication links provide a total of 80 Mbytes/sec 
bi-directional bandwidth. 
The development of the the T9000 transputer has been delayed for several years in 
early 90's. The vendors of parallel systems developed multiple alternative solutions. 
The TI C40 signal processors operate with 6 eight-bit wide links and 40 million 
FLOPS CPU. The drawback of this processor is the cost. The true Parallel C 
support emerged only in 1995. 
Another example can be the combination of a transputer with a powerful CPU 
like the DEC AXP Alpha. The DEC 21066 Alpha processor is fully pipelined, 
dual-issue 64-bit advanced RISC processor. Running at 233 MHz, it is capable 
of a maximum 233 million FLOPS in IEEE double precision format and performs 
466 million instructions per second. There are 300 MHz versions of this processor 
but not available for commercial market (spring 1995). The four Inmos compatible 
bi-directional transputer links are used to connect multiple Alpha processors into 
network. These "twins" are programmed in Parallel C as standard nodes of the 
transputer array. 
Programming languages. The transputer was originally developed to utilize 
the occam programming language, although over the last few years various other 
compilers have been developed. Occam is a unique parallel programming language 
which has roots in the mathematics of communicating sequential processes [7]. The 
language encompasses a design formalism which has strong modularity and yet is 
simple to use. An occam program consists of a hierarchy of parallel processes which 
can be run on one or more transputers, with the communication synchronized by data 
paths known as channels. These features provide a powerful tool for the development 
of concurrent programs and all of the real-time implementations, presented in the 
following sections, were programmed using the occam language. 
Parallel Implementation Issues 
The innovation of parallel processing has added a new dimension to the design of 
algorithms and programs. Parallel programming is not a simple extension of its serial 
238 J. KADLEC 
counterpart but it requires a complete reconsideration of the implementation process 
in order to fully exploit the possibilities offered by concurrency. Furthermore, the 
widespread availability of parallel processing hardware has led to a diverse range 
of applications from database management systems to real-time embedded control 
systems. 
An efficient parallel implementation must exploit both the capabilities of the 
target architecture and the natural parallelism inherent in the algorithm. There are 
three important considerations involved in the process of partitioning an algorithm 
to obtain a concurrent realization: 
• Subdivision of the algorithm into parallel processes 
• Inter-process communication 
• The efficient execution of each process. 
A crucial requirement, when subdividing an algorithm, is to obtain a balanced 
computational load across each of the processes. This ensures that no one process 
dominates over the others which provides the most effective concurrent realization. 
The procedure involves decomposing the problem into a set of tasks, or processes, 
to extract the natural parallelism inherent in the algorithm. 
Inter-process communication is the major overhead in parallel processing sys-
tems. Indeed, communication delays can considerably degrade the performance of 
a concurrent system and render its application unjustifiable. Thus, it is a crucial 
requirement to obtain an effective trade-off between the communication and com-
putat ion demands. 
The transputer application reported in this paper was programmed in Parallel C 
for a network of T800 transputers and a network of DEC 21066 233 MHz Alpha 
processors. The implementation runs under MS Windows 3.1 version of Matlab 4.2. 
The t ranspute r /Alpha network is interfaced to Matlab via the Alpha-Bridge software 
development support utility Alpha-Bridge [14]. 
A parallel t ransputer implementation of a regularized exponential parameter 
tracking is described in next sections of the paper. The formulation of the regu-
larized parameter tracking takes in to account the transputer implementation re-
quirements outlined above. This results in the final parallel algorithm suitable for 
the transputer implementation. 
2. RECURSIVE IDENTIFICATION 
Recursive parameter estimation and filtering is a key tool in modern application 
areas such as the adaptive signal processing or control. 
When there is little information available about the detailed structure of the 
system, the s tandard Kalman filter model [3, 4] is reduced to a "random-walk of 
parameters" regression type model 
&k+i = 0k + wk 
Vk = <p'k0k + Vk- (1) 
Transputer Implementation of Block Regularized Filtering 239 
Here Ok is the vector of n unknown system parameters, yk is the sampled system 
output , (pk is the vector of n past input and output data, Vk and Wk are assumed to 
be the zero mean, Gaussian, white noise sequences characterized by the covariance 
matr ix Wk and Vk respectively. 
The related estimator takes the form of the standard Kalman-filter da ta update 
step (2) combined with different specifications of the time update part of the recur-
sive parameter estimation. 
D a t a u p d a t e . 
ejfc|Jb-i = 1/k ~ <Pk Qk\k-1 
D D Pk\k-l<pk<p'k
pk\k-l 
r k \ k = i j f c | j f c - i 7 — — ; - = • 1 ' 1 + ¥>k
Pk\k-l<Pk 
Ok\k = 0k\k-i + Pk\k tpkhlk-l- (
2) 
K a l m a n filter-like t i m e u p d a t e . Specified by 0 < Wk 
Ok + l\k = 0k\k 
Pk+i\k = Pk\k + Wk. (3) 
The alternative concept of exponential age-weighting [23] by the scalar exponential 
weighting factor 0 < A& < 1 is frequently used when an effective implementation is 
required [25]. 
E x p o n e n t i a l f o r g e t t i n g t i m e u p d a t e . Specified by 0 < A* < 1. 
Ok + l\k = 0k\k 
Pk+i\k = K'Pklk- (4) 
The adding of Wk to Pk\k (3) is avoided in (4) and the algorithm can be based on 
the effective Q R update of the square root factors of the information matr ix Pj~.k 
and implemented on the triangular systolic array of Gent leman-Kung [24], [21], [28]. 
Limits o f t h e s t a n d a r d t i m e u p d a t e formulat ion 
When the measured da ta contains little information about some of the parameters, 
the parameter estimates become increasingly dependent on the parameters specifying 
the time update . Standard time updates (3) or (4) take into account neither the 
information contained in previous data, nor prior information about the possible 
values of the estimated parameters. Thus, numerical problems could eventually 
occur, even in the case of numerically robust factorised implementations. 
Several techniques to cope with poor informative da ta have been proposed in the 
current l i terature. Directional forgetting [6, 18], the constant trace approach [22] 
and the shift invariant directional forgetting are only a few examples. Their origi-
nal formulation has been designed for efficient sequential implementation. If these 
240 J. KADLEC 
methods are directly mapped onto a systolic implementation, the global feedback 
loops in the signal flow degrade considerably the data throughput of the array. Their 
formulation must therefore be revised in the pipelined transputer implementation to 
avoid low processor utilization. This paper addresses this issue. 
The concept of regularized estimator 
To overcome these limitations, we shall design a systolic regularized estimator using 
the general concept of parameter tracking introduced in the Bayesian formulation 
by Kulhavy in [16] and analyzed in [17]. 
Generalized exponential forgetting time update. Specified by 
^ J ( P * ) - 1 > 0 ; 0 < A » < 1 . 
Ak = {l+^P^iPk)-1}'1 
4+i |* = Ak9k\k + (I-Ak)9*k 
Pk+i\k = {A,P^
1 + ( l - A f c ) ( P ; ) -
1 } " 1 . (5) 
The time update step (5) could be understood as the introduction of an additional, 
alternative reference matrix (Pk )
- 1 > 0 and parameter estimate 6k exhibiting the 
role of an "alternative value" or "regularizing point" in the specification of the time 
update step. 
Regularized information filtering 
The vectors 6k\k ; 9k and (n, n) matrices Pk\k ; Pk can be propagated in transformed, 
inverse or factorised forms. The information filter works with the transformed re-
gression parameter estimates denoted as 
^ . 1 * = ^ * ! * ; dkHPk)-l4t (6) 
Information data update . 
pk\l = m l i + w i 
dk\k = dk\k-i + fk Vk 
0k\k = Pk\kdk\k- (7) 
Regularized information time update . Specified by 
^ ; T O - 1 > 0 ; 0 < ^ < 1 -
dk+i\k = ^kdk\k + (1 - Ajb)c?£ 
pk+\\k = ^ - K I - ^ H P ; ) - 1 . (8) 
Transputer Implementation of Block Regularized Filtering 241 
The information d a t a update (7) is an inverse equivalent of (2) and the regularized 
information t ime update (8) is identical to (5) as follows directly from (6) substituted 
into (5). 
One of possible square root systolic implementations of (7), (8) is proposed in 
next section. 
3. S Q U A R E R O O T SYSTOLIC IMPLEMENTATION 
We propose implementation of the regularized information filter (7), (8) as an estima­
tor implementing recursive Q R decomposition technique [25], based on Gent leman-
Kung array [24], [21], [28]. 
T h e Q R algorithm [5, 24] implements the d a t a update of the factorised informa­
tion matr ix Pk~k (7) and the d a t a update of vector r ^ defined 
Pk\k - Rk\k Rk\k 
~k\k - Rk\k 0*|Jb = 
where Rk\k > 0 is upper triangular matr ix. 
R e g u l a r i z e d Q R f a c t o r i z a t i o n 
Let us define analogically to (9) the factorization 
>-т Rk\kdк\к (9) 
(PT1 = E>*T D * Kк Kк 
r fc — Rк к = (Rк)~ dк- (10) 
Pк\к dк\к 
dк\к * 








' W)"1 dl ' = õ*T õ* Kк Kк ~ 
Щ 4 " 
o * 
т D * Kк 
0 * 
(12) 
where R*k ; Rk\k ', R*k are upper triangular and * is a "don' t care" scalar term. 
Then the regularized information filter (7) - (8) can be implemented in the square 
root factorised form (13), (14). 
S q u a r e r o o t regu lar ized d a t a u p d a t e . 






S q u a r e r o o t regu lar ized t i m e u p d a t e . Specified by 
R4>0;0< A* < 1 . 
Rk+i\kpk+i\k = ^kRk\kRk\k + ( i - ^k)R\ R*k-
It follows directly from compare with (7), (8) and the definition of factorization (9) 
and ( 1 0 ) - ( 1 2 ) . 
(14) 
242 J. KADLEC 
Systolic implementation of the regularized estimator 
The proposed regularized parallel estimator implements (13) and (14). This is in 
fact the standard systolic array for the QR filtering with (n) * (n -f l ) /2 cells. The 
cells store and update the triangular matrix Rk\k and regression parameter estimate 
is updated in the transformed form as rk\k on the extended row of n cells. 
The array requires n + 1 time steps between measurements to complete the re-
cursion and can be formally described as follows. 
Systolic regularized estimator array. 
Ҷ>k 
Q 
г p T _ ^ p T -i 
л J k | J k - l ""-JfclJk 
т т 
Уk Г Jk|Jk-l ""* ГÅ:|Jfc 
<= 1 => QR array 
(l-A,)V*Дf Q...Q Л
1 / 2 f í т -^ Rт 
Лk лjfc|Jfc " ^ ЛJfc + l|Jfc 
(1-A,)'/Vf = > \ l l 2 r T -+rT лk ' k\k тk+l\k 




The array implements one QR update (15) of the data measurement (13) and n 
additional QR updates (16) representing the "addition" of (1 — Xk)
x^2R*k
T and (1 — 
Ajt)1/2r^T as n columns of "artificial" data to implement the time update (14). 
If additional information about the changes of the parameter estimates is avail­
able, it can be built into the regularizing terms r*k ; R*k > 0 related to 9*k and {Pk)~
x 
through the definitions (6), (12). 
The regularized array (15)-(16) provides as a by-product of the data update (15) 
the output filtration error ek\k equally to the Gentleman-Kung array [21], [24], [28]. 
Regression parameter estimates extraction 
The function of array (16) can be combined with the n data updates necessary for 
the parameter estimate extraction technique known as " weight flushing " [25, 28]. 









The regression parameter estimates can be captured as a serial stream of n items 
from the bottom of the array. 
4. BLOCK REGULARIZED ALGORITHM 
In [12], [13] a novel block regularized algorithm has been described. It is in fact a 
modification of the regularized exponential forgetting algorithm which allows parallel 
implementation. 
Transputer Implementation of Block Regularized Filtering -4o 
The Block Regularized Algorithm [12] overcomes both the Pipelining and Com-
plexity problems arising in the implementation of the regularized exponential algo-
ri thm ( 8 ) - ( 7 ) . It is achieved by the additional assumption about the alternative 
covariance P* and the alternative parameter vector 9*. If they both remain constant 
within an block i.e. 
P* = Pjfc*.iv = P*-jv+1 = ••• = -%•_. (18) 
0* = Ql-N = °t-N+i = --- = 9l^l (19) 
the computat ion will be simplified as the ( P * ) " 1 and (P* ) - -0* terms can be accu-
mulated and added in every Nth iteration as shown in (2Q) - /22) 
Pk\k = A f c - i • • • Xk-NPk~-N\k-N + 
+ \k-l • • • \k-N+l<Pk-N+l<Pk-N+i + Aifc_i • • • \k_N+2Vk-N+2<p'k-N+2 
+ ••• + Xk-m-lVk-l + <Pk<Pk + \[k~.N,k](P*)~1 ( 2 0 ) 
dk\k = \k-l • •-\k-Ndk-N\k-N 
+ \k-l • • • \k-N+l<Pk-N+iyk-N+l + Ajfc_i • • • Aifc_Ar+2^ifc-JV+2yifc-.V+2 
+ ••• + \k-i(pk-iyk-i + <pkyk + \k-N,k](
p*)"10* (21) 
Ok\k = pk\kdk\k (22) 
where 0 < \[k-N,k] is a scalar accumulated weight given by (23) 
A[/t_iv,jfc] — Afc_i • • • A)t_7v+i(l — Afc_jv) + Ajfc_i • • • Ajk_Ar+2(l — Afc_Ar+i) 
+ ••• + A * _ i ( l - A * _ 2 ) + ( 1 - A „ _ i ) . (23) 
The regularized exponential algorithm is thus modified by these assumptions (18), 
(19) into The Block Regularized Algorithm [12], 
In the block regularized algorithm the alternative covariance matr ix is added only 
every Nth iteration but in such a way as to achieve exactly the same results as the 
algorithm which adds it every step. Exactly the same parameter estimates will be 
provided, but they will be only available at every Nth iteration. 
Figure 1 demonstrates the block regularized algorithm. For a clearer comparison, 
there is a trace of the parameter estimates of the original regularized exponential al-
gorithm. Notice, tha t both algorithms start and end at precisely the same conditions. 
The block regularized algorithm reduces considerably the number of computations 
necessary to get to the final point. 
Note tha t 6*, P* can change in the time instants k, k + N,... but must remain 
constant in the intermediate N steps. Usually N = n, where n is the number of 
estimated parameters. 
Using the QR orthogonal rotation concept (15), the factorised block restricted 
algorithm can be rewritten into the final QR form, updating the upper triangular 
matr ix Rk\k and vector rk\k-
The final QR block restricted algorithm (24) is a sequence of QR orthogonal 
rotations of the type (15), applied for the real data measurements. It is followed 
by the accumulated regularization of the type (16) performed on the boundaries of 
blocks, combined with the weight flushing (17) of the current parameter estimates. 
244 J. KADLEC 
Ífc-З — гk-2 
L_I = ðï , = * -3 - Vk-2 — Џk-1 Џ 
Fig. 1. Block Regularized Algorithm for time k — 3 to k. 
Block regu lar ized a lgor i thm. The algorithm is specified by the parameters: 
R* > 0; \k-N, • • •, Ajb_i; 0 < Xj < l;N > n. 
fk-i 
R — Rk-N\k-N ;
 r = rk-N\k-N 
c y c l e for j = N - 1, N - 2 , . . . , 0 : 
R * 1 ._ n 
o * J - VW-i 
e n d of cyc le for j 
Rk = R; rk = f 
0Jfc-At = Rk_N
rk-N 
„ * _ o * / I * 
r/fc-At — ^k-N^k-N 
cyc le for i = n, n — 1 
: = Qi.ifc 





e n d o f c y c l e for i 
R 
\/X[k-N,k] R*_jv[-l 











Transputer Implementation of Block Regularized Filtering 245 
5. T R A N S P U T E R ARRAY D E S C R I P T I O N 
The t ransputer array implementing the algorithm (24) is plotted in Figure 2. The 
detailed cell description depends on the specification of the basic "rank-one" update 
at the cell level. See [21], [24] [25], or [28] for different options. In this paper the Q R 







^ з [ l ] ; l ^ + 2 [ 2 ] ; 1 
PJfe+2.1 


























'UM = E.S*U[M]0fc-4[.] 
î ^ - 4 [ l ] 
output 
Fig. 2. Block Regularized QR Array Concept. 
Let us discuss first, how the array selects the new position of the alternative 
est imate 6* for each block of N d a t a measurements in (24). 
246 J. KADLEC 
The basic areas marked in Figure 2 by ABC and BCDE store and update matr ix 
Rk and r^ (respective R and r) as in (24b) and (24g). 
The automatic selection of 0*k_N for each block requires a sequence of additional 
computations (24d) - (24f ) . The equation (24d) computes the parameter estimate 
0k-N• In (24e) the selection of 0*._N is made. Consequently the alternative quadratic 
form J*(Ok) for the block [k — N,k] will have its minimum positioned in the point 
0*k_N = h-N = Rk-N^k-N-
Notice, tha t by using "old" variables Rk_N ', ?k-N which have been available N 
time steps before, Ok-N can be computed off-line at this time. 
Finally, once the alternative parameter 0*k_N and R*k-N f °
r the block [k — N,k] 
are chosen, the vector r*k_N as a matr ix by vector product R*k_N0*k_N is computed 
in (24f). The vector r*k_N, weighted by A[fc_jv,fc] (23), is finally "added" to r& in 
(24g) to realize the accumulated regularization. 
The block regularized algorithm (24) produces as a side effect the parameter esti-
mates Ok-N ; Ok . . . used for the automatic positioning of the alternative quadratic 
form J*(Ok). These parameters have to be distinguished from the parameter es-
t imate Ok-N\k-N i Ok\k • • • which can be computed using the formulae (22h). In 
Figure 3 the difference between Ok and Ok is illustrated. 
The advantage of selecting the alternative quadratic form J*(Ok) in the semi 
constant points Ok-N ] Ok • • •• is an efficient parallel implementation. The compu-
tation oiOk-N (24d) can be started in parallel with the accumulated regularization 
(24g) computation in the previous [k — 2N, k — N] block. In contrast, computation 
of Ok-N\k-N (24h) can start only after finishing of the accumulated regularization 
(24g) of the transformed variables in the previous block ie. after all the computa-
tions of the block [k — 2N, k — N] and therefore it would cause a decrease of the 
throughput of the final pipelined implementation. 
The extended row of n cells, DEFG, performs the matr ix by vector multiplication 
(24f). The matr ix R*k-N is loaded from the north to ABC area. When loaded, R* 
starts to move (with the element RI-NI.71^] h f S 0 through BCDE in the easterly 
direction, to realize (24f) in DEFG. 
The vector 0*k_N moves (the item 0*k_N[n] first) to the array marked DEFG from 
the south to realize (24f). The resulting product rk_N[i] emerges from the northernly 
output of the DEFG array. It is directly loaded back to the north input of the array 
BCDE (synchronously with the rows R*k-N\^\ loaded to the array ABC) to realize 
the accumulated "addition" (24g). 
Simultaneously, both arrays ABC and BCDE provide the parameter estimate 
extraction known as "weight flushing" [25, 28] (24d) by inserting (n, n) unit ma-
trix and a column of zeros on the array data inputs (synchronously with the rows 
[R\-N\}\ > rfc-Arlj] ])• The parameter extraction "filter" works with the "frozen" 
matr ix Rk-N a n d vector fk-N, stored in the areas ABC and BCDE. 
Systo l i c array character is t ics 
The concept of the systolic implementation of the regression parameter estimator 
is presented in Figure 2. Figure 3 presents the use of the input data buffering 
Transputer Implementation of Block Regularized Filtering 247 
and multiplexing, to provide approximately 50 % t h throughput in compare with the 
systolic triangular RLS filter array. 
The cells of the array work in two phases. These phases are distinguished in the 
cells by the sign of an index j propagated through the array. The value of the index 
gives to the cells the timing information within both phases. 
Afc+3 Vfc+4[1] <Pk+4[2] y>fc+4[3] yfc+4 
Afc+2 V>fc+3[1] Pfc+3[2] Vfc+3[3] yfc+3 













.k^[i,l] = Rl_4[i,l]-)I[iJ] 
-•U*1 = E!5^4['.^-4 output 
Fig. 3. Transputer implementation of Block Regularized QR Array. 
In the first phase a block of N > n da ta is processed (n is the number of estimated 
regression parameters) using the standard function of the RLS systolic information 
filter ( N rank-one d a t a updates). 
In the second phase the processing of the input data temporary stops and the 
248 J. KADLEC 
array computes (in n steps, each with complexity comparable to a rank-one data 
update) these combined functions: 
(1) The computat ion of the regression parameter estimate by "weight flushing". The 
parameters emerge as a serial stream of n items. 
(2) The positioning of the minimum of the regularizing semi-constant quadratic form 
(the point $*) to the point of the recently flushed-out regression parameter estimate. 
(3) The regularization, accumulated over last N data measurements. 
The resulting parallel implementation of the estimator has several remarkable 
features: 
• The transputer implementation is completely pipelined with processor utiliza-
tion close to 100 %. 
• The processes perform only the neighbor-to-neighbor one-directional commu-
nication. 
6. CONCLUSIONS 
The block regularized algorithm is described and its square root form derived, which 
allows parallel t ransputer implementation under the assumption, that the alternative 
regularizing cost function is constant within data blocks and adjusted to the new 
value on the boundaries of blocks only. This restriction (see Fig. 1) does not reduce 
significantly the effectiveness of the general regularized concept because the array 
implementation adjusts the minimum of the semi-constant alternative cost to follow 
the current regression parameter estimate. 
It is then shown how it could be implemented on a systolic architecture. Fig-
ure 2 presents the principle, Figure 3 describes the final structure of a compacted 
transputer array, with the input data buffering and multiplexors. 
The final block regularized transputer architecture (Fig. 3) is half as fast as the 
s tandard architecture of McWhirter [24] for the exponentially weighted recursive 
least squares without regularization. It should be noted again, that the processes 
perform nearest-neighbor, one-directional communication. There are no long global 
connections in Figure 3. Data only passes in one direction on any given link. The 
feedback associated with adjustment of the new position of the alternative cost 
function is completely compatible with the pipelined implementation. 
The side effect of the parallel square root information filter implementation is that 
the regression parameter estimates are provided by the accumulated regularization 
process on the boundaries of the computational blocks. The direct use of these 
parameters can be found for example in the design of adaptive optimal controllers, 
or in the high resolution adaptive spectral analysis. 
(Received November 10, 1993.) 
Transputer Implementation of Block Regularized Filtering 249 
REFERENCES 
[I] G. J. Bierman: Factorisation Methods for Discrete Sequential Estimation. Academic 
Press, New York 1977. 
[2] M. J. Flynn: Very high speed computing systems. IEEE Proc. 5Ą (1966), 1901-1909. 
[3] F. M. F. Gaston and G. W. Irwin: The systolic approach to information Kalman filter-
ing. Internat. J. Control 15(1989), 1, 225-228. 
[4] F . M . F . Gaston, G. W. Irwin and J .G. Mc Whirter: Systolic square root covariance 
Kalman filtering. J. VLSI Signal Processing 2 (1990), 37-49. 
[5] W. M. Gentleman and H.T. Kung: Matrix triangularisation by systolic arrays. Proc. 
SPIE, Vol. 298, Real Time Signal Processing IV, 1981, pp. 19-26. 
[6] T. Hägglund: The problem of forgetting old data in recursive estimation. In: Proceed-
ings of the IFAC Workshop on Adaptive Systems in Control and Signal Processing, 
San Francisco, California 1983. 
[7] C. A.R. Hoare: Communicating Sequential Processes. Prentice Hall, Englewood Cliffs, 
N. J. 1985. 
[8] K. Hwang and F. A. Briggs: Computer Architecture and Parallel Processing. McGraw 
Hill, New York 1985. 
[9] J. Kadlec: A joint criterion for exponential directional and mixed parameter tracking. 
In: Preprints of the International Symposium on Adaptive Systems in Control and 
Signal Processing (ACASP'92) Conference, Grenoble 1992, pp. 687-692. 
[10] J. Kadlec: A recursive Gramm-Schmidt identification with directional tracking of 
parameters. In: Preprints of the the 9-th IFAC/IFORS Symposium on Identification 
and System Parameter Estimation, Budapest 1991, pp. 1707-1712. 
[II] J. Kadlec, F. M. F. Gaston and G. W. Irwin: Implementation of the regularized pa-
rameter estimator. In: Proceedings of the 1992 IEEE Workshop on VLSI and Signal 
Processing, Napa 1992, pp. 520-529. 
[12] J. Kadlec, F. M. F. Gaston and G. W. Irwin: Parallel implementation of restricted 
parameter tracking. In: Mathematics in Signal Processing (J. McWhirter, ed.), Oxford 
Publishers Press., Oxford 1992, pp. 315-326. 
[13] J. Kadlec, F. M. F. Gaston and G. W. Irwin: The block regularised parameter estimator 
and its parallel implementation. Automatica 31 (1995), 8, 1125-1136. 
[14] J. Kadlec and N. N. Nakhaee: Alpha Bridge for MATLAB 4. In: World Transputer 
Congress WTC 95, UK 1995 (submitted). 
[15] R. Kulhavý: Directional tracking of regression-type model parameters. In: Preprints 
of the 2nd IFAC Workshop on Adaptive Systems in Control and Signal Processing, 
Lund 1986, pp. 97-102. 
[16] R. Kulhavý: Restricted exponential forgetting in real-time identification. Automatica 
55(1987), 589-600. 
[17] R. Kulhavý and B. Zarrop: On a general concept of forgetting. Internat. J. Control 58 
(19993), 4, 905-924. 
[18] R. Kulhavý and M. Kárný: Tracking of slowly varying parameters by directional 
forgetting. In: Preprints 9th IFAC World Congress, Budapest 1984, pp. 178-183, 
1984. 
[19] H.T. Kung, C. E. Leiserson: Algorithms for VLSI processor arrays. In: Introduction 
to VLSI systems (C. Mead and L. Conway, eds.), Addison-Wesley, Reading, Mass. 
1980. 
[20] S.Y. Kung: VLSI Array Processors. Prentice HaU, Englewood Cliffs, N. J. 1988. 
[21] F. D. Ling, D. Manolakis and J. G. Proakis: A recursive modified Gгam-Schmidt al-
gorithm for least squares estimation. IEEE Trans. Acoust. Speech Signal Process. ЗĄ 
(1986), 829-835. 
250 J. KADLEC 
[22] L. Ljung and S. Gunnarsson: Adaptation and tracking in system identification - a 
survey. Automatica 26 (1990), 7-21. 
[23] L. Ljung and T. Soderstrom: Theory and Practice of Recursive Identification. MIT 
Press, Cambridge, MA 1983. 
[24] J. G. McWhirter: Recursive least squares minimisation using a systolic array. In: Proc. 
SPIE 431, Real Time Signal Processing VI, pp. 105, 1983. 
[25] J .G. McWhirter: Algorithmic engineering - an emerging discipline. SPIE Vol. 1152, 
Advanced Algorithms and Architectures for Signal Processing IV, pp. 2-15, 1989. 
[26] J .E. Parkum, N.K. Poulsen and J. Hoist: Recursive forgetting algorithms. Internat. 
J. Control 55(1992), 109-128. 
[27] V. Peterka: Bayesian approach to system identification. In: Trends and Progress in 
System Identification (P. Eykhoff, ed.), Pergamon Press, Eindhoven 1981, Chap. 8, 
pp. 239-304. 
[28] T.J . Shepherd, J .G. McWhirter and J .E. Hudson: Parallel weight extraction from 
a systolic adaptive beamformer. In: Proc. IMA Internat. Conf. on Mathematics in 
Signal Processing, Warwick 1988, pp. 775-790. 
Ing. Jiří Kadlec, CSc, Ústav teorie informace a automatizace AV ČR (Institute of In­
formation Theory and Automation - Academy of Sciences of the Czech Republic), Pod 
vodárenskou věží 4, 182 08 Praha 8. Czech Republic. 
