Interleaved pipeline structures for two-dimensional recursive filtering by Lu, Tongxin & Azimi-Sadjadi, Mahmood R.
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY. VOL. 3. NO. I. FEBRUARY 1993 1',7
TABLE I
COMPARISONS OF FSA, TSS. MTSS, AND DSWA / IS WITH FOUR
CIF FORMAT SEQUENCES ("CLAIRE," "MISSA,"
"SALESMAN," AND "SWING")
BMA





















































in Proc. Nat. Telecommun. Con]. (New Orleans, LA, Nov. 29-Dec.
3,1981), pp. G5.3.1-5.3.5.
[3] M. Ghanbari, "The cross-search algorithm for motion estimation,"
IEEE Trans. Commun., vol. 38, pp. 950-953, July 1990.
[4] S. C. Kwatra, C.-M. Lin, and W. A. Whyte, "An adaptive algorithm
for motion compensated color image coding," IEEE Trans. Cum-
mun., vol. COM-35, pp. 747-754, July 1987.
Interleaved Pipeline Structures for
Two-Dimensional Recursive Filtering
Tongxin Lu and Mahmood R Azimi-Sadjadi
TABLE II
COMPARISONS OF FSA, TSS, MTSS, AND DSWA / IS ALGORITHMS




Difference ofDFD PSNR Positions
BMA (DFD) (bits z pcl) (dB) (per frame)
w=7 FSA 2039845 4.766 31.12 295924
TSS 2278935 4.863 29.67 33750
MTSS 2091261 4.794 30.77 42513
DSWA/IS 2125939 4.809 30.57 24506
w = 21 FSA 2584641 5.045 28.98 2397136
TSS 3366908 5.337 26.18 55350
MTSS 3089561 5.241 27.14 76351
DSWA/IS 3336509 5.323 26.45 41904
technique. In four CIF and four CCIR 601 format video se-
quences, each of 30 frames is used as the test picture. All entries
in Tables I and II represent the average results using a total of
120 frames. The motion-vector search is based on the luminance
component with the search block of 16 X 16 pels. Table I
compares the average performances of four CIF images for
maximum motion displacements of w = 7 and 21. As can be
seen, DSWA/IS is slightly superior to TSS in performance, but
the computations are evidently reduced about 35% and 44% for
w = 7 and 21, respectively. The experiment with the CCIR 601
pictures is shown in Table II. It also shows that DSWA/IS is
more efficient than TSS. Referring to our experimental results,
it can be concluded that DSWA/IS, which adequately adjusts
the search-window size in every stage, can perform better than
TSS at a lower computational cost.
V. CONCLUSIONS
A new search algorithm DSWA/IS is proposed in this paper.
This algorithm shows lower DFD and has a 24~44% savings in
computations over TSS. Our experimental results also demon-
strate the DSWA/IS is better than TSS, even if the search
window size is large (43 X 43 pels). Hence, it is concluded that
this algorithm can be applied to a wide range of applications
such as video phone, video conference, and HDTV.
REFERENCES
[I] K M. Yang, M. T. Sun. and L. Wu, "A family of VLSI designs for
the motion compensation block-matching algorithm," IEEE Trans,
Circuits Sysi., vol. 36, no. 10, pp. 1317-1325, Oct. 1989.
[2] T. Koga, K Iinuma, A. Hirano, Y. linuma, and T. Ishiguro,
"Motion compensated interframe coding for video confcrencing,"
Abstract-This paper presents new parallel and pipeline structures for
real-time 2-D recursive filtering. For general scalar 2-D recursive filters,
a 2-D multiple-interleaved pipeline architecture is introduced that is
compatible with the nature of the image-scanning scheme. Using this
new structure, the sampling period can be a fraction of the time needed
for one scalar addition operation and the delay is only a few samples. In
addition, this structure does not need any I / 0 butTers for real-time
implementation.
L INTRODUCTION
The need for high-speed digital image processing became
evident with the increasing utilization of the relevant techniques
in medical, geophysical, and military applications. Many of these
applications require acquisition, processing, and display of im-
ages in fractions of a second. A number of approaches have
been suggested to achieve high-speed processing for 2-D recur-
sive filtering using systolic array processors [1]-[1n The "look-
ahead computing" [4] and the "interleaved pipeline" [6] schemes
have been proposed by several authors, mainly for I-D filters.
With look-ahead computing, an output sample can be computed
without the need to use the results of several previous output
samples. By interleaving such pipelines, several output samples
can be computed concurrently.
In this paper, new parallel and pipeline architectures are
introduced which can be used to perform high-speed 2-D scalar
recursive-filtering operations. In Section II, a pipeline architec-
ture for canonical implementation of 2-D recursive filters is
given that requires a pixel-processing period of only one single-
scalar multiplication time plus one single-scalar addition time.
The delay is one multiplication time plus two addition times. By
interleaving multiple look-ahead pipelines in parallel form with
look-ahead computing [4], [6], [8], [9], a multiple look-ahead
architecture is developed. If a quadruple look-ahead pipeline is
used, the sampling period can be only one single addition time,
and the delay from receiving an input pixel to delivering the
corresponding output pixel can be one scalar multiplication plus
two addition times. If a higher sampling rate is desired, more
than four look-ahead architectures can be used. Section III gives
concluding remarks and discussions on the proposed architec-
tures.
Manuscript received February 12. 1991: revised August 27. 1992. This
paper was recommended by Associate Editor Peter Pirsch.
The authors are with the Department of Electrical Engineering,
Colorado State University, Fort Collins. CO 80523.
IEEE Log Number 9204662.
1051-8215/93$03.00 © 1993 IEEE















+ L L cp(lx(i-p,i-q)
(I'. q t> (0.0)
(p·(I)~II.O)
V- I (-N
y(i,i) = L L
dp q = [0 0 1]
Notc that a i. } and hi} arc set to zero if the (i, i) pair is outside
the region of support of the filter. Using (2), the computations of
the L-successive output pixels, y(i, i - L + l ), ... , yU, i), can
be performed simultaneously in L pipelines with the time dif-
ference of the 1/L pipeline clock cycle. The region of support
Fig. I. Support regions for the quarter-plane causal recursive filter.
speed can, of course, be much higher for the parallel block
implementation. Therefore, it is important to search for other
concurrent algorithms and structures which cannot only provide
the minimum latency benefit of this conventional pipe lined
implementation but also the high-speed capability of the block-
implementation scheme. This objective can be met by using
multiple-interleaved look-ahead pipeline structures, which are
discussed next.




By recursive application of (I) and substituting y(i, i -
I), yU, i - 2), ... , yU, i - L + I) into the expression for
y(i, i), yU, i) can be calculated without the need for output
pixels yU,i - L + J) to yU,i - I). Consequently, all L output
pixels, vt i , i-I- + I), ... , yU, i-I) and yU, i) can be calcu-
lated independently and concurrently. It can be shown that
N- IN 1
+ L L bp'Iy(i-p,i-q) 0)
P~ 0 q~O
Ip. 'I)" W.O)
where {xU, i)} and {yU, i)} represent the input and the output
arrays, respectively, and ap, 'I and bp.'I are the coefficients of the
N X N order 2-D recursive filter. The regions of support for a
third-order filter arc shown in Fig. I.
Fig. 2 shows the signal flow for a third-order (N = 3) 2-D
filter in a canonical form. In this figure, z: 1 corresponds to the
delay between pixels along a row and zi l corresponds to the
delay between adjacent rows during the row-scanning process.
This signal flow is ideally suited for pipeline and systolic implc-
mentations. There are nine nodes in the pipeline path of this
figure at which the partial sums of products are generated and
propagated from left to right. The inputs to the right three
nodes are xi i, i) and y(i, i - 1), and to the left six nodes arc
xt i, i - n and y(i, i - l ). At the present moment, the three
right-most nodes are working on the output pixels along the ith
row, the three nodes in the middle are calculating the partial
sums for the output pixels along row i + I, and the left-most
three nodes are working on partial sums for the outputs of the
row i + 2, For example, while the right-most node performs the
operations for the output yU, j), the left-most node is generat-
ing the first partial sum for the future output pixel yU + 2, i +
J),
The advantage of this pipeline structure over the parallel
block-implementation scheme [7]-[9] is that it provides the mini-
mum latency, which is one multiplication time plus two addition
times, The latency is defined as the time interval elapsed from
the time xt.i, i) is applied to the time y(i, i) is generated, In
addition, this direct scalar implementation is compatible with
the raster scanning of images, while for the parallel vector-im-
plementation schemes, such as block implementation, the scan-
ning sequence is not followed and as a result some extra laten-
cies are commonly introduced. The minimum achievable sam-
pling period for this single pipelined structure is limited to only
one multiplication period plus one addition period. This can be
achieved by interleaving the multiplication and addition opera-
tions for the term {+auoxU, i)} at one addition time in advance
of those needed for {+ billy(i, i-I)}. The achievable sampling
A. Direct Pipeline and Systolic Implementation
Consider a linear time-invariant causal 2-D recursive filter
described by the following 2-D difference equation;
N- IN-I
y(i,i) = L L ap'Ix(i - p i ] - q )
P~ 0 a: 0
II. LOOK-AHEAD COMPUTING AND
MULTIPLE-INTERLEAVED PIPELINE STRUCTURE
In this section, a high concurrent structure for 2-D recursive
filtering is introduced, The conventional direct implementation
of the 2-D recursive filtering is first briefly reviewed, Unlike
block implementation [7]-[9], scalar implementation does not
provide massive parallelism. However, the 2-D block-implemen-
tation scheme introduces a delay proportional to the number of
rows in a block and hence becomes incompatible with the
row-scanning process, This inevitable delay is undesirable, par-
ticularly for real-time applications, To overcome these deficien-
cies a multiple-interleaved look-ahead implementation scheme is
introduced which is ideally suited for real-time image-filtering
applications,
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY. VOL. 3. NO. 1. FEBRUARY 1993 89
y(iJ)
x(iJ)
Fig. 2. Signal flow of the 2-D recursive filter.
Fig. 3. Support regions for the look-ahead structure.
The quadruple look-ahead pipelines can be used to process
1024 X 1024 images at a rate of 30 frames/so Using 16 inter-
leaved pipelines, 2048 X 2048 images can be processed at a rate
of 30 frames z's without I/O buffering.
The look-ahead hardware cost for 2-D recursive filtering can
be obtained in terms of the number of required basic computa-
TM = 45 ns
T4=19ns




single pipeline period. Assuming that a multiplication takes
twice the time as an addition, then the overall sampling period
using the quadruple pipeline structure may be a single addition
time. However, this sampling period is not the minimum achiev-
able sampling period for the multiple-interleaved pipeline imple-
mentation. If more interleaved pipelines are used, the sampling
period may be even less than an addition operation time. Note
that in this case the data flow in Fig. 6 needs to be rearranged
because the sampling speed exceeds the pipeline data-propa-
gation speed. However, the arrangement of the pipeline is quite
flexible.
III. CONCLUSION AND DISCUSSION
The interleaved look-ahead pipeline in this paper provides an
extremely high-speed (both in sampling period and latency) tool
for 2-D scalar recursive filtering. This is achieved by combining
the look-ahead computing, pipelining, and interleaving schemes.
In addition. its time complexity and processing delay are no
longer functions of the filter order. With today's integrated-cir-
cuit technology, such a special-purpose structure can be directly
connected to the image-acquisition system for real-time 2-D
recursive-filtering operations. Using the available integrated-cir-
cuit chips [1]. the quadruple-interleaved look-ahead pipeline
architecture can process 1024 X 1024 size images directly at the
rate of 30 frarncsys independent of the filter order. However,
1024 X 1024 is not an upper bound of a possible sampling rate
by the multiple-interleaved look-ahead pipelines. Let us consider
the same integrated-circuit chips as listed in [1] with the follow-
ing specifications:
for (2) and the corresponding signal flow graph are shown in
Figs. 3 and 4. respectively.
C. Multiple-Interleared Pipeline
In this section, the multiple-interleaved look-ahead pipeline
structures and in particular the quadruple-interleaved pipeline
structure are introduced. As shown in Fig. 4, the calculation of
y(i,j) does not depend on output pixels y(i,j - I), y(i,j - 2),
and y(i, j - 3). This implies that any output pixel can be calcu-
lated without the three past nearest neighbor output pixels along
the same row. Consequently, the four output pixels y(i, j - 3),
y(i, j - 2), y(i, j - I), and y(i, j) can be calculated concurrently
using four identical structures in an interleaved fashion. Though
the speed of each pipeline structure does not change, the overall
sampling speed is improved by a factor of 4. Fig. 5 shows the
signal flow of one of the four identical structures. In this figure,
the horizontal delay is represented by z~ -I which is the delay of
a single pipeline and is four times the overall sampling period,
i.e., Z~-I ~Z24. Also, Z;-I =zll is used to represent the
corresponding line delay. This delay can be implemented using
arrays of linear shift registers or line memories. In this signal
flow, y(i, j - 5), y(i, j - 6), and y(i, j - 7) are assumed to be
available in the appropriate pipeline periods. These are gener-
ated using similar computing structures whieh are not shown in
Fig. 5. Fig. 6 shows the data flow of the last pipeline in Fig. 5.
which is shown in the dashed-line block. The blocks marked A.
B, C, and D are latches clocked by four interleaved clocks A, B,
C, and D, as shown in the timing diagram of Fig. 7. As shown in
this timing diagram, in one pipeline period the relevant multipli-
cations and additions at each pipeline node are performed, and
the result is added to the previous partial sum generated at
other nodes along the pipeline. Thus, the speed at which these
partial sums are collected is only one addition time.
Fig. 8 shows the overall configuration of the quadruple-inter-
leaved pipeline structure with four identical pipelines as denoted
by A, B, e, and D. Note that the assignment of clocks in Figs. 6
and 7 correspond to pipeline A in this figure. Pipeline A
generates y(i, j - 4) at the last pipeline period. Similarly, y(i, j
- 5) is generated from pipeline D and y(i. j - 6) and y(i, j - 7)
are generated from pipelines e and B, respectively. In this
structure, multiplications and additions for the local partial sums
of products are performed outside the pipeline path. Therefore.
as soon as a partial sum in the next node is available, the
pipeline is able to collect all the partial sums along the entire
pipeline path at the speed of a single addition time for each
node. As shown in Figs. 6 and 7, the paths from yii; j-4) to
y(i, j) require only one multiplication plus two additions. This
determines the latency and the sampling period of a single
pipeline. However, because there are four identical pipelines in
the interleaved fashion and each of them generates an output
pixel in every time interval of one multiplication plus two addi-
tion operations, the overall sampling period is one-fourth of the
90 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOCjY, VOL. 3, NO. I, FEBRUARY 1993
r-----------------,------------------------------,-,'(I.J-7)











Fig. 5. Signal flow of the look-ahead structure with interleaving.










Additions through the pipeline
TABLE I
VALUES OF Neu FOR TYPICAL VALLES OF LAND N
Image size Filter order (N x N)
L (30 frames / s) 2 x 2 3 X 3 4 X 4 8 X 8
I 512 X 512 7 17 31 127
2 512 X 512 20 44 70 284
4 1024 X 1024 64 128 208 088
8 1024 X 1024 224 370 040 1856
16 2048 X 2048 832 1472 2170 5032
Additions with the corresponding
products of x pixels ...........
+ -F+
y(i,j-6) d 06 + -+-'+..
y(i,j-5) dO' +.. '+-..
/ y(i,j-4) d04 • +~ '\
C One ipeline period ".)
Fig. 7. Timing diagram of Pipeline A in the quadruple-interleaved
pipeline structure.













~ ~ ~• t Ir Irrxl x2 x3 cO cl c2 031 IxO xl x2 x3 cO cl c2 c3 rxo xl x2 x3 cO cl c2 03 IxO x I x2 x3 cO c1 c2 03
1
Pipeline A Pipeline B Pipeline C Pipeline D
Youl )'l yl y2 y3 Youl )'l yl y2 y] Yout yO y1 )'2 )'3 )'OUI y() yl y2 y3
y(i,4m) y(i.4m+l) ~ y(i,4m+2) ~ lY('Am+]) r ~
~ ~




Fig. 1\. The quadruple-interleaved pipeline structure.
tional units, Each computational unit is composed of onc scalar
multiplier. The following equation gives the expression for the
number of computational units needed as a function of the filter
order N and the look-ahead factor L:
Ncu = [2N
2 ~ 1 + (L - I)(2N - I)]L
where Ncu is the number of basic computational units, A list of
the values of Ncu for some typical values of Land N is given in
Table L As a consequence, the interleaved pipelined provides an
efficient architecture for moderate-size images and filter orders,
REFERENCES
[I] AN. Venetsanopoulos, K. M. Ty, and A C P. Loui, "High-speed
architectures for digital image processing," IEEE Trans. Circuits
Syst., vol. CAS-34, no. 1\, pp. 887-1\96, Aug. 1987.
[2] T. Aboulnasr and W. Steenart, "Real-time array processor for 2-0
spatial filtering," IEEE Trans. Circuits Svst : vol, 35, no. 4, pp.
451-455, Apr. 191\8.
[3] R. A Cohen, J. W. Woods, M. Sanya, and J. F. McDonald, "A
video rate architecture for a fully recursive two-dimensional filter,"
in Proc. ICASSP (Dallas, TX, Apr. 1987), pp. 1973-1976.
[4] K. K. Parhi and D. G. Messcrschmitt. "Look-ahead computation:
Improving bound in linear recursions," in Proc. ICASSP (Dallas,
TX, Apr. 1987), pp. 11\55-11\51\.
[5] --, "Concurrent architectures for two-dimensional recursive dig-
ital filtering," IEEE Trans. Circuits Svst.. voL 36, pp. 813-1\29, June
191\9.
[6] --, "Pipeline interleaving and parallelism in recursive digital
filters-Parts I and II," IEEE Trans. Acousl., Speech, Signal Pro-
cessing, voL 37, pp. 1099-1134, July 191\9.
[7] M. R. Azimi-Sadjadi and A R. Rostampour, "Parallel and pipeline
architectures for 2-0 block processing," IEEE Trans. Circuits Sysi..
vol. 36, pp. 443-441\, Mar. 191\9.
[1\] T. Lu, "Implementations, algorithms and architectures for 2-0
recursive filtering," M.S. thesis, Dept. Elec. Eng., Colorado State
Univ., Fort Collins, 191\1\.
[9] T. Lu, M. R. Azimi-Sadjadi, and A. R. Rostarnpour, "Skew-pipe-
line and interleaved pipeline structures for 2-0 recursive filtering,"
Proc. ICASSP'90 (Albuquerque, NM, Apr. 1990), pp. 1037-1040.
[10] S. Sunder, F. El-Guibaly, and A Antoniou, "Systolic implementa-
tions of two-dimensional recursive digital filters," in Proc. of
ISCAS (New Orleans, LA, May 1990), pp. 1034-1037.
[II] N. R. Shanbhag, "An improved systolic architecture for 2-D digital
filters," IEEE Trans. Signal Proc.. vol. 39, no. 5, pp. 1195-1202,
May 1991.
