Linear operators for digital contour smoothing are described. These operators are defined by circulant Toeplitz matrices and allow to smooth digital contours in the least-squares sense. They minimize the undersampiing, digitizing and quantizing error and allow to calculate invariants, such as curvature, which are not possible to calculate without smoothing. A bit-level systolic array which is capable of realizing the proposed operator is described. This array is easy to implement in VLSI, because the array ceils involved are very simple. Furthermore, the array is completely pipelined on the bit-level, so that it operates with a high clock frequency achieving very high throughputs.
Introduction
In [19] linear operators of the form
where C is an n × n circulant Toeplitz matrix and c is a suitable constant, are investigated. These operators allow to smooth simply closed digital curves in the least-squares sense. They minimize the undersampling, digitizing and quantizing error and allow to calculate invariants, such as curvature, which are not possible to calculate without smoothing [20] . In order to improve the stability of calculated global invariants [2, 6, 7, 12] different techniques have been suggested [22] . For real-time applications the one-dimensional systolic array TOPSS-28 has been developed [14] . The Hughes Research Laboratories Toeplitz systems solver TOPSS-28 allows to perform real-time calculation of splines and to reduce the data required for reference images. A simple algorithm and a pipelined architecture for B-spline interpolation is described in [18] .
In this paper a bit-level systolic array for digital contour smoothing is described. Kung and Leiserson [9, 10] have introduced the notion of systolic arrays, a special kind of highly regular, homogeneous, pipelined processor arrays. These have been shown to hold a great potential for VLSI implementation and to effectively employ the two major principles of parallel processing, namely multiprocessing and pipelining [11] . Meanwhile a lot of different systolic algorithms and arrays have been proposed for different computational problems. However, little attention 0167-8191/89/$3.50 © 1989, Elsevier Science Publishers B.V. (North-Holland) has been payed to the fact that a systolic array can benefit from its regular structure for VLSI design only if its processors (cells) are relatively small and simple. In fact, only systolic arrays which have very simple cells operating on the bit-level--we refer to such arrays as to bit-level arrays--have been implemented in single VLSI chips [3, 8, 13] . Having a bit-level systolic algorithm, the corresponding systolic array is extremely easy to design. In this case only one cell of few gates should be designed, simulated and tested. The whole array is generated by replicating this cell [4] . Since the bit-level cells are separated by delay elements, the very high clock frequency results in a high throughput of the array. We refer to such arrays as to completely pipelined arrays, as the pipelining is realized on the finest grain, most primitive level, the bit-level. In this paper a completely pipelined, bit-level systolic array for digital contour smoothing is presented and real-time applications for this array are summarized.
Digital contour smoothing operators
In the following a digital picture Z is a finite rectangular array whose elements are called points or picture elements. Each point P of ~ is defined then by a pair of Cartesian coordinates (x, y), which we may take to be integer valued. A point P = (x, y) in a digital picture Z has two types of neighbors:
(a) its four horizontal and vertical neighbors (u, v) such that I x -u I + I Y -v I = 1, (b) its diagonal neighbors (u, v) such that Ix -u l = I Y -v I = 1. We shall refer to the neighbors of type (a) as 4-neighbors of P, and to the neighbors of both type, as 8-neighbors [16, 17] . A simply closed digital curve is a path 7 = P0, /'1 ..... P, such that [17] (a) Pi = Pj iff i=j, and (b) Pi is a neighbor of Pj iff i =j + 1 (mod n + 1). Let Pk.,(i), k = 0, 1, 2 ..... m < n be a set of orthogonal polynomials defined on the equidistant set of points 0, 1, 2 ..... n where k denotes the degree of the given polynomial. Let us suppose that these polynomials have the form 2i 6i 6i 2 Po,,(i) = 1, ?1,,(i) = 1 ---, e2,,(i ) = 1 --+ n n-1 n(n-1)' iP,,,.,,(i) = --
n m(n+m+l)p P.,+I,,,(i) + ~Pm.,,(i) -2-(2m+--~) ra--l'n(i)"
(1)
Let f: R 1 --* R 1 be defined on the set of points 0, 1, 2 ..... n and let the values of f on these points be f(i), i = 0, 1, 2 ..... n. Let the approximation function have the form
and let this function approximate the function f in the least-squares sense, i.e.,
According to the orthogonality of the polynomials, the coefficients cj are defined by [1] cj(Pj.., Pj.,,) --(f, Pj..), j=O, 1 .... ,m
where and n Ef(i)Pi.,(i) i=0 = (2j+l)n ~j) CJ= ~pj2.,(i) (j+n+l) (j+l) i=of(i)PJ'"(i)' j=0,1 ..... m (5) i=0 n(J)=n(n-1)-.. (n-j+ 1), (j+ n + 1) (j+l)= (j+ n + 1)(j+ n)... (n + 1). (1/c)CX where C is an N x N circulant Toeplitz matrix and c is the sum of all elements in a row of C, whereby the coefficients of the matrix C are defined by (5) . For different values of rn and n + 1 we obtain the following operators [19] :
Let us consider
-m=l, n+l=3: (8) is shown in Fig. l (a) and by operator (10) in Fig. l(b) . A subset of linear operators defined by an N x N circulant Toeplitz matrix C which smooth digital dosed contours in the least-squares sense is suitable for digital contour approximation and these operators will be called feasible [19] . Let us denote
where X' has elements x i and y,. Then a linear operator (1/c)C, where C is an N × N circulant Toeplitz matrix whose dements are defined by (5), is feasible if
(a) Fig. 1 .
According to this definition a feasible operator is defined by the constrained least-squares smoothing with box constraints (12) . It has been shown that (6), (9), (10) are feasible operators [19] . These operators generate points which lie in a corridor defined by (12) , in which also the original curve (see Fig. 2 ) approximated by the 4-connected closed digital curve lies. The generalization of (1)-(5) for digital contour smoothing is based on the following equivalence:
Let i",. denote the radius vector defined by the point Pt = (xi, Yi) and by O = (0, 0) the initial point. Let us denote
II r~l[ = Ixi-Ol+ly,-OI.
Then (1/c)CX is equivalent to the one-dimensional smoothing defined by (1)-(5) on the equidistant set of points defined by (13) and with function values f~ defined by (14), see Fig. 3 . The linear operators described above allow to smooth simply closed digital curves provided that they are defined by uniformly spaced points. They minimize the undersampling, digitizing In the following a bit-level systolic array will be described which performs the matrix-vector multiplication u~-fx where C is defined by (10) . The matrix-matrix multiplication CX can be then performed by two matrix-vector multiplications Cx and Cy. Let us denote the non-zero elements of C by a_ 3 = -2, a_ 2 = 3, a_ 1 = 6, a 0 = 7, a 1 = 6, a 2 = 3, a 3 = -2.
Completely pipelined bit-level systolic array
The matrix-vector multiplication can be represented by the circulant convolution where u i is the ith component of u and Xo=Xu. A word-level systolic array for this transformation is shown in Fig. 4 . The array cells in Fig. 4(a) are combinational circuits which are capable of executing the so-called inner product step in one clock period. Since the primitive operations of thiS algorithm are operations on the word level, the array is qualified as a word-level systolic array and the parallelism involved is called word-level parallelism. The small bars in Fig. 4(b) denote delay elements which are controlled by a common clock. The The coefficients are chosen in such a way that the neighboring coefficients are represented by the neighboring powers of 2. This corresponds to shifting by just one position in the neighboring calls of the corresponding array. According to this, the systolic array shown in Fig.  4 can be realized by two systolic arrays of the same kind which correspond to the matrices C °) and C (2), respectively, and which are chained together as shown in Fig. 5 . In the following the realization of the inner product step on the bit level will be described. Figure 7 shows the implementation of the first two array cells which themselves are realized as linear, vertical, bit-level arrays. The multiplication by 21 in the second cell is realized by shifting the input x-data upwards by one bit position. The addition is carried out in a carry-riple array of full adders. For 8-bit input data and according to the sum 3 )-'. aj = 21, j=-3 the intermediate result will be represented by (8 + [log221])-bit numbers. Hence, one word-level cell, i.e. one column of the bit-level array, consists of (8 + [log221])= 13 bit-level cells (full adders). The clock period of this system is controlled by the carry-rippling delay from the LSB to the MSB cell, which is readily estimated to be 13t where t is the carry delay of a single full adder. To achieve shorter clock period, the carry rippling was eliminated by coupling the full adders lying in one column via delay elements, forming so a vertical pipeline. The input data are scewed (see Fig. 8 ).
Notice, that due to the shifting of the x-data upwards, a retiming is necessary by removing one of the delay elements in each of the z-connections in order to provide data alignment of the x-and z-bits in a given cell. The clock period of this completely pipelined system is controlled by the delay t of a single bit-level cell and it is 13 times shorter than the clock period of the non-pipelined system.
In the next the realization of a multiplication with a negative coefficient is described. In the third column of the array we have a(01)= -2 °. The multiplication by -1 is realized by converting the input x-words into their 2's complements before their addition. This can be I z(12)... done quite easily by inverting the corresponding x-inputs of the full adders (see Fig. 9 , where the small circles denote inverters), and adding a 1 via the carry input of the LSB full adder. The whole array is a regular matrix of full adders arranged in 12 columns (corresponding to the 12 coefficients) and 13 rows (corresponding to the bits of different significance). In Fig. 10 the profile of the x-and z-connections, the number of the delay elements which are arranged in these connections, and the positions at which inverters (shown as small circles) are arranged in the x-inputs of the full adders are shown. The number of delay elements in the carry interconnections is constant for the whole array: one delay element in each carry interconnection. These connections are not considered in Fig. 10 . The number of delay elements in the xand z-interconnections is derived from the corresponding number of delay elements in the original word-level array (Fig. 6 ) according to the following rule: a delay element is removed from the z-connection or x-connection between two cells of the bit-level array, if the x-data are shifted upwards or downwards between these cells, respectively.
The directions of the carry and z-connections are constant for the whole array and these connections are local. The direction of the x-connections is constant within one column, but changes from column to column. However, all x-connections are local due to the fact that the x-data are shifted maximal by one bit position between two neighboring columns. Inverters are arranged in the x-inputs of the full adders in the third, sixth and the seventh column. The general technique for the bit-level systolic arrays design for the circulant convolution is presented in [15] .
The smoothing operator defined by (10) requires also a normalization by c = 21. This can be done in a straightforward way but it depends on the application in which moment it will be performed. In the next section it is shown that this will be performed as one division after the invariants have been calculated from the array CX. Another approach is to use a fast approximation algorithm which takes advantage from the particular value of the denominator [15] .
Applications
Let a i be an arbitrary linear segment in the 2-dimensional Euclidean space R2, see Fig. 11 . The area of a closed polygon represented by the string a 1, a 2 ..... a N can be calculated by [5] According to this the normalization defined by division by c will be performed after the area and the first-order moments have been calculated from the array CX. Similarly we can derive formulas also for the second-order moments [5] . The linear operators described above minimize the undersampling, digitizing and quantizing error and so they are able to improve the stability of calculated local and global invariants such as the area, center of gravity defined by the first-order moments and invariants defined by the second-order moments [2, 6, 7, 12] , the length of a curve, and enable to calculate invariants such as curvature [20] which are not possible to calculate without smoothing. For length and curvature calculation of smoothed digital curves efficient string processing algorithms have been suggested [21] . In [22] another approach for moment invariants calculation is described.
Conclusion
Linear operators for digital contour smoothing are described. These operators are defined by circulant Toeplitz matrices and enable to smooth digital contours in the least-squares sense. A bit-level systolic array which is capable of realizing the proposed operator is presented. Since the array cells are very simple and only one cell type is used, the array is extremely easy to implement in VLSI. The array is completely pipelined on the bit-level, so that it operates at high clock frequency achieving very high throughputs and is devoted to real-time applications. Another example of a bit-level systolic array exploatation is described in [15] .
