Abstract-In this correspondence, we present a bit-serial architecture for convolving/correlating long numerical sequences by long filter functions. Because of its two-level interleaving structure, the proposed device does not require "wait cycles" between consecutive input samples. As a result, it achieves the highest possible throughput. Cascadability, fault tolerance, feasibility in VLSI technology, and computing performances are discussed and analyzed.
VI. DISCUSSION
The CRB's for scale and orientation increase with wavelength. To explain this phenomenon, recall that as the wavelength increases, the size of the disc in the Fourier plane decreases, and hence, only a low-frequency part of the Fourier transform is available for shape estimation. The effect of orientation of the object has maximum effect in the high-frequency region, particularly for an equidimensional object such as a square. Therefore, for reliable estimation, the Fourier transform of the object over as large a disk as possible must be used. Conversely, the objects such as a strip or triangle having a significant angular variation of the spectrum are more likely to be correctly estimated. Thus, for a perfectly equidimensional object, such as a circle, it is not possible to estimate the orientation at all. For small objects (a fraction wavelength), the estimation error is higher because most of the high-frequency spectrum falls outside the disc.
I. INTRODUCTION
In this correspondence, we present a bit-serial VLSI architecture for the direct convolution/correlation of long numerical sequences by Manuscript received March 9, 1998 ; revised October 1, 1998 . The associate editor coordinating the review of this paper and approving it for publication was Dr. Konstantinos Konstantinides.
The author is with the Dipartimento di Elettrotecnica ed Elettronica, Facoltà di Ingegneria, Politecnico di Bari, Bari, Italy (e-mail: marino@poliba.it).
Publisher Item Identifier S 1053-587X(99)03249-3.
1053-587X/99$10.00 © 1999 IEEE long filter functions. Such an architecture is a highly modular pipeline structured in two levels of interleaving that avoid the need for "wait cycles" between consecutive input samples. As a result, the proposed device can work "on-the-fly," i.e., its processing can proceed at the same frequency as data input access, achieving the highest possible throughput. Even though, from the computational point of view, indirect approaches (such as the FFT [1] ) are less complex than the direct approaches, they need the previous acquisition of the whole input sequence (from now on composed by M samples) when only a single point of the output sequence is required. As a consequence, indirect techniques are not practicable when an "on-the-fly" computation is desired. Moreover, although the direct convolution requires a computational complexity growing with M 2 , it has greater simplicity and smaller sensitivity to error propagation than the indirect approach, especially when long filters are used [2] . Therefore, direct convolution notably contributes to hardware simplification and to overall performance.
Because of the real-time requirement of many applications needing convolution, a large number of convolver/correlator architectures have been proposed. Among these, the most suitable for VLSI implementation are those designed to exploit the bit-serial approach. In fact, this approach has many advantages with respect to the parallel one [3] , such as a simpler communication strategy (single wires instead of data-buses), a reduced number of pins, and the possibility of achieving very high throughput by pipelining at the bit level. Such advantages increase when long filters have to be realized.
The removal (total or partial) of "wait cycles" between two consecutive input samples is a key for increasing the achievable throughput in bit-serial convolution and correlation. Such a task has been addressed in some architectures [4] - [7] , which exploit particular processing elements (PE's) schemes [6] , double adder circuits [4] , [6] , or automatic word rounding [7] . All of these schemes are fully systolic and require a skewed parallel input of N words (where N is the number of filter taps). As a consequence, even though they are powerful for applications with small values of N (e.g., vector quantization for speech and image coding, where N is the dimension of the vectors [8] ), because of the pin-limitation, they are absolutely not suitable for those applications where long filter functions are prescribed. Word-serial input schemes are the only practicable solutions for such applications since they need only one input line for any value of N and M .
To the best of our knowledge, the only bit-serial device that is able to convolve a word-serial input without the need for "wait cycles" is the Dadda's polyphase convolver [9] . Polyphase convolvers require at least three phases, and therefore, they need three multipliers for each tap of the filter. Such multipliers allow the processing of input streams without "wait cycles," but they are underutilized because one or more of them are idle during each clock cycle.
In the next section, we describe an architecture that uses only two multipliers per tap. These multipliers are fully utilized and allow the highest possible throughput in each part of the architecture. As a result, the used VLSI area is optimized. The proposed device produces results without loss of precision and operates at the same frequency as the sampled bit-serial input, up to the maximum allowed by the implementation technology. The pipelining allows easy cascadability, fault tolerance, and potential wafer scale integration. Theoretical comparative evaluation in terms of computing performance and hardware complexity is given in Section III, and some simulation results for a particular implementation technology are reported in Section IV. A conclusive remark is given in Section V.
II. THE PROPOSED ARCHITECTURE
The aperiodic convolution of two numerical sequences (real or complex) (1)
The reference processor structure that we used for the on-the-fly computation of (1) is the pipeline shown in Fig. 1 since, among many alternatives, this requires only one input line (for any M and N ) and provides the shortest latency. It requires a preloading phase of the filter function coefficients, but this does not restrict the computing performances since the filter function needs infrequent updating during processing. Moreover, because of the trivial relation between convolution and correlation, 1 the same device can be also used for computing a correlation (it is sufficient to preload h h h in the inverted order).
According to this scheme, at the mth step, the nth processing element (PE) computes the product h n x m and sums this datum with the term (h n+1 x m01 + (h n+2 x m02 + (h n+3 x m03 + (1 11)))) that was computed during the (m 0 1)th step by the (n + 1)th PE. The first output from the device (h 0 x 0 ) is immediately available while x 0 is input, and partial products are generated. No carry propagation occurs in this architecture since the sum computation is distributed. In this scheme, we have introduced two levels of interleaving (IL's) in order to allow a correct format expansion that occurs during the products and the sums required by (1). 2 
A. First Interleaving Level (Product)
An analogy between multiplication and convolution algorithms suggests that the same architecture that is used for the convolver could be used at a lower level to implement the multipliers for on-thefly product computation. This is possible when the bits are serially processed. 1 The correlation of two numerical sequences x x x and h h h is the sequence y y y, where y m = N01 n=0 h n 1 x m+n . 2 If qx and q h are the number of bits used to quantize, respectively, x x
x and h h h, each element of y y y might require up to q x + q h + dlog 2 (N )e bits. We adopted the serial/parallel multiplier shown in Fig. 2 . Such a device is appropriate because in our scheme, the input x x x is serially fed, and the filter coefficients can be stored in parallel once and for all (in a preloading step) since they remain constant during the whole process. The bit serialization does not introduce processing delay, and the generation of the final result requires exactly the minimum number of computing steps. Moreover, in terms of hardware complexity, this multiplier is simpler than a fully parallel one (e.g., the one described in [10] ) because the number of logic ports is reduced by a factor of approximately q h .
In the following, we will refer to digital convolvers using 8-b quantization both for x x x and for h h h (i.e., q h = q x = 8). Extension of the model to the case of more precision is straightforward. In this case, each term h n x m0n may require up to 16 bits. Therefore, eight additional processing steps (eight steps were performed during the bit-serial input of xm0n) are needed to output the complete result. This additional delay is a bottleneck to performing on-the-fly convolution. To solve this problem, a pair of multipliers (Mi; 0 and M i; 1 ) has been used in each PE i , where both are controlled by an interleaving unit (Fig. 3, IL 1) .
A multiplier receives only one datum xm (over two) of the bitserial input stream x x x; in Fig. 4 , the axis shows the data input to Mi; 0 (upper side) and to Mi; 1 (lower side) in each PEi. As a result, exactly 16 clock cycles (from now on, cc's) are given to each multiplier to correctly compute the product and to output the result. This time slice separates the input of the first bit of a datum (i.e., the LSB of x m ) from the input of the first bit of the next datum to process (i.e., the LSB of x m+2 ). This is represented on the axes , , , and , which are related, respectively, to M n+1; 0 ; M n+1; 1 ; M n; 0 , and
We shall show that by introducing a second IL, which is constituted by two full adders in cascade for each multiplier, we also solve the problems caused by carry propagation in the sum chain, and we achieve the highest possible throughput in each part of the whole architecture.
B. Second Interleaving Level (Sum)
To focus on the problems resulting from carry propagation, we consider one of the two multipliers Mn; j . If we assume only one adder circuit in cascade for such a multiplier at a given time, this adder should be used by the 16 bits of h n x 2k+j and immediately after by the bits of hn x 2(k+1)+j (these values have to be added with data coming from PE n+1 ). 
after the addition of h n x 2k+j with p n+1; 2k+j01 , the most significant part of such a sum will conflict with the least significant part of the following sum in the adder output. This is shown in Fig. 4 , where the data on the axes and are partially overlapped concerning the additional bits generated by carry propagation.
To solve this problem, we have provided each multiplier Mn; j with a pair of output lines O n; j and O n; j+2 , where each one is input to a single adder n; j and n; j+2 (Fig. 3, IL 2 ). Such adders are realized simply by a FULL-ADDER with the carry suitably fed back. The axes ; ; and in Fig. 4 show how the results are interleaved on the output lines by each multiplier. In particular, each axis shows the data that are output by On; j (upper side) and by On; j+2 (lower side).
In this way, there are 16 void cc's between the MSB of h n x m and the LSB of hnxm+2 on each channel On; j . Therefore, each adder may produce a result having up to 16 additional bits (one bit for each void cc). This is shown on the axes ; ; and , where each one represents the output of a different adder. Fig. 3 also shows how the output lines of each adder n+1; j must be delayed and connected to the adders in PEn to correctly compute (1). These relationships are also noted in Fig. 4 by solid arrows.
Negative terms may be easily managed (e.g., as proposed in [11] ) if they are presented in a two's complement form.
C. Cascadability
As a consequence of the introduced two interleaving levels, 2 16 terms can be correctly added in the case of q h = q x = 8. This means that filters having 2 16 taps can be implemented by means of the proposed architecture. We would like to remark that this limit can be achieved independently by the integration, which means by cascading a suitable number of chips, since the architecture is fully modular. Such a possibility is shown in Fig. 5 . An N -tap filter can be straightforwardly realized by means of C chips, where each one integrates N=C PE's.
D. Fault Tolerance
The proposed architecture is a pipeline having identical stages (PE's). Therefore, fault tolerance capability can be added at the expenses of a little hardware overhead. A fault tolerant device is composed by N + PE's, where is the maximum number of tolerable faults. Such PE's are provided with one register storing one single bit w i , which drives five 2 : 1 demultiplexers, as shown in Fig. 6 . Such demultiplexers are used to configure the "logical" connectivity of the pipeline according to a status-word w w w (=w0w1w2 1 1 1 wN+01) , which maps the "health" of the device. The status word w w w is determined by end-of-production testing. 3 Each bit w i is set to 1 if PE i is faulty. By this way, the data flow will bypass all the faulty PE's during processing. The preloading of w w w must precede the preloading of h h h in the initialization phase.
III. COMPARATIVE EVALUATION
In this section, we evaluate the proposed two-level interleaving architecture (from now on, 2L) in terms of throughput T T T (2L) and hardware complexity C C C(2L). In order to do comparative evaluations, T T T (.) and C C C(.) will be estimated in terms of [cc's] 01 and the number of transistors since such measures are independent from the implementation technology. In the next section, we shall present the "actual values" of silicon area and operative frequency that have been derived for a specific implementation of 2L.
A. Throughput
Because of its interleaving structure, 2L does not require any "wait cycles" between two consecutive bit-serial input samples. As a consequence, 2L can receive in input a new datum every q x cc's, and it can achieve a throughput T T T (2L), which is the maximum achievable by a bit-serial architecture T T T (2L) = 1=q x (cc's) 01 = T T T max :
This result is reported in row #8 of Table I and represents an impressive speedup with respect to the conventional bit-serial architectures that achieve a throughput (row #0 in Table I) T T T (Conventional) = 1=(q x + q h + dlog 2 (N )e) (4) since they require q h + dlog 2 (N )e "wait cycles" between two consecutive input data in order to allow the format expansion in y n .
In Table I , we also compare 2L with other architectures which remove (totally or partially) the bottleneck of the "wait cycles." Such architectures are properly described in [4] - [7] and [9] and are briefly commented on by means of some remarks in Table I. Architectures #1-4 require only dlog 2 (N )e "wait cycles" between two consecutive data quantized by q x bits, and therefore, they can achieve a throughput T T T (:) = 1=(q x + dlog 2 (N )e) (5) 3 Such a test is required only for the devices that were detected as faulty. Since it has to distinguish fault-free and faulty PE's, it has a complexity of O(N + ). Briefly, it consists of a) verifying a fault-free connectivity (a faulty connectivity implies an unusable array). This is accomplished setting w i = 1 for each i (all the PE's are bypassed) and transmitting a set of test vectors fh h h; s s s 0 ; s s s 1 ; s s s 2 ; s s s 3 g across the lines H, S 0 ; S 1 ; S 2 , and S 3 . We then have b) to characterize PE j (for each j = 0; 1; 111; N + 0 1). This is accomplished setting w j = 0 and w i = 1 (for each i 6 = j). data that should be computed by PE k (k > j) during normal processing. The test does not need additional pins since in a), the correct transmission of the test vectors is verified through the pins used for connecting more chips. Even though run-time faults cannot be detected during the computation, periodical tests can make the device robust to these kind of faults by simply updating w w w. Fig. 7 . Hardware complexity (evaluated in terms of number of transistors) for the proposed two-level interleaving architecture (2L), a three-phase convolver (3P ) and a four-phase convolver (4P ) [9] for different values of N and q = q x = q h . Note that f3P; q = 8g cannot implement filters having N > 2 8 taps.
INPUT MODE (N INPUT PINS). b) WORD-SERIAL INPUT MODE (ONE INPUT PIN). c) AUTOMATIC ROUNDING (LOSS OF PRECISION). d) FULL PRECISION
which is lower than T T T max . Note that architectures #1 and 2 achieve this target by means of automatic word rounding (remark c), and therefore, they are not suitable for applications requiring high precision, whereas architectures #3 and 4 do not have such a limitation (remark d).
Architectures #5-7, as well as the proposed one, are able to completely remove the need for "wait-cyles." Therefore, they can achieve T T Tmax as well. Note that architecture #5 uses automatic word rounding (remark c), whereas architectures #6 and 7, as well as 2L, compute results with full precision (remark d).
B. Hardware Complexity
We would like to point out that all the architectures considered in Table I (except 2L and architecture #7) require a skewed parallel input of N words (remark a). As a consequence, they require N input pins, and therefore, they are absolutely not suitable for those applications that require long filter functions (e.g., high-precision radars) because of the pin limitation. The only practicable solutions for those applications appear to be 2L and the polyphase convolvers (architecture #7) since such bitserial schemes require only one word-serial input (remark b). As a consequence, such architectures need a global VLSI area that can be reasonably estimated by the number of transistors (say, N ) since they use only one input pin for any value of N and M and have only few bit-serial data lines, which are not local since they are semisystolic. Conversely, N gives only a partial and rough estimation of the global VLSI area needed by architectures #1-6. In fact, in such architectures, the VLSI area needed by the necessary N input pins has the same order of magnitude of N (i.e., O(N)). At the current integration levels, such VLSI area becomes preponderant (or at least it cannot be neglected) with respect to the VLSI area needed by the transistors. 4 For such a reason, in Fig. 7 , we have estimated C C C(.) by means of the number of transistors N only for 2L and for two different polyphase convolvers, namely, a three-phase convolver (3P ) and a four-phase convolver (4P ) [9] . In order to compute N, we have assumed an implementation based on standard cells having the same characteristics as those reported in Table II. In the diagram shown in Fig. 7 , we have considered different values of N and q = q x = q h (for the sake of simplicity, we assumed the same quantization level for x x x and for h h h). Conversely, M has not been taken into account since C C C(.) is independent from it for all the architectures. Note that even though C C C(2L) and C C C(3P) are almost equivalent, 2L can implement filters having up to 2 2q taps, whereas 3P cannot implement filters having more than 2 q taps [9] .
For such filters, a polyphase convolver requires at least four phases (4P) and a complexity C C C(4P) that is approximately 32% higher than C C C(2L). 
IV. SIMULATION RESULTS
In order to compute the "abstract values" of silicon area and operative frequency for a specific technology, the above-derived C C C(.) and T T T (.) have to be multiplied by the average area needed by a transistor and by the inverse of one period of clock, respectively. Note that due to some physical phenomena (e.g., load capacitances, fan-in, and fan-out, etc.), "actual values" may be different from the derived "abstract values." Such "actual values," as well as the silicon area needed by the interconnecting wires, can be known only by means of a detailed analysis of the implementation. For such a purpose, an implementation of 2L in standard cells (0.7-m CMOS, ES2 technology [12] ) was evaluated. The complete placement and routing for 300 PE's (q h = 8) in an 11.6 2 11.6 mm 2 chip was automatically obtained using the software tool SOLO2030 developed by CADENCE. An operating frequency of 50 MHz was verified through a logical simulation performed by SILOS, in which standard cell characteristic data were used (Fig. 8) . Such a result means that an 8-b quantized sequence could be bitserially convolved by a long filter function at 6.25 Msamples/s without loss of precision.
V. CONCLUSION
We have described a bit-serial VLSI architecture for the on-thefly convolution of numerical sequences with long filter functions. Such a device operates without loss of precision, achieving complete hardware exploitation; in addition, it can be fully pipelined. The obtained throughput is the highest possible one since no "wait cycles" are required between consecutive input samples. Its feasibility in VLSI has been verified. Moreover, a comparative evaluation with other existing architectures in terms of throughput and hardware complexity has been provided.
ACKNOWLEDGMENT
The author wishes to thank V. Piuri, Polytechnic of Milan, for suggestions that have improved this correspondence. He would also like to acknowledge the anonymous reviewers for their helpful comments.
