Abstract-This letter presents an improved Toom's algorithm that allows hardware savings without slowing down the processing speed. We derive formulae for the number of multiplications and additions required to compute the linear convolution of size = 2 . We demonstrate the computational advantage of the proposed improved algorithm when compared to previous algorithms, such as the original matrix-vector multiplication and the FFT algorithms.
I. INTRODUCTION

C
ONVOLUTION is a very important operation in signal and image processing with applications to digital filtering and video image processing. Many approaches have been suggested to achieve high-speed processing for linear convolution and to design efficient convolution architectures [1] - [5] .
The proposed work is based on a nontrivial modification of the one-dimensional (1-D) convolution algorithm presented in [1] and shown in Fig. 1 . Using an alternative (permutation-free) construction, we show that the number of lower order parallel convolutions (Stage #2 in Fig. 1 ) can be reduced from three to only two, while keeping the regular topology and simple data flow of the original very large scale integration (VLSI) architecture. Our methodology employs tensor-product decompositions and permutation matrices as the main tools for expressing DSP algorithms.
Let and be two sequences of length . The linear convolution in matrix form is given by , where is the convolution matrix defined by [1] ( 
represents a shuffle permutation matrix on inputs with stride , and represents the tensor-product [6] . Equation (1) can be realized by cascading the three stages: preaddition stage followed by one stage of multiplications , followed by a postaddition stage as shown in Fig. 1 [1] .
II. IMPROVED PREADDITION AND POSTADDITION
Using the tensor-product property , where [4] - [6] , the term contained in (3) can be simplified to (6) Since the tensor-product is commutative and from (6), then
But since where
From (7) and (8), we have By decomposing the two permutations and in (10) into their serial forms [6] , can be simplified to (11) Note that the above expression for does not include any (explicit) permutations. The realization of ( ) using the original form of (3) and the modified shuffle-free representation of (11) is shown in Figs. 2(a) and 3(b) , respectively.
Even though the resulting circuits in Figs. 2(a) and 3(b) are topologically equivalent, removing the shuffle-permutations from the tensor formulations can simplify data movement.
Similarly, substituting for in (4) and applying similar tensor-product properties [6] , can be simplified to
The permutation-free recursive realization of the eight-point convolution using three four-point convolutions is shown in Fig. 3(a) . We assume that each addition requires one unit of time.
III. MULTIPLEXED ARCHITECTURE OF THE 1-D CONVOLUTION
A careful scrutiny of the realization shown in Fig. 3(a) reveals that the data movement through the computational stages encounters different amounts of delays. In particular, the computations involved in the matrix affect only the middle four-point convolution in the center stage. Thus, the top and the bottom four-point convolutions can be computed one addition cycle ahead of the middle four-point convolution. This means that only two four-point convolvers are needed at a time. Therefore, through the use of a multiplexer, either the top or the bottom four-point convolver can be removed from the architecture [as shown in Fig. 3(b) ], allowing hardware savings without slowing down the processing speed.
It should be mentioned that it would not be possible to observe the resource sharing shown in Fig. 3(a) without the improved shuffle-free architecture. This is evident by comparing the realizations of in the original form [ Fig. 2(a) ] and the modified shuffle-free form [ Fig. 2(b) ]. 
From (4), the number of additions is the number of additions to compute the term plus the additions to compute . The rest are permutations only. Since each of the threeinput adders used in computing the matrix can be realized by two two-input adders, the number of additions is given by (14) The matrix is a special matrix of zeros and ones and has the property that the number of one-entries per row is either one or two; all other entries are zeros. Moreover, the number of one-entries per column is exactly one; all other entries are zeros. Using these properties, we can derive modular realizations for the matrix at different stages. For a convolution of size , the coordinates of the one-entries are where, , . Let . Then, the coordinates for the one-entries become . Now, observe that if while varies over its entire range, then the set of coordinates for the one-entries is given by , which describes an identity matrix that occupies rows 0 to and columns 0 to of the matrix . Similarly, when and varies over its entire range, the set of coordinates for the one-entries is given by , which describes another identity matrix placed in rows to and columns to . Finally, when and varies over its entire range, then the set of coordinates for the one-entries is given by which describes the third identity matrix placed in rows to and columns to . Finally, notice that there is no overlapping between the row coordinates of and , which means that each row of the matrix will contain only two one-entries at the row coordinates specified above and only one one-entry in the remaining rows. Since the number of additions required is equal to the number of these rows with two one-entries, the number of additions to compute is given by
Therefore, from (14) and (15), the number of additions needed to compute the postadditions is given by (16) From (13) and (16), the total number of additions needed to compute -point ( ) convolutions is given by
Since is a diagonal matrix of order , applying the matrix implies performing independent elemenwise multiplications, which can be done in parallel. Therefore, the total number of multiplications required is equal to the number of multiplications in the single-core stage and is equal to [1] . Tables I and II show the computational advantage of the proposed improved algorithm when compared to previous algorithms, such as the original matrix-vector multiplication and the fast Fourier transform (FFT) algorithms. For example, although the number of additions of the proposed algorithm is nearly twice that of the FFT, the number of multiplications is less by 70% for the case .
V. CONCLUSIONS
In this letter, we presented an improved Toom's algorithm that allows hardware savings without slowing down the processing speed. We demonstrated the computational advantage of the proposed improved algorithm when compared to previous algorithms, such as the original matrix-vector multiplication and the FFT algorithms.
