with less hardware cost compared to the original LEAP circuit, requiring fewer transistors than with conventional CMOS circuits.
It is possible to further increase the speed performance of the modified LEAP circuit. As mentioned before, the selection of input variable order for the construction of a BDD is usually based on the criterion of minimising the number of nodes. However, in practice, designers may define their own variable order accordmg to a realistic implementation situation. For example, if not all the input variables of a Boolean function arrive at the same time, one might give the slowest input variable the highest priority during Shannon expansion so that the path from this primary input to the output passes fewest multiplexers. Another possible improvement is to construct a BDD-like tree with each node implemented by a higher-order multiplexer. For example, in Fig.  3a , a radix-2 BDD tree of height 20 has a critical path going through 20 two-to-one multiplexers plus eight-inverters. Conversely, the radix-4 BDD-like tree of height 10 has the critical path going through only 10 four-to-one multiplexers plus five inverters, a significant improvement in speed performance. Fig. 3b shows the four-to-one multiplexer used in the radix-4 implementation.
In fact, the BDD-based synthesis process can combine multiplexers of different orders in order to achieve optimal speed performance. For example, the nodes in upper levels can be realised using high-order multiplexers while low-order multiplexers implement the nodes in lower levels. Hence, the larger horizontal delay through the controlling input to hgh-order multiplexers in the upper levels is hidden under the vertical delay through the passtransistor chain and inverters in lower levels. Fig. 4 compares several synthesised 64 bit CLA adders constructed from two levels of 8 bit CLA blocks. M-LEAP1 and M-LEAP2 are as defined previously; M-LEAP3 is our modified LEAP implementation with user-defined priority; M-LEAP4 is the radix-4 version of our modified LEAP method based on four-toone multiplexers; CPL and DPL denote the implementations using CPL and DPL multiplexers, respectively [ 1, 21. We can see that the modified LEAP circuits synthesised from our logicicircuit generator have better speed and area performance compared to other approaches.
Conclusion:
We present a new top-down cell-based design method based on pass-transistor logic with only two types of cell, multiplexers and inverters, in the cell library. The corresponding logic/ circuit synthesis tool is also developed, which takes advantage of the characteristics of pass-transistor logic to minimise the delay and area cost. Further performance improvement can be acheved using mixed-radix design and the option of user-defined variable order. 2N -1 input data) . In the algorithm which we derive, the variables a-, and U, also appear as input data, but the value of these variables does not interfere with the final result. Setting a-, = a, = 0 is numerically preferable. The DCT A = (Amn) of a 2D matrix a = (aJ (not necessarily a Toeplitz matrix) of size N x N is given by the following expression: (15) This formula enables us to obtain A,,,, m # 0 requiring only the computation of the two additional 2D DCTs, u3(m) and U&) .
The case n = m = 0 once again constitutes an indetermination.
Nevertheless, it can be immediately calculated by using the expression
k l which has a computational complexity of O(N).
The simplest way of calculating a bidimensional transform is by breaking it down into multiple unidimensional transforms on rows and columns. Calculated in this manner, it requires 2N 1D DCTs (one for each row and each column). The algorithmic complexity of each of these 1D DCTs is O (N log, N) . Furthermore, in the literature, other more efficient algorithms have been proposed for the calculation of a bidimensional transform. For example, for the case of the cosine transform, [3] reduces the number of 1D DCTs required to N.
Nevertheless, the calculation can be simplified in a much more radical fashion by using the special form of the Toeplitz matrix. We will demonstrate that, by this method, the number of 1D DCTs needed can be reduced to four. The method is based on eliminating the redundant operations that take place in the realisation of the unidimensional transforms carried out on the columns (first step) and on the rows (second step) of the Toeplitz matrix.
The first step, realised directly, involves computing the 1D DCT for each column of the Toeplitz matrix (which we denote .x,,,, 0 5 k < N ) :
1 =O with e, , = 7~ miN. Also 
This equation enables us to obtain the 2D DCT, A,,, for the co-ordinates n # m. u,(i) and u,(i) (i = 0, ..., N -l), are the only ID DCTs that need to be computed in this case.
For the points of the co-ordinates n = m we cannot apply eqn. 10 as this results in an indetermination. By using 1'Hospitals' rule we derive the expression with respect to e,, maintaining O,n constant, and later we substitute n = m. Then we obtain A7,,,2sin8,
Algorithm: In summary, the proposed algorithm for computing the 2D DCT of a Toeplitz matrix involves the following steps: (i) Computation of four ID DCTs using eqns. 11, 12, 14 and 15. These can be computed, for example, by using the algorithms F-61.
(ii) Obtaining A , , for the points of co-ordinates n # m by using eqn. 10; for the points n = m # 0 use eqn. 13; and for n = m = 0 use eqn. 16.
In total we need 5N+3(Nz-1) additions, 2(NL-1) products by trigonometric coefficients and 3N products from scale factors, as well as four 1D DCTs of size N. This is a notable reduction with respect to [ 11, which amongst other operations needs to compute five 1D FFTs of size 2N.
Conclusions:
We have presented an algorithm to compute the 2D cosine transform of a Toeplitz matrix. The algorithm exploits the special form of this type of matrix and basically requires the evaluation of four 1D cosine transforms. The algorithm obtained requires less operations and is simpler and more regular than previously proposed algorithms.
