# Hybrid Parallel Counters – Domino and Threshold Logic

Troy D. Townsend, Peter Celinski, Said F. Al-Sarawi and Michael J. Liebelt The School of Electrical and Electronic Engineering, The University of Adelaide, SA 5005, Australia E-mail: {troy,celinski,alsarawi,mike}@eleceng.adelaide.edu.au

#### Abstract

Parallel counters are the building blocks of partial product reduction tree (PPRT) circuits, which are required for high-performance multiplication. In this paper we will implement novel counters using a hybrid of domino and threshold logic. A test  $64 \times 64$  PPRT using these counters was found to reduce latency by 39% and device count by 38% compared to the domino logic equivalent.

### 1 Introduction

The partial product reduction tree (PPRT) is the most delay and area intensive portion of a high performance parallel multiplier [3, 5, 6] and is the focus of this paper. The PPRT is typically constructed using small parallel counters [5] as CMOS is most efficient at low fan-in.

The primary performance objective for our designs is latency – throughput can be set by varying the pipeline depth and degree of parallelism; latency reductions save area and power by removing pipeline latches.

We will measure time delays using logical effort analysis [7]. Accurately estimating power consumption and wiring overhead are beyond the scope of this paper, so we will use device count to gauge efficiency.

To implement threshold logic, we will use Charge-Recycling Threshold Logic (CRTL) [2]. Each gate computes a threshold function, which is specified by the gate threshold T and the weights  $w_1, w_2, \ldots, w_n$ ;  $w_i$  is the weight associated with the  $i^{th}$  input variable  $x_i$ . The output y is high when  $\sum_{i=1}^{n} w_i x_i \ge T$  and low otherwise.

All domino logic in this paper is dual rail, sized to provide an input loading equivalent to a minimum-sized inverter, with such an inverter buffering each gate's output. We have obtained logical effort parameters for domino logic from simulations; the parameters for CRTL are provided by [1]. As both CRTL and domino are typically operated on a 50% duty cycle, we can compare them solely on the basis of evaluation delay.



Figure 1. Hybrid 3:2 counter circuit

### 2 Counter implementations

A standard 3:2 counter ("full adder") circuit [5, 6] and associated model appear to be prevalent; we will implement it using domino logic. This model finds the sum output to be computed at time  $s = \max(b + x_2, d + x_3)$  and the carry output at  $c = d + y_3$ , for input delays  $a \le b \le d$  and path delays  $x_2, x_3$  and  $y_3$ .

The Minnick family of counters [4] are the fastest known TL implementation. We can construct a hybrid 3:2 counter by using this to compute the carry output and domino to compute the sum output, as seen in Figure 1.

Some threshold logic families (including CRTL) can support high fan-in much more effectively than CMOS, permitting efficient larger counters. However, CRTL gates can only resolve a limited set of voltage steps, which mandates that we limit fan-in (sum of input weights) to 30 for each CRTL gate [1] – enough for 7:3 and 15:4 counters. We will implement larger hybrid counters by using domino logic (a tree of XOR gates) to implement the least significant output and the Minnick architecture for the others. This is exemplified by Figure 2, which shows a 15:4 hybrid counter.

## 3 Comparison

In order to optimise signal delays, all counter outputs are synchronised to multiples of a time quantum, which we will name  $\phi$  – in this fashion we can take into account faster and slower paths through a counter rather than assuming all are equal, providing better optimised circuits. For the standard 3:2 counter  $\phi$  is taken to be the larger of the XOR and majority functions' computation times [5].





Figure 2. Hybrid 15:4 counter circuit

Table 1. 3:2 counter comparison

| Circuit | A  | L          | $\phi$   | $x_2$ | $x_3$    | $y_3$    |
|---------|----|------------|----------|-------|----------|----------|
|         |    | $(\kappa)$ | $(\tau)$ | (	au) | $(\tau)$ | $(\tau)$ |
| Domino  | 66 | 3          | 6.42     | 12.8  | 6.4      | 12.8     |
| Minnick | 36 | 2          | 4.33     | 8.6   | 8.6      | 4.3      |
| Hybrid  | 41 | 2          | 5.42     | 10.8  | 5.4      | 5.4      |

Table 1 shows area, loading and performance (latency and throughput) information for each of the 3:2 counters. Area A is measured in terms of device (transistor or capacitor) count; input load L is measured in terms of  $\kappa$ , the load presented by a minimum-sized inverter. The unit of time is  $\tau$ , the delay of a parasitic-free minimum-sized inverter driving a load of  $\kappa$  (commonly used in logical effort analysis). It is clear from the table that both Minnick and hybrid counters outperform the domino circuit, despite placing a lower load on the input and requiring significantly less area.

To compare counters for use in heterogeneous circuits, we will restrict each circuit to  $\phi = 6.42\tau$  and  $L \leq 3\kappa$ , matching the parameters of the domino 3:2 counter. Table 2 provides latency and area data for a number of different counters under these conditions. Area (device count) is normalised against the domino 3:2 counter – that is, for a counter reducing *i* bits (e.g. the 15:4 counters reduce 11 bits), normalised area  $A_n = \frac{A}{i.A_r}$ , where  $A_r$  is the device count of the domino 3:2 counter. For simplicity, the table does not consider "fast inputs" – each of the hybrid counters are slightly faster than listed under some circumstances.

We note that each hybrid is faster than its Minnick equivalent and also that the 7:3 and 15:4 versions require less area. Thus heterogeneous circuits should be constructed solely using hybrid counters in order to achieve lowest latency.

An initial comparison of PPRT performance using these counters was undertaken at  $64 \times 64$ . The three-greedy algorithm [5] was used for the homogeneous circuits as the computation requirements for finding optimal circuits [6] were too great; heterogeneous circuits were constructed us-

Table 2. Counter comparison with fixed  $\phi$ , L

| Circuit       | $O_8$    | $O_4$    | $O_2$    | $O_1$    | $A_n$ |
|---------------|----------|----------|----------|----------|-------|
|               | $(\phi)$ | $(\phi)$ | $(\phi)$ | $(\phi)$ |       |
| Minnick 3:2   |          |          | 1        | 2        | 55%   |
| Hybrid 3:2    |          |          | 1        | 2        | 62%   |
| Minnick 7:3   |          | 2        | 3        | 3        | 54%   |
| Hybrid 7:3    |          | 1        | 2        | 3        | 45%   |
| Minnick 15:4  | 3        | 4        | 4        | 4        | 78%   |
| Hybrid $15:4$ | 2        | 3        | 3        | 4        | 54%   |

ing the algorithm in [8]. Final latency was found to be  $18\phi$  for domino,  $15\phi$  for both Minnick 3:2 and hybrid 3:2 homogeneous circuits, and  $11\phi$  for the heterogeneous circuit. The area of the homogeneous circuits is proportional to  $A_n$  (refer to Table 2); the heterogeneous circuit will reduce device count by at least 38% compared to domino.

## 4 Conclusion

A  $64 \times 64$  PPRT circuit implemented with hybrid counters of varying sizes was found to reduce latency by 39% and device count (area) by at least 38% compared to those of domino logic. This is an encouraging result, and future optimisation is expected to improve this even further.

#### References

- P. Celinski, S. D. Cotofana, and D. Abbott. A logical effort based delay model of charge recycling threshold logic gates. In *Proc. ProRISC Workshop on Circuits, Systems and Signal Processing, Veldhoven, Netherlands*, November 2003.
- [2] P. Celinski, J. F. López, S. Al-Sarawi, and D. Abbott. Low power, high speed, charge recycling CMOS threshold logic gate. *IEE Electronics Letters*, 37(17):1067–1069, August 2001.
- [3] L. Dadda. On parallel digital multipliers. In E. E. Swartzlander, Jr., editor, *Computer Arithmetic*, volume I, pages 126– 132. IEEE Computer Society Press, 2nd edition, 1990.
- [4] R. C. Minnick. Linear-Input Logic. IRE Transactions on Electronic Computers, EC-10:6–16, March 1961.
- [5] V. G. Oklobdzija, D. Villeger, and S. S. Liu. A method for speed optimized partial product reduction and generation of fast parallel multipliers using an algorithmic approach. *IEEE Transactions on Computers*, 45(3):294–306, March 1996.
- [6] P. F. Stelling, C. U. Martel, V. G. Oklobdzija, and R. Ravi. Optimal circuits for parallel multipliers. *IEEE Transactions* on Computers, 47(3):273–285, 1998.
- [7] I. E. Sutherland, R. F. Sproull, and D. L. Harris. *Logical Effort: Designing Fast CMOS Circuits*. Morgan Kaufmann, 1999.
- [8] T. D. Townsend, P. Celinski, S. F. Al-Sarawi, and M. J. Liebelt. A hybrid approach for multiplier design. To be submitted to Great Lakes Symposium on VLSI, Boston, USA, April 2004.

