Abstract The most widespread 16-bit multiplier architectures are compared in terms of area occupation, dissipated energy, and EDP (Energy-Delay Product) in view of low-power low-voltage signal processing for digital hearing aids and similar applications. Transistor-level simulations including back-annotated wire parasitics confirm that the propagation of glitches along uneven and re-convergent paths results in large unproductive node activity. Because of their shorter full-adder chains, Wallace-tree multipliers indeed dissipate less energy than the carry-save (CSM) and other traditional array multipliers (6.0 lW/MHz versus 10.9 lW/MHz and more for 0.25 lm CMOS technology at 0.75 V). By combining the Wallace-tree architecture with transmission gates (TGs), a new approach is proposed to improve the energy efficiency further (3.1 lW/MHz), beyond recently published lowpower architectures. Besides the reduction of the overall capacitance, minimum-sized transmission gate full-adders act as RC-low-pass filters that attenuate undesired switching. Finally, minimum size TGs increase the V dd to ground resistance, hence decreasing leakage dissipation (0.55 nW versus 0.84 nW in CSM and 0.94 nW in Wallace).
Introduction
The strict specifications for digital hearing aids impose tight constraints on both area occupation and energy consumption, while relaxing the timing requirements. Thanks to their compact size and relatively large energy storage [1] (around 100 mAh), zinc-air batteries dominate the market of the energy suppliers for these applications. In order to decrease the dissipation and to exploit the battery lifetime fully, digital signal processing units, DSPs or ASICs, are supplied down to 0.9 V or even lower [1, 2] , depending on the technology.
Multipliers represent at the same time basic modules and relevant energy sinks in digital signal processing; for this reason, many efforts have been spent in trying to increase their efficiency. Many works in the past, [3, 4] among others, focused on decreasing the consumption of the single full-adder cell. While this improves the overall multiplier efficiency, it overlooks the most significant limitation to the design of low-power multipliers, namely the large extent of spurious activity, due to unevenly long and re-convergent paths [1] . All the solutions proposed in literature so far to suppress glitches can be classified in three main families:
(1) the shortening of full-adder chains in the multiplier matrix that penalizes, for instance, array architectures over Wallace trees [5] ; (2) the equalization of the internal delays, as in the Leapfrog architecture [6]; (3) the alignment of the internal signals by means of selftimed circuits [7] .
The first approach is simple and attractive; its efficiency is confirmed in this work and the new proposed multiplier is also based on the Wallace-tree architecture. As opposed to that, the efficiency of the second and the third techniques is strongly dependent on the specific technology; they require a very careful transistor-level design and dedicated calibration. The third approach is analyzed in more detail in Sect. 6, where a direct comparison to the proposed multiplier is drawn.
A new idea is presented in this paper; it may be considered as a fourth approach. Transmission gates are used to act as RC-low-pass filters that suppress glitches while reducing the overall capacitance. It will also be shown that minimum-size transmission gates present the relevant feature of reducing the leakage consumption. Meanwhile, the EDP (Energy-Delay Product) is kept quite low, which preserves flexibility in the choice of the working frequency (up to few MHz).
The rest of this paper is organized as follows: Section 2 presents a short introduction to the operation of multiplication itself for a better understanding of the different architectures, which are reviewed and compared in Sect. 3. A method to estimate the spurious activity is presented in Sect. 4. It allows the investigation of the reasons of the uneven energy dissipation in the various architectures. In Sect. 5 the new structure is introduced. Its performance is the topic of Sect. 6. Section 7 investigates the benefits of the mixed topology of CMOS and transmission gate fulladders, while Sect. 8 draws the conclusions.
Signed two's complement multiplication
The following notation for a 16-bit unsigned multiplication [8] is used in the next section:
where Z represents the product, X i the i-th bit of the multiplicand and Y j the j-th bit of the multiplier. The modified Baugh-Wooley algorithm (see paragraph 11.3 of [9]) enables the extension of unsigned to signed multiplication.
When the Booth radix-4 recoding [10] is applied, Eq. 1 transforms into the following: where Yb j represents the j-th operand of the multiplier after Booth recoding. As can be noticed from Eq. 1 and Eq. 2, Booth recoding allows the number of partial products, hence the number of additions, to be halved. Yet, the precalculation of Yb j and the multiplication of the multiplicand by -2, -1 and 2 require extra logic, which is paid for in terms of power dissipation (and area occupation). Depending on the way the terms X i Y j in Eq. 1 and the terms X i Yb j in Eq. 2 are generated and summed up, various multiplier architectures are possible. In the next section, the RCM (Ripple-Carry Multiplier), the CSM (Carry-Save Multiplier), the CSM featuring Booth recoding and the Wallace-tree architecture [8] will be examined and compared to each other. The first one is the most straightforward extension of the concept of ripple-carry adder to the multiplier architecture. The last three represent the most widespread multiplier architectures in a variety of applications [11] .
3 Review of multiplier architectures RCM and CSM differ only in the carry propagation mechanism [8] , as illustrated in Fig. 1 . In the RCM, the carry bits ripple from right to left, from the LSB to the MSB; in the CSM, they further descend one row. CSM has also been implemented with radix four Booth recoding.
The Wallace-tree multiplier is substantially different: the partial terms X i Y j of Eq. 1 are added all at once before entering the main full-adder network [8] . That results in a quite irregular architecture (Fig. 2) , which, nevertheless, makes it possible to shorten the longest path up to the final addition. The latter can be carried out according to well known adder topologies; in this work, a final RCA (Ripple Carry Adder) has been chosen, trading speed for energy, as shown in [12] .
The four multiplier architectures were simulated with Spectre at transistor-level in a 0.25 lm CMOS technology. In order to satisfy the requirements imposed by hearing aid applications, the supply voltage has been fixed at 0.75 V, less than one third the nominal voltage (2.5 V). All multipliers are fed with the same set of random stimuli vectors; the input signals are aligned in time and they are glitchfree. These assumptions can be considered valid with very good approximation for time-shared units, where input data emanate from registers. All the presented architectures 
