Abstract-This paper describes the micro-architecture and circuit techniques for building multipliers with micro-electromechanical (MEM) relays. By optimizing the circuits and micro-architecture to suit relay device characteristics, the performance of the relay based multiplier is improved by a factor of ~8x over any known static CMOS-style implementation, and ~4x over CMOS pass-gate equivalent implementations. A 16-bit relay multiplier is shown to offer ~10x lower energy per operation at sub-10 MOPS throughputs when compared to an optimized CMOS multiplier at an equivalent 90 nm technology node. To demonstrate the viability of this technology, we experimentally demonstrate the operation of the primary multiplier building block: a full (7:3) compressor, built with 98 MEM-relays, which is the largest working MEM-relay circuit reported to date.
I. INTRODUCTION
Due to their negligible leakage, MEM relays have recently been proposed as a solution to overcome the minimum energy limitations of CMOS circuits [1] . Although the mechanical movement makes relays slower than CMOS transistors, we have shown in adders and other basic circuit building blocks that appropriate circuit design strategies can enable relays to be more competitive at the circuit and system level [2, 3] . However, for MEM relays to be a practical replacement technology for CMOS, the performance gains must translate across the entire system. In this work we investigate the design of MEM relay based multiplier structures as they are commonly the most complex arithmetic blocks. We develop the multiplier micro-architecture and circuit techniques tailored to the relay device properties. These new micro-architectures are optimized around larger compressor circuits to minimize the mechanical delay. The larger pass-gate style compressor circuits are also optimized to provide singlemechanical delay operation and minimize the number of devices. The functionality of the optimized (7:3) compressor is experimentally demonstrated.
II. MEM-RELAY OPERATION AND CIRCUIT DESIGN
The structure and operation of our 4-terminal MEMrelay [3] is shown in Fig. 1 . When the gate-to-body (|V GB |) voltage exceeds the pull-in voltage (V pi ) the channel connects the source and drain electrodes and allows current to flow. The current flow is disrupted by an air gap when |V GB | decreases below the pull-out voltage (V po ) and the relay de-actuates.
In optimized relay-based circuits all mechanical movement should happen simultaneously to minimize the impact of the relatively slow mechanical delay [2] , thus favoring a tailored pass-gate design. Figure 2(a,b) illustrates the difference between CMOS and MEM-relay logic design styles. A straightforward substitution of CMOS transistors with relays in a standard CMOS 4-input AND logic circuit would result in 4 mechanical delays as each signal hitting a gate triggers an additional mechanical delay. The optimized relay design shown in Fig. 2(b) incurs only one mechanical delay since all mechanical movements happen at the same time. Thus, given a logic function, the design strategy is to stack as many MEMrelays in series as possible. The upper bound on the number of MEM-relay devices in a stack is reached when the electrical and mechanical delays are equal. Figure 2 (c) compares the electrical delay of a device stack with the mechanical delay, for our 1 µm and 90 nm MEM-relays [3] . The mechanical delay is obtained for a reasonable range of V GB overdrives and shows that this design approach is extendable to hundreds of series devices and consequently encompasses most practical logic functions.
III. MEM-RELAY MULTIPLIER DESIGN

A. Design Overview
The main opportunity for innovation in relay multiplier design comes from the logic for the partial product matrix reduction. Figures 3 and 4 show 6-bit examples of the two Figure 3 shows a multiplier composed of (3:2) compressor (full-adders) and half adder circuits shown in Fig. 5 . In both relay-based adder cells, there is one source/drain input whose output delay is only the electrical delay of the relay. The electrical propagation path allows for stacking (3:2) compressors and half adders without additional mechanical delays. Figure 4 shows the second microarchitecture which uses a higher compression ratio to decrease the total number of reduction steps. In each reduction step, various (N:3) compressors are stacked in such a way as to avoid paths with mechanical delays. The largest block is the (7:3) compressor with six gate inputs and one source/drain input. To illustrate the impact of using higher ratio compressors, Figs. 3 and 4 show the design of a simple 6-bit multiplier with both approaches, where the multiplier using (N:3) compressors reduces the number of mechanical delays from 4 to 3 (including one mechanical delay for partial product generation).
B. Partial Product Generation
As expected, using more complex compressors with larger radix introduces an area/delay tradeoff. A technique that can potentially benefit both area and delay is the modified radix-4 Booth encoding which reduces the total number of partial products by half [4] . The corresponding relay-based partial product generation circuit is shown in Fig. 6 . Although it adds one mechanical delay to the partial product generation step, compared to a simple AND network, it proves to be promising in reducing the overall complexity and delay for larger multipliers. Table I summarizes the trade-offs for a broad range of multipliers, indicating the performance of larger multipliers benefits even more from higher compression ratios, optimized compressor circuits and Booth encoding.
IV. MEM-RELAY MULTIPLIER COMPONENTS DESIGN
The logic function of an (N:3) compressor can be described as:
The corresponding circuit is comprised of 3 sub-circuits for generating Y 0 , Y 1 and Y 2 . The most important design consideration for this relay logic circuit is to include a shoot-through electrical path from an input to the output for all of the sub-circuits, which allows for stacking of compressors without additional mechanical delay penalty. In each circuit, these electrical paths are provided through source/drain connections to . According to (1) , the LSB (Y 0 ) is a 7-input XOR gate. This sub-circuit can be built by cascading 6 two-input relay XOR gates (Fig. 7(b) ). An alternative to this implementation is shown in Fig. 7(c) , where the body terminal is used in the design to cut the total number of relays to half. However, for experimental robustness, the former design has been implemented in this work.
The implementations of the Y 1 and Y 2 sub-circuits are illustrated in Figs. 8 and 9 , respectively. In both cases, a (5:3) compressor is used for illustration of different design steps. Figure 8(a) shows the propagation path logic for Y 1 , where A 0 is passed to the output when and is passed when . Figure 8(b) shows the integration of "generate" and "kill" paths, which happen for the cases of and , respectively. Using a similar method, the Y 1 sub-circuit of a full (7:3) compressor can be built, as shown in Fig. 8(c) . propagates A 0 , generates and kills the output. Fig. 9(c) shows the full Y 2 circuit for a (7:3) compressor. As can be seen in the compressor circuits, is the only complementary signal required for the operation of the multiplier. In this design, that bit is created by and since the added area and energy overhead is relatively small due to the symmetry of those circuits. On the other hand, Y 2 lacks that symmetry, and hence implementation of requires twice the number of relays. As a result, the omission of reduces the total MEM relay count of a (7:3) compressor by 30.
In the proposed (7:3) compressor, the electrical path is provided through the source/drain inputs ( / ), enabling compressor stacking. The resulting compressor uses only 98 MEM-relays and results in a 5 mech. delay critical path for the 32-bit multiplier, while the static CMOS-style implementation of the same circuit would have 44 mech. delays. A direct MEM-relay translation of the CMOS passgate (7:3) compressor whose MSB sub-circuit is shown in Fig. 9(d) [5] requires two mechanical delays and would result in 19 mechanical delays in the critical path of a 32-bit multiplier. In the proposed design, the mechanical delay corresponding to the top PMOS in Fig. 9(d) is eliminated by direct implementation of kill/generate paths and hence the total mechanical delay is ~4x lower.
V. ENERGY/DELAY ESTIMATES OF SCALED MEM RELAY VS. CMOS MULTIPLIERS
In order to illustrate the potential gains over CMOS implementations, we benchmark the 16-bit relay multiplier built with our predictive scaled MEM relay parameters [3] , against two different 16-bit CMOS multipliers. The first CMOS multiplier is designed using the Dadda tree algorithm [6] with Han-Carlson adders [7] , and placedand-routed using the Nangate 45 nm Standard Cell Library [8] , resulting in a total area of 0.007 mm 2 . The energy/delay curves are obtained by scaling the supply voltage between 0.7V-1.4V. The second CMOS multiplier employs optimally tiled compressor tree architecture (OTCT) with radix-4 Booth encoding and an arrivalprofile aware completion adder [9] . This multiplier is built in a 90 nm CMOS technology and its total area is 0.03 mm 2 . The energy/delay curves have been plotted for various operation voltages and frequencies reported in [9] .
The MEM relay-based multiplier is built with 5610 relays in a projected 90 nm lithographic process, where each relay occupies 12 µm 2 resulting in a total multiplier area of approximately 0.087 mm 2 . The energy/delay curves are shown for the operating voltage in the range of 2V pi to 5V pi . The simulations are based on the MEM-relay model described in [10] and all parasitic and routing wires 
VI. THE (7:3) COMPRESSOR EXPERIMENTAL RESULTS
Reliability issues typically associated with MEM-relays such as contact oxidation, welding and charging can potentially prevent this technology from being used in larger circuit blocks. To illustrate the potential of this technology for building more complex circuits, we demonstrate a fully operational 98 relay (7:3) compressor block. The die photo and the operation of the fabricated compressor are shown in Figs. 11 and 12 . A full-set of random vectors ranging from 0 to 127 is applied to the compressor, actuating the relay gates with V GB voltage of ~10V and activating different propagate, generate and kill paths in the sub-circuits. The measured output code perfectly matches the expected value. The decline of output voltage in time is due to formation of native oxide on the surfaces of the electrodes of the MEM relays, which gradually increases their on-resistance. Before testing a circuit, the native oxide is intentionally broken by temporarily applying a larger source/drain voltage. Hermetic encapsulation of the chips would prevent contact oxidation and the need for an initial oxide breaking step.
VII. CONCLUSION
This work describes the micro-architectural and circuit techniques for design of MEM relay-based multipliers optimized over an area-delay trade-off space. Design analysis shows the performance benefits of using relaytailored higher ratio compressors at the micro-architecture level as well as customized compressor circuits. The operation of the main building block of the MEM-relay based multiplier, the (7:3) compressor, is experimentally demonstrated. This circuit is the largest MEM-relay based circiut successfully tested to date. Simulation results of a 16-bit relay multiplier built in a scaled relay process predicts ~10x improvement in energy-efficiency over CMOS designs in the 10 MOPS performance range. The relative performance of the multiplier enhancements are in line with what was previously predicted by a MEM relay 32-bit adder [3] , suggesting that complete VLSI systems (e.g. a microprocessor or an ASIC) would expect to see similar energy/performance improvements from adopting MEM relay technology. 
