Abstract -The increasingly complex digital integrated circuits (ICs) often incorporate multiple power domains, thereby requiring multiple voltage converters to produce the corresponding supply voltages. These converters not only take substantial on-chip layout area or off-chip space, but also aggregate the power loss during the voltage conversions. In order to alleviate this problem by reducing the number of voltage converters needed, an asynchronous Multi-Threshold NULL Convention Logic (MTNCL) "stacked" circuit architecture is presented in this paper. By stacking multiple MTNCL circuits between power and ground, multiple of V DD can be supplied to the entire stack. With simple control mechanisms, the dynamic range fluctuation problem can be mitigated. A 130nm Bulk CMOS process and a 32nm SOI CMOS process are used to evaluate the effect of stacking different circuitry while running different workloads. Results and physical implementations are discussed for demonstrating the feasibility and advantages of the proposed MTNCL stacking architecture.
I. INTRODUCTION
As semiconductor technology continues to move forward and digital integrated circuits (ICs) designed with smaller process nodes operate at lower voltages, new challenges arise. One such challenge is that while the circuit components continue to decrease in size allowing for an increased number of them being implemented on a single chip, the power management system has become more and more complex since a larger number of power-supply rails are required for these circuits [1] . Addressing these multiple power domains is an additional concern to converting the off-chip power down to the lower voltages that the circuit components require to reliably operate at their optimum levels [2] . This has traditionally been accomplished using off-chip and on-chip voltage converters, but as the number of power domains increases, their tradeoffs make them less than ideal since these converters not only take substantial on-chip layout area and off-chip space, but also aggregate the power loss during the voltage conversions.
An alternative method for alleviating the energy loss caused by multiple voltage converters is by incorporating "stacked" low-voltage logic blocks. This methodology not only reduces the chip current draw, but also simplifies the offchip and on-chip power delivery system. It uses voltage stacking to increase n times the rated normal operating voltage to n circuits, each circuit in series with the others. By increasing the voltage level required for delivery and decreasing the chip current draw, this architecture is able to achieve several benefits:
1. I 2 R power loss is reduced by a factor of n 2 due to board and package resistance;
2. IR drop is reduced by a factor of n 2 because the IR drop reduction occurs by a factor of n over n-stacked circuits;
3. Voltage regulators and converters benefit from a lower step-down ratio that improves their efficiency and reduces their design complexity.
Previous research [3] was carried out using synchronous logic in a 150nm FDSOI CMOS process. However, the design would work reliably only if copies of the same circuit were stacked running exactly the same workload due to the middle voltage fluctuation and current mismatch between the stacked layers. In order to retain the energy efficiency benefit of voltage stacking while removing the strict constraints of stacked circuitry, this research is to implement the voltage stacking methodology using an asynchronous quasi-delayinsensitive (QDI) paradigm named Multi-Threshold NULL Convention Logic (MTNCL). Stacking MTNCL circuits not only provides the same benefits as discussed above, but also allows to have different workloads and even stack different types and sizes of circuits because of MTNCL's robustness, timing independence, and minimized effect from electromagnetic interference (EMI). MTNCL circuitry also has the ability to be put to sleep in an autonomous manner, which in this stacked infrastructure allows other circuits to continue running while one or more sleeps. By adding simple control logic, the user will have the option to choose whether to allow for the non-sleeping circuit to speed up while other circuits sleep, or reduce energy consumption while maintaining performance. This paper is based on stacking MTNCL logic circuits in both the 130nm Bulk CMOS process and the 32nm Silicon on Insulator (SOI) CMOS process.
II. BACKGROUND

A. NULL Convention Logic
NULL Convention Logic (NCL) is an asynchronous circuit design methodology that uses dual-rail encoding [4] .
NCL is QDI and correct by construction, therefore not subject to the same timing requirements of bounded-delay asynchronous circuits [5] . The dual-rail encoding of NCL incorporates two rails that provide three valid states and one invalid state because they are mutually exclusive of each other, meaning that both rails cannot be asserted at the same time. Therefore, when rail 0 = 1 and rail 1 = 1, it is in an INVALID state. The other three are VALID states that create the NULL, DATA0 and DATA1 values: that is, when rail 0 = 0 and rail 1 = 0, it is in the NULL state, equivalent to no data; rail 0 = 1 and rail 1 = 0, the output is DATA0, equivalent to a logic 0; and rail 0 = 0 and rail 1 = 1, the output is DATA1, equivalent to a logic 1. There are 27 fundamental NCL gates that make up the majority of the Boolean equivalents, which can be used when designing an asynchronous circuit.
NCL gates also include hysteresis, where all n inputs must be de-asserted before the output returns to a logic 0. This ensures that all of the gates in a combinational logic block propagate their data completely. Once this occurs, all of the inputs return to a logic 0 state, which creates a NULL wave that serves as a buffer before the next DATA wave. The static implementation of an NCL gate is seen in Fig. 1 (left) . At both ends of NCL combinational logic blocks are delay-insensitive registers that are used to create a single-stage or pipelined design if more than one stage is required. These registers use handshaking signals to communicate with the previous set of registers when it is ready for either a NULL or DATA wave.
B. Multi-Threshold NULL Convention Logic
As processes continue to get smaller, leakage power becomes more of an issue to deal with when considering overall power dissipation. Multi-Threshold CMOS (MTCMOS) power gating was developed to reduce leakage in synchronous designs by incorporating two different transistors, one with a higher threshold voltage (V t ) than the other.
The Low-V t transistors are used to maintain performance because their switching speeds are fast, yet their leakage is high. This is somewhat negated by using High-V t transistors in the circuit design because although they have a slower switching speed, they allow a much smaller leakage current to flow through the transistor when turned off. To maximize the effect of using both transistors, the Low-V t transistors are mostly used on the logic blocks, while every path from power to ground or to the output incorporates a High-V t transistor to minimize the flow of leakage current.
MTNCL was developed by combining MTCMOS power gating with NCL. MTNCL gates replace the hysteresis function with a sleep function that is comprised of a sleep signal connected to both an NMOS transistor and a PMOS transistor in their respective pull-up and pull-down networks as shown in the static gate implementation in Fig. 1 (right) . This sleep signal serves as a power-gating technique for the circuit and is also responsible for generating the NULL wave since when it is asserted, the gate is disconnected from the power source and the output is directly tied to ground, resulting in an output of 0. Therefore, when a NULL wave is requested from the next register set, all that needs to occur is for the sleep signal to be asserted and the gates in that stage of the pipeline produce an output of 0. Conversely, when DATA is requested, the sleep signal is disabled and the gates are allowed to propagate the data they receive from the previous stage.
Just like in NCL pipeline architecture, MTNCL has delayinsensitive registers on both ends of any MTNCL combinational logic that latch either the DATA or NULL waves. When the sleep signal is disabled, the inputs of rail 0 and rail 1 propagate to the outputs of rail 0 and rail 1 , but when the sleep is enabled, both output rails are 0, resulting in a NULL value. In addition to the registers, MTNCL pipeline architecture is much like that of NCL, where there is combinational logic and completion logic for each stage of the pipeline. Similarly, handshaking signals are also used in MTNCL pipelining to indicate when a stage is ready for the next NULL or DATA wave. And just like NCL pipelining, each DATA wave is separated by a NULL wave in order to prevent data of one wave corrupting that of another. However, the difference is that the combinational logic enables or disables the sleep signal for the subsequent stage in the pipeline based upon the next stage's status.
III. POWER MANAGEMENT
In the past years, the voltages needed to power up core rails have continued to drop from 5 V to 0.9 V while the offchip supply has remained at 12 V, 5 V, and 3.3 V [1] . Therefore, for multiple cores on a single chip, dynamic voltage and frequency scaling (DVFS) with fast voltage transitions for each core/block is sought after because it can reduce power consumption and improve energy efficiency. This has been accomplished with off-chip DC-DC converters that implement multiple on-chip power domains, or more recently, there has been a lot of interest in on-chip converters that also implement the multiple power rails [6] .
Tradeoffs exist between these off-chip and on-chip conversion designs. The main issue with an off-chip power supply is that when the current demand on-chip increases, there is a higher IR drop and I 2 R power loss due to the board and package resistance. This is amplified by the fact that offchip power delivery impedance realistically does not scale [3] . The two main types of systems implemented on-chip to supply the necessary voltages to the different components are linear regulators and inductor-based switching regulators. The former is relatively low cost because they are easy to design, take up little area, have an internal switch and need only an input and output capacitor; but their drawback is the low efficiency. The latter's efficiency is much higher at (85% to 95% usually); but the cost of implementation is much higher due to their complexity, support components, design time and board area [7] . Two main topologies of DC-DC converters are the low dropout (LDO) regulator, an example of a linear regulator, and a switched inductor (SI) buck regulator, an example of an inductor-based switching regulator. Prior research [2] uses these two topologies to compare their performance when the input voltage is 1.5 V and the output voltage is 1.0 V. The data in Table 1 above shows that there are clear tradeoffs between the two options, depending on the power supply input and the required output. The efficiency, speed of delivery, area used, and complexity can be major drawbacks when designing the power delivery system.
IV. VOLTAGE STACKING
The main goal of this research is to create a functional voltage stacking model using MTNCL circuits in order to develop a comprehensive blueprint for other designers to use based upon their needs. By reducing the number of on-chip and off-chip DC-DC converters and regulators, there is an immediate impact to simplifying the power management system and reducing power loss in these excess converters and regulators. Demonstrating different combinations of stacked circuits running at different workloads will take this research past what was accomplished using the synchronous logic and will allow for further improvements on the stacked methodology. Such improvements will be accomplished through incorporating additional logic into the stacked model in order to balance different tradeoffs such as speed versus energy consumption and reducing leakage power when idle.
A sample design for stacking two MTNCL pipelined circuits is shown in Fig. 2 . The bypass capacitors are used to ensure that when two identical gates operate, the middle voltage will remain near half of the supplied voltage (which is 2×V DD so the middle voltage is V DD ). Transistors M1 through M4 are additional logic designed to mitigate the effects of stacking different sized circuits and also prevent the issues that arise when one or more of the circuits go to sleep. The additional logic is implemented in parallel to the stacked circuits in order to provide a separate path from one supply rail to the next that circumvents the circuit which is sleeping. This logic is controlled by separate signals generated at the system controller level indicating when the circuit is being put to sleep for an extended period of time.
When either circuit is running, the Awake signal stays high keeping both the innermost transistors (M2 and M3) on. Now if either circuit is put to sleep for an extended period, their respective Bypass signal will also be asserted turning the transistor (M1 or M4) in the same row on and shorting either 2×V DD to the middle node or the middle node to GND. By incorporating this logic, the middle node which would normally shift drastically towards the circuit that is still running, can be pulled in the opposite direction, thereby increasing the dynamic range and speed for the working circuit. The Awake signal is set low (turning off M2 and M3) when both circuits are put to sleep for an extended period, thereby blocking the direct path from 2×V DD to GND while the two Bypass signals are enabled.
V. SIMULATIOIN ANALYSIS
Cadence transistor-level simulations have been performed in both the 130nm Bulk CMOS process and the 32nm SOI CMOS process using the same MTNCL circuits. The designated V DD for the 130nm process is 1.2V, and 0.9 V for the 32nm process. One of the first simulations takes two copies of the same MTNCL 11×7 DADDA multiplier, stacks them on top of one another and runs them using different workloads to see the effect on the middle node voltage and their individual performance. With the supply voltage being 2×V DD , while both circuits are running their different workloads, the middle node voltage oscillates around V DD . Although the middle voltage does fluctuate constantly during the simulation, both circuits maintain proper operation due to the delay insensitivity of MTNCL. To evaluate whether the voltage stacking of MTNCL circuits is scalable, the stack of three identical MTNCL DADDA Multipliers running different workloads was simulated and the results also demonstrated proper operation of all circuit copies. Fig. 3 shows the results from running the same circuits with different workloads in both processes and the three stacked design in the 130nm process. The simulations clearly show that the middle node voltage fluctuates much more in the 130nm process than it does in the 32nm process, but both work correctly even running different workloads. The next set of simulations implemented an 11-bit MTNCL Ripple Carry Adder (RCA) stacked with the previously used 11×7 DADDA multiplier. The multiplier is about 4 times larger than the RCA, which causes the middle node voltage to shift towards the larger circuit when both of them are operating, but still function correctly due to MTNCL's delay insensitivity. However, with the middle node shifts, the resulting changes in dynamic ranges for both circuits have a direct effect on the performance of each circuit. An increase in the dynamic range increases the speed of the circuit, but also increases the active energy consumption; a decrease in the dynamic range has the exact opposite effect, i.e., slowing the circuit down while using less energy. This effect occurs in both processes and Fig. 4 depicts one such simulation where the multiplier is stacked on top of the RCA and the middle node voltage shifts towards the larger circuit, i.e., the multiplier, in both the 130nm and 32nm processes. Fig. 4 also shows how the middle node voltage (highlighted signal) shifts towards the circuit that is still running when the other is put to sleep. Table 2 lists the execution time for 12 input patterns, the corresponding active energy and energy delay product (EDP) for the various simulations when the multipliers are stacked and when they run individually (not stacked) in both processes. Table 2 clearly shows that the overhead of the stacked architecture is negligible (~0.3%).
By using the additional control logic shown in Fig. 2 for the two-stacked model, the sizing of the transistors (M1-M2 or M3-M4) in parallel to the sleeping circuit directly affects the performance of the working circuit. Increasing the transistor widths in the same row of the sleeping circuit continues to increase the speed and dynamic range of the circuit running, but also increases the active energy consumed. There is an optimal width for each transistor that corresponds to the lowest EDP for that stacked architecture, which varies based upon the circuits stacked and the semiconductor process used. Fig. 5 shows the effect of increasing the transistor widths on the same simulation setup in Fig. 4 . Tables 3 and 4 show how increasing the widths of the control transistors at the sleeping circuit row in the 130nm and 32nm processes, respectively, affects the energy-delay product (EDP). Although an optimal transistor width for each stacked implementation exists where the EDP is the lowest, this may not be the best option for the specific application. Therefore, this robust MTNCL voltage stacking architecture allows the users to fine-tune the circuit parameters for their needs. For example, a circuit that is put to sleep for extended periods and then called to work may want to complete the computation very quickly while consuming more energy in that short span. Conversely, a circuit that continues to run for an extended period could be run at a slower speed in order to consume much less energy over its duty cycle. Either option is available by increasing or decreasing the size of the transistors in the additional logic.
VI. PHYSICAL IMPLEMENTATION
While MTNCL voltage stacking is very promising from the above schematic simulations, it needs to be physically laid out to demonstrate its feasibility. This will introduce different isolated wells in the 130nm Bulk CMOS process that may cause problems due to unintended diodes being created between the bulk substrate and multiple wells. By introducing n stacked circuits, n-1 additional wells need to be created, each with a voltage range from (n-1)×V DD to n×V DD , which can get quite large and cause a breakdown between the substrate and one or more of the isolated wells. Unlike the 130nm process, the 32nm process does not run into this issue since it is a SOI process, is ideal for implementing larger stacks. The cross-section views of stacking two inverters in both processes are depicted in Fig. 6 . 
VII. CONCLUSIONS
The energy consumption and speed of the circuits in the MTNCL stacked architecture are comparable to them running individually, while the stacked architecture reduces chip area, the overall energy and design complexity by removing extra DC-DC converters. Unlike the synchronous stacked architecture, using the asynchronous MTNCL paradigm allows different combinations of circuits to be stacked as well as running different workloads with different operating cycles. In addition, the simple control mechanism incorporated into the stacked architecture provides the circuit designer with the capabilities to optimize the design based upon their individual needs. Moreover, the presented MTNCL circuit stacking architecture has the potential to be incorporated in mixedsignal systems to bring up the supply voltage of digital components close to or even match the supply voltages of analog/RF components, thereby simplifying the overall power management system design and saving energy from the power sources, e.g., batteries.
