Abstract-Tunnel field-effect transistors (TFETs) are one of the most attractive steep subthreshold slope devices currently being investigated as a means of overcoming the power density and energy inefficiency limitations of CMOS technology. In this paper, we analyze the relationship between devices and RT-Level architecture choices. We claim that architectural issues should be considered when evaluating this type of transistors because of the differences in delay versus supply voltage behavior exhibited by TFET logic gates with respect to CMOS gates. More specifically, the potential of pipelining and parallelism, both of which rely on lowering supply voltage, as power reduction techniques is evaluated and compared for CMOS and TFET technologies. The results obtained show significantly larger savings in power and energy per clock cycle for the TFET designs than for their CMOS counterparts, especially at low voltages. Pipelining and parallelism make it possibly to fully exploit the distinguishing characteristics of TFETs, and their relevance as competitive TFET circuit design solutions should be explored in greater depth.
INTRODUCTION
Intensive research is currently being conducted into devices with steeper subthreshold slopes (SS) below the physical limit of 60mV/dec of CMOS technologies. A smaller SS makes it possible to lower threshold voltage while keeping leakage current under control, facilitating low voltage operation with acceptable speed and thus overcoming the power density problems and energy inefficiency of scaled CMOS.
Tunnel transistors (TFETs) are one of the most attractive steep subthreshold slope devices [1] - [5] .
Experimental TFETs with SS under 60mV/dec have already been obtained in different material systems, including Silicon TFETs and III-V TFETs. Band-to-band tunnel field-effect transistors based on twodimensional transition metal dichalcogenide semiconductors are being explored as a potential means of improving their on currents (I ON ) [6] .
Many benchmarking efforts have been made to evaluate gains over CMOS and, thereby, identify those María J. Avedillo and Juan Núñez Instituto de Microelectrónica de Sevilla, IMSE-CNM (CSIC/Universidad de Sevilla) devices which are the most promising candidates for replacing or complementing CMOS under different metrics or in different application areas. Several works have shown power benefits for iso-performance or higher performance at iso-power up to moderate operating frequencies [7] - [12] . This is due to the fact that current TFETs do not reach the on current values obtained by CMOS transistors at their nominal supply voltages. Nevertheless, opportunities to enhance higher performance domains also exist in applications with stringent power budgets [13] .
More recently, different studies have addressed analyses in which architecture choices at different levels of abstraction are taken into account. That is, they focus on the relationship between devices and architectures, showing that TFETs are more attractive than CMOS for specific design techniques or computing paradigms.
This suggests that the conclusions drawn from comparisons between TFETs and CMOS could be different if these aspects are taken into account and that it may be possible to extend the application domains of these transistors in different directions. In [14] - [16] , for example, processor level issues were addressed, showing speedup over CMOS due to the fact that a large number of TFET cores can function within the thermal limit and that TFETs can operate over a much wider range of microarchitecture complexity. In [17] , [18] , studies into lower levels of abstraction addressed issues related to circuit logic depth selection. Our work analyzes the relationship between devices and register transfer (RT) level architecture options. The conclusions drawn will be useful in the design of application specific circuits.
In this paper, we explore and compare the impact of pipelining and parallelism -two well-known register transfer level optimization techniques for increasing operating frequency and throughput -in TFET and CMOS technologies. The two techniques can also be applied to reduce supply voltage while maintaining speed, leading to power reductions or, in general, different power-speed trade-offs. However, their impact on power may vary due to the differences in the I ON versus V DD behavior of the two types of transistors.
There is therefore a justifiable need to explore these architecture options when comparing the two technologies.
The rest of the paper is structured as follows. Section 2 compares the delay versus supply voltage of a tunnel transistor technology and a CMOS, and explains the rationale behind our work. Simulation results from benchmark circuits are described and discussed in Section 3. Finally, some conclusions are presented in Section 4.
IMPACT OF CONCURRENCY ON POWER
Two well-known techniques for increasing concurrency at circuit level are pipelining and parallelism. They can be used to improve frequency performance or for power optimization [19] . Both techniques are shown in Fig. 1 . In Fig. 1b , pipeline registers are added in order to cut down signal propagation paths in the combinational block C shown in Fig. 1a . Thus, assuming equal delay for C1 and C2 and ideal registers, the operating frequency and the throughput (data produced per time unit) can be doubled with respect to the implementation in Fig. 1a . Equivalently, the frequency and throughput of the implementation in Fig. 1a can be maintained at lower V DD by that in Fig. 1b thanks to the shorter signal paths, thus producing power benefits. Different frequency-power trade-offs are possible. In Fig. 1c , a copy of the processing circuit and some extra circuitry are added in order to have two clock cycles available to propagate signals through combinational blocks. Assuming an ideal multiplexer, throughput/frequency can thus also be doubled with respect to the original implementation. Equivalently, the throughput and the frequency of the circuit in Fig.   1a can be maintained at lower V DD . As in the pipelined implementation, different throughput-power tradeoffs could be possible.
In both cases, frequency and throughput can be maintained at a reduced V DD because timing constraints are relaxed. In the ideal concurrent architectures (Fig. 1b and Fig. 1c Since gate delay depends on I ON , it is expected that supply voltage can be lowered by a larger amount in the TFET circuit than in the CMOS one for equal delay degradation.
To illustrate these differences in behavior regarding delay and supply voltage, we simulated fan-out 4 (FO4) inverters at electrical level in CMOS and TFET technologies respectively. The TFET was a projected 20nm translates into a greater potential for power savings associated with the use of optimization techniques which relax timing constraints, like pipelining or parallelism, as described below. As explained, frequency and throughput in ideal implementations of pipelining (Fig. 1b) and parallelism (Fig. 1c) The average dynamic power dissipation of a logic gate can be expressed as:
where C is the total capacitive load, α is the activity factor, f is the operating frequency and V DD the supply In an ideal implementation of the pipelined circuit (ideal pipeline registers), exactly the same gates as those in the original circuit are operated at the same frequency. The power ratio (PR) between the ideal pipelined design and the original circuit can be approximated as:
Note that (2) is independent of C, f, and α, so the selected transistor sizing criteria and interconnection capacitances for the benchmark circuits do not impact this approximation of PR. (2) is also valid for the ideal parallel circuit, where each gate is now replicated (two copies of the original circuit are required) but switching activity is halved. The previous analysis does not consider speed. However, it is interesting to evaluate the same power impact taking this factor into account. To complete our study, Fig. 2c shows supply voltage versus FO4 delay. Of the four curves depicted, CMOS-ORIG and TFET-ORIG correspond to the V DD required for the CMOS and the TFET FO4 inverters, respectively, to produce a given delay. CMOS-RED (TFET-RED) correspond to the V DD required for the CMOS (TFET) FO4 inverter to double a given delay, i.e., the V DD , RED for a given FO4 delay. FO4 delays of up to 20ps are shown. As expected, for a given speed, the V DD for CMOS-RED (TFET-RED) is smaller than that of CMOS-ORIG (TFET-ORIG). It can also be observed that the TFET inverter (TFET-ORIG curve) is not able to operate as fast as the CMOS inverter, as is also well known.
Furthermore, as found in many previous works, for equal delay the V DD for TFET-ORIG (TFET-RED) is smaller than the V DD for CMOS-ORIG (CMOS-RED), thus producing power advantages for isoperformance. To achieve a delay of 10ps, for instance, V DD is 0.7V for CMOS (the CMOS-ORIG curve) and
0.42V for TFET (the TFET-ORIG curve).
It is even more interesting to analyze the results in Fig. 2c from the perspective of applying concurrency techniques. In this scenario, the ORIG versions of the curves correspond to the V DD required to operate the designs at a given frequency. The RED versions of the curves correspond to the V DD required by the ideal pipeline or parallel implementation operating at the same frequency. That is, minimum supply voltages needed to operate at a given speed, with (RED versions) and without (ORIG versions) applying concurrency can be compared.
Here, a number of important points should be mentioned. Firstly, note that for FO4>10.5ps the supply voltage of CMOS-RED is larger than that required by TFET-ORIG. Secondly, the smallest delays (highest frequency) obtained by CMOS-ORIG could be achieved in TFET through the application of ideal concurrency (TFET-RED) at supply voltages of around 0.25V, much smaller than in CMOS (0.9V or 0.47V, when concurrency is applied). This suggests that the application domain in which TFET can offer power advantages could be extended to larger frequencies. In other words, supply voltage reduction is significant enough to anticipate advantages even taking into account possible differences in interconnection capacitances.
Finally, it is also interesting to consider applications with stringent power budgets requiring low operating voltages. Assume a maximum allowed supply voltage of 0.5V. Note here that the TFET speed could be further increased. TFET-RED, which achieves the smallest delay with 0.26V, can be operated at a larger supply voltage to increase speed. In contrast, CMOS-RED, which achieves the smallest delay with 0.47V, leaves little room for improvement.
These results support our claim that introducing concurrency could be more efficient for optimizing power in TFET technologies than in CMOS, especially at low voltages. However, this model for estimating power advantages is very simple. It takes into account only dynamic power, and not the impact of supply voltage into transistor capacitances. Neither does this simple experiment consider the power and delay overheads associated with the extra circuitry that needs to be added to support pipeline or parallel operations.
Moreover, only the simplest logic gate, the inverter, is evaluated, whereas the behavior of more complex gates is not considered, and other sources of power consumption, like glitches, are ignored. To overcome these limitations, electrical simulations were carried out at circuit level. These are described in the next section.
CIRCUIT LEVEL EVALUATION

A. Applying parallelism
Two versions of an 8-bit adder were evaluated and compared. They were implemented using the architecture shown in Fig. 1a (ORIG Power ratio values of between 25% and 60% were obtained for TFET (slightly higher than those estimated for the inverter). This small increment was partially due to the multiplexer we added to support the parallelism. Note that this extra circuitry was not taken into account in the preliminary inverter experiment.
For CMOS, larger differences with respect to the inverter experiment were observed. Note that, unlike the inverter, there were no power savings (PR>100) under V DD,ORIG =0.4V. For purposes of comparison, the simple estimation of PR given by expression (2) using the measured V DD,PARA values is superposed (in black)
for each technology. As expected, the measured power ratios were higher than the estimated values. In particular, note that the small differences in V DD,PARA do not explain the larger power ratios obtained for CMOS. Note also that the differences were greater at lower supply voltages. To interpret the results, we analyzed the simulations in depth. It can be concluded that power was wasted in glitches occurring in the parallel designs operated at reduced supply voltages (slower circuits).
The power savings achieved in the TFET technology were significantly larger than those achieved in CMOS at low voltages. As pointed out in Section II, similar relative power savings can be achieved with CMOS at higher supply voltages. However, these parallel CMOS implementations cannot compete with TFET in terms of power consumption. To operate at the maximum frequency of CMOS-ORIG at its nominal V DD , for example, the CMOS-PARA design requires 0.5V, while the TFET-PARA design requires only around 0.3V.
Significant power advantages can therefore be expected from the TFET-PARA design. In our experiment, we found that that this design consumes 10% (3%) of the power of the PARA (ORIG) circuits in CMOS.
Although these results could vary, depending on the selected transistor sizing criteria or due to parasites associated with interconnections the differences obtained are large enough to confirm the power savings advantages of TFET designs.
B. Applying pipelining
Two versions of a two-level adder tree were evaluated and compared. Each adder was an RCA, implemented as described in the previous section. A two stage pipeline design with registers between the first and second level adders (PIPE) was compared with the original adder tree (ORIG), and an experiment similar to the one carried out for the parallel benchmark was conducted. To further analyze the results, the power of both the adders and the pipeline registers was measured separately. Fig. 4b shows the power ratios produced without considering the impact of the pipeline registers.
The power ratios obtained were better than those predicted by the simple estimation model (the dashed line), indicating larger power savings. This is partially attributable to the reduction in glitches in the second-level adder associated with the incorporation of the registers.
Again, significant power advantages were achieved in TFET-PARA in comparison with both CMOS designs. At the maximum frequency of CMOS-ORIG at its nominal voltage, we found that TFET-PIPE required a supply voltage of 0.3V and that its power consumption was 12% and 5% of the values obtained for the respective CMOS-ORIG and CMOS-PIPE (with V DD,PIPE =0.55V) designs.
Power savings obtained by both concurrent designs (PIPE and PARA) with respect to the ORIG one translate into energy savings. Note that comparison between concurrent and original designs is carried out at the same operating frequency. So, energy ratios and power ratios are identical. In the same way, power comparisons between TFET designs and FinFET ones, at the end of each sub-section, can be as well extended to energy.
CONCLUSIONS
The experiments carried out confirm that, as expected from the delay versus V DD behavior of TFET logic gates, the reductions in supply voltages achieved by applying pipelining or parallelism while maintaining speed are larger in TFET circuits than in CMOS circuits. It has been shown that applying pipelining or parallelism as a means of relaxing timing constraints on signal propagation paths is a more efficient power and energy reduction technique for tunnel transistor circuits than for CMOS. These results suggest that such architectural issues should be considered when evaluating this type of transistors. That is to say, the benchmarking of TFETs versus CMOS should not be limited to comparing logic gates or identical circuit structures, since the impact of RT-level optimization techniques can vary greatly between the two technologies. From a complementary point of view, it is preferable to use RT/logic architecture when designing with TFETs, in order to fully exploit their specific features, including their I ON advantages for low supply voltages. Techniques which make it possible to reduce supply voltage while maintaining operating frequency and throughput should therefore be considered as a means of designing more competitive TFET logic circuits.
As in previous studies at higher abstraction levels, our results also demonstrated the potential of TFETs to achieve power and energy advantages even at frequencies above those of CMOS designs operated at nominal supply voltages in severely power-limited applications.
