Abstract-Field-Effect Transistors (FETs) with on-line controllable-polarity are promising candidates to support next generation System-on-Chip (SoC). Thanks to their enhanced functionality, controllable-polarity FETs enable a superior design of critical components in a SoC, such as processing units and memories, while also providing native solutions to control power consumption. In this paper, we present the efficient design of a SoC core with controllable-polarity FET. Processing units are speeded-up at the datapath level, as arithmetic operations require fewer physical resources than in standard CMOS. Power consumption is decreased via embedded power-gating techniques and tunable high-performance/low-power devices operation. Memory cells are made smaller by merging the access interface with the storage circuitry. We foresee the advantages deriving from these techniques, by evaluating their impact on the design of SoC for a contemporary telecommunication application. Using a 22-nm vertically-stacked silicon nanowire technology, a coarsegrain evaluation at the block level estimates a delay and power reduction of 20% and 19% respectively, at a cost of a moderate area overhead of 15%, with respect to a state-of-art FinFET technology.
INTRODUCTION
Driven by the Moore's scaling law, the semiconductor industry foresees ever-higher levels of integration. Hence, the System-on-Chip (SoC) approach has developed rapidly over the last fifteen years producing entire systems on single pieces of silicon. By integrating many processors, memories and peripheral components monolithically, SoC allow designers to increase the performance and reduce the power consumption of complex electronic systems.
At the advanced 22-nm technology node, FinFET transistors provide an alternative to planar devices in order to build higher-performance low-power SoCs [1, 2] . To push device performances even further, vertically-stacked Silicon NanoWire FETs (SiNWFETs) with gate-all-around control are considered the natural extension of FinFETs [3] . In addition to the enhanced electrostatic control as compared to FinFETs, Multiple-Independent-Gate (MIG) SiNWFETs are also promising thanks to their controllable-polarity behavior [4, 5] , i.e., the dynamic reconfiguration of the device polarity from nto p-type thanks to an extra gate terminal. Devices with a dynamic configuration property enable a technological uniformity and allow the fabrication processes to be simplified. Recent works [6, 7, 8] introduced Double-Independent-Gate (DIG) [7] and Three-Independent-Gate (TIG) [8] SiNWFETs to design compact combinational circuits, thanks to the higher device expressiveness. Different techniques have also been proposed for these emerging devices, addressing dedicated logic synthesis tools [9, 10] , low-power techniques [8, 11] and memories [12] .
In this paper, we review state-of-art design techniques exploiting the enhanced functionality of controllable-polarity FETs. In particular, we provide a transversal review of the possibilities unlocked by this device family, by covering logic gate design, memory design, low-power techniques and design methodologies. We also comment on these techniques in light of their applications in SoC design through a contemporary telecommunication application. Using a coarse grain evaluation at the block level, we forecast that such an approach enables a delay and power reduction of 20% and 19% respectively at a cost of a moderate area overhead of 15%, as compared to a state-of-art FinFET technology at a 22-nm fabrication node.
The remainder of the paper is organized as follows. In Section II, we report on controllable-polarity FET technology with a strong emphasis on MIG SiNWFETs. In Section III, we present its opportunities for arithmetic logic by reviewing compact logic gate implementations and novel design automation methodologies. In Section IV, we introduce lowpower design methodologies while, in Section V, we present the design of compact memory structures. In Section VI, we illustrate the impact of all these techniques on the design of a SoC, targeting a contemporary telecommunication application. Finally, we conclude the paper in Section VII.
II. TOWARDS CONTROLLABLE-POLARITY FETS
Multi-Independent-Gate (MIG) devices are transistors whose electrostatic properties are dynamically controlled via additional gate terminals. MIG devices have been successfully fabricated using carbon nanotube [5] , graphene [13] and Silicon NanoWire (SiNW) [7, 14] technologies. As the natural evolution of the FinFET structure, vertically-stacked SiNWs are a promising platform for MIG controllable polarity devices thanks to their high I on /I off ratio and CMOS compatible fabrication process [7] .
Among this family, we can emphasize DoubleIndependent-Gate (DIG) [7] and Three-Independent-Gate (TIG) [8] SiNWFETs. DIG and TIG FETs consist of a channel composed of vertically-stacked SiNWs supported by two source and drain pillars, and three separated gated regions. The device structure, represented by three vertically-stacked nanowires, along with the associated symbols are depicted in Fig. 1 -a. The side regions are called the Polarity Gates (PG) while the central region is tied to the Control Gate (CG). The Polarity Gate at Source (PG S ) and the Polarity Gate at Drain (PG D ) tune the Schottky barriers at the source and drain junctions respectively. When tied together, we obtained a unique PG terminal (and therefore a DIG device), that controls the channel carrier's type (V PG = V DD à n-type, V PG = V SS à ptype). When separately controlled, i.e., within a TIG FET, the set of possible functionalities is larger with an additional control of the device threshold voltage (V th ) and the ability to realize two transistors in a unique device [8] . The control gate modulates the amount of carriers flowing into the channel. Fig. 1-b shows the I-V characteristic for a 22-nm TIG device simulated using a TCAD physical model [8] . We observe a dynamic reconfiguration behavior in terms of polarity and threshold voltage depending on the gate biases. Note that DIG behavior can be derived from the conditions where PG S =PG D . The level of performances, i.e., the oncurrent level or the leakage floor are similar to a FinFET Low STandby Power (LSTP) technology [2] . In the following, comparisons will always be carried out with regards to LSTP technology. The dual-V th characteristic of TIG is also visible (dash lines: high-V th -solid lines: low-V Tth ) with a threshold different of about 0.3V. In conventional dual-V th technology, the high-V th devices achieve lower leakage floor but also reduce the on-current compared to low-V th devices. However, with TIG SiNWFETs, the maximum on-current for both high-V th and low-V th configurations is the same, thus leading to more efficient circuits. We refer the interested reader to [7, 8] for more details about the physics of DIG-and TIGSiNWFETs respectively.
Finally, note that MIG devices are bigger (three gate regions) and have larger parasitic capacitances compared to unipolar devices. While this leads to less efficient devices, we will see in the following that the enhancement of the functionalities leverages a diverse set of opportunities at circuit level.
III. COMPACT DATA PATH DESIGN
Arithmetic logic is critical in most of today's Integrated Circuits (ICs). Indeed, arithmetic operations are the basis of datapaths that form the reasoning core of logic applications in silicon. Exclusive-OR (XOR) and MAJority (MAJ) logic functions are extensively used in arithmetic circuits; consequently their physical realization is of paramount importance. In this context, MIG transistors open up new opportunities to implement XOR-and MAJ-based logic gates with few resources [6, 10, 15] . Based on transmission-gates, the implementation of 3-input XOR and 3-input MAJ gates, depicted in Fig. 2 , enables a full-adder realization with only 8 devices (input inverters apart) and only one transistor per stack. The full-adder forms a fundamental building block for many arithmetic circuits. Such advantageous full-adder design extends to a whole range of arithmetic primitives, as reported in Table I . The area is normalized to the number of transistors employed to realize the function, whereas the delay is normalized to the delay of a 2-input XOR gate. Compared to an equivalent transmission-gate FinFET implementation, the proposed full-adder shows 22% improvement in area and 40% improvement in normalized delay. When used in arithmetic compressors, the proposed fulladders also enable area and delay savings.
By employing the arithmetic elements in Table I , we report on various study various industry standard benchmark circuits comprising of adders, multipliers, compressors and counters as listed in Table II . Note that, for the transmission-gate style, we included additional buffering every four-stacked devices. From the table below, we observe that the use of MIG devices consistently fares well when compared to the conventional CMOS logic in both area (32% on average) and delay (38% on average). The compact implementations of XOR and MAJ functions with controllable-polarity transistors bear a promise for superior automated design of datapaths. However, conventional logic synthesis tools are not adequate to fully harness the advantages led by the controllable polarity feature in arithmetic logic, missing some optimization opportunities. To overcome these limitations, it is required to develop new approaches that better integrate the efficient primitives of controllable-polarity FETs. On the one hand, it is possible to propose innovations in the data representation form. For instance, Biconditional Binary Decision Diagrams (BBDDs) [9] are a canonical logic representation form based on the biconditional (XOR) expansion. They provide a one-to-one correspondence between the functionality of a controllable-polarity transistors and its core expansion, thereby enabling an efficient mapping of the devices onto BBDD structures. On the other hand, it is also promising to identify the logic primitives efficiently realized by controllable-polarity FETs in existing data structures. In particular, BDD Decomposition System based on MAJority decomposition (BDS-MAJ) [10] is a logic optimization system driven by binary decision diagrams that supports integrated MUX, XOR, AND, OR and MAJ logic decompositions. Since it provides both XOR and MAJ decompositions, BDS-MAJ is an effective alternative to standard tools to synthesize datapath circuits. In the controllable-polarity transistor context, BDS-MAJ natively and automatically highlights the efficient implementation of arithmetic gates.
IV. AVANCED LOW POWER TECHNIQUES
Thanks to their good electrostatic control (coming from the GAA architecture), MIG devices are promising candidates for low-power applications. However, the gain brought by the technology does not reduce to intrinsic device performances. Indeed, the enhanced set of functionalities enable simple implementation of advanced low power techniques. In this section, we review a dual-V th mode of operation and a powergating technique.
A. Dual-Threshold Voltage Operations
As briefly introduced in Section II, TIG-FETs can be configured in terms of polarity but also in terms of threshold voltage. The dual-V th characteristics of TIG SiNWFET are depicted in Fig. 1-b . For low-V th configuration (solid lines), PG S and PG D are biased with the same voltage. In this configuration, the device is switching between on and standard off states [8] . For high-V th configuration (dash lines), the device is wired unconventionally, as compared to a DIG device. Indeed, fixed bias voltages are now applied to CG and PG S for p-type (CG and PG D for n-type), while a voltage sweep is applied on PG D (PG S ). Here, the device is switching between on and low-leakage off states.
Such properties are used to create multi-V th circuits in a simplified way. Indeed, traditional multi-V th circuits require extra technological steps to build devices with different threshold voltages, which affect the layout regularity and increases the process costs as compared to single-V th design [17] . Here, the same transistors support the 2 configurations and lead to a drastic cost reduction. Fig. 3 illustrates two different NAND gate realizations for HP and LL applications, implemented with only 3 transistors. In Fig. 3 -a, the HP gate is obtained by connecting inputs to the CGs of p-FETs. Thus, the performance for pulling the logic gate up is improved by applying the low-V th configuration of the devices (solid line in Fig. 1-b) . In contrast, the LL gate ( Fig. 3-b) is obtained by controlling the p-FETs from the PG D . Leakage power is thereby reduced by forcing the devices into high-V th operation (dash line in Fig. 1-b) . In both HP and LL gates, PG S and CGs of n-FETs are connected to input signals. Hence, delay and leakage in pull-down paths cannot be further tuned.
Extensively studied in [8] by TCAD simulations, such an approach demonstrates the ability to reach the same level of performances than FinFET LSTP transistors at 22-nm technology node for a slight area overhead of 8%. Therefore, the circuit-level opportunities compensate the initial limitations, such as the increase of parasitic capacitances, found at the device level. 
B. Embedded Power-Gating
An efficient power-gating implementation is also unlocked by the enhanced functionality offered by MIG-FETs. From a system perspective, power-gating is a common and effective technique to reduce leakage power in conjunction with multi-V th design. Power gating uses sleep transistors to disconnect the power supply from the rest of the circuit during idle time. The main drawbacks of power-gating are due to the series sleep transistor that (i) reduces the speed during normal operation and (ii) increases the circuit area.
By exploiting the on-line control of the device polarity, it is possible to create logic gates with power-gating capabilities with no series sleep transistors [11] . Based on Differential Cascade Voltage Switch Logic (DCVSL), pull-up devices are not fixed to behave as p-type but their polarity is on-line modulated by a sleep signal, connected to the polarity gates. The global concept is depicted in Fig. 4 . In standby mode, i.e., when Sleep=1, the pull-up devices are switched to n-type through the PGs. The CGs are tied to ground by the two additional n-type devices. Therefore, both pull-up devices are in the off-state. This provides the desired disconnection from the power supply. In the active operation mode, i.e., when Sleep = 0, the pull-up devices act as p-type. The CGs (connected to the gate outputs) are not anymore tied to ground since the two additional n-type devices are in the off-state. The pull-down networks are now enabled to drive the outputs and close the standard feedback in DCVSL gates.
Applied to arithmetic and computation intensive circuits, it has been shown in [11] that such technique leads to area, delay and leakage power savings of 1.4×, 1.3× and 1.9× on average respectively compared to power-gated circuits using FinFET LSTP devices at 22-nm technology node.
V. MEMORY OPPORTUNITIES
The performance of modern systems System-on-Chips are mainly influenced by the different sequential elements encountered along the data path. Controllable-polarity devices open again promising new approaches in this field. In this section, we describe two novel memory designs adapted to local registering and large memory planes.
A. True-Single Phase Clock Flip-Flops
As already introduced, MIG FETs enable a large compactness of different circuits thanks to their intrinsic comparator property. In practice, they enable two major improvements: (i) the compact realization of XOR functions and (ii) the merge of two serial transistors in a single device.
These two properties can be efficiently used in True-Single Phase Clock (TSPC) design [18] . Fig. 5 shows a FF design build with only 8 transistors as compared to 15 in its traditional CMOS counterpart. By reducing the number of transistor stacked in pull-up and pulldown networks and by using the larger functionality set offered by the controllable polarity, it has been shown in [12] that the proposed design leads to data path storage elements with on average area and delay savings of 20% and 43% respectively compared again to FinFET LSTP transistors at 22-nm technological node. 
B. 4-Transistor Pseudo SRAM Cells
Multi-independent-gate transistors also introduce novel opportunities to design versatile memory arrays. In particular, we introduce a memory cell, that can play an active role in future systems-on-chip by enabling a dual operation mode between Dynamic RAM and traditional SRAMs. As the memory cell has both static and dynamic latching modes, we call it a pseudo-Static Random Access Memory (SRAM). The memory cell, depicted in Fig. 6 -a, consists of four transistors, realizing two cross-coupled inverters with special properties. First, the bottom transistors are not standard FETs but MIG FETs, where one gate is still connected as in usual inverters while the others provide enhanced controllability. Second, the bottom terminals of the cross-coupled inverters are not grounded, but are connected to BitLines (BLs). By exploiting the controllability of the bottom multi-gate FETs, it is possible to let the BLs write/read the cell by directly forcing/sensing the logic value at the output nodes of the crosscoupled inverter.
The proposed cells have 3 operation modes as highlighted in Fig. 6-bcd . The signals W (write) and EN (enable) control the bottom multi-gate FETs and thus impose the operation mode.
• When W=EN=1, the memory cell is in writing or static latch mode. Indeed, if both BLs are grounded, then the cell behaves as a static latch, as depicted by Fig. 6-b . When instead the BLs assume non-identical value in this specific operation mode, the internal nodes of the cross-coupled inverter are forced to assume such values, thereby operating in a similar way than an SR-latch. Note that after a short period (typically few tens of picoseconds) the memory cells naturally stabilizes to the written values. Simulations waveforms, in a 22-nm silicon nanowire technology, for write/latching operations are reported in Fig. 7 . • When W=EN=0, the cell is in reading mode (Fig. 6-c) . The BLs are initially discharged to ground. Subsequently, the bottom FETs charge the BLs to the values stored in the internal nodes. Similarly as in standard SRAMs, the reading process can be speeded-up by using sense amplifiers and related circuitry.
• When W and EN are different, the memory operates as a dynamic latch (Fig. 6-d) . The bottom transistors disconnect the cell internal nodes from the BLs. Therefore, the value stored inside the cell is stored as in a dynamic latch. This mode enables an intrinsic power-gating configuration. In this mode, the cell needs periodically refresh operation (typically every ms) through a configuration in its static latch mode.
Regarding reading and writing times, simulation results in a 22-nm silicon nanowire technology show that these operations can be accomplished in 14.67 ps and 14.25 ps, respectively. As compared to a standard SRAM cell in 22-nm FinFET technology, the proposed cell is 14% smaller and 16% faster.
The enhanced configurability of the proposed pseudo-SRAM cell is especially interesting in modular systems, where a memory array can be used either as regular dynamic memory or can be employed as static registers array. This selection can be dynamic and therefore give more flexibility to increase the configurability of some portions of the chip. From a system level perspective, we expect the pseudo-SRAM cell to provide a valuable alternative to standard SRAM cells in circuits where low-power and high-performance operations are of paramount importance.
VI. SYSTEM-LEVEL IMPACT
In this section, we attempt to forecast the gain brought by MIG devices from a system-level perspective.
A. Methodology
The consider SoC system is inspired from a promising linear block error-correcting decoder. In particular, we consider the 1024-bit Polar code decoder presented in [19] , with a parallelism degree of 64. The architecture for the Polar code decoder is depicted in Fig. 8 and divides into four major units: an array of arithmetic processing elements (64 in parallel), a regular SRAM-based memory, a partial-sum logic and an FSM in charge to schedule the decoding algorithm.
The initial performance of such SoC platform are coarsegrain estimated for a CMOS 22-nm technology node, scaling original data from [19] , using tri-gate FinFETs with LSTP option [2] . We employ the techniques presented so far to design an optimized Polar code decoder using TIG controllable-polarity SiNWFET technology. We apply diverse optimization techniques for each major unit in Fig. 7 , and we sketch their effect in comparison to an un-optimized design but also to the traditional CMOS technology. Note that all evaluations are done at the block-level, i.e., considering the expected impact on major units, without taking into account physical design. For this reason, the data presented hereafter is not pretending to be fully accurate, but more an indicator of the new technology potential. Fig. 9 anticipates the results for CMOS technology and MIG-SiNWFETs, with different levels of optimization. In standard CMOS, the Polar code decoder has an area of 20.17µm (Fig. 9-TIG) . This is worse than the traditional CMOS design as the MIGSiNWFETs are bigger (three gate regions) than FinFET and introduces significant parasitic capacitances.
B. Improvement Evaluation and Discussions
However, it is possible to unlock the potential of controllable-polarity devices by using the design techniques presented so far. First, one can exploit the polarity-control to have more compact arithmetic gates (Fig. 2) , which are of paramount importance in the processing elements and partialsum logic of the Polar code decoder. Also traditional negative unate gates are more compact thanks to the polarity control of TIG-SiNWFETs (Fig. 3) . The combined effect in the decoder design corresponds to {26.66µm 2 , 0.29ns, 9.57mW} (Fig. 9 Compact Gates). Already at this point, the predicted critical path in the SiNWFET design is much shorter than in CMOS, allowing higher throughput. By now focusing on power numbers, we can exploit the power-gating opportunity to disconnect portion of the circuit in idle time during the decoding process. Following to the chosen parallelism degree, we can shutdown, on average, half of the processing elements, partially active during the decoding time. This corresponds in a reduction of the power consumption to 8.72mW (Fig. 9 -LowPower) at minor area and delay penalties. Considering the memories, the pseudo-SRAM cells of Fig. 6 find use in the LLR memory region of the polar code decoder, while the TSPC FFs of Fig. 5 enable an efficient implementation of the partial sum memories. These techniques reduce both area and power consumption (Fig. 9-Memories) . Considering all the optimization techniques for TIGSiNWFETs, we obtain a design having {23.47µm 2 , 0.29ns, 6.98mW} thus a throughput of 1126Mbps and an energy efficiency of 6.19pJ/bit. With our design techniques, a TIGSiNWFET based polar code decoder can be notably faster (20%) and more energy efficient (32%) than in CMOS, at a moderate area overhead cost (15%).
VII. CONCLUSION
In this paper, we presented a complete design framework exploiting controllable-polarity transistors to support the next generations of System-on-Chips. Thanks to their enhanced functionality, controllable-polarity FETs enable a superior design of critical components in a SoC, such as the processing units and memories, while also providing native solutions to control the power consumption. Many techniques introduced recently were reviewed and their opportunities were evaluated on a complex arithmetic intensive SoC. We showed that, with our proposals, the designed SoC can be 20% faster and is 32% more energy efficient than its FinFET counterpart, at 22-nm node, at a moderate overhead of 15%.
