ABSTRACT A new 3D IC fabric named NP-Dynamic Skybridge is proposed that provides fine-grained vertical 3D integration for future technology scaling. Relying on a template of vertical nanowires, it expands our prior work to incorporate and utilize both n-and p-type transistors in a novel NP-Dynamic circuit-style compatible with true 3D integration. This enables a wide range of elementary logics leading to more compact circuits, simple clocking schemes for cascading logic stages and low buffer requirement. We detail new design concepts for larger-scale circuits, and evaluate our approach using a 4-bit nanoprocessor implemented in 16 nm technology node. A new pipelining scheme specifically designed for our 3D NP-Dynamic circuits is employed in the nanoprocessor. We compare our approach with 2D CMOS as well as state-of-the-art transistor-level monolithic 3D IC (T-MI) approach. Benchmarking results for the 4-bit nanoprocessor show benefits of up to 56.7x density, 3.8x power and 1.7x throughput over 2D CMOS. Compared with T-MI, our new 3D fabric showed 31x density, 3x power and 1.4x throughput improvement. Additional evaluation of 4-bit and 8-bit CLA designs shows that significantly improved gains can be achieved for our 3D approach over 2D CMOS with increasing circuit bit-width, indicating potential for future scalability.
I. INTRODUCTION
2D CMOS integrated circuit (IC) technology scaling faces severe challenges that result from device scaling limitations [1] , [2] , interconnect bottleneck that dominates power and performance [3] , etc. 3D ICs with die-die and layer-layer stacking using Through Silicon Vias (TSVs) [4] and Monolithic Inter-layer Vias (MIVs) [5] have been explored in recent years to generate circuits with considerable interconnect saving for continuing technology scaling. However, these 3D IC technologies still rely on conventional 2D CMOS's device, circuit and interconnect mindset showing only incremental benefits [8] while adding new challenges such as thermal management, manufacturing and reliability issues [4] , [5] .
In [6] , [7] , a vertical nanowire based 3D integrated circuit fabric called Skybridge was proposed showing a pathway for truly fine-grained 3D integration. In this 3D fabric, core IC fabric aspects from device to circuit-style, connectivity [9] , thermal management [10] and manufacturing pathway [11] are co-architected keeping 3D compatibility in mind, and uses the vertical dimension instead of a multi-layered 2D mindset.
In the Skybridge fabric, uniform n-type transistors are used in a dynamic circuit style [6] , and n-type transistor based NAND and AND-of-NAND compound gates are the elementary logic functions. Multiple clock signals are used to control the cascading of elementary logic stages, and buffers are used inbetween stages for signal propagation and restoration in largescale designs, which introduce limitations.
Here, we propose NP-Dynamic Skybridge fabric [12] that incorporates both n-and p-type transistors to build a new class of fine-grained 3D integrated circuits. It follows a fabric-centric mindset to integrate both n-and p-type circuit components in the physical layer for 3D compatibility. The use of both n-and p-type transistors allows a wide range of elementary logic functions to be supported including NAND, AND-of-NAND, NOR and OR-of-NOR, which provide high flexibility for implementation of logic functions. This results in compact circuit designs with short interconnects and low buffer requirement. The use of NP-Dynamic circuit style also allows a simple clocking scheme (since alternately cascading n-and p-type logic stages maps intrinsically to the signal monotonicity requirement of dynamic circuits). The achieved design with unique 3D components has good connectivity [9] and routability [13] to address high-density interconnects in larger-scale circuits, which leads to compact circuit design with significant power efficiency and high performance.
In this paper, the NP-Dynamic Skybridge approach is extensively evaluated at the circuit-level and system-level to investigate its benefits in circuit designs with complex and high-density interconnects. For circuit-level evaluation, its core fabric components including fabric structures facilitating both n-and p-type integration and elementary logic gates combined with TCAD and HSPICE simulations are presented in detail. For system-level evaluation, a 4-bit Wire Stream Processor (WISP-4) [14] design is used as benchmark. A new pipelining scheme specifically designed for NP-Dynamic 3D circuit style is used in the WISP-4 design to reduce the operation time of each stage in pipeline for improved throughput. Device-to-circuit fabric evaluation is presented for WISP-4 accounting for characteristics of selected materials, circuit style, placement and routing, and 3D layouts, which shows up to 56.7x density benefit, 3.8x power efficiency and 1.7x benefit in throughput against 2D CMOS with the equivalent technology node. We also compare our approach with the state-of-the-art approach of conventional 3D IC direction, the transistor-level monolithic 3D IC (T-MI) [15] using equivalent technology node. Our 3D approach showed up to 31x density benefit, 3x power efficiency and 1.4x benefit in throughput against T-MI 3D IC. Additionally, we used 4-bit and 8-bit CLAs as benchmark to evaluate NP-Dynamic Skybridge's gain over 2D CMOS as the bit-width of design grows; NP-Dynamic's 8-bit CLA design has 24 percent increased latency and 16 percent increased power compared with 4-bit CLA design while 2D CMOS's 8-bit CLA design has 2x latency and 25 percent increased power compared with 4-bit CLA design.
The rest of the paper is organized as follows: Section II presents the proposed 3D fabric's core components. In Section III, we show the elementary gates and their fan-in sensitivity analysis. Section IV shows a device-to-circuit evaluation methodology for this new fabric. Then, in Section V, we show the benchmarking of a 4-bit nanoprocessor (WISP-4) with an optimized pipelining scheme for improved throughput versus 2D CMOS and T-MI 3D. Section VI presents a preliminary study of gain versus bit-width utilizing carry look-ahead adder designs. Section VII concludes the paper.
II. ENABLING 3D INTEGRATION: CORE COMPONENTS
NP-Dynamic Skybridge follows a fabric-centric mindset to create a truly fine-grained 3D integration system in the vertical dimension. Each core component is designed for 3D compatibility and overall system efficiency. These components are assembled on a 3D uniform template of single crystal nanowires that act as scaffolding for vertical assembly.
While extensive device-to-circuit aspects and evaluation are shown in this work, as well as experimental progress of key process steps involved, its full experimental demonstration towards a functional circuit is an ongoing long-term effort. Figure 1A shows the envisioned NP-Dynamic Skybridge; Using a similar process flow described in [11] , vertical nanowires, are constructed primarily through masking and high aspect ratio etching on heavily doped silicon bulk (other methods are also possible). Architected fabric components are constructed on these nanowires by using material deposition techniques [11] . In this section, we present the core components that enable fine-grained integration of both n-and p-type nanowires in Skybridge fabric. Detailed explanation of material selection and working mechanism are presented to illustrate how these components are used in unison to achieve desired functionality and 3D compatibility with circuits implemented across both horizontal and vertical dimensions.
A. VERTICAL NANOWIRES
Vertical nanowires are the fundamental building blocks that enable vertical stacking of designed core Skybridge components. The nanowires serve multiple functions -they can act as (i) logic nanowires that have stacked transistors to implement required logic gates, (ii) routing nanowires to carry electrical signals along the vertical dimension, and (iii) heat dissipating nanowires to extract and sink heat generated during circuit operation to the bulk substrate [10] .
The nanowire formation step precedes all manufacturing steps, and is done after wafer preparation. Wafer preparation involves stacking heavily doped n-type and p-type silicon layers to create a dual-doped silicon wafer ( Figure 1B ). This can be achieved by bonding heavily doped n-type and p-type substrates using techniques that are similar to the ones described in literature [16] , [17] and currently used for conventional 3D ICs. A silicon dioxide layer is used between the n-type and p-type doped silicon layers for isolation. Vertical nanowire patterning can be achieved through inductively coupled plasma etching [11] , [18] and has been experimentally demonstrated as shown in Figure 2A .
B. VERTICAL GATE-ALL-AROUND (VGAA) TRANSISTOR
VGAA junctionless transistors are used as active devices, and are formed on nanowires through consecutive material deposition steps [11] . These junctionless transistors use uniform doping with no abrupt variation in Drain/Source/Channel regions ( Figure 1D -E), which simplifies manufacturing requirements and is especially suitable for this fabric. Their channel conduction is modulated by the workfunction difference between the heavily doped channel and the gate [19] . Titanium Nitride (TiN) and Tungsten Nitride (WN) are chosen for n-type and p-type transistors respectively to provide the required workfunction for the accumulation mode when the transistor is ON [20] , [21] . 3D TCAD Process and Device simulations [22] were used to extract the device I-V characteristics, shown in Figure 3A . The n-type device had an ON current of 30 mA, and OFF current 0.1 nA. The p-type device had an ON current of 26 mA, OFF current 0.76 nA. Figure 3B shows the TCAD-simulated gate capacitance of the n-type VOLUME 5, NO. 2, APRIL-JUNE 2017
287
VGAA transistor with applying various Vds values. In saturation state, the VGAA transistor has around 250aF gate capacitance. The simulation methodology and assumptions are detailed in Section IV. Nanowire based junctionless transistors has been demonstrated experimentally in our previous work [25] . Compared to 15 nm FinFET [42] , our VGAA junctionless transistor has significantly lower gate capacitance (see Figure 4 ) due to its higher equivalent electrical oxide thickness (T ox ) [46] . Additionally, junctionless transistor has lower parasitic capacitance due to its simple device structure [47] . These two factors help in relieving the performance degradation of junctionless transistor caused by its degraded on current compared to junction transistor (See Figure 4) . Additionally, the series of 3D VGAA transistors were characterized through both SPICE behavioral model based simulation (See Section IV.C) and verified with Sentaurus TCAD [22] physical simulation (See Section IV-A); the SPICE simulation shows the on current of the series has I on of 15 mA and I off of 15.3 pA ( Figure 5A) ; the TCAD simulation shows I on of 15.7 mA and I off of 16.1 pA ( Figure 5A ). The difference between I TCAD and I SPICE is within 6.5 percent ( Figure 5B ). Details of TCAD simulation assumptions and methodology are shown in Section IV-A.
C. OHMIC CONTACTS
In NP-Dynamic Skybridge the input/output ports of different gates are connected using horizontal metallic routing components called bridges (See Section II-E) and vertical coaxial routing structures (See Section II-D). Specific materials are chosen for each doped silicon region to minimize contact resistance between heavily-doped silicon and metals ( Figures 1D and 1E) . Nickel is used for creating a low-resistance Ohmic contact with p-doped silicon and Titanium is chosen for n-doped silicon. Each of these metals has the proper workfunction to eliminate Schottky Barrier in the interface with corresponding doped silicon, achieving low resistance; in addition, they also have good adhesion to doped silicon [26] , [27] . A thin Titanium Nitride layer in the p-type nanowire Ohmic contact is used for avoiding the chemical reaction between Nickel and Tungsten. Figure 2B shows experimental demonstration of Ohmic contact formation through material deposition around vertical nanowire.
D. COAXIAL ROUTING STRUCTURES
Coaxial routing refers to a scheme where an outer signal routing layer runs coaxially with another inner signal routing layer without affecting each other. Every routing layer in such a coaxial structure facilitates signal propagation along the vertical dimension. This is unique and enabled by the fabric's vertical integration approach, and can be manufactured similar to the process flow used in [11] . A coaxial routing structure ( Figure 1G ) consists of two concentric metal layers separated by dielectric layers around a nanowire. The outermost metal shell (M2) and the inner nanowire are used for carrying input/output signals. Electrical coupling noise between the inner nanowire and outer metal shell can be mitigated by pinning the inner metal shell (M1) to a ground (GND) signal for shielding. Figure 1G illustrates this concept; the GND signal is applied to the M1 metal shell which thus acts as a shield layer, and prevents coupling between signals in M2 shell and the inner nanowire.
Given that a nanowire itself can carry a signal and the fabric incorporates both n-and p-type nanowires, it needs support to allow signal routing between n-and p-regions bypassing the isolation dielectric layer between them. An inter-region contact structure is designed for this purpose to form a low resistance Ohmic contact between p-type and ntype regions on a single nanowire ( Figure 1G ). Figure 6 shows the I-V characteristics of the contact structure that was carried out by emulating the fabrication process flow in Synopsys Sentaurus Process and Device simulator [22] (see Section IV-B for details).
E. BRIDGES
Bridges ( Figures 1D and 1E ) connect with Ohmic contacts and coaxial routing structures to carry and propagate signals horizontally in-between nanowires ( Figure 1F ). As shown in Figures 1D and 1E , Tungsten is used as the material to form the bridges because of its good adhesion with Titanium [27] .
III. ELEMENTARY LOGIC CIRCUITS
A. NAND/NOR GATE NAND and NOR gates can be implemented by stacking nand p-type transistors respectively on a dual-doped nanowire. Figures 7A and 7B show 5-input NAND and NOR gate implementations respectively. Compared with 2D CMOS, the benefit of our 3D integration approach is evident from Figures 7A and 7B, where 7 transistors and 3 contacts occupy only one nanowire cross-sectional area footprint for NAND and NOR gate implementations.
As shown in Figures 8A and 8B, these elementary gates are controlled with pre-set and evaluate clock signals. The NAND gate has two operation phases; the output node is pre-charged in the pre-set with turning on the transistor controlled by signal pre-set (See Figure 8A) ; the output value is evaluated through the pull down network with the input signals in the evaluate phase which is controlled by the transistor with input evaluate (See Figure 8A ). For the NOR gate, the output node gets pre-discharged during the pre-set phase, and the output is pulled up to a final value dependent on the ON/OFF status of p-type transistors during the evaluate phase.
B. COMPOUND LOGIC GATE
In this fabric, compound logic gates can be designed by using a combination of elementary Boolean gates (NAND and NOR) to realize complex logic functions in a single step. Compound logic, in this case OR-of-NORs (AND-of-NANDs), can be implemented by shorting the outputs of a collection of NOR (/NAND) gates. Thus, NP-Dynamic Skybridge fabric offers high design flexibility that can produce extremely compact circuits. Figures 7C and 7D illustrate an example, where an AND-OR-INVERTER (AOI)2x2 gate is built by AND-ofNANDs and OR-of-NORs logic respectively.
C. CASCADING LOGIC GATES
As explained in ref. [28] , n-type dynamic logic (NAND gate) requires monotonically rising signals as inputs while p-type dynamic logic (NOR gate) requires monotonically falling signals as inputs; this is referred to as the monotonicity requirement for correct operation of dynamic circuits. In NPDynamic Skybridge, this requirement is intrinsically met by successively cascading n-type logic stages with p-type dynamic logic stages. This allows cascaded stages of a given circuit to be evaluated in the same clock period, and requires only one set of pre-set and evaluate clocks leading to a simple clocking scheme.
The circuit schematic in Figure 7E illustrates a cascaded logic design. It was designed with two 2-input NAND gates in the first stage, the outputs of which drive a 2-input NOR gate in the second stage. Figure 7G shows the simulation output from HSPICE for functional validation. Initially, NAND gate output was pre-set to logic '1' and NOR gate output was pre-set to logic '0' simultaneously in a single clock period. During the evaluate period, both NAND and NOR gates are evaluated in the same clock phase because the NAND gate output provides a monotonically-falling output signal that acts as an input to the subsequent NOR gate to satisfy operational requirements.
The 3D layout of the above design is shown in Figure 7F ; three logic nanowires are used for implementing NAND and NOR logic gates; three additional nanowires are used for input/output routing, and two routing nanowires are used for internal input/output connection. 
D. FAN-IN CONSIDERATIONS
Circuit design using high fan-in gates allows compact implementations with fewer transistors and interconnects. However, the fan-in was severely limited in CMOS circuits (typically with a maximum fan-in of 4) that used complementary MOSFETs in a static circuit-style, due to a prohibitive increase in circuit delay. This is because each gate's diffusion and load capacitances increase significantly with fan-in impacting the overall load capacitance that needs to be driven. In contrast to CMOS, the Skybridge approach builds each elementary gate by using single type uniform transistors in a dynamic circuit style. As a result, capacitances at the gate output node are significantly reduced, which yields better performance compared to CMOS for higher fan-in designs.
We analyzed the sensitivity to fan-in for elementary logic gates to evaluate the feasibility of high fan-in logic in NPDynamic Skybridge as follows. Figure 8 shows the circuit schematics used. VGAA junctionless device I-V and C-V characteristics were extracted using TCAD Device simulations (See Section IV-A) to build HSPICE compatible behavioral device models, which were then used to build circuit netlists in HSPICE. For CMOS, equivalent NAND gate circuit ( Figure 8C ) was built using 16 nm tri-gate high-performance PTM device models [29] . The output nodes of both CMOS and NP-Dynamic Skybridge gates had a fan-out of 4 inverters. The worst-case delay was measured with falling transition (90%VDD to 10%VDD) of the output node. Figure 9 compares the simulated gate delays of CMOS NAND gate, NP-Dynamic Skybridge NAND and NOR gates. These results are normalized to the delay of one fan-in gate in respective technologies. The circuit performance degrades with increasing fan-in, and the rate of increase in circuit delay with fan-in is referred to as fan-in sensitivity. It is desirable to have lower fan-in sensitivity because it drives the use of higher fan-in circuits which can lead to more compact circuit designs by reducing the number of logic levels. Elementary gates (NAND and NOR) in NP-Dynamic Skybridge have much lower fan-in sensitivity compared to 2D CMOS NAND gate due to the usage of dynamic circuit style and compact 3D implementation. In CMOS NAND gate, as the fan-in is increased, the load capacitance of the output node increases linearly due to the added drain capacitances from p-type transistors in the pull-up network. Therefore, the delay suffers from both increased load capacitance at output node and increased RC delay [28] in pull-down network, which results in a rapid increase of gate delay as circuit fan-in is increased. NP-Dynamic Skybridge uses dynamic circuit style where the charging (or discharging) network for NAND (or NOR) has only one transistor connected with output node (see Figure 9 ). Thus the load capacitance at output node is constant in spite of increasing fan-in, and only the RC delay in the evaluation path (function network) increases with fan-in. Thus, our 3D NAND and NOR gates exhibit a lower rate of increase in RC delay (See Figure 9 ) with increasing fan-in compared to 2D NAND. Since the p-type VGAA junctionless transistor has a higher parasitic capacitance than n-type (doping concentration of our p-type nanowire is lower than n-type nanowire to achieve low OFF current), the NOR gate has slightly higher fan-in sensitivity than 3D NAND (See Figure 9) .
Typically in standard CMOS designs, the maximum fan-in is limited to 4. As shown in Figure 9 , we determined the max fan-in number of our 3D NAND and NOR gates by using similar normalized delay upper bound as CMOS's NAND gate. This led to a maximum fan-in of 8 and 7 for NPDynamic Skybridge NAND and NOR gates respectively.
IV. DEVICE-TO-CIRCUIT EVALUATION METHODOLOGY
A device-to-circuit simulation methodology that includes detailed effects of material choices, confined dimensions, nanoscale device physics, 3-D circuit style, 3-D interconnect parasitics, and characterization of large-scale circuits, is used in our evaluations. Figure 10 shows the main steps of this methodology. The key metrics of our benchmark circuits presented in Sections V and VI are comprehensively evaluated by using this flow. The following sections describe each phase of this flow in detail.
A. VGAA JUNCTIONLESS TRANSISTOR CHARACTERIZATION
The n-type and p-type VGAA junctionless transistors were extensively characterized using detailed physics-based 3D simulation of the electrostatics and operations using Synopsys Sentaurus TCAD [22] . The Sentaurus Process [23] was used to create the device structure (See Figure 11A) emulating actual process flow; process parameters such as ion implantation dosage, anneal duration and temperature, deposition parameters etc. were similar to our experimental process parameters for junctionless device demonstration [24] . The resulting device structure had 16 nm long Si channel, 2 nm of HfO 2 as gate oxide, 11.5 nm thick gate electrode, 5 nm long Si 3 N 4 as spacer material, and 22 nm thick S/D contact material (See Figure 14) . Gate metal work function is 5.2 eV (TiN) and 4.3 eV (WN) for n-type and p-type transistors respectively [20] , [21] . 16 nm channel length was simulated following similar feature size as the original Skybridge's device [6] , [7] . Uniform doping for drain, channel and source was required to form the VGAA junctionless transistor (See Figure 11B) , and As and Br were chosen as dopants for n-and p-type devices respectively. The doping concentration for ntype device was 10 19 cm À3 and p-type was 10 20 cm À3 . The device structure was then used in Sentaurus Device [24] simulations to extract device characteristics accounting for nanoscale confinement, surface and coulomb scattering, and mobility degradation effects [22] . Silicon bandstructure was calculated using the Oldslotboom model [22] , charge transport was modeled using hydrodynamic charge transport [22] ; quantum confinement effects were taken into account by using density gradient quantum correction model [22] .
Electron mobility was modeled taking into account effects due to high doping, surface scattering, and high-k scattering. The simulated device characteristics are shown in Figure 3 . These simulated device characteristics were used to generate a behavioral device model for HSPICE circuit simulations.
In order to verify that the behavioral device model can precisely capture the electrostatic characteristics of our 3D gate with series of multiple VGAA transistors, the structure with two n-type VGAA transistors was created in Sentaurus TCAD (See Figure 12A) . The key process parameters were kept the same as in the single device's process simulation. Figure 12B shows the TCAD simulated current density when both transistors are on; the on current flows through the top transistor's drain contact, drain channel, source and contact on source to the bottom transistor which fully validates the normal function of VGAA transistors series. The measured I on -V g curve is shown in Section II-B with comparison to our behavioral SPICE device model based simulation results. The results indicate minimal difference between TCAD and SPICE simulations of up to 6 percent. This verifies the precision of the created behavioral device model in circuit simulation.
B. INTER-REGION CONTACT STRUCTURE SIMULATIONS
The silicon-metal contact interface, where there is a resistive interface region caused by Schottky Barrier of the interface between doped silicon and metal, was simulated with nanoscale 3D device physics simulation using Synopsys Sentaurus TCAD. Figure 13 shows the simulated contact structure that includes the interface between Nickel and p-type doped silicon, and the interface between Titanium and n-type doped silicon. The area of each interface region is equal to the area of each contact in the bypass routing structure shown in Figure 1G . The doping concentration was 10 19 cm À3 for n-type silicon and 10 20 cm À3 for p-type silicon. The work function was 4.7 eV for Nickel and 5.2 eV for Titanium [31] . The Schottky boundary model was chosen as physics model of the contact interface. Interface scattering, surface roughness and interface trapped charges are considered in the device simulations. FIGURE 11. STCAD device simulation: A) Generated n-type VGAA structure with high-density meshing [22] in channel, gate oxide and gate metal regions B) Uniform heavy doping (10 20 cm À3 ) in S/D and channel for our n-type VGAA transistor. 
C. CIRCUIT SIMULATIONS
Novel nanoelectronic devices do not have built-in models in traditional circuit simulators such as HSPICE [38] . One general solution is using device simulation data to create behavioral models of the novel devices compatible with HSPICE [38] . The TCAD simulated device characteristics (I d -V g , C g -V g , etc.) were used to generate an SPICE -compatible behavioral device model for our VGAA transistor. Regression analysis was performed on the device characteristics, and multivariate polynomial fits were extracted using the DataFit software [33] . Mathematical expressions were derived to express the drain current as a function of two independent variables, Gate-Source (VGS) and DrainSource (VDS) voltages. These expressions are then incorporated into sub-circuit definitions for voltage-controlled resistors in HSPICE [38] . Capacitance data from Sentaurus Device [24] simulations is directly integrated into HSPICE using voltage-controlled capacitance (VCCAP) elements and a piece-wise linear approximation. The regression fits for current together with the piece-wise linear model for capacitances and sub-circuits, define the behavioral SPICE model for the VGAA Junctionless transistor. This modeling methodology is similar to our prior work on horizontal nanowire device modeling [30] .
In addition to accurate device characteristics, our 3D circuit simulations also accounted for 3D layout specific interconnect parasitics and coupling noise effects considering actual dimensions and material choices. 3D layout mapping into Skybridge fabric and interconnection were according to manufacturing assumptions and followed the fabric's design rules (see Figure 14) . All design rules also followed ITRS guidelines for 16 nm technology node [34] . Capacitance calculations for the vertical Coaxial routing structures were according to the methodology in [35] , and resistance calculations were according to the PTM interconnect model [36] . The PTM model [36] was also used for horizontal metal routing (bridge) resistance, parasitic and coupling capacitance (RC) calculation. The interconnect RC were calculated in the PTM model tool according to the interconnects' actual dimensions and spacing in the designed 3D layout as well as the characteristics of used materials. Figure 7H shows the 3D layout of a 4-bit carry look-ahead adder (CLA) with signal routings, local clock tree design and power delivery network (PDN). The logic nanowires (logic gates) and routing nanowires (vertical coaxial structure) are separately placed row by row. In each logic nanowire row, logic nanowires share uniform local clock signal wire (CLK1 and CLK2; See details of control clock signal in Section V-A). The whole CLA uses uniform set of {CLK1 CLK2} controls and executes as one stage in our WISP-4 processor pipeline. Thus, the clock signal wires of different logic nanowire rows are also connected through the routing nanowire rows inbetween. Also, the CLK1 and CLK2 are complementary to each other (See Figure 16 ) and can be generated by one uniform clock source, CLK. Therefore, each stage of the WISP-4 processor uses one uniform clock signal (CLK) and the local complementary signals (CLK1 and CLK2) are generated by inverters. Due to the uniform distribution of local clock signal wires and the improved routing capacity, the clock tree design of our Skybridge 3D IC (after proper floorplan for each stage's circuit) has much lower complexity than 2D CMOS. The parasitics of the designed clock tree routing and the impact of clock jet (from generating complementary clock signals) were all included in our HSPICE circuit simulation.
D. AREA EVALUATION
The example circuits used for benchmarking were designed and physical 3D layout was manually performed. Area footprint of each design was calculated based on the number of nanowires used and nanowire pitch as per Skybridge design rules (See Figure 14) . 
V. CELL-TO-SYSTEM CIRCUIT BENCHMARKING A. DESIGN FOR A 4-BIT NANOPROCESSOR IN NP-DYNAMIC SKYBRIDGE
A 4-bit wire stream nanoprocessor [14] (WISP-4) was chosen as benchmark circuit for system-level evaluation. Generally, in nanoscale computing fabrics [30] , [39] , each pipeline stage with dynamic circuit implementation has three operation phases including pre-set, evaluate and hold phase. In this paper, we propose a new pipelining scheme where each pipeline stage has only pre-set and evaluate phases, and the hold phase is removed for reduction of each stage's operation time which significantly improves the throughput. We implement this new pipelining scheme by using an enabled latch [40] , which is designed with two cascaded clock-enabled inverters. Figure 15A shows the circuit schematic of the enabled latch, and the timing graph ( Figure 15B ) shows how the latch works to remove the hold phase. During the evaluation phase of stage 1, the latches are enabled by signal {Eva, Eva} (See Figure 15B ) and the output values go into latches. After the evaluation phase, the latches are turned off and output results are held for the evaluation phase in the next stage. Therefore, for the stage 1, it can go to pre-set stage again immediately without using the hold phase to wait and hold the values for next stage. By applying this pipeline scheme, each stage is operated in two phases: pre-set and evaluate, which means the operation time of each stage goes down from 3 Ã T critical (T critical is the critical path delay of the longest stage) to 2 Ã T critical . Additionally, this pipelining scheme is noise-resilient since the output node 'R hold ' (See Figure 15A ) of the latch has no coupling with the 'pre1' and 'pre1' signal in the first stage (See Figure 15A) which were found to be the main source of noise imported from the first stage into the second stage [30] .
The WISP-4 nanoprocessor is designed with five stages (see Figure 16A) including 'Instruction Fetch' (IF), 'Instruction Decode' (ID), 'Register Files' (REG), 'Execute' (ALU) and 'Write Back' (WB). The enabled latches are used as registers which store and carry signals in-between stages.
The pipeline timing of these five stages is shown in Figure 16B . For the stages: 'IF', 'REG' and 'WB', the CLK1 is used to control pre-set phase and CLK2 is used to control evaluate phase. For the stages: 'ID' and 'ALU', the CLK2 is used to control pre-set phase and CLK1 is used to control evaluate phase. In each stage, CLK1 (CLK2) is used for control of n-type dynamic logic and CLK1 (CLK2) is used for control of p-type dynamic logic. CLK1 and CLK2 are universal clock signals and used for {PRE EVA} control in each stage of Skybridge 3D's WISP-4 processor design (See Figure 16 ). The CLK1 and CLK2 are generated from one uniform clock source (CLK) by using inverters placed within the circuit of each stage. This way, the clock tree design of Skybridge 3D has much reduced complexity.
NP-Dynamic circuit style based combinational circuits and layouts were designed for the implementation of each stage. HSPICE circuit definition of the entire WISP-4 was created with extracted interconnect RC information from layouts and GAA junctionless transistor models to calculate the power and performance. The area of WISP-4 was calculated by using customized layout blocks with proper floor plan. Detailed methodology of device characterization, circuit simulation and area calculation is discussed in Section IV.
B. NP-DYNAMIC SKYBRIDGE VERSUS CONVENTIONAL 3D ICS
Conventional 3D ICs use the same device and interconnect mindset as 2D CMOS which results in incremental interconnect RC saving against 2D design. For their typical approaches that stack two silicon dies for 3D implementation [4] , [5] (our NP-Dynamic Skybridge also uses two silicon layers), within 2x benefits [8] in all aspects are achieved against 2D CMOS while new challenges such as thermal management [41] and routing congestion issue [15] are added. By contrast, NP-Dynamic-Skybridge approach addresses each architectural aspect with 3D mindset. It architects compact 3D interconnect from device to circuit for achieving significant interconnect saving which leads to orders of magnitude benefits over 2D CMOS (See Table 1 ).
In this work, we chose the transistor-level monolithic 3D IC (T-MI) [15] for conventional 3D CMOS benchmarking since it is the state-of-the-art and more fine-grained than any Gate-level simulation for static 2D CMOS, dynamic 2D CMOS, T-MI and NP-Dynamic Skybridge were performed. We simulated a NAND2 gate with FO4 load, which assumed each standard cell in the design has an average fan-out of four inverters (INVs). The input signal slew was set to be 15ps. The power and delay were measured in average through all switching scenarios. The power was measured assuming the same input signal frequency of 1 GHz. The pre-set and evaluate phases were performed in each clock period in 2D dynamic and NP-Dynamic Skybridge 3D's simulation. The design and RC extraction rules of 2D CMOS and T-MI follow the Nangate 15 nm technology [42] . Table 1 shows the interconnect design rules comparison between 2D CMOS, T-MI [15] and NP-Dynamic Skybridge. The layout design and circuit simulation methodology of our Skybridge 3D approach is included in Section IV. In T-MI, the top-tier interconnects use the same pitch and width as 2D CMOS and the bot-tier has two additional metal layers (M1-M2) for 3D cell implementation [15] . Our Skybridge uses uniform horizontal metal pitch (92 nm) in each metal layer (similar as M1 in 2D CMOS) to enable high-density design. Additionally, the vertical routing (coaxial routing structure) has improved routing capacity and high routing density, which contributes to a compact 3D layout with intra-gate routing efficiency. The schematics of gate-level benchmarking circuits are shown in Figure 17 . For the 2D dynamic gate, the output node is connected with a static inverter to generate monotonic signal [28] for the next dynamic gate; in NPDynamic Skybridge the output node is directly connected to the input of next dynamic gate as there is no monotonicity issue (as discussed in Section III-C). Consequently, Skybridge 3D dynamic gate shows significantly reduced output load capacitance versus the 2D dynamic gate due to fewer loading transistors (See Table 2 ). Additionally, the compact 3D design also achieves interconnect RC savings. These factors lead to both power and performance benefits in the NP-Dynamic Skybridge 3D gate compared with the dynamic gate in 2D CMOS. Due to the high switching activity in a dynamic circuit style, the 2D dynamic gate has higher power than the 2D static gate. Overall, the Skybridge 3D gate has slightly improved power efficiency (by $10 percent) compared with the 2D static (See Table 2 ) one. T-MI 3D gate shows benefits in both power (7 percent) and performance (3 percent) compared with the static 2D CMOS (See Table 2 ). This can be attributed to the 3D splitting of pull-up and pull-down network in cell design that achieves intra-cell RC reduction [15] .
For system-level evaluation, we benchmarked WISP-4 nanoprocessor using T-MI, NP-Dynamic Skybridge and 2D CMOS with a uniform technology node of 16 nm and make comprehensive comparison in key metrics. The design methodology in [15] was used in the benchmarking of T-MI 3D approach. First, we created the technology and design library for T-MI 3D based on modified Nangate15 nm PDK [15] . The T-MI technology and design library includes 3D cell LEF file [43] , 3D cell power and [45] . Next, the ASIC flow show in [15] was used to encompass all steps of T-MI's benchmarking from RTL synthesis, cell placement and routing, to system-level density, power and performance evaluation. We created the 2D CMOS based nanoprocessor using Nangate 15 nm PDK [42] through standard ASIC flow. Area, power and performance comparisons are shown in Figure 18 . Power consumption and throughput were measured for worst-case scenarios. The power consumption of each technology was measured under a uniform clock frequency of 1 Ghz. NP-Dynamic Skybridge's WISP-4 shows 56.7x benefit in density, 3.8x improved power efficiency and 1.6x higher throughput compared with 2D CMOS. Compared with T-MI, NP-Dynamic's implementation shows 31x density, 3x power efficiency and 1.4x higher throughput. These significant benefits are not observed in gate-level evaluation. This result reveals that these system-level benefits mainly come from significant cell-to-cell interconnect savings in our Skybridge 3D designs. Further, a metric PPA which takes into account power, performance and area (PPA), is used to comprehensively evaluate the efficiency of technology. PPA can be denoted by the following equation:
Higher PPA value indicates higher total efficiency of the technology. NP-Dynamic Skybridge shows 218.2x higher PPA compared with 2D CMOS while T-MI 3D approach shows 2.5x PPA benefit versus 2D CMOS. There are other 3D IC approaches that achieve considerable benefits by shrinking standard-cell footprint. In [48] and [49] , the Gate-All-Around transistors are used to build standard cells with reduced footprint. In the 3D gate approach [50] , [51] , conventional 2D transistors are employed while the pull-up and pull-down network of each standard cell are vertically stacked layer-by-layer for cell footprint reduction (of around 50 percent). However, these approaches still rely on conventional via-to-metal routing structures, which add no improvement in routing capacity and lack sufficient routability capacity to address high-density cell-to-cell routing demands [15] . This in turn diminishes their overall benefits. In Skybridge 3D, the vertical GAA transistors and coaxial routing structures are built using vertical nanowires for highdensity 3D design with vertical 3D gates requiring ultrasmall footprint. Skybridge achieves significantly improved routing capacity and routability [13] addressing the high-density 3D gate routing needs more effectively. These factors lead to significant interconnect reduction against both conventional 3D approaches as well as 2D CMOS.
VI. BENEFITS IN LARGER-SCALE CIRCUITS
Due to strict design rules and complex routing, the design in nanoscale 2D CMOS has significantly increased interconnects as the design scale increases. It results in severe power and performance degradation as the design scales up. NPDynamic Skybridge inherently supports good connectivity [9] and routability [13] for circuit design which significantly reduces the number of buffers and interconnects for improved efficiency. Therefore, NP-Dynamic Skybridge based design would avoid the impact of interconnect bottleneck as the design scale goes up indicating better scalability than 2D technology. In order to provide a preliminary indication (using very large circuits is not yet feasible until a new CAD flow is completed -that is beyond the scope of this paper), we use 4-bit and 8-bit Carry Look-ahead Adder (CLA) for evaluation. Following the benchmarking methodology discussed in Sections III and VI, extensive simulations were done based on circuit netlists with extracted RC and device models. Circuit latency and power were measured in worst-case scenarios. Figure 19 shows the comparison between key metrics between NP-Dynamic Skybridge and 2D CMOS based CLAs. As the CLA design scales up from 4-bit to 8-bit, the 2D CMOS based 8-bit CLA has doubled latency, while the NP-Dynamic Skybridge based 8-bit CLA has only 24 percent 
VII. CONCLUSION
In this paper, we detailed a new vertical nanowire based 3D integrated circuits fabric called NP-Dynamic Skybridge. Fabric's core components and device-to-circuit evaluations were shown. By using both n-and p-type transistors, NP-Dynamic Skybridge offers a wide range of elementary logics for improved implementation efficiency. Additionally, a new pipeline scheme was proposed for simplified operation of pipeline stages and improves throughput. Benchmarking of a 4-bit nanoprocessor showed 56.7x benefit in density, 3.8x improved power efficiency and 1.6x increased throughput versus 2D CMOS, and 31x density, 3x power efficiency and 1.4x throughput benefits compared to transistor-level monolithic CMOS 3D IC. We also investigated the impact of increasing bit-width on circuit metrics for CLAs, which showed that NP-Dynamic Skybridge was less affected than 2D and 3D CMOS indicating its potential for scalability. Further work is ongoing including CAD flow, CAD-based designs and experimental prototyping. MINGYU LI received the BS degree in automation engineering from the Shandong University, Jinan, China, and the MSECE degree from University of Massachusetts Amherst, in 2012 and 2015, respectively. He is currently working toward the PhD degree in electrical and computer engineering at University of Massachusetts, Amherst. He is a research assistant in Nanoscale Computing Fabrics lab, UMass. He has published his research in several peer-reviewed IEEE, ACM journals and conferences, where he also contributes as a reviewer. His research interests include 3D integration technology and post-CMOS computing fabrics.
MOSTAFIZUR RAHMAN joined the Computer Science and Electrical Engineering (CSEE) Department at University of Missouri Kansas City after receiving the PhD degree from University of Massachusetts Amherst in electrical and computer engineering. He leads the Nanoscale Integrated Circuits (Nano-IC) lab and is a co-lead for the Center for Interdisciplinary Nanoscale Research (CINTR) at CSEE. His group's research focus is on transformative approaches for nanoelectronics to surpass the current limitations of today's integrated circuits. He is currently serving as publication chair for NANOARCH and guest editor for special issue of the IEEE Transactions on Nanotechnology. He is also a program committee member for NANOARCH and VLSIDESIGN conferences. In addition, he serves as a reviewer for TNANO, JETC, JPDC, NANO-ARCH and other publications.
SANTOSH KHASANVIS received the PhD degree in computer engineering from University of Massachusetts, Amherst, in 2015. He is a senior research scientist with BlueRISC Inc. This work was performed during the PhD degree. His research interests include unconventional computing architectures, nanoscale computing, and cyber-security. He is a member of the IEEE since 2015.
CSABA ANDRAS MORITZ received the PhD degree in computer systems from the Royal Institute of Technology, Stockholm, Sweden, in 1998. From 1997 to 2000, he was a research scientist with Laboratory for Computer Science, the Massachusetts Institute of Technology (MIT), Cambridge. He has consulted for several technology companies in Scandinavia and held industrial positions ranging from CEO, to CTO, and to founder. His most recent company, BlueRISC Inc, develops security microprocessors and hardware-assisted security solutions for anti tamper, cyber defense, hardware assisted security and system assurance. He is currently a professor with the Department of Electrical and Computer Engineering, University of Massachusetts, Amherst. His current research interests include NANO electronics and nanoscale systems, computer architecture, and security. 
