Maintaining benefits of CMOS technology scaling is becoming challenging, primarily due to increased manufacturing complexities and unwanted passive power dissipations. This is particularly challenging in SRAM, where manufacturing precision and leakage power control are critical issues. To alleviate these challenges, we proposed a novel volatile memory alternative to SRAM called nanowire volatile RAM (NWRAM). Due to NWRAM's regular grid-based layout and innovative circuit style, manufacturing complexities are reduced and, at the same time, considerable benefits are attained in terms of performance and leakage power reduction. In this article we elaborate NWRAM's circuit aspects and manufacturability, and quantify benefits at 16nm technology node through simulation against state-of-the-art 6T-SRAM and gridded 8T-SRAM designs. Our results show that when lower bounds in design rules are considered, 10T-NWRAM's read and write time are 1.38x and 2x faster, and the leakage power is 14x better in comparison to high-performance 6T-SRAM. Similarly the 10T-NWRAM achieves 1.3x and 1.9x read and write performance, and 35x leakage power improvements compared to high-performance 8T-SRAM. 10T-NWRAM's density is comparable to 6T-SRAM and 8T-SRAM for lower bounds, but exhibits higher active power in similar comparisons. This article details all benchmarking results and provides thorough analysis of NWRAM's evaluations.
INTRODUCTION
The continuous push for denser, faster, and more power-efficient computing is driving CMOS scaling to its limit. Numerous new challenges are emerging related to power consumption, interconnection, circuit noise, manufacturability, and cost. Some of these challenges are especially critical for CMOS SRAM circuits, where both PMOS and NMOS transistors need to be precisely sized and doped for memory operation and for sufficient noise margin. Due to the complex and compact layout of SRAM circuits, it is becoming difficult to maintain such precision at nanoscale. Moreover, controlling leakage power during the period of inactivity is proving to be a bigger challenge due to ever-increasing leakage current in ultra-scaled transistors.
To overcome these issues in SRAM, we proposed a novel nanowire-based volatile RAM (NWRAM) [Rahman et al. 2011] . Salient features of NWRAM are: -manufacturing simplification through a very regular grid-based layout, single-type and uniformly sized transistors, and fewer metal layers; -volatile memory implementation through a dynamic circuit style that conforms to the underlying physical framework; -noise mitigation by using a non-overlapping clocking control scheme and synchronous data input for memory operations, while requirements for precise sizing or complementary doping of transistors are eliminated; and -leakage power control by incorporating stacked transistor-based design in the core memory circuit, and by designing to include the restore operation as a fundamental memory operation to allow state preservation during periods of inactivity and state restoration periodically (the restore operation is unique for NWRAM, since it neither requires "read-out" nor "write-back" for restoration, but rather simple turning ON of control clock signals).
In terms of design, operation, and physical implementation, NWRAM is fundamentally different from SRAM and DRAM counterparts. In this article, we discuss in detail NWRAM circuit and manufacturing aspects, and contrast it with other kinds of memory cells. We show detailed methodology and benchmarking of a 10-transistor NWRAM (10T-NWRAM) against state-of-the-art 6-transistor-based SRAM (6T-SRAM) and 8-transistor-based gridded SRAM (8T-SRAM) designs in 16nm technology node. The benchmarking results indicate considerable benefits can be attained with 10T-NWRAM at significantly lesser manufacturing complexities. Our results show that 10T-NWRAM is 1.38x and 2x faster in terms of read and write time, and consumes 14x less leakage power with respect to high-performance 6T-SRAM. The results also indicate 10T-NWRAM performs 1.3x and 1.9x faster for read and write time, and consumes 35x less leakage power in comparison to high-performance 8T-SRAM. For low-power SRAM designs, however, SRAMs show better leakage power results at significantly degraded performance. The density of 10T-NWRAM is comparable to that of 6T-SRAM and better than 8T-SRAM for lower bounds in design rules, whereas for the upper bound in design rules which considers pessimistic assumptions for NWRAM, the density of SRAMs is better than NWRAM. The benchmarking results also indicate slightly higher active power consumption for NWRAM. In this article, we provide extensive details of benchmarking results. This article is organized as follows: Section 2 presents the underlying nanowire-based physical fabric for 10T-NWRAM, Section 3 discusses 10T-NWRAM circuit and layout details, Section 4 shows benchmarking methodology and results, Section 5 details the stability of 10T-NWRAM, and Section 6 concludes.
UNDERLYING PHYSICAL FABRIC: N 3 ASIC
N 3 ASIC [Panchapakeshan et al. 2011a ] is a physical fabric where nanoscale devices and interconnects are integrated in a novel manner that simplifies manufacturability, while at the same time allowing efficient logic and memory implementations. In this fabric, manufacturing complexities are reduced through the use of regular grid-based layout, and novel circuit style that uses uniform transistors without complementary doping or sizing variations. A combination of low-cost unconventional manufacturing (e.g., nano-imprint, block co-polymer-based self-assembly) and conventional lithography approaches (following lithographic design rules) are envisioned for large-scale manufacturing. In our prior work on N 3 ASIC's manufacturability [Panchapakeshan et al. 2011b; Rahman et al. 2013] , we have shown relaxed lithographic requirements for N 3 ASIC assembly (3σ = ± 8nm for N 3 ASIC versus 3σ = ± 3.3nm for CMOS), shown a step-by-step scalable manufacturing pathway, and experimentally demonstrated a logic stage of N 3 ASIC fabric.
As shown in Figure 1 , N 3 ASIC building blocks are: arrays of patterned semiconductor nanowires, cross-nanowire FETs (xnwFETs), orthogonal metal gate inputs, control and power rails, standard vias, and the 3D metal stack. All logic and memory functionalities are achieved using semiconductor nanowires. Depending on the logic function, xnwFETs are placed on certain cross-points between the input metal gate and bottom nanowire, vias carry output of each logic stage, and the input/output signals are routed through the 3D metal stack. The xnwFET device structure and TCAD Sentaurus simulated device characteristics are shown in Figure 2 .
Two channels are assumed for each device, leading to dual-channel xnwFETs (2C-xnwFET); the precise dimensions and spacing of nanowires were chosen based on demonstrated experimental results [Martensson et al. 2004; Wang et al. 2008; Heath 2008] . Dual-channel xnwFETs were found to have superior characteristics compared to single-channel xnwFETs [Panchapakeshan et al. 2011a ]. Furthermore, the use of bundled nanowires for logic and memory implementation does not add to density, performance, or power overhead. On the contrary, the logic implementation of N 3 ASIC using dual-channel xnwFETs was found to be 3x denser and 5x more power efficient at comparable performance [Panchapakeshan et al. 2011a] . N 3 ASIC logic circuits are designed using a dynamic circuit style [Panchapakeshan et al. 2011a ] that is amenable to implementation in regular nanowire arrays. All the devices are of same type and uniform size; active devices are all n-type xnwFETs. Our 10T-NWRAM implementation uses this N 3 ASIC framework and follows a similar circuit style.
10T-NWRAM
The core of a 10T-NWRAM circuit consists of two cross-coupled dynamic NAND gates that store true and complementary data values on their outputs. In order to read out the stored value, a separate read path is used. A set consisting of select clock inputs (W 0 pre 0 , W 0 pre 1 , W 0 eva 0 , W 0 eva 1 ), synchronous data input (bit 0 ), and a read signal (read 0 ) is used for memory operations (write, read, and restore).
A 10T-NWRAM circuit schematic is shown in Figure 3 (a). The non-overlapping clock control signals (W 0 pre 0 , W 0 pre 1 , W 0 eva 0 , W 0 eva 1 ) serve as word select and are used to write data input (bit 0 ) in the form of true (out) and complementary values (nout) at the first row (W o ) and first column (bit 0 ) position of the memory array ( Figure 3(b) ). The read signal (read 0 ) is used to read memory output (bit 0 , bit 1 , . . . , bit n ) from the first row. Figure 3 (b) shows 10T-NWRAM array organization. In the following we discuss the basic memory operations in a single 10T-NWRAM cell.
Memory Operations
Write operation in 10T-NWRAM is performed by synchronizing the bit 0 signal with either (W 0 pre 0 , W 0 eva 0 ) or (W 0 pre 1 , W 0 eva 1 ) clock pair signals. For example, to write "1" (i.e., out = "1", nout = "0") in the memory cell, bit 0 is kept low during the precharge (W 0 pre 0 ) and evaluate (W 0 eva 0 ) phases of the corresponding NAND gate; as a result, out retains its precharged value "1". During subsequent precharge (W 0 pre 1 ) and evaluate (W 0 eva 1 ) phases of the second NAND gate, bit 0 is kept high, and as a result nout becomes "0" storing the complement of out. In order to write "0" to out, the opposite sequence is followed by pulling up nout to "1" first. The bit 0 signal is normally "1", and pulled to "0" in synchronization with clock inputs (W 0 pre 0 , W 0 eva 0 , or W 1 pre 1 , W 1 eva 1 ) for writing a value in out or nout. An HSPICE simulation result validating this behavior is shown in Figure 3 (d). Once the out and nout values are set, they are retained in subsequent clock cycles due to the self-restoring nature of this cross-coupled circuit.
Read operation is achieved by gating the nout signal in a 2-input dynamic NAND gate. The signal bit 0 is used to carry the read output, since it is shared across multiple cells in a column of the NWRAM array (Figure 3(b) ). In order to perform read operation, bit 0 is initially precharged to "1", and then the read 0 signal is turned ON; depending on the value stored in nout, bit 0 is either pulled to Vss or kept high, performing the readout of a stored bit. The read operation is illustrated in Figure 3 (e) through simulated waveforms. Figure 3 (e) shows both read "0" and read "1" operations. When the stored bit is "0" (i.e., out = "0", nout = "1") and read 0 signal is high, the bit 0 signal is pulled to Vss. On the other hand, when the stored bit is "1" (i.e., out = "1", nout = "0") and read 0 signal is high, bit 0 remains high at Vdd.
10T-NWRAM exploits the memory cache usage pattern that, at a certain time, memory activity is centered only on a small portion of the memory block. During memory cell inactivity, all control signals are kept at "0", allowing the cell to be in state-preserving mode; stored values are retained in output capacitance, and in parasitic capacitances of the interconnect and adjacent transistors. However, due to OFF-state leakage current in xnwFETs, the stored charge degrades over time. To restore the charge, the control signals (W 0 pre 0 , W 0 pre 1 , W 0 eva 0 , W 0 eva 1 ) are turned ON sequentially after a predefined period of time. The restore operation is shown in Figure 3 (f) through HSPICE simulations. During the restore operation, depending on the prior stored values in out and nout, either the T6 or T2 transistor (Figure 3(a) ) is turned OFF, resulting in restoration of original values in out and nout. The bit 0 signal is kept at its default state ("1"), which keeps the T3 and T7 (Figure 3(a) ) transistors open during the restoration cycle.
Since the leakage of a stored charge is dependent on OFF resistance of xnwFETs, the retention time is very long (i.e., 8000 cycles, each cycle = 50ps, out node capacitance = 38aF). The output capacitance in storage nodes can be engineered to have even higher retention time, as shown in Figure 4 . In our previous work [Narayanan et al. 2012] , we have shown capacitance engineering techniques for nanowire fabrics. The restore operation for NWRAM is fundamentally different from 1-T DRAM. In DRAM's refresh operation, the stored value has to be "read out" and "written back" periodically to retain the original value within the retention time [Keeth et al. 2008; Vounckx et al. 2006 ], and thus consumes twice the active power. By contrast, there is no need for read-out and write-back in NWRAM.
10T-NWRAM Physical Layout
10T-NWRAM's physical layout follows a regular grid-based layout which conforms to the N 3 ASIC fabric. As shown in Figure 3 (c), all transistors and via placements are on pre-patterned nanowires. Transistor and via shapes are uniform and adhere to underlying nanowire dimensions. Only n-type xnwFETs are used as active devices. Intra-cell routing in 10T-NWRAM is limited to only two layers of metal interconnects (M1 and M2).
This physical representation is in stark contrast to traditional 6T-SRAM, where complementary devices are sized proportionally and organized in a complex layout. At sub-20nm scale, maintaining precision in device sizing for an SRAM layout is a big challenge, since the zigzag shapes ( Figure 5(b) ) in the SRAM layout require extensive optical proximity corrections and are prone to localized process variations. The number of mask steps required for NWRAM is also reduced. There is no need for p-type device masking; moreover, SRAM-like mask requirements for Metal3 and Metal4 contacts are eliminated as well. Furthermore, the mask overlay requirements for NWRAMlike regular grid-based circuits are less stringent [Panchapakeshan et al. 2011b ] in comparison to CMOS-based designs.
BENCHMARKING
In order to quantify the benefits of our 10T-NWRAM design over state-of-the-art SRAM designs, we have done extensive benchmarking. Layout analysis and HSPICE simulations were carried out to compare 10T-NWRAM against high-performance (HP) 6T-SRAM, low-power (LP) 6T-SRAM, high-performance gridded 8T-SRAM, and low-power Figure 5 shows the layout of different memory cells used in this work. As shown in Figure 5 (a), 10T-NWRAM follows a regular grid-based layout with semiconducting nanowires, uniformly sized transistors, and metal interconnects; therefore, 16nm 1D gridded design rules (Table I) from Bencher et al. [2009] and Burn [2010] were used to calculate 10T-NWRAM cell area and interconnect parasitics. 2C-xnwFET devices used for our 10T-NWRAM design were modeled and simulated using the TCAD Synopsys Sentaurus 3D device simulator [Panchapakeshan et al. 2011a ]. To model carrier transport in these devices, a hydrodynamic charge transport model with quantum confinement corrections was used. Major xnwFET device characteristics are highlighted in Table II. Table II also shows a comparison of key device metrics with respect to PTM HP and LP device models during nominal conditions.
In order to do HSPICE circuit simulation of a 10T-NWRAM, an HSPICE-compatible device model was developed from TCAD simulations using the methodology described in Narayanan et al. [2012] . Cell interconnect length and width were derived from cell layout ( Figure 5(a) ), and a PTM interconnect model [PTM 2012] was used for interconnect RC calculations.
To scale the 6T-SRAM to 16nm technology node, we have collected published data about cell area and design rules from industry [Jan et al. 2005; Bai et al. 2004; Mistry et al. 2007; Natarjan et al. 2008; Greene et al. 2009; Chen et al. 2008; Narasimha et al. 2006; Steegen et al. 2005; Leodandung et al. 2005; Cheng et al. 2007; Diaz et al. 2008] for both high-performance and low-power 6T-SRAM designs at 65nm, 45nm, and 32nm technology nodes. From this data, various scaling factors were derived based on cell area, poly, Metal1, Metal2 and via scaling trends. These were used to calculate 6T-SRAM cell areas and corresponding 16nm design rules, as shown in Tables III  and IV. Interconnect lengths in 6T-SRAM cells were calculated from the layout (Figure 5(b) ) using cell area (Table III) and corresponding design rules (Table IV) . Pass transistors and pull-down transistors were considered to be 1.4x and 1.7x larger compared to pull-up transistors; PTM 16nm high-performance and low-power devices as well as interconnect models [PTM 2012] were used to simulate 6T-SRAM cell characteristics in HSPICE. Similar simulations were carried out for manufacturing friendly gridded 8T-SRAM cells ( Figure 5(c) ). 1D gridded design rules, as shown in Table I , and 16nm PTM device and interconnect models [PTM 2012] were used for simulations.
Benchmarking Results
Results from scaled memory cell area calculations are shown in Figure 6 . In this figure, upper bound (colored black) and lower bound (colored yellow) corresponds to upper and lower bounds in cell area due to the considered range (Table I) of design rules and scaling factors (Table III ). Figure 6 shows that the lower bound of 10T-NWRAM cell area is comparableand, in some cases, better than-scaled 6T-SRAM cells, whereas the area comparisons between 8T-SRAM and 10T-NWRAM cells show similar results.
The upper bound in 10T-NWRAM shows a larger area estimation for a single cell in comparison to the upper bounds of SRAMs. This is mainly due to pessimistic assumptions in design rules during area calculations. The design rules for a customized 10T-NWRAM cell are expected to be close to those of the lower bound in Table I , since 10T-NWRAM uses a regular layout, uniformly sized transistors, and only two metal layers of interconnects.
The 10T-NWRAM write operation is significantly faster compared to that of SRAMs. Figure 7(a) shows 10T-NWRAM write time to be almost 2x faster in comparison to HP 6T-SRAM and HP 8T-SRAM, and more than 4.5x faster when compared to LP 6T-SRAM and LP 8T-SRAM. This is primarily due to fast dynamic NAND logic style, Nanowire Volatile RAM as an Alternative to SRAM 30:9 high-performance 2C-xnwFETs, and less load capacitance in the storage node. The load capacitance during bit transition is less in 10T-NWRAM because the bit transitions in true and complementary nodes take place during non-overlapping clock cycles, whereas in SRAM both true and complementary bit values are flipped simultaneously. Figure 7(b) shows the benchmarking results for read time. When lower bounds in design rules are considered, 10T-NWRAM's read time is 1.38x, 3.18x, 1.31x, and 3.14x faster, and for the upper-bound case, 10T-NWRAM's read time is 1.88x, 2.77x, 1.08x, and 2.5x faster compared to HP 6T-SRAM, LP 6T-SRAM, HP 8T-SRAM, and LP 8T-SRAM, respectively. The faster read operation in 10T-NWRAM is due to the faster switching speed of xnwFETs and the read logic scheme with data gating mechanism.
Active power consumption results are shown in Figure 8 . The SRAM power consumption results are similar for all designs; the slightly higher active power consumption for LP SRAM designs are due to higher Vdd used. In comparison to SRAMs, 10T-NWRAM's active power per cell during the read operation is higher (∼2x), primarily due to higher performance and longer bit 0 length in the 10T-NWRAM cell. During the read operation, the bit 0 either remains charged or gets discharged depending on the stored value; the calculated read power consumption is the power consumed during bit 0 discharge. The physical layout of the 10T-NWRAM is elongated in one direction: as shown in Figure 5(a) , within the cell bit 0 propagation is horizontal through Metal2, whereas in SRAMs (Figures 5(b) and (c)) bit 0 or ∼bit 0 propagation is vertical through Metal1 and the lengths are shorter by almost half compared to 10T-NWRAM bit 0 length. Therefore, the RC effect of longer bit 0 length results in higher active power consumption for the 10T-NWRAM cell.
Through memory array optimization (i.e., more words, fewer bits in a block) the active power consumption of a 10T-NWRAM block can be made similar to that of an SRAM block. Layout optimizations to reduce bit 0 length can reduce active power consumption further. In addition, a power-performance trade-off can be made for specific designs.
The leakage power for the 10T-NWRAM cell is significantly less compared to highperformance SRAM designs, as shown in Figure 9 . The reduction in leakage power is primarily due to the stacked transistor design and restoration capability of 10T-NWRAM. During periods of inactivity, all control signals are switched off to save power. However, due to charge leakage, the storage nodes need to be periodically restored; this is done by turning ON word select control clock signals. Therefore the net leakage power consists of both inactive power and restoration power. Retention time of storage determines how frequently the memory needs to restored, which in turn depends on cell design and the targeted application. For this work, we have assumed 38aF capacitance in the output nodes, which is equivalent to total parasitic and contact capacitance; the data retention characteristics with varying output capacitance are shown in Figure 4 . Figure 9 presents average leakage power results without considering restoration power (Figure 9(a) ) and with restoration power (Figure 9(b) ). As evident from Figure 9 , the cost of restoration power in 10T-NWRAM gets amortized in average power results due to ultra-low power consumption during inactive periods. The lower-bound data in Figure 9 (b) shows 10T-NWRAM to be 30x and 11x better in terms of average leakage power when compared to HP 8T-SRAM and 6T-SRAM designs for both upper-and lower-bound cases. Low-power SRAM designs consume less leakage power due to the LP transistors used in these designs, which have very low OFF current (∼10 −12 A).
The leakage power of a 10T-NWRAM cell can be further improved by optimizing the circuit to charge intermediate nodes. A different placement of control transistors (T4 and T8 before T3 and T7, respectively in Figure 2(a) ) and slight modification of the control scheme (turning W 0 Eva 0 or W 1 Eva 1 ON when bit 0 is "0") can allow nodes adjacent to out or nout to be charged as well during a write operation, which in turn can slow down the leakage of charge from the storage node since both the source and drain of the T1 or T5 transistor (Figure 2(a) ) will have same potential. Additionally, the output capacitance in storage nodes can be engineered to have even higher retention time, and hence can pave way for lower leakage power by trading off with performance.
The benefits of high-density, high-performance, and low-leakage 10T-NWRAM are also accomplished at a lower manufacturing complexity. In our previous work [Panchapakeshan et al. 2011a] , we have shown that N 3 ASIC-based designs would have significantly less overlay requirements (3σ = ±8nm) for maximum yield in comparison to CMOS-based designs (3σ = ±3.3nm).
STABILITY ANALYSIS
Cell stability is critical for any memory circuit. Unstable bit-cells or low noise margins in write/read operations lead to large overhead for error correction or redundancy. Degrading cell stability with scaling is a severe concern for CMOS SRAM cells due to an increase in local process variation. An SRAM circuit requires precise sizing of complementary transistors for ensuring stability during read and write operations. Mismatch in desired transistor strength due to variations induced by random dopant fluctuations, line edge roughness, etc., can cause erroneous circuit behavior. The 10T-NWRAM circuit presented in this article does not depend on matched device strengths for write/read/restore operations and, as a result, the stability concerns associated with SRAM are not relevant for 10T-NWRAM.
However, since the 10T-NWRAM's memory access operations use overlapping clock phases, stability issues may arise from perturbations in clock rise and fall time, or in clock period. These perturbations can result from power supply noise and from local switching activities. In this section, we analyze the impact of clocking imperfection or clock jitter on NWRAM's core operations. The clock jitter can only affect write and restore operations, since the read operation does not depend on clock signals.
Write Stability
The write operation in 10T-NWRAM is performed by sequential precharge and evaluation of nodes out and nout. The value to be written is determined by bl, which is kept low during xeva to write "1" in out. Clock jitter can cause an overlap between the falling xeva and rising bl signals, which in turn can result in incorrect writing of "0" in node out. This effect is illustrated in Figure 10 . Our simulations indicate that overlap of 2ps or more between evaluate clock and bl can cause write errors.
Restore Stability
Similar to the write operation, the restore operation can also be affected by clock jitter. Skew in the falling edge of evaluate signals can cause bit-flips. An example is shown in Fig. 11 . Effect of clock jitter on restore operation. Delay in falling edge of xeva signal causes bit-flip during restore. Figure 11 , where node nout was incorrectly pulled to "1" during the restore operation due to jitter in the xeva signal. Our simulations indicate an overlap of 2ps or more can cause bit-flip.
These stability issues due to clock jitter can be addressed by designing for increased skew between clock edges. By separation of falling and rise edges between precharge and evaluate signals, as well as between evaluate and bitline signals, these problems can be avoided.
CONCLUSIONS
In this article, NWRAM was proposed as a volatile memory alternative to SRAM. Comprehensive discussions were provided on all aspects of this memory, including memory operations, physical layout, manufacturability, benchmarking, and stability. In 10T-NWRAM, manufacturing complexities were significantly reduced due to the regular grid-based layout, uniform devices, and fewer metal interconnect layers. In the best case, the benchmarking results showed 10T-NWRAM is 2x faster and consumes 35x less leakage power when compared to gridded HP 8T-SRAM. Significant benefits were also attained in comparison to HP and LP 6T-SRAM cell designs.
