Published by the IEEE Computer Society 0018-9162/11/$26.00 © 2011 IEEE while trading off both endurance (number of cycles) and operating voltage. Using band-gap-engineered gate stacks, multiple metal floating gates, thinner oxides, and tunneling as the main charge-transfer mechanism can achieve lower operating voltages and, in principle, reduce stresses and increase endurance.
while trading off both endurance (number of cycles) and operating voltage. Using band-gap-engineered gate stacks, multiple metal floating gates, thinner oxides, and tunneling as the main charge-transfer mechanism can achieve lower operating voltages and, in principle, reduce stresses and increase endurance.
We propose a scalable dual-metal floating-gate device that can be used as the basis for a unified memory-a single device with merged volatile and nonvolatile operation. This device has the potential to enable instant-on computing, solve energy-proportional computing problems, and improve resiliency. Engineered to operate at modest voltages compared to most floating-gate devices, the device is scalable, in bulk form, to at least the 16-nm node and could potentially be packed as an 8F 2 cell. We accomplish this with a technology set that is more mature and potentially more uniform in operation than resistive memories, the main competing technology. The unified memory can be monolithically 3D stacked using deposited layers of indium-gallium-zinc-oxide amorphous semiconductors. IGZO-based transistors can achieve better performance than amorphous silicon transistors and offer sufficient performance to enable a nonvolatile memory device. 7, 8 A stack of four devices has the potential to realize densities equivalent to an 8-nm node. The proposed device could be further improved by using metal nanocrystals to increase endurance at the expense of reduced charge transfer.
M
etal-oxide-semiconductor field-effect transistors employing programmable floating gates have enjoyed enormous success as the basis for scalable nonvolatile memory technology. A traditional floating-gate MOSFET is a device in which an additional gate is embedded in the transistor's dielectric and completely isolated from the top gate (control gate) and the channel. This so-called floating gate can be used as a storage node that can be charged or discharged, thereby enhancing or shielding the channel from the voltage applied on the control gate, which modifies the device's threshold voltage. Typically these devices are designed such that the charge on the floating gate, and thus the device's state, remain even if the system power is off, making them useful nonvolatile memory components.
Despite the promise of floating-gate MOSFETs, however, their use has been limited to storage memories. [1] [2] [3] [4] [5] [6] One reason for this is that such devices have been optimized primarily to maximize retention time (10 years) The authors report on the design, operation, and architectural implications of single and double floating-gate devices for nontraditional applications enabling low-power FPGAs and analog-to-digital converters, and propose a unified nonvolatile/volatile memory device. Figure 1 shows the proposed unified memory device designed for a 16-nm node. We use a back gate to create an optional select line; this function could also be provided using a well contact, similar to flash devices. Developers can build the back gate using 3D techniques or a siliconon-insulator (SOI) process with backside alignment. In this structure, both floating gates can be programmed via a tunneling current. For dynamic programming, relative low voltages applied on the control gate transfer free charges between the two floating gates to program or refresh the dual floating-gate device.
Charge redistribution occurs rapidly via a thin high-k oxide, resulting in one positively and one negatively charged floating gate, thus changing the device's threshold voltage. This is considered the volatile or dynamic storage state, as subsequent charge leakage is relatively fast, requiring a refresh. The device utilizes direct tunneling under low electric fields (less than 0.2 V/nm) in this mode, in which endurance and reliability factor into design criteria. The electrons are not exposed to scattering events with the lattice atoms in the oxide. 9, 10 This produces less damage to the oxide than Fowler-Nordheim tunneling and hot carrier injection, 2,9,10 especially in low-field stress. Both floating gates can also be programmed more slowly through Fowler-Nordheim tunneling of electrons from the channel by applying a relatively high programming voltage on the control gate, similar to conventional floating-gate devices; this is the nonvolatile state. Thus, the two memory operations are clearly distinguished by the applied voltage envelope, and the device can be switched rapidly between the two modes of operation. Furthermore, it can operate simultaneously as a volatile and nonvolatile device, permitting checkpointing. The device's state is determined by its threshold voltage, which is detected through an appropriate external circuit.
The unified memory device is designed to provide optimal tradeoff between programming characteristics and charge retention, especially in the volatile state. Some of these aspects can be achieved by selecting appropriate materials and thicknesses to band-gap engineer the device. Band-gap engineering permits a decrease in the barrier's actual height for fast programming, and using high-k dielectrics allows an increase in its actual thickness to maximize retention time. For example, for dynamic memory operation, a low work-function metal for the bottom floating gate FG BOT -such as tantalum nitride, titanium nitride, aluminum, tantalum, or magnesiumenables rapid charge transfer through the low barrier to the top floating gate. FG TOP uses a high work-function metal such as palladium, gold, or platinum to reduce charge leakage through the higher barrier and improve charge retention, enabling a longer time between refresh periods. Figure 2a shows a cross-section of a dual floating-gate test device. To simplify fabrication of this first-generation device, we varied the materials and layer thicknesses from the above-specified targets. We measured capacitance voltage characteristics as the control-gate voltage swept from negative to positive and then back to negative. Figure  2b shows the resulting shift in the flat-band voltage. The negative shift in the flat-band voltage for low sweep voltages, followed by more substantial positive shifts for larger sweep voltages, is consistent with what would be expected for this device. We are exploring variations of this dual-layer floatinggate structure as replacements for single-layer nanocrystal floating-gate (NCFG) devices. Isolated storage nodes could increase device endurance, and thus reliability, for both volatile and nonvolatile memory operation due to decreased susceptibility to stress-induced leakage current (SILC) compared to NCFG devices of equivalent dimensions. 11 In recent years, single-layer NCFG devices have been introduced that permit use of a thinner oxide due to the reduced susceptibility to SILC, which results in a more efficient charge transfer. Compared to traditional single-layer continuous floating-gate devices, these devices could deliver lower operational voltage, lower power consumption, higher programming (erasing) speed, and lower programming voltage. They therefore have great potential for use in static applications such as the fieldprogrammable gate array (FPGA) or the flash analog-todigital converter (ADC). Unlike the proposed dual floatinggate device, they are incapable of both nonvolatile and volatile memory operation, which makes them unsuitable for unified volatile/nonvolatile applications.
STATIC APPLICATIONS
New circuit architectures exploit NCFG devices as both storage elements and active circuit elements for static applications. These circuit architectures include a bidirectional interconnect design for FPGAs 12 and a compact low-power four-bit flash ADC. 13 A physical model of this NCFG 14 can be used in the Cadence Virtuoso design platform and allows accurate analysis of speed and power consumption in both circuit architectures using HSPICE simulations.
FPGAs
A new FPGA interconnect design employs bidirectional wiring by introducing NCFG devices for the switch box and the connection block, as Figure 3a shows. 12 These devices, which enable an SRAM-free interconnect, retain their state while system power is off and do not need to be configured at each boot-up.
The switch box itself comprises six NCFG devices. To provide a path, the NCFG device must be in erase mode; if it is in program mode, the path is blocked and the signal cannot propagate through. Routing switches inserted between the switch boxes consist of two NCFG devices and two buffers, which are necessary to maintain full swing and full signal strength throughout the entire interconnect.
In the connection block, the logic-block output is first connected to the routing channels via NCFG devices as Figure 3b shows. One NCFG device for each RCh is then either erased to pass the signal through or programmed to block the signal. Figure 3c shows the connection-block design for the target logic-block input. NCFG devices again function as switches for each RCh whose output nodes are tied together and connected to a buffer. This buffer then drives the incoming signal to the logic block's input. By rule, at any time only one NCFG device should be erased to pass the signal through while the others are all programmed. If none of the channels are to be connected to the logic block, such that their corresponding NCFG devices are all programmed, an additional NCFG device is used to pull the buffer's input node down to G ND to avoid a floating node.
Compared to traditional bidirectional and modern unidirectional SRAM-based FPGA interconnects, 15 design reduces total gate area by 87 percent and 63 percent, respectively. Simulations indicate it achieves better performance and consumes less power than a standard bidirectional SRAM-based FPGA. Signal propagation delay through an unclocked routing system with 10 switch boxes is 235 picoseconds shorter. Furthermore, the static and dynamic power consumption in a 32-tap finite impulse response filter is reduced by 58 percent and 34 percent, respectively. Power consumption and propagation delay are similar to a modern unidirectional SRAM-based FPGA. However, the increased RCh availability from bidirectional wiring results in improved functionality and lower complexity.
In sum, the new interconnect design's potential gains are
• smaller area, • decreased power consumption, • higher speed, and • increased functionality, which typically trade off and cannot be achieved simultaneously by SRAM-based counterparts.
Flash ADC
Another new design using single-layer NCFG devices for static applications is a low-power four-bit flash ADC with a sampling rate of 1.2 gigasamples per second. 13 As Figure  4a shows, the encoder unit converts a 16-bit thermometer code into a four-bit binary output. 16 The NCFG devices use threshold inversion quantization to shift the inverter's switching threshold voltage. The output of this inverter constitutes the output of the comparator. A programming and calibration unit for the inverter increases the threshold voltage shift at each quantization level in steps of 0.0125 V. A direct current sweep carried out at the input of the inverter stages produces the corresponding output voltages demonstrated in Figure 4b . The new design displays good results in linearity with a maximum integral nonlinearity of 0.35 and a maximum differential nonlinearity of 0.095. Table 1 compares the NCFG design to existing flash ADC designs in which the inverter's switching threshold voltage is determined by varying the number of NMOS transistors, and hence effective width, in the inverter's pull-down path (Design A); 17 and by varying the width of the NMOS transistor in the inverter's pull-down path (Design B). 18 The new design provides the highest efficiency in terms of power, area, linearity, and full-scale range; it also can be reprogrammed if desired. Figure 5 shows a unified memory array that utilizes the proposed dual floating-gate device.
UNIFIED MEMORY ARRAY

Architecture and analysis
The overall structure is similar to a NOR flash. The resistor R BL and capacitor C BL for each bitline represent the wire resistance and capacitance, respectively. The sense amplifier is designed such that it can detect the different voltages on the bitlines due to the modified threshold voltages of the new devices, depending on their states. Note that the device can store two bits. The clock signal is attached to the gate terminal of an NMOS transistor whose drain terminal is connected to its corresponding bitline and is in close proximity to the sense amplifier. If the clock signal goes high, the bitlines are pulled down to G ND . This occurs very quickly compared to the precharge mode in a conventional DRAM device. If the clock signal goes low, the corresponding NMOS transistor is off, and one of the word lines turns on to read an entire row.
We designed the memory array in Cadence Virtuoso 2010 using the BSIM4.0 MOSFET simulation model with 45-nm gate-length technology. The dual floating-gate device has the parameters given in Figure 1 with an increased gate length of 45 nm and a gate width of 90 nm. We extended the previously described model for singlelayer NCFG devices 14 to include the additional floating-gate layer, solid floating-gate layers, and high-k dielectrics. The size of the array is 32 Kbytes (128 rows × 256 columns). The array is cycled row by row with a clock frequency of 1 GHz and with the supply voltage and reference voltage V ref set to 1 V. The energy consumption in read mode is 10.1 picojoules per bit read. This includes the memory array, the sense amplifiers, the clock network, and the capacitances and resistances of the bitlines as well as the word lines (not shown in Figure 5 ). To write the entire memory array into the dynamic state '1' by applying 3 V row by row on the word line and -2 V on all select lines takes 6.4 microseconds and consumes 9.8 nanojoules per bit written. Table 2 lists the memory array's cell states, along with the voltages and time constants required to change the threshold voltage from the device's current state to the desired state, as obtained from Sentaurus technology computeraided design and circuit-level HSPICE simulations. Each transition is effected by applying the voltage envelope (amplitude and duration) to the word line and select line. The device's native threshold voltage is V T0 ~ 0.4 V.
Obviously, correct operation requires multiple voltage supplies. The memory array reads bits by detecting each transistor's threshold voltage-in these simulations, using a biased inverter. The memory cell states are as follows.
Dynamic write. It takes 50 nanoseconds for the memory array to write a dynamic '1', similar to a DRAM. Changing a dynamic state from '1' to '0' takes 10 µs. Dynamic retain. This is essential for those cells in a row that are to retain their states during a dynamic write or refresh.
Dynamic refresh. This is only required if a '1' is written in the dynamic state. After 300 milliseconds at room temperature, ∆V T shifts by about 0.2 V due to charge leakage. Therefore, a read-write cycle every 300 ms is required.
Nonvolatile write. E a ch memor y a rray cha nges it s nonvolatile state through a readwrite cycle one row at a time. It takes 30 µs to change from '0' to '1', and 14 µs to change from '1' to '0', which is still faster than flash memory write and erase. Of course, the write data could be externally sourced too.
Dynamic/nonvolatile read. Because the memory array operates more like an SRAM than a DRAM, reads should be very fast: 1-4 ns including array overhead. The read voltage is less than that for current DRAMs if the nonvolatile state is uncharged. Note that the read is nondestructive-no write-back is needed. In addition, the device can store one bit in the volatile state and another bit in the nonvolatile state. Each bit combination can be detected through a quaternary logic sense amplifier or by performing two successive read cycles at different voltages.
Dynamic write
0 0 0 V 3V -2 V 50 ns 1 0 -0.33 V 0 1 +1.52 V 3 V -2 V 1 1 +1.21 V 1 0 -0.33 V -3 V 2 V 10 µs 0 0 0 V 1 1 +1.21 V -3 V 2 V 0 1 +1.52 V Dynamic retain 0 0 0 V 3 V 2 V 50 ns 0 0 0 V 0 1 +1.52 V 3 V 2 V 0 1 +1.52 V 1 0 -0.33 V -3 V -2 V 10 µs 1 0 -0.33 V 1 1 +1.21 V -3 V -2 V 1 1 +1.21 V Dynamic refresh 1 0 -0.11 V 3 V -2 V
Comparison to 1T1C DRAM
The proposed unified memory array offers several advantages compared to a conventional DRAM array with 1T1C cells.
First, the array should have higher density. The 1T1C DRAM cell will have difficulty scaling beyond the 22-nm node due to the capacitor required to maintain a capacitance of 20-25 femtofarads for sufficient charge sharing with the bitline. Furthermore, the transistor cannot scale with CMOS technology because its leakage current must be very low to achieve a minimal refresh period of 64 ms.
Second, the array's read mode is very fast. Depending on the device's size (larger arrays have higher capacitance due to transistor overhead and added wire capacitance), read time ranges from 0.31 ns for 64 rows to 2.18 ns for 1,024 rows. Clock frequency is thus 1.61 gigabits per second to 0.23 Gbps if a synchronous clock signal is used (an asynchronous clock signal can be used with a shorter low signal, enabling faster operations). Another savings is that the read is nondestructive-that is, no following write cycle is needed.
Third, a lower word line voltage of 1.2 V is required if the memory array is in a nonvolatile uncharged state, compared to 1.3 V in Intel's 32-nm DRAMs. Writes require higher voltages, depending on whether the storage is volatile or nonvolatile. However, the array's refresh period-the dynamic mode's retention time until most of the charge tunnels back to FG BOT such that the sense amplifier can no longer detect the change in threshold shift-is 300 ms at room temperature, which is longer than in a conventional 1T1C DRAM with a minimal refresh period requirement of 64 ms. This may result in overall improved energy efficiency in the write mode of the memory array compared to its counterpart.
APPLICATIONS
The unified memory array permits instant-on computing: it can switch to hibernate mode when the computer is idle and then back to active mode. In principle, the array could switch to hibernate mode in about 30 ms, assuming each memory bank has 1,024 rows and all the banks in all the memories can change state in parallel. A switch back to active mode would take about half this time.
In addition, the memory array greatly enhances a computer's resiliency-that is, its ability to withstand and recover from faults. A fast save can capture the array's entire state upon detection of a transient fault or power-down, while checkpointing can greatly decrease its performance impact. The computer can carry out a checkpoint operation in about 30 ms by conducting a readnonvolatile write cycle on each row. The array can then continue operating as a two-state device, with a backup store in the nonvolatile portion. As well as being faster, checkpointing is more energy efficient.
A subtler use of the memory array's capability would be to actively control partial hibernation as one method to achieve energy-proportional computing. When the computer is only running background tasks, it can "freeze" portions of the memory so that they are not consuming power and "unfreeze" them in only a few cycles when needed. A study of Google servers indicates that processors consume 60 percent of their peak power when operating at only 10 percent of their peak throughput; furthermore, most of the time processors operate at 60 percent or less of peak capacity. 19 Because the CPU already has many fast-switching partial sleep modes, much of this "wasted power" is due to power consumption in the memory, as the DRAM cannot be put into partial sleep quickly and easily. In contrast, a unified memory array can go into a partial sleep mode in a few milliseconds, depending on how that mode is organized. Recovery is similarly fast.
The dual floating-gate device could also be deployed as a switch in the routers of circuit-switched networks to reduce power consumption while retaining ordinary CMOS performance. One possible structure is similar to that shown in Figure 3 but operates differently: the device stores charge in nonvolatile mode to bring the transistors to a "just off" state, then uses a 50-ns pulse to switch the pass transistor to an on-state or back again. This enables operation of the network in a "just in time" mode wherein circuit paths are set up ahead of when needed and then torn down again when not. F loating-gate devices have great potential for integration within CMOS logic and therefore to function as active circuit elements rather than as pure storage memories. Normally, these devices are not used for logic due to their low write-cycle count, but they can achieve high write-cycle endurance using direct tunneling for programming or metal nanocrystals in the floating gate. Single-gate devices enable low-power switch boxes for FPGAs and low-power ADCs, and have achieved high efficiency in terms of power, area, speed, and functionality compared to traditional architectures. Dual floating-gate devices can be used as a unified memory, permitting combined volatile and nonvolatile storage in the same device. The state can be changed between the two quickly on a row-by-row basis. The device operates very fast in read mode and achieves a refresh period about five times longer than conventional DRAM arrays.
Such a memory could enable instant-on computers as well as dramatically impact power consumption and resiliency. It could be placed into partial hibernate states so that when the computer is in a low activity state, which is most of the time, its power consumption would be more proportional to its activity. Checkpoints could be taken very quickly, either on a regular schedule or when powering down. The device also could be implemented in a circuit-switched network on chip.
