SRAM continues to be the critical technology enabler for a wide range of applications from low-power to high-performance computing. This session showcases the leading-edge SRAM developments from the semiconductor industry. Intel presents the smallest SRAM bitcell for 10nm technology, with design assist techniques to enable low V MIN operation. Samsung presents the smallest bitcell for 7nm technology and shows a double-write driver technique to further improve V MIN . TSMC demonstrates a 7nm 5GHz L1 cache for high-performance computing.
In Paper 11.2, Samsung Electronics presents a 7nm FinFET SRAM using EUV lithography. It adopts a 0.026μm 2 bitcell and V MIN is improved with a proposed dual write-driver (DWD) scheme in combination with a negative bitline scheme.
9:30 AM 11.3 A 5GHz 7nm L1 Cache Memory Compiler for High-Speed Computing and Mobile Applications
M. Clinton, TSMC, Austin, TX In Paper 11.3, TSMC presents a 7nm L1 cache memory compiler, which operates at a 5GHz clock frequency. It implements a self-timing scheme with small-signal sensing and a folded architecture to increase the performance.
ISSCC 2018 / SESSION 11 / SRAM / 11.1
A 23.6Mb/mm 2 SRAM in 10nm FinFET Technology with Pulsed PMOS TVC and Stepped-WL for Low-Voltage Applications
Zheng Guo, Daeyeon Kim, Satyanand Nalam, Jami Wiedemer, Xiaofei Wang, Eric Karl Intel, Hillsboro, OR The emergence of cloud computing and big data analytics, accompanied by a sustained growth of battery-powered mobile devices, continues to drive the importance of energy and area efficient CPU and SoC designs. Low-voltage operation remains one of the primary approaches for active power reduction, but SRAM V MIN can limit the minimum operating voltage. Device size quantization continues to be a challenge for compact 6T SRAM design in FinFET technologies, where careful co-optimization of the technology and assist circuit design is required for high-density low-voltage array implementations. This paper presents two SRAM array designs in a 10nm low-power CMOS technology featuring 3 rd generation FinFET transistors: a high-density 23.6Mb/mm 2 array and a low-voltage 20.4Mb/mm 2 array.
Figure 11.1.1 shows the layout diagrams of a 0.0312μm 2 high-density 6T SRAM cell (HDC) and a 0.0367μm 2 low-voltage 6T SRAM cell (LVC) in a 10nm FinFET technology. The HDC utilizes minimum sized devices, with a fin ratio of 1:1:1 (PU:PG:PD), to minimize cell area, while the LVC features a larger PD device (1:1:2) for improved read stability at low voltage. Self-aligned quad patterning (SAQP) is introduced on critical layers to achieve fin pitches down to 34nm and metal pitches down to 36nm with 193nm immersion lithography [1] , enabling a 0.62x area scaling of the 6T SRAM cell relative to a 14nm technology [2] . To further maximize density scaling of the 10nm technology, several key architectural features have been added to achieve further array area scaling of the 128kb HDC and LVC macros: achieving a 0.58x and 0.57x reduction relative to 14nm equivalents. Figure 11 .1.1 highlights the cell area (μm 2 ) and array area (mm 2 /Mb) of recently reported 6T SRAM designs from 14nm, 10nm and 7nm technologies [2] [3] [4] .
Figure 11.1.2 details two architectural features of the 10nm technology for improved density [1] . The first eliminates the need for isolation dummy gates by introducing a minimum isolation step at the source/drain boundary to isolate neighboring transistors by the width of a single gate. The second enables the placement of gate contacts over active transistors, thus eliminating the need for gate extension over isolation to land contacts. The tables in Fig. 11 .1.2 summarize the area scaling of critical SRAM periphery circuits, and the array efficiency and density of 128kb SRAM macros in 14nm and 10nm technologies. The combination of single-gate isolation, enabling contacts over active gates, along with the improved pitch scaling of critical interconnect layers has enabled aggressive area scaling of critical SRAM peripheral logic from 14nm to 10nm with minimum fin depopulation. As a result, a 77.1% array efficiency and a 23.6Mb/mm 2 density are achieved for a 128kb HDC macro: a 5.4% area efficiency improvement over a comparable 14nm design. A 78.4% array efficiency and a 20.4Mb/mm 2 density are achieved for a 128kb LVC macro: a 6.8% area efficiency improvement over a comparable 14nm design.
Wordline underdrive (WLUD) is used to improve the low-voltage read and halfselect stability of an SRAM cell: trading off performance for V MIN [2] . To minimize the impact of interconnect resistance on WL voltage uniformity between different rows of the decoder, WLUD PMOS devices are implemented locally in the WL driver using a matched layout and routing across neighboring rows. To improve the low-voltage write margin, a column-based transient voltage collapse (TVC) scheme is employed to weaken the PU transistor during a write [2] . In this work, a PMOS device (PWR) is used to discharge the memory cell supply (V CS ). Compared to an NMOS device, a PMOS device improves V CS control, but at the cost of discharge speed. V CS can be regulated by PMOS bias devices (PB[1:0]), as is illustrated in Fig. 11.1.3 . To minimize write energy overhead, a pulsed V CS collapse can be applied with no bias current [2] . To avoid half-select instability along the column, due to a low V CS , careful control of the TVC pulse must be implemented across a range of array configurations when using pulsed TVC. Since the PMOS transistor drive strength degrades super-linearly with a falling V CS in this configuration, wider TVC pulses can be applied without requiring a bias current to avoid half-select instability. V CS sensitivity to the TVC pulse-width and/or array configuration is also reduced, and can be further adjusted by tuning the PMOS transistor V t . If PMOS bias is required, V CS can be determined by the voltage division across PWR and PB [1:0] . The resulting voltage level correlates well to the write margin, compared to an NMOS TVC that can produce a higher V CS under process skew with a lower NMOS:PMOS drive ratio, where the SRAM write margin is degraded.
While WLUD is effective for enhancing read stability, it degrades the write margin. To independently improve the 6T cell's read and write V MIN , a stepped-wordline (S-WL) scheme [5, 6 ] is implemented to complement the pulsed PMOS TVC write assist. Figure 11. 1.3 details the design of the S-WL and PMOS TVC in a 128kb SRAM macro. WLUD pulses (WLUDPULSE[2:0]) are generated from static WLUD bias controls (WLBIAS [2:0] ) and the read/write clock. WL suppression is first enabled to create a sufficient BL separation that reinforces cell stability, before WLUDPULSE [2:0] are adjusted to restore the WL to a higher voltage level. Since the required BL differential needed to improve read stability is higher than the voltage sensing margin, the read performance is not impacted by S-WL operation. To maximize the effectiveness of the TVC write assist, the TVC pulse is delayed to align it with WL restoration. To minimize interconnect delay along the WLCLK# and WLUDPULSE paths, local buffers are implemented to drive across the 32b sections of the 256b decoder. This reduces the distributed WLCLK# and WLUDPULSE gate loading by 8×, while maintaining the same logic depth for WL generation. Control logic for S-WL is implemented in the timer/control region with a negligible area overhead.
Figure 11.1.4 shows the simulation waveforms during a write cycle using static WLUD, no WLUD and S-WL. With static WLUD, WL is suppressed for the duration of the WL pulse to maintain cell stability while degrading the write margin. Turning off WLUD improves write margin, but compromises read stability. When S-WL is enabled, the WL is suppressed during the first phase to maintain cell stability. After a sufficient BL separation is achieved, WLBIAS#[2:0] are adjusted to raise the WL voltage, aligned to the TVC pulse, to improve the write margin. SRAM plays an integral role in the power, performance, and area of a mobile system-on-a-chip. To achieve low power and high density, extreme ultraviolet (EUV) technology is adopted for the 7nm FinFET technology [3] [4] . Conventional ArF immersion with a single exposure for an extreme high-resolution patterning shows the limitation of lithographic patterning. Therefore, multi-patterning lithographic technique is applied to support a high-resolution lithography. However, this also includes process variations due to using multi-pattering masks. Alternatively, EUV offers competitive scaling with a single-mask with the benefit of smaller wavelength, which provides smaller process variation with less additional pattering. Figure 11 .2.1 shows a 7nm EUV FinFET 6T high-density (HD) SRAM bitcell with an area of 0.026μm 2 . The pull-up, pass-gate, and pull-down ratios are 1:1:1 for high-density and low-power applications. Another benefit of EUV technology also features a bi-directional metal layer with a scaled pitch that provides an extra degree of freedom for signal and power routing. SRAM assist is a common technique for achieving low power in recent technologies [2] [3] [4] [5] . Since the 6T-HD bitcell does not cover low-voltage ranges in write and read operation, especially in the FinFET technology, SRAM assist techniques are selectively applied to write and/or read operations. Figure 11 .2.3 illustrates conventional SRAM assist schemes that control the WL, BL, and bitcell voltage (V DDC ) independently or together to affect bitcell characteristics favorably. WL is controlled to help bolster the Access-Disturbance Margin (ADM) by trading off against the Write-Margin (WRM) temporarily. V DDC is lowered to skew the WRM within the safe range of bitcell retention. Meanwhile, a negative BL (NBL) scheme is used as a write-assist technique to improve WRM without affecting ADM. However, the NBL technique is limited in application due to BL resistance; Fig. 11.2.3 illustrates the WRM degradation as the number of rows per BL (RPB) increases. The NBL effect diminishes for a large RPB, and is worse for the bitcell farthest from the write-driver. Otherwise, WL is used as an assistknob to avoid the resistance impact, since WL is connected to the gate, not source of pass transistor. The WL voltage is turned-on slowly low-to-high for both ADM and WRM [5] in connection with the timing penalty for a safe ADM. Meanwhile, the BL line is designed with wider width of metal through the part of bitcell array to mitigate the BL resistance [2] . Therefore, BL resistance decreases up to 50% of the original BL by widening the infinitive width of half-BL at most, which degrades the performance with large capacitance of BL instead.
Conventionally, the write driver is located at the bottom of SRAM macro to drive the whole bitcell array. Therefore, the top bitcell, which is located farthest from the write driver, suffers from the worst WRM due to the largest BL resistance among the bitcell array under the same condition of bitcell variation itself. To minimize BL resistance effectively, the Dual Write-Driver (DWD) is proposed as a write-assist as shown in Fig. 11.2.4 . The DWD uses two write drivers on the top and bottom, which act coherently in a short time. Since the two write drivers are designed by half-size of the conventional single write driver, the DWD has a similar area to the conventional one. Moreover, the farthermost bitcell from the write driver is located in the middle of the bitcell array, neither in the top nor bottom.
The effective resistance of DWD is calculated using simple methods: (1) Since BL length from the write driver to the farthermost bitcell is cut by half, each BL resistance is reduced by 2x. (2) Also, the two write-drivers drive the middle bitcell in parallel at the very same time, thus reducing the BL resistance by 2x again. (3) Therefore, the effective BL resistance sums up to be 0.25x of the conventional single write-driver for the farthermost bitcell from the write-driver finally. The top write-driver features a Global Write BL (GWBL) that is designed to be enabled with the bottom write-driver in a short time. There are other approaches to decrease the BL resistance in the conventional SRAM design: (1) BL is designed using a 4x width that proportionately decreases the ADM. Therefore, there is a limitation to increase BL width using the optimum bitcell margin. Moreover, there is a PPA trade-off such as performance degradation due to a large BL width. (2) Alternatively, a multi-bank architecture is also adopted to provide a smaller BL resistance in each chunk of BL. However, a multi-bank SRAM macro requires white-space that tends to increase area at the boundary between the bitcell array and the peripheral, even more in a recent cutting-edge technology [3] . However, the DWD is effective to reduce the BL resistance by maintaining the BL capacitance with additional write-driver path. It mitigates the potential technology challenges, which make design overhead with conventional approaches. The DWD can handle 4x larger RPB effectively without trading off with the bitcell stability and scaling, which is not easily accomplished in the conventional SRAM design. As shown in Fig. 11.2 .5, the top-most bitcell (256 th row among 256 RPB) shows the worst WRM, and a lower bitcell (64 th row of 256 RPB) has better WRM in either no-assist or NBL. However, DWD shows smaller V MIN variation over the bitcell array, which provides better controllability of process margin for mass production. Silicon shows that DWD reduces V MIN variation by up to 8x compared to without DWD.
The SRAM macro area overhead is assessed for the different write-assist schemes in Fig. 11 .2.5. NBL requires about 5% area overhead, due to the charge pump and additional buffer. However, DWD shows no more than a 0.5% area overhead due to the additional driver. Otherwise, a multi-bank architecture can be applied to decrease the resistance per BL. For example, a 4-bank architecture is adopted to implement 64 RPB with a similar V MIN as shown in the silicon result. However, a 4-bank architecture requires four times white-space at the bitcell array boundary, compared to a 1-bank architecture. The SRAM macro area also increases by up to 30% for a 4-bank architecture with 64 RPB, versus a 1-bank architecture with 256 RPB. Figure 11 .2.7 shows the die-photo of the 7nm EUV FinFET SRAM test-chips. Chip-A is designed using a 256Mb SRAM macro that explores NBL and DWD write-assist schemes. Chip-B is configured using 512Kb SRAM macros using the 0.026μm 2 6T-HD bitcell, which shows a V MIN distribution with NBL assist and DWD impact. • 2018 IEEE International Solid-State Circuits Conference In high performance computing (HPC) applications, the speed of the L1 cache will typically determine the maximum frequency (f MAX ) of the processor core.
Companies that mass produce high-performance microprocessors commonly have the L1 cache consist of fully-custom macros: to ensure that the performance of the L1 cache does not limit the f MAX or throughput of the processor. In addition, it is also common for the custom L1 cache designs to use a two-port 8T or a large 6T bitcell, along with domino read logic and very short BL [2, 3] . These designs tradeoff density and area for high performance. This paper presents a different approach, one which can satisfy a range of different applications; a memory compiler that can generate more than 10,000 different high-speed L1 cache macro configurations is proposed. The 7nm L1-cache compiler described in this paper uses a high-current (HC) 6T bitcell, which is more area efficient than an 8T bitcell. The HC bitcell, along with small-signal sensing, allows for long BL (256b), leading to further area efficiency improvements. Since these L1 macros are just as likely to be used in mobile applications as they are to be used in HPC applications, they were implemented using the array dual-rail (ADR) architecture [4] . The ADR architecture (Fig. 11.3 .1) allows the periphery circuits of the L1 macro to operate at the same voltage as the processor core: a lower V DD results in dynamic power savings. ADR performance is also improved, over an interface dual-rail, when the SRAM and logic supplies are equivalent, as ADR design does not suffer from a level-shifter delays on the inputs or outputs.
Quickly activating the WL is critical for a high-speed L1 cache. The L1 macro is built using a standard SRAM butterfly architecture and places the row-decoder and WL drivers in the center of the macro, which reduces the WL RC delay by 4×. Due to the increased wiring and via resistance in advanced nodes (i.e. 7nm) careful layout construction is required to guarantee that the upper and lower WL's are activated at exactly the same time. Within our power, performance and area constraints, we found that a four-WL clock-drive scheme resulted in the best address setup, access time and wiring/circuit area optimization (Fig. 11.3 .2). Using wider than minimum WL clock (wl_clk<3:0>) wires, and reducing the gate load by a factor of four helped speed up WL activation. In addition, by controlling the WL pulse width independently for read and write cycles, we are able to shorten the WL pulse during a write and reduce the dynamic power associated with dummy reads.
In an ADR design, the WL driver must use the bitcell voltage (V DDM ) for proper bitcell operations. The L1 cache performs voltage level translation from the periphery's supply to V DDM using the NAND gate in the WL decode path. The selftimer scheme used in the L1 cache and described in more detail in the next paragraph, depends on all of the various delays in the normal and self-time path matching. In this design we copy the entire 4-WL decode/driver block and use it to activate the single tracking WL. This allows us to replicate the layout context and layout dependent effects (LDE) for this critical portion of the access path.
The rising edge of CLK generates the internal clock (iclkz) and starts an access to the L1 cache. The self-timing scheme controls the setting time of the sense amplifier and the timing of the restore sequence. The internal clock has a very high fan-out, but we are able to generate iclkz and drive it with only one gate delay by using a dynamic clock generator circuit (Fig. 11.3.3 ).
The self-timing scheme consists of tracking bitcells, which are base-layer identical to normal bitcells, and therefore can track the normal bitcell read current (I CELL ) closely. The tracking BL has the same wire and diffusion loading as a normal BL, thus tracking the rate of voltage change very closely and proportionately to the rate of differential development on a normal BL. This scheme uses a tracking WL which is tuned to match the rise time of a normal WL across the full range of columns of the L1 compiler. The differential on the BL's at sense time is flat as a function of columns, which allows us to drive the global IO signals with fast edges. The restore operation start is timed from the sense enable trigger signal, which helps to minimize cycle time.
The HC bitcell can meet the performance targets with a 256b long BL, but there is a significant performance improvement when the BL length is cut in half. This is exactly what is done with what we refer to as the folded option. For this option, we fold the L1 macro over its right edge and reduce the BL length in half (Fig. 11.3.4) . The capacity of the macro remains the same, but the BL length is halved leading to a 15-20% reduction in access and cycle time. In our current implementation of the folded macro, the area penalty is approximately 15%. The folding option can offer a sufficient performance boost, for example by pushing the minimum cycle time of the largest macro (72kb) to over 5GHz.
We recognize that the minimum differential, even with a 6σ weak bitcell, increases as the SRAM bitcell voltage is increased. We take advantage of this fact by offering a turbo mode at higher voltages, where the sense enable timing is advanced. Putting the largest L1 macros into turbo mode at high voltage, can result in an additional 5% performance boost.
Compared to a 16nm L1 cache [5] that uses the same architecture, the presented 7nm cache is over 60% smaller (Fig. 11.3 .5). The high-speed 7nm L1-cache complier described in this paper has been verified in silicon. Cycle time measurements made at room temperature and -40°C are presented for a 512×36 and a 1024×72 macro. The measured results were performed on a slow-corner lot.. The -40°C measured results show that the 18kb macro is able to run at 5.36GHz at 1.115V, while the largest 72kb macro is able to achieve 4.4GHz operation at this voltage (Fig. 11.3 .6). 
