Abstract: There is a growing demand for embedded non-volatile memory (eNVM) in Internet of Things (IoTs). Phase change memory (PCM) is a promising candidate for next generation eNVM with excellent CMOS process compatibility. Deep trench isolation (DTI) process is one of the extra steps to generate the diode selectors for PCM cells under 40 nm CMOS process. In this paper, we propose a 40 nm Non-volatile Standard cell Library (NSL) by reusing the DTI process to minimize the space among active areas. Thus, benefit from the embedded PCM (emPCM) process, the area of logic circuits can be reduced by using NSL. To verify the validity of the NSL, four ring oscillators are fabricated by using the normal standard cell library and NSL. The measured results show that the circuits by NSL work well and there is almost no performance sacrifice.
Introduction
The analysts forecast there will be almost 50 billion smart devices connected to the Internet of Things (IoTs) by 2020 [1] . Most of the devices will lie at the leaf nodes of IoTs, such as RFID and sensor node System-on-Chips (SoCs). Thus there will be a great demand for embedded non-volatile memories (eNVMs) in these devices, which are usually served as security code storage, device configuration and data storage during power down. At present, the mature eNVMs are mainly EEPROM and eFlash [2] , which are energy intensive for write or erase operations. Moreover, they has encountered many technical barriers in scaling beyond sub-50 nm regime. So finding a low-power and CMOS-process-compatible eNVM solution is quite pressing to IoTs. Phase change memory (PCM), one of the most developed NVM candidates at 45 nm and beyond [3] , has the advantages of low power, high density, and long endurance. PCM is also a kind of back-end-of-line (BEOL) memories, which is quite promising for embedded application [4, 5, 6 ]. The work in [7] proposed an epitaxial diode array with dual trench isolation for PCM. The diode array is separated by the deep trench isolation (DTI) along the bit-line direction and the shallow trench isolation (STI) along the word-line direction, respectively. Although the PCM process is fully compatible with the standard CMOS process, three extra masks must be added, resulting in the rise of cost. Thus it is quite important for designers and foundries to reduce the cost of embedded PCM (emPCM) process. In this paper, we introduce a non-volatile shrink standard cell library (NSL) based on the 40 nm emPCM process to reduce the logic area and cost. The basic idea of NSL is to reuse the DTI process to replace some of the STI to separate the active areas in standard cells. Thus the logic area can be reduced benefited from the emPCM process.
40 nm emPCM process
First, we will briefly review the 40 nm emPCM process with diode selectors. The 3D view of the PCM array is illustrated in Fig. 1 . The diode selectors are generated by the P+ diffusion layer and the buried N+ layer (BNL). PCM cell array will be patterned on the top of the anode of diodes via standard CMOS contact (called CT1) and the heater. Through another contact or top electrode (called CT2), the PCM cells are connected to bit lines (BLs) with metal 1 layer. The BNL is picked up as word lines (WLs) with metal 2 layer. WLs are separated by DTI across the whole BNL to suppress the cross-talk effect. The DTI is filled with linear oxide and then undoped poly with depth of about 1 µm and width of 72 nm. The diodes for different BLs are separated by the typical STI process with depth of about 300 nm. The cathode of PN-junction diodes along one word line is connected together because of the shared BNL layer. With the dual-trench isolation process, the PCM array achieves very high density.
At present, the DTI process is only used to generate the diode array, which has no affiliation with the digital circuits. The general isolation among active areas is STI, which is set to 90 nm in the 40 nm design rule. The standard cell library, which is generally provided by foundry for the current digital integrated circuits (IC) design, is a group of logic function (e.g., AND, NAND, buffer) or storage function (Flip-flop or latch). Digital designers use logic synthesis tools to transfer the register-transfer level (RTL) description into a gate-level netlist by using these cells in the library. These full-custom cells are realized with fixed height and variable width, which enables them to be placed in rows, easing the process of automated digital layout. If we reuse the DTI process to replace some of the STI process between active areas in the standard cells, the area of digital circuits can be reduced automatically. The detailed implementation will be introduced next. In this section, NSL will be introduced based on the 40 nm emPCM process described in Section 2. The active areas are normally isolated by STI in the standard CMOS process, as shown in Fig. 2-(a) . The width of STI (STIW) is relatively large to avoid the latch-up effect and parasitic leakage current, which means that the poor scalability of active areas is a limitation to VLSI CMOS density. The DTI process is one of the promising technologies for scaling the space between the active areas and preventing latch-up and leakage current simultaneously [8] . The space between two active areas can be reduced to the width of DTI (DTIW) [9] . However, it is cost intensive to add an extra DTI step in pure CMOS process. But in the 40 nm emPCM process with diode selector, the DTI process has already existed. As a result, we can reuse the DTI to minimize the space between the active areas, as shown in Fig. 2-(b) . The width of the DTI in the 40 nm emPCM process is 72 nm, leading to 20% space saved. NSL is just using this idea to shrink standard cells. It should be pointed that some standard cells can be shrunken by using DTI to replace STI, some are not. In order to better explain the principle of NSL, two layouts of a standard cell that using normal STI and using DTI to replace STI are given in Fig. 3 . The layout of a standard cell that using normal STI is shown in Fig. 3-(a) , which marked 6 positions for using STI. Position 1 and position 4 are hard to using DTI because there would be no enough space for metal wiring. The other positions can use the DTI process to break the limitation between active areas (called minimum STI spacing), as shown in Fig. 3-(b) . It is obviously seen that the cell area are reduced. The top active area in N-well (for PMOS) and bottom active areas (for NMOS) cannot use the DTI for isolation because there are still other design rule constrains, like the minimum space from the N-well boundary to the active area. Another concern is that the height of the standard cells should be kept in a fixed value. So there is no scope for DTI. Table I gives the area comparison of 4 standard cells between the normal standard cell library and NSL. The area of NSL cells has a considerable reduction. If these cells are largely used during digital design flow, the digital logic area would be saved.
Regarding to area of synthesized logic circuit, we confirmed 9.9% reduction in typical design of a macro which consists of an ARM processor and some peripheral with NSL in comparison to the design with normal standard cell library. Consequently the NSL is proved to be effective in the reduction of macro size.
Dummy fill, one of the design-for-manufacturability (DFM) techniques, is an effective technique to reduce the process variation and improve the yield for advanced integrated circuit manufacturing process. In the 40 nm embedded PCM process, the DTI dummy pattern is not only placed in the PCM array, but also in the logic area for DFM. Therefore, using DTI to shrink logic area has another benefit that it can serve as the dummy pattern in logic area and improve the yield of the entire wafer. As described above, only some of the STI process can be replaced by DTI, so STI in digital logic area is still the dominant. Therefore, the inserted DTI process would have no impact to the STI pattern.
Measurements and data analysis
As the cells from standard cell library or NSL in 40 nm platform are working at pico-second level, which is so hard to capture the difference by the test equipment if we just test one single shrink cell. In order to verify the validity and the performance of the NSL, a test chip is fabricated with the 40 nm embedded PCM process. The test chip has two blocks with identical circuits and function. The only difference is that block 1 is fabricated with cells from NSL while block 2 is fabricated with cells from the normal standard cell library.
Block function description
Each block consists of a decoder, four ring oscillators (ROs), four frequency dividers, a MUX and a counter, as shown in Fig. 4 . The 4 ROs is composed of odd number of inverters connected sequentially to form a feedback loop with (a) (b) Fig. 3 . Layout of a standard cell using STI (a) and using DTI to replace some of STI (b). To convenient test, the output of the selected RO is fed to the frequency divider to lower its frequency. The frequency divider has four outputs, CLK2, CLK4, CLK8 and CLK16, which are divided by 2, 4, 8 and 16 from the original clock, respectively. And then the frequency selection signals Freq_SELh1:0i of the MUX select one clock to drive the 16-bit counter, which is started to count by the enable signal CNTEN. Specifically, only the four frequency dividers and the counter in the shrink block are manufactured with cells in NSL, as shown in Fig. 4 with the red dotted line, while the other function modules are not because the used cells cannot be shrunken.
Measured results
No matter which frequency divider ratio (FDR) is selected, T C (the period of the final clock to the counter) can be calculated according to the enable time for counting and the count value. T C is calculated as follows:
Where T CNTEN is the time for counting or the pulse width of CNTEN signal, N is the count value (CNTh15:0i). We will use the value of T C to represent the performance of the shrink block and normal block. In order to make sure the accuracy of T C , numerous tests by giving different long enough pulse width of T CNTEN have been executed. Then we would calculate the average value (µ) and the standard deviation (·) of T C . block. Large X represents large internal load capacitance in the RO leading to large clock period. The results prove that the shrink block can work as well as the normal block. The difference of T C under the same condition between the two blocks is not obvious. For further analysis we take CLK2 of the Ring Oscillator 1 as an example. From Fig. 6 we can also learn that the shrink block has larger clock period than the normal block. This can be explained by the capacitance coupling effect of metal wiring. As the layout comparison is shown in Fig. 3 , using DTI to replace STI will not change the shape of the active area, which would have little influence to the parasitic capacitance of diffusion. By contrast, as the space between active areas decreases, the main influence caused is more compact metal wiring, which must enhance the coupling effect and slow down the device a little. The study in [10] also shows that in advanced process node the impact of interconnect on circuit performance are significant in comparison with the devices. The influence of NSL is 0.453 ns difference from normal block and it is only 1.8% of the clock period. For mature CMOS technologies, the ·/µ ratios of most parameters of MOSFETs, typically range from 1%-10% [11] . Compared with the normal block, the average T C has only a rise of 1.59%, 2.30% and 1.49%, which is within the range of permitted variation. It should also be noted that the difference is accumulated by a number of shrink standard cells. If focusing on one specific cell, this difference is smaller than 30 picosecond, which would have little impact to digital designer. To summarize, the test results show that the cells from NSL can work well and there is no performance loss caused by DTI process.
Conclusion
In this paper, a non-volatile standard cell library (NSL) is proposed by reusing the DTI process to achieve active area separation based on the 40 nm emPCM process. Using this technique does not bring any extra cost, but the benefit is the logic area reduction and yield improvement. From the test results we have validated the technique that the circuits with shrink cells work well without performance sacrifice. In the next stage, we will enrich the cells in NSL and then apply them to all logic implementations in the digital design flow. 
