SUMMARY This paper presents a low-power and low-voltage 64-kb 8T three-port image memory using 28-nm FD-SOI process technology. Our proposed SRAM accommodates eight-transistor bit cells comprising onewrite/two-read ports and a majority logic circuit to save active energy. The test chip operates at a supply voltage of 0.46 V and access time of 140 ns. The minimum energy point is a supply voltage of 0.54 V and an access time of 55 ns (= 18.2 MHz), at which 484 fJ/cycle in a write operation and 650 fJ/cycle in a read operation are achieved assisted by majority logic. These factors are 69% and 47% smaller than those in a conventional 6T SRAM using the 28-nm FD-SOI process technology. key words: image memory, multi-port SRAM, 8T, majority logic 
Introduction
Application of image recognition is being extended to various fields such as an automatic driving systems, robot vision, and augmented reality systems with improved image resolution. Image resolution enhancement leads to increased SRAM capacity, area, and power consumption because of the increase amount of image data. Power consumption in SRAM dissipates 43% of a whole image processor in a 65-nm CMOS process [1] . For wearable devices handling image information, energy-efficient SRAM will be expected, as presented in Fig. 1 .
28-nm Fully Depleted SOI (FD-SOI) technology is promising to provide high speed with low-voltage SRAM [2] . The 28-nm FD-SOI has fully depleted transistors and an ultra-thin silicon body and BOX layer, giving them excellent electrostatic control. Therefore, it brings stable features with low voltage operation. A BOX layer reduces the leakage current to control the electrical flow from a source node to a drain node in a transistor. Moreover, the BOX layer reduces the parasitic capacitance between the source node and the drain node. This feature of 28-nm FDManuscript received December 2, 2015. Manuscript revised March 28, 2016. † The authors are with the Graduate School of System Informatics, Kobe University, Kobe-shi, 657-8501 Japan.
† † The author is with Renesas Electronics Corporation, Kodairashi, 187-8588 Japan.
† † † The author is with the Graduate School of Natural Science & Technology, Kanazawa University, Kanazawa-shi, 920-1192 Japan.
* This paper is the modified version of the original paper presented at the IEEE Custom Integrated Circuit Conference (CICC) 2015 [18] .
a) E-mail: mori.haruki@cs28.cs.kobe-u.ac.jp DOI: 10.1587/transele.E99.C.901
SOI enables the production of ultra-low-power SRAMs [3] - [8] .
Input data for image processing are stored temporarily in SRAM. In an image processor, many processing cores access the SRAM for multi-thread processing, as presented in Fig. 1 . Demands for multi-port SRAM have been increased to accommodate high-speed and low-power image processing. The multi-port SRAM is suitable for parallel operation. It improves the total chip performance. To date, a multiport SRAM that supports simultaneous write and read operations is proposed for use as the image processor [9] , [10] . The three-port SRAM is reportedly suitable for use as an image processor [11] , [12] . When comparing features of two images, simultaneous read operations are requested to SRAM cells. Furthermore, realizing real-time processing requires a write operation for the next comparison at the same time as the read operation. Therefore, two read operations and one write operation must be performed simultaneously, which requires multiport SRAMs that have two-read/one-write access ports for the image processor.
The bitcell layout in the conventional three-port SRAM needs a larger area than an 8T dual-port SRAM due to the larger number of transistors [11] . In particular, an image processor requires a larger multiport memory capacity, which gives a serious impact on its cost. In this paper, we exhibit an 8T three-port SRAM smaller the conventional threeport one; its area is as small as the conventional 8T dual-port SRAM.
We designed a 28-nm FD-SOI 8T three-port SRAM for a low-power image processor and compared it to a 28-nm FD-SOI 6T SRAM in the conventional form. Then we demonstrated high energy-efficiency of sub-pJ/cycle in the proposed SRAM. The remainder of this paper is organized as follows. Section 2 presents the proposed 8T three-port SRAM design and its operation. Measurement results are shown in Sect. 3. The final section summarizes the findings. A circuit schematic of the proposed 8T three-port SRAM is presented in Fig. 2 . It has a pair of write bitlines and two single-ended read bitlines (one-write/two-read bitcell structure). The proposed SRAM has two pull-up PMOSs (load-PMOS), two pull-down NMOSs (drive-NMOS), and four transfer NMOSs (access-NMOS). In this circuit, M7 and M8 transistors are the two single-ended read ports. Source nodes of M7 and M8 transistors are connected to node QB. The drain nodes are connected to read bitlines (RBL A, RBL B). The gate nodes are connected to the read wordlines (RWL A, RWL B). This asymmetrical 8T three-port SRAM cell achieves high density. All transistor W/L sizes in the bitcell are shown in Table 1 . The W/L size of the pull-down transistor in the bitcell is chosen to remain a sufficient SNM (static noise margin) even when the both read ports are activated. Figure 3 (a) presents FEOL of the proposed 8T threeport SRAM. Read ports comprising M7 and M8 transistors are arranged separately from a 6T SRAM cell, which share a common contact located at the middle as the QB node. This layout achieves a smaller cell area than in symmetrical layout in which the additional read ports are arranged at both ends [13] .
Figure 3(b) shows the BEOL of proposed SRAM. The SRAM cell size is determined by the number of horizontal and vertical wires. In our proposed SRAM, two read ports consisting of M7 and M8 transistors are configured as two single-ended read ports having three bitlines and three wordlines. The cell area is 0.56 µm 2 on a logic rule base, which is as small as the dual-port 8T bitcell [14] , although the number of ports is increased. The operating waveforms in the read operation is depicted in Fig. 4 . No read current flows through the read bitlines (RBL A and RBL B) when the internal node, node QB, is "1". Maximizing the number of "1" s at node QB is important to reduce dynamic power in the read operation.
Precharge-Less Energy-Efficient Write Circuit
Figure 5 presents write schemes for the conventional 6T SRAM and the proposed 8T SRAM. Figure 5 (a) depicts the conventional write circuit; it is necessary to precharge a bitline pair to maintain stability of read operations because both read and write operations use the common bitline pair. Figure 5 (b) depicts the precharge-less write circuit. Successive writes of the same data consume less energy because the proposed 8T SRAM does not need a precharge scheme on the write bitlines because of the dedicated read ports for the read operation. However, it incurs the well-known halfselect problem along the write wordline. The divided wordline structure is therefore adopted to avoid the half-select problem [15] . Figure 6 portrays simplified waveforms during write cycles. Figure 6(a) shows the waveform of the write wordline (WWL) commonly used in the conventional SRAM and the proposed SRAM. Figure 6(b) shows waveforms of the write bitlines (WBL and WLBB) in the conventional write scheme. The charge/discharge power is consumed in every cycle by the precharge to a supply voltage. Figure 6 (c) portrays waveforms of the write bitlines in the proposed SRAM. By virtue of the precharge-less write scheme, which reduces the write energy, the charge/discharge power on WBL and WBLB is consumed only when a write datum is changed.
Static Noise Margin in Proposed 8T Three-Port SRAM
A multiport SRAM supports simultaneous accesses from plural cores through read and write ports. Particularly in a one-write two-read (1W2R) three-port SRAM cell, the two read ports are both available for simultaneous readouts, which implies that simultaneous readouts occur [16] . Figure 7 shows a variety of read situations in the 1W2R threeport SRAM cell when both read ports are enabled simulta- neously. Figure 7 (a) depicts two SRAM cells on different row addresses and different column addresses, designated independently. No issues emerge relative to the access conflict. However, the simultaneous dual-port readouts to a single SRAM cell activates both RWL A and RBL B, as presented in Fig. 7(b) , which might worsen the static noise margin (SNM) because of double read currents. Figure 8 presents simulation results of the SNM in the proposed 1W2R 8T three-port SRAM cell at several supply voltages of Vdd = 0.4-1.0 V. Figure 8(a) depicts the standard butterfly curves in the single port read situation: the SNM of 171 mV are achieved at 1.0 V, leaving 85% of the SNM in the conventional 6T SRAM [2] . Figure 8(b) depicts the worst-case butterfly curves in the simultaneous dual-port reads. The SNM is reduced to 101 mV at 1.0 V. An interesting point is that the maximum SNM of 102 mV is observed at 0.8 V.
Combination with Majority Logic
Our earlier study demonstrated that the majority logic circuit can conserve charge/discharge power on the read bitlines [17] . Image data reflect luminance information: bright pixels have many "1" data; dark pixels have many "0" data. For read energy reduction, the dark pixel having many "0" s should be inverted by the majority logic. To maximize the number of "1" s, the majority-logic circuit counts "1" s and decides if input data should be inverted in a write cycle, so that "1" s are in the majority. The inversion information is stored in an additional flag bit. In a read cycle, the procedure is reversed. Output data are inverted if a flag bit is true, so that the original data can be read. The majority logic does not reduce write energy because the "1" write energy and the "0" write energy are the same. In our proposed SRAM, majority logic conserves charge/discharge power effectively on the read bitlines because the number of "1" s in input data is maximized.
Chip Implementation and Measurement Results
We fabricated a 64-kb 8T three-port SRAM macro using 28-nm FD-SOI process technology. Figure 11 shows the Shmoo plot in write operations. The test chip can operate at write pulse width of 4 ns. Figure 12 portrays a schematic of the proposed 8T three-port SRAM array and its peripheral circuits. Figure 13 shows the measured leakage and active energies.
In the write operation, the test pattern of the "ALL0" write pattern means successive "0" writes to all bitcells in the memory macro. "ALL1" means successive "1" writes. In those cases, bitcell data do not change, and the bitline charge/discharge energy are saved. The "01-pat." write pattern signifies the alternately writing "0" s and "1" s to the bitcells. Then the charge/discharge power occurs on the WBLs. This is the worst case in the write operation. The worst-case write energy is 484 fJ/cycle, which is 69% Fig. 11 Write Shmoo plot. Fig. 12 Schematic of proposed 8T three-port SRAM array and its peripheral circuits.
smaller than that in the 6T SRAM (see Fig. 13 ).
The BL lengths of the proposed three-port SRAM are 1.3 times longer than the conventional 6T SRAM because of the three WLs (1WWL/2 RWLs) drawn through the 8T bitcell. However, the proposed 8T three-port SRAM does not require the WBL precharge scheme in the 6T SRAM. Furthermore, its WL are divided by every 16 rows. Therefore, the proposed SRAM can reduce needless energy in the half-selected bitcells; As a result, the write energy turns out lower than the conventional 6T SRAM.
It is noteworthy that the read circuit must have the RBL precharge scheme because of the single-ended read ports. In the read operation, the test patterns of the "ALL0" and "ALL 1" mean successive "0" and "1" read operations, respectively. The "01-pat." read pattern results in the average dynamic energy of "ALL0" and "ALL1". The respective "0" and "1" read energies are 1663.2 fJ/cycle (a read dynamic energy of 1449 fJ/cycle + a read leakage energy of 213.2 fJ/cycle) and 361.7 fJ/cycle (a read dynamic energy of 168.5 fJ/cycle + a read leakage energy of 193.2 fJ/cycle). Consequently, the energy saving in the "1" read operation is 77%. The read energy improvement is, however, merely 35%, on average with no majority logic. Figure 14 portrays the impact of the majority logic on the read energy saving. In bright Image 1, the read energy was reduced by 23%, whereas, in the dark Image 6, it reaches a 47% saving. As one might expect, the dark image is more appropriate and effective for the majority logic. In this case, the read energy is 650 fJ/cycle. Table 2 presents test SRAM characteristics. Figure 15 shows the estimated power consumption when the proposed 8T three-port SRAM with the majority logic is applied to our prior work, ME264 motion estimation processor [18] ; the values are scaled by the process node, supply voltage and operating frequency (28-nm process node, 0.54-V supply voltage and 50-MHz operation frequency). The ME264 processor has SIMD systolic-array architecture, and a 10T three-port SRAM is used as a search window and a template block. The energy consumed on the proposed SRAM is saved by 290 µW, which signifies 24% energy reduction in total over the conventional processor. Therefore, the proposed 8T three-port SRAM is suitable for the image processor.
Conclusion
As described in this paper, we presented an 8T three-port SRAM for an image processor. The proposed SRAM comprises one-write/two-read ports and a majority logic circuit to save active energy. We fabricated a 64-kb 8T three-port SRAM using 28-nm FD-SOI process technology. The test chip exhibits 0.46 V operation and access time of 140 ns. The energy minimum point is a supply voltage of 0.54 V at a frequency of 18.2 MHz, at which 484 fJ/cycle in a write operation and 650 fJ/cycle in a read operation are achieved, assisted by the majority logic. These factors are 69% and 47% smaller than those in a 28-nm FD-SOI 6T SRAM. 
