Abstract-This paper reports a 1-V 128-kb four-way set-associative CMOS cache memory implemented by a 0.18-m CMOS technology using wordline-oriented tag-compare (WLOTC) structure with the 10-transistor tag cell usually for content-addressable memory (CAM) for low-voltage low-power VLSI system application. Owing to the WLOTC structure with the CAM 10-transistor tag cell for accommodating the one-step hit/miss generation and the dynamic pulse generators for realizing read-enable signals, a small hit access time (3.5 ns), low power consumption (4.1 mW at 50 MHz), and good expansion capability without sacrificing speed have been obtained. Index Terms-Cache, CMOS, content-addressable memory (CAM), low voltage, set associative, VLSI.
I. INTRODUCTION
L OW POWER and low voltage have become an unavoidable requirement for portable computers and wireless communication systems today. Cache memory is a key memory device in CPU-related VLSI systems, communication networks, and advanced DRAMs [1] , [2] for achieving high data transfer rates. In these VLSI systems, cache memory usually consumes a majority of power provided. Low-power cache memory is very important for achieving the low-power requirements. There are three types of cache memory depending on the technology adopted: fully associative mapping [3] , direct mapping [4] , and set-associative mapping [5] . Among these three, the fully associative mapping technique has the best hit rate, however, its access time is the longest and its power consumption is the largest. The direct mapping technique has the shortest access time, but its hit rate is the worst. The access time and the hit rate of the set-associative mapping technique are between the two extremes. The set-associative mapping technique has been frequently applied to realize cache memory. Until now, cache memory chips have usually been implemented using the bitline-oriented tag-compare (BLOTC) structure [5] , [6] , where for each bitline a sense amplifier is required. When the size of the cache memory is large, the speed of its sense amplifier is slow due to the large parasitic capacitance associated with the bitline. In addition, the signal from the output of the sense amplifier needs to be compared with the index to produce the hit/miss signal. This two-step procedure in the generation of the hit/miss signal can be slow when the size of the cache memory is large. In this paper, a 1-V 128-kb four-way set-associative CMOS cache memory implemented by a 0.18-m CMOS technology using the wordline-oriented tag-compare (WLOTC) structure with the 10-transistor tag cell usually used for content-addressable memory (CAM) for low-voltage low-power VLSI system application is reported. It will be shown that, owing to the WLOTC structure with the CAM 10-transistor tag cell and the dynamic pulse generators for realizing read enable signals, a small hit access time (3.5 ns) and low power consumption (4.1 mW at 50 MHz) have been obtained. In the following sections, the cache memory design based on the WLOTC structure with the CAM 10-transistor tag cell for accommodating the one-step hit/miss signal generation is described first, followed by performance, discussion, and conclusion. Fig. 1 shows the block diagram of the 1-V 128-kb four-way set-associative CMOS cache memory using WLOTC structure with the CAM 10-transistor tag cell. As shown in the figure, this 128-kb cache memory is composed of four tag portions, two memory portions, a predecoder, a timing controller, and a data sense amplifier region. Fig. 2 shows the schematic of the tag portion in the 128-kb four-way set-associative CMOS cache memory. As shown in the figure, in the tag portion, there are 64 tag segments. Each tag segment is made of eight columns of 20 tag cells (TC in the figure). Each tag segment has eight pairs of the write wordlines (W-WL) and the read wordlines (R-WL). At the top of each column, there is a second-level decoder, which generates the sense wordline (S-WL) and the read wordline (R-WL) based on the read enable signal (RE). In each tag segment, eight pairs of the sense wordlines (S-WL) and the read wordlines (R-WL) from eight columns are connected to a tag sense amplifier at the bottom. In each tag portion, there are 64 tag sense amplifiers. For the four tag portions in the whole chip, there are 256 tag sense amplifiers in total.
II. CACHE MEMORY DESIGN

A. Tag Portion
Until now, cache memory chips were usually designed using the BLOTC structure [5] , [6] with conventional SRAM memory cells. In the BLOTC structure, for each bitline a sense amplifier is required. When the size of the cache memory is large, the speed of the sense amplifier is slow due to the large parasitic capacitance associated with the bitline. In addition, the signal from the output of the sense amplifier needs to be compared with the index to produce the hit/miss signal. This two-step procedure in the generation of the hit/miss signal can be slow when the size of the cache memory is large.
Different from the BLOTC structure usually adopted in the cache memory design, which needs a sense amplifier for each bitline and a two-step procedure in generating the hit/miss signal, in this cache memory the vertical WLOTC structure using the 10-transistor tag cell usually used in CAM [7] , [8] has been used. In the following pages, it will be shown that owing to the WLOTC structure with the CAM tag cell for accommodating the one-step hit/miss generation, high speed, low power, and convenient expansion capability can be obtained.
1) Tag Cell: For conventional cache memory circuits, the BLOTC structure is usually used. In the BLOTC structure, the tag cell is usually made by the conventional SRAM cell. Here, in order to facilitate the WLOTC structure, the 10-transistor tag cell circuit generally used in the CAM as shown in Fig. 3(a) is used in the cache memory circuit. As shown in Fig. 3(a) , the two-port 10-transistor tag cell with built-in compare capabilities [6] - [8] is derived from the tag cell usually used in the CAM circuits. As shown in the figure, in the tag cell, in addition to conventional 6-transistor SRAM cell-SRAM cell portion, there are four transistors (MN6-MN9) in the tag-compare portion. In the 6-transistor SRAM cell portion, there are a pair of write bitlines (W-BL, W-BL) connected to the internal storage nodes BIT0/BIT1 via pass transistors controlled by the write wordline (W-WL). In the tag compare portion, the source terminals of MN8 and MN9, which are controlled by the read bitlines (R-BL, R-BL), are connected to the read wordline (R-WL) instead of ground as in conventional 10-transistor tag cell used in the CAM [7] . Using the nongrounded read wordline (R-WL) approach, the overall sense structure is more compact. In the conventional 10-transistor tag cell in the CAM, the R-WL as shown in Fig. 3(a) is grounded instead. In the new tag cell, instead of being grounded, it is replaced by R-WL. When the tag cell is not selected, R-WL remains high such that this tag cell is not accessed. When the tag cell is selected, R-WL is grounded. If the conventional CAM cell approach is adopted, one more nMOS device controlled by R-WL between the ground and S-WL is necessary-the new approach is more compact. At the top of the tag-compare portion, the drain terminals of MN6 and MN7, which are controlled by the internal storage nodes (BIT0, BIT1) are connected to S-WL. Before the read operation, read enable (RE) is low, and S-WL is precharged to high. When the tag cell is not accessed, R-WL is high-both paths formed by MN6/MN8 and MN7/MN9 are off. Therefore, S-WL remains high. When the tag cell is accessed, RE is high and R-WL is low. When the index signal imposed on the read bitline (R-BL) is not the same as with the data in the internal storage node (BIT0), one of the two paths formed by MN6/MN8 and MN7/MN9 from S-WL to R-WL is on. Therefore, S-WL is pulled down to low, indicating a "miss." If the index signal imposed on R-BL is the same as the data in the internal storage node (BIT0), both paths formed by MN6/MN8 and MN7/MN9 are off. As a result, S-WL remains high, which indicates a "hit." As shown in Fig. 3(b) , using a 0.18-m CMOS technology, the layout area of this 10-transistor tag cell is 6 m 2.92 m, where three layers of metal lines have been used.
2) Second-Level Decoder: As previously described, in each column of 20 tag cells, there is a second-level decoder at the top, as shown in Fig. 2 . In each segment, eight pairs of sense wordlines (S-WL) and read wordlines (R-WL) are connected to a tag sense amplifier at the bottom for facilitating the WLOTC structure with the one-step hit/miss generation procedure. The second-level decoder used in each column of tag cells in each segment is used to generate S-WL and R-WL based on the outputs from the predecoder as shown in Fig. 1 . Note that in the predecoder as shown in Fig. 1 , nine bits of read address (R-ADDR) are divided into three groups. Each group of three read address bits are used to produce eight bits of predecoder outputs (PRE-DECs). Each group donates a bit of its predecoder outputs to form a set of three-bit inputs: PRE-DEC0, PRE-DEC1, and PRE-DEC2 to the second-level decoder as shown in Fig. 4(a) for producing the R-WL for a column in a tag segment. In total there are eight sets of three-bit inputs: PRE-DEC0, PRE-DEC1, and PRE-DEC2 for the eight read wordlines for the eight columns in a tag segment.
As shown in Fig. 4 (a), in the second-level decoder, when the precharge signal (PRECHARGE) is low, the S-WL is pulled up to high, which indicates that a column of 20 tag cells are ready for the tag-compare procedure in the next clock cycle. The precharge signal (PRECHARGE), which is generated from the timing control circuit, is low when the tag-compare operation is done regardless of results-no matter whether hit or miss. As shown in Fig. 4(a) , from the predecoder output signals PREDEC0, PREDEC1, and PREDEC2, when the internal read enable signal (REI) is high, the read wordline (R-WL) is low, indicating the specified column of tag cells are ready for the tag-compare operation. Dynamic logic circuit techniques with active pull-up have been adopted in the second-level decoder circuit as shown in Fig. 4(a) for enhancing the speed performance.
3) Quasi-Static Pulse Generator: Fig. 4(b) shows the quasistatic pulse generator for producing the REI signal. The REI signal is derived from the read enable signal in the quasi-static pulse generator using dynamic logic circuit techniques. As shown in Fig. 4(b) , in a conventional pulse generator, marked by the surrounding dashed line, the REI pulse is derived from RE ANDed with the delay version of RE with its pulse width determined by the delay of the inverters. If the parasitic capacitance at the internal node P0 is large, the pulse generated at the internal node P0 becomes triangular shaped, which may not be sufficient for the correct function of REI. Note that the large parasitic capacitance at P0 is due to the large driver for producing the REI signal with a short delay considering the large parasitic capacitance for REI to drive. In the new quasi-static pulse generator, the positive-edge trigger technique has been adopted. Its operation is considered here. Initially, the RE signal switches from high to low, the internal node P2 is pulled to high by MP0 since P1 is high although MN0 is off. When RE rises to high, P2 is pulled low since both MN0 and MN1 are on. Once P2 is low, P3 is high, and MN3 turns on. Therefore, P4 is pulled low and P1 is pulled low. As a result, MP0 turns on and P2 turns high. The width of the P2 pulse when it stays low is determined by the propagation delay of the inverters in the front. If RE turns low again in less time than the propagation delay of the inverters in the front, P2 will not turn high since MN0 is off, the pulse width of P2 when it stays low is stable regardless of the pulse width of RE, which implies the negative edge of RE cannot have any effect on P2. This is why it is called "positive-edge triggered." Therefore, the pulse width of REI is totally determined by the propagation delay of the inverters in the front, regardless of the pulse width of RE.
4) Tag Sense Amplifier:
In the WLOTC structure with the CAM 10-transistor tag cell for the one-step hit/miss signal generation, eight pairs of sense wordlines and read wordlines are connected to a tag sense amplifier. Fig. 4(c) shows the tag sense amplifier used in this 1-V 128-kb four-way set-associative CMOS cache memory. The core of the tag sense amplifier is made of the latch amplifier MP8, MP9, MN0, MN1 and the current sources MSA1 and MSA0. The sense wordlines S-WL0/3 and the read wordlines R-WL0/3 are connected to one side of the latch amplifier and the sense wordlines S-WL4/7 and the read wordlines R-WL4/7 are connected to the other side. Before the tag-compare operation, the pull-up signal (PULL-UP) is low, thus both sides of the latch amplifier are pulled high. In addition, before the tag-compare operation, the preset signal (PRESET) is also low to precharge the hit and the miss signals (HITA, HITB, MISS) to high. Note that the complementary signals DUMMY and PULL-UP are used to compensate for the clock feedthrough effects. During the tag-compare operation period, for the column selected, the read wordline signal is low. Thus the current sources MSA1 and MSA0 are turned on to activate the latch amplifier. If a sense wordline (for instance, S-WL0) is high, the side of the latch amplifier connected to S-WL0 is high. The other side of the latch amplifier is slewing toward low from high initially since its associated parasitic capacitance is much smaller. As a result, the hit signal (HITB) becomes low, indicating a hit. If the sense wordline S-WL0 is low, the side of the latch amplifier connected to the sense wordline S-WL0 is low and the miss signal (MISS) is low, which indicates a miss. The arrangement of partitioning sense wordlines S-WL0/7 and read wordlines R-WL0/7 into two groups connected to the two opposite sides of the latch amplifier is to facilitate a compact layout, thus small parasitic capacitances and high speed. Using the tag sense amplifier in the WLOTC structure, one-step hit/miss signal generation can be done, which is different from the two-step hit/miss signal generation for the traditional cache memory circuits based on the BLOTC structure. Thus, a higher speed can be expected for the new cache memory with the WLOTC structure. In addition, compared with the BLOTC structure, the number of active sense amplifiers during read access can be substantially reduced, and a much smaller power consumption can be obtained.
B. Two-Port Memory Portion
In the 128-kb four-way set-associative CMOS cache memory, in addition to the tag portion, there are two two-port memory portions. Fig. 5 shows the two-port 8T-memory cell and its layout used in the two-port memory portion of the 1-V 128-kb four-way set-associative CMOS cache memory. As shown in the figure, in the two-port memory cell, there are a read wordline (R-WL), a write wordline (W-WL), a pair of read bitlines (R-BL, R-BL), and a pair of write bitlines (W-BL, W-BL) for simultaneous read and write access. As shown in Fig. 5(b) , based on a 0.18-m CMOS technology using three-layer metal lines, the layout of the two-port 8-transistor memory cell occupies an area of 2.92 m 3.82 m. Fig. 6 shows the schematic of a memory portion in the 1-V 128-kb four-way set-associative CMOS cache memory. As shown in the figure, each memory portion contains 32 sections. Each section contains four ways of two-port memory cell arrays. In each way of two-port memory cell array, there are a pair of read bitlines (R-BL, R-BL) and a pair of write bitlines (W-BL, W-BL). In addition, next to each way of two-port memory cell array, there is a latch-type sense amplifier in the data sense amplifier region. There are also a set of multiplexers for the read-out operation to designate a specific way to be connected to data out (DATA), as determined by the way signal (WAY). In each way of two-port memory cell array, there are 512 columns-512 pairs of read wordlines (R-WL) and write wordlines (W-WL). When the write enable signal (WE) is high, it is the write operation. During the write operation, the 32-bit write data (W-DATA) are written into a specific column in the memory portion, as specified by the write address (W-ADDR). During the write operation, the 20-bit write index (W-INDEX) signals are written into a column of 20 tag cells in the tag portion specified by the write address. When the RE signal is high, it is the read operation. During the read operation, the tag-compare operation is executed first. The 20-bit R-INDEX data is compared with a specified column of 20 tag cells designated by the 9-bit R-ADDR data. If the 20-bit R-INDEX data match with the column of 20 tag cells specified by the 9-bit R-ADDR data, the hit signal (HIT) turns high to trigger the readout of the data stored in the memory portion as specified by the R-ADDR data to the data output (DATA). Fig. 7 shows the timing chart for the tag-compare operation of this 128-kb cache memory. As shown in the figure, the tagcompare operation is initiated by the RE signal, which is provided externally. Then an REI signal with a fixed pulse width is generated by the quasi-static pulse generator. As controlled by the REI signal, the tag sense amplifier generates a HIT or a MISS signal depending on the tag compare result, which triggers to produce reset signals including PRECHARGE for the second-level decoder, as shown in Fig. 4(a) , and PRESET, EN-ABLE, PULL-UP, and DUMMY for the tag sense amplifier, as shown in Fig. 4(c) .
C. Timing Chart
D. Measured Results and Discussion
In order to verify the performance of this 1-V 128-kb fourway set-associative CMOS cache memory using the WLOTC structure, a test chip has been integrated using a 0.18-m CMOS technology with one polysilicon layer and six metal layers. As indicated in Table I , the gate oxide is 4.08 nm, the threshold voltage for the nMOS/pMOS device is 0.513 V/0.566 V. The layout area of the 8-transistor memory cell and the tag cell is 3.82 m 2.92 m and 6 m 2.92 m, respectively. Fig. 8 shows the die photo of this 1-V 128-kb four-way set-associative CMOS cache memory. The active die area is 2.09 mm 1.99 mm. There are 208 staggered I/O pads surrounding a die area of 3.5 mm 3.5 mm. A substantial amount of pads have been assigned as various power-supply pads for different circuits in the test chip to reduce the mutual interference of noise due to the bouncing of the voltage on the power supply arising from the transient effects of various circuits. Note that the read enable dummy signal (RED) is connected from the input read enable signal (RE). The inclusion of the RED signal is used for measuring the actual access time from the RE to the data out excluding the delay time due to pads. Fig. 9 shows SPICE simulated transient waveforms and the measured results of this 1-V 128-kb four-way set-associative CMOS cache memory operating at a supply voltage of V during the hit access. As shown in the figure, from the transition of RED signal to that of the data out (DATA) at the output, the hit access time is 3.5 ns. Also shown in Fig. 9 are the transient waveforms for this cache memory during the miss access [ Fig. 9(c) ] with one out of 20 tag cells in a column having a miss. Note that HIT0 and HIT1 are the two hit signals related to the two opposite sides of the latch amplifier in the tag sense amplifier. After the hit/miss signal, PRESET becomes high as controlled by the timing controller. When PRESET is high, it will preset the tag sense amplifiers and the memory sense amplifiers. The power consumption is 4.1 mW at V operating at 50 MHz. The small power consumption is attributed to the distributed tag sense amplifier structure in the WLOTC structure. Fig. 10 shows the hit access time versus of this 1-V 128-kb four-way set-associative CMOS cache memory. As shown in the figure, this CMOS cache memory can work at a supply voltage of V, at which the hit access time is 8.4 ns. Fig. 11 shows the critical path from RE to DATA during the hit access of this 1-V 128-kb four-way set-associative CMOS cache memory using the WLOTC structure. Table II shows the power consumption distribution and the active area for this 128-kb cache memory using the WLOTC structure. In addition, Table II also shows the power consumption distribution of the cache memory if the BLOTC structure is adopted assuming the same memory sense amplifier using the same memory portion and output peripheral logic. As shown in Fig. 11 , during the hit access, the critical path is from the RE input (marked 1 in the figure) via the quasi-static pulse generator in the timing controller to generate the REI signal (marked 2 in the figure). Then via the second-level decoder (marked 3 in the figure), it is to the read wordline (R-WL) (marked 4 in the figure) and the tag sense amplifier, followed by the hit signal (marked 5 in the figure) to the sense amplifier in the memory portion (marked 6 in the figure) and finally to the multiplexer and the data out (marked 7 in the figure). From the total access time of 3.5 ns, the propagation delay of the tag portion is 2.7 ns and the propagation delay of the memory portion is 0.8 ns. Therefore, the propagation delay of the tag portion dominates the overall read-access time. As shown in Table II , in this 128-kb cache memory using the WLOTC structure, the tag portion and its timing controller consume only 3.45% of the total power. In contrast, if the BLOTC structure is adopted, the tag portion and its timing controller consume 44% of the total power. In absolute figures, the power consumption of the tag portion using the BLOTC structure is 20 times larger compared to that using the WLOTC structure. As for the total power consumption, the WLOTC one is around 35% less as compared to the BLOTC one despite the increase in the power consumption for the decoder and the input logic using the WLOTC structure. As shown in Table II , using the CAM approach for the tag cell, the active area of the WLOTC structure is 2.09 mm 1.99 mm. Compared to the BLOTC structure, the active area of the whole chip with the WLOTC structure is 8.6% larger due to a total of 40K tag cells in the whole 128-kb cache memory although the layout area of the 10-transistor tag cell using the CAM approach (6 m 2.915 m) is 57% larger than that of the conventional 8-transistor one. Minimizing the propagation delay of the tag portion is the key to a high-speed cache memory. Owing to the adoption of the WLOTC structure with the CAM 10-transistor tag cell for facilitating the one-step hit/miss generation, the propagation delay of the tag portion has been minimized. Thus, this cache memory chip can have a much smaller hit access time as compared to the cache memory using the traditional BLOTC structure, which has a two-step hit/miss signal generation procedure. In addition, in the cache memory using the WLOTC structure, in each tag segment, there is only one tag sense amplifier. Furthermore, during the tag-compare operation, only one tag sense amplifier is active. Compared to the cache memory using the conventional BLOTC structure, which needs 20 active sense amplifiers during the tag-compare operation assuming the same size as in our design, the new cache memory using the WLOTC structure with the CAM 10-transistor tag cell has much smaller power consumption.
This cache memory with the WLOTC structure has a better expansion capability without sacrificing the speed performance. Currently, in this cache memory, it has two memory portions. Each memory portion has a size of 128 512 memory cells, thus 512 columns. If the size of the cache memory is doubled such that each memory portion has a size of 128 1024 memory cells, it will have 1024 columns. Therefore, using the WLOTC structure, although the size of the cache memory is doubled, the hit access time remains the same since the length of the wordlines remains unchanged; the cache memory is expanded horizontally. Although the horizontal bitlines have become longer, the length of the read and sense wordlines stays unchanged. In contrast, if the size of cache memory using the conventional BLOTC structure is doubled, due to the increase in the bitline involved, the hit access time is degraded. Thus, the cache memory with the WLOTC structure has a much better expansion capability without sacrificing the speed performance.
III. CONCLUSION
In this paper, a 1-V 128-kb four-way set-associative CMOS cache memory implemented by a 0.18-m CMOS technology using the WLOTC structure with the 10-transistor tag cell usually for the CAM for low-voltage low-power VLSI system application has been described. Owing to the WLOTC structure with the CAM 10-transistor tag cell for accommodating one-step hit/miss generation and the dynamic pulse generators for realizing read enable signals, a small hit access time (3.5 ns), low power consumption (4.1 mW at 50 MHz), and good expansion capability without sacrificing speed performance have been obtained.
