Abstract-Read-only memories (ROMs) are widely used in both digital communication systems and daily consumer electronics. The major functions of ROMs are storage of data, programs, firmwares, etc. In this paper, a three-dimensional decoding structure for ROMs is proposed. The number of address decoding stages is drastically shortened. Hence, the delay is reduced, as well as the power consumption and area. The analysis of overall transistor count and delay is thoroughly derived. A real 256 8 ROM possessing the proposed decoder is physically fabricated by 0.5-m two-poly two-metal (2P2M) CMOS technology.
I. INTRODUCTION
R EAD-ONLY memories (ROMs) are an important part of many digital systems, e.g., digital signal processors (DSPs), microprocessors, digital filters, etc. They are particularly important in portable systems due to the storage of programs and data. Hence, the chip size and power consumption need to be enhanced, as well as the improvement of speed. Prior ROM designs were mainly focused on the technology evolution [1] , [4] , [6] , core architecture [5] , [7] , or special-purpose circuit and logic [2] , [3] . The improvement of address decoders and data encoding for ROMs has long been ignored. Most of the prior decoders for ROMs utilized multiplexers to decode the row and column addresses. However, the characteristics of implant ROMs [8] cause the Hamming distance of adjacent words to be one. This feature leads the order of the data words to be stored, which will depend on the decoder structure such that they appear in a nonnatural order, e.g., 0, 2, 5, 3, 1, . The following problems will then be introduced.
1) Different ROM users will store the data words in different patterns owing to the different decoder structures. 2) Due to the data words stored in various patterns, programs to call the data stored in the ROMs must be adjusted accordingly. 3) Such types of ROMs are very difficult to test. In this paper, we propose a three-dimensional address decoding structure associated with a corresponding data cell encoding arrangement for implant ROMs such that the data words are encoded and stored in the ROMs in a natural pattern, i.e., 0, 1, 2, 3, without any conversion circuits. Our method simplifies the procedure to encode the ROMs and decode the address such that application programs need no adjustment and testing becomes much easier. Not only is the size of the entire decoder shrunk, but the access time and power dissipation is also greatly reduced.
II. THREE-DIMENSIONAL DECODER FOR ROMs

A. Structures of ROMs
The storage of a bit "0" or "1" in a ROM is determined by the existence of a pass transistor residing at the cross of a row and column. Referring to Fig. 1 , two typical CMOS ROMs are shown. The top of Fig. 1 is a NOR gate. When the input of any row turns high, the pass transistors coupled in the row are then turned on to ground the corresponding outputs. By contrast, the bottom of Fig. 1 shows an NAND gate of which output is low when the inputs to all of the rows coupled to the column are high. These two examples demonstrate that the ROMs can realize any given Boolean functions as long as the word length is adequate.
B. One-Dimensional Decoder Structure of ROMs
Referring to Fig. 2 , a straightforward decoding circuit, i.e., one-dimensional decoder [2] , for a implant ROM is shown. The " implant" NMOS transistor denotes that an NMOS is added to a layer of implant. The gate of this NMOS is then opened such that it cannot be turned on by a voltage asserted on its gate. Meanwhile, PMOS transistors only appear in the designated module in which PMOSs are used as pull-up loads. The operations of such a implant ROM are: 1) precharging: when the clock is low, the outputs -are all precharged to high and 2) evaluation: when the clock turns high, address inputs -will then determine which output line is connected to ground.
For instance, if , in Fig. 2 will be kept high, while the rest of the outputs will be grounded. Thus, the row selection signal is .
C. Two-Dimensional Decoder Structure
When the address size of ROMs increases linearly, the one-dimensional decoder structure will inevitable make the core of the ROM grow exponentially. Not only does such a decoding scheme leads to a large area, but it also increases the access time of data. A two-dimensional decoder structure was proposed to resolve such a problem [1] , [3] . It was also called an " -" decoding. Fig. 3 shows a block diagram of a 128 1 ROM. The operations of such a design are illustrated as follows.
1) The address lines are equally divided into two parts. , , and are fed into a implant decoder, which, in fact, is a one-dimensional ROM decoder to generate a word selection signal , , , or . The generated word selection signal will be fed to the ROM data core and determine which row of the core is activated. For instance, if , is activated. 2) The upper decoder is fed with , , and to decode which pair of columns in the data core is selected. Notably, -implant NMOS transistors are utilized in the upper decoder. The -implant NMOS possesses an layer such that the transistor is always turned on regardless of the asserted gate voltage. Hence, if , then the pair of columns 14 and 15 is selected. 3) The lower decoder is fed with , , , and to determine which one of the selected column is granted to be the output. Similar to the upper decoder circuit, -implant NMOS transistors are utilized in the lower decoder. It is easy to comprehend that is the one to select the column. Assume that is 0. Column 14 is then chosen to be output. 4) The states of the 128 bits in the 128 1 ROM is determined by whether there is a implant NMOS in the corresponding position. 5) In summary, , the state of the corresponding bit (row:
, col : 14) in Fig. 4 will be accessed. The features of the two-dimensional decoder are summarized as follows.
• Referring to enjoys the advantage of column sharing. The intersect of these two selected pairs will decide which bit is delivered to the output. On the other hand, we can also say that is used to select which one of the selected two adjacent cells (bits) is granted by a different point of view.
D. Three-Dimensional Decoder Structure
The major disadvantage of the two-dimensional decoding scheme is that the column-sharing design leads to a nonnatural order of data arrangement, i.e., the order of the data bit is 0, 1, 3, 2, 6, 7, . This drawback brings up two side effects, i.e., it is hard to program the needed data and it is difficult to debug. We, thus, propose a three-dimensional decoder structure to resolve these problems without any format conversion circuit. Fig. 7 shows the block diagram of the three-dimensional decoder. Take the same 128 1 ROM as an example. 1) The address lines are divided into three parts: , , and are used to decode the row number of the data core; , , and are fed to an upper decoder; and only one is required in the lower decoder circuit. 2) Two additional modules are needed to resolve the nonnatural encoding problem in prior ROM decoder designs. They are the upper pass block associated with the upper decoder and the lower pass block associated with the lower decoder circuit. The elements of these two blocks are NMOSs. 3) All of the -implant NMOS transistors in prior ROM decoders are dispensable since they provide virtually no function. A binary-tree-like decoder can then be constructed, as shown in Fig. 8.  4) , , and selects a row (word) in the data core. , , and then determine which shared column is activated. 
5) Two pass transistors are gated with
and then decide which side of the activated column to be the output "
."
Notably the order of the data encoded in the ROM core is in a natural order of sequence in such a decoder design, i.e., 0, 1, 2, 3, . Besides this, the binary-tree-like decoders without any -implant saves a large number of transistors such that the area becomes much smaller. Fig. 9 shows the detailed schematic of the three-dimensional decoder circuit.
E. Performance Analysis 1) Transistor Count:
Although the transistor count cannot fully represent the area of the whole chip, more transistors bring up more wiring so as to increase the chip area. In order to compare the difference among these decoder schemes, we assume address lines to be decoded for a ROM. The following analysis excludes inverters (buffers) on the address lines and the precharging PMOSs at the output lines since they are all required in any decoder circuit. It also did not contain the transistors of the ROM core.
One-dimensional scheme: It is trivial to derive that the number of the NMOSs needed is
(1) Two-dimensional scheme: Referring to Fig. 3 , we conclude the following facts, while not counting the -implant NMOS, which are always on.
Number of MOSs in the upper decoder circuit is
Number of MOSs in the lower decoder circuit is Number of MOSs in the implant core decoder circuit is Thus, the total transistor count in this scheme is the summation of the above three terms as follows: (2) Three-dimensional scheme: Referring to Fig. 7 , a total of address lines are divided into three parts: for the data core decoding, for the upper decoder, and for the lower decoder, i.e., . However, will be the only address line used in the lower decoder circuit to determine which side of the activated column is the granted output. is then fixed to be one. Thus, we attain .
• Number of MOSs in the upper decoder circuit is
• Number of MOSs in the upper pass block is .
• Number of MOSs in the lower decoder circuit is two.
• Number of MOSs in the lower pass block is .
• Number of MOSs in the implant core decoder circuit is Thus, the total transistor count in this scheme is the summation of the above five terms. Besides this, we also substitute to derive the following result:
By using (3), we can make a comparison in Table I . Since is definitely much larger than and , there is no need to compare it with the other two. Fig. 10 shows the transistor count of and in a three-dimensional mesh format.
is the most transistor saving. 2) Speed Analysis: Since a design with a minimal number of MOSs might not be the fastest circuit, the cost of delay using the three-dimensional decoder structure should be evaluated before the selection of or . If the rows and columns are not too long, we assume that the "delay count" is proportional to the number of MOSs on the data path from input to output. Hence, the delay measure of the three-dimensional decoding scheme is formulated as follows: According to Table II, although the scenarios given and appear to possess almost the same transistor-delay product, the delay is generally deemed as a parameter with a higher priority. This why we choose and to implement a 256 8 ROM below. Fig. 12 shows the three-dimensional decoder ROM simulation waveforms of the address inputs and the output product given thorough HSPICE simulations.A 4 4 integer multiplier can be realized by a 256 8 ROM in which four address lines denote the multiplicand, while the other four represent the mul- tiplier. The byte-wide output is the product. The 256 8 ROM is composed of eight 256 1 ROMs, i.e., ROM0, ROM1, , ROM7. Their individual output is , respectively. Fig. 11 is the schematic of one of the ROMs, e.g., ROM4. The NMOS, which is connected to GND, represents the implant NMOS. Figs. 13 and 14 reveal the maximal delay of the two-dimensional and proposed decoders, respectively. The maximal access time (delay) of the three-dimensional decoder is 3.3 ns 
F. Chip Implementation: 256 8 ROM
III. SIMULATION AND TESTING
A. Speed (Delay) Simulation
B. Power Dissipation Simulations
Regarding the power consumption comparison, we also conduct a series of simulations that employ the Monte Carlo method of HSPICE. The number of sweeps is 30 and the clock period is 10 ns. The power dissipation results are tabulated in Table IV. Since there are more PMOSs and inverters in the three-dimensional decoder architecture, it has higher power dissipation than that of the two-dimensional decoder. However, if we consider the power-delay product as a measure, the superiority of our three-dimensional decoder for ROMs is preserved.
C. Physical Chip Testing
The proposed ROM decoder was approved by the Chip Implementation Center (CIC), National Science Council (NSC), and then fabricated by the United Microelectronics Company (UMC) via 0.5-m two-poly two-metal (2P2M) CMOS technology. The size of the chip is 1.8 1.8 mm . Fig. 15 shows the die photograph of the 256 8 ROM and the proposed decoder. The input test patterns are supplied by an HP 1660CP logic analyzer. Fig. 16 shows the snapshots generated by the logic analyzer when the chip is fed with a 20-MHz clock rate. The multiplier is denoted by -and the multiplicand by -. The product of the output of ROM from to is displayed in a decimal format. The maximum clock rate is 20 MHz, while the maximal access delay is measured to be 16 ns given 256 test patterns. These results justify our design.
IV. CONCLUSION
We have presented a novel area-saving and high-speed decoder structure for ROMs. The transistor count and the delay have been clearly analyzed and verified. The physical chip implementation using the proposed three-dimensional decoding schemes has also been presented. The simulation results turn out to be very appealing.
