Abstract-BiCMOS standard cell macros, includiug a 0.5-W 3-ns register file, a 0.6-W 5-ns 32-kilobyte cache, a 0.2-W 3-ns table look-aside buffer (TLB), and a 0,1-W 3-ns adder, are designed with a 0.5-pm BiCMOS technology. A supply voltage of 3.3 V is used to achieve low power consumption.
I. INTRODUCTION
I N ORDER to realize high-performance systems that include processors with 1OO-MHZ clocks [1] , large functional macros are important. The high-speed operation and high integration density lead to high chip power consumption and large thermal dissipation, which in turn create expensive package requirements. Consequently, lowpower and high-speed macros are crucial to build costeffective high-performance systems. In this paper, several BiCMOS standard cell macros, including a 0.5-W 3-ns register file, a 0.6-W 5-ns 32-kilobyte cache, a 0.2-W 3-ns table look-aside buffer (TLB), and a O.1-W 3-ns adder, are presented based on a 0.5-pm BiCMOS technology. Several BiCMOS/CMOS circuits are devised to achieve high-speed operation at a supply voltage of 3.3 V, which is required for low-power operation.
In Section II, a basic logic gate with full swing capability for use with a 3. 3-V supply voltage is proposed and measured results of the logic gates are summarized. Simple function cells, such as NAND's, NOR'S, and flip-flops are prepared by using the proposed basic gate configuration. Section III describes the larger macros, such as a cache and a register file, in detail. In these macros, novel circuit ideas such as an ECL-level circuit for cache hit Manuscript received April 6, 1992; revised July 19, 1992 logic and a self-aligned threshold inverter have been employed to achieve high performance. Finally, Section IV discusses the results for these macros obtained from a fabricated test chip.
II. BASIC LOGIC GATE
A BiNMOS gate [2] with parallel PMOS transistors, namely a PBiNMOS gate, is adopted for building simple logic gates. The PBiNMOS gate configuration, shown in Fig. 1(a) , is suitable for high-speed operation at reduced supply voltage because the pull-down of the output is accomplished through the n-channel MOSFET pair [2] . In CMOS logic gates, the p-channel transitor is two to three times slower than the n-channel transition. In this sense, the BiNMOS circuit, with a bipolar transistor to increase the pull-up speed by a factor of 2 to 3, is a well-balanced configuration.
The PBiNMOS gate uses weak PMOS devices in parallel with the bipolar pull-up to restore the full logic level of the output. This full-swing capability is necessary to reduce the power dissipation of subsequent gates. In addition, the full rail-to-rail swings reduce the dc current paths that are found by leakage current tests used in the die sorting process [3] . Fig. 1(b) shows the measured fan-out dependence of the propagation delay. The PBiNMOS two-input NAND gate shows a propagation delay time of 200 ps with fanout of 7. Both CMOS and PBiNMOS cells are available in this standard cell library. In standard cell designs, the PBiCMOS gates are used to drive highly capacitive loads, whereas the conventional CMOS gates are used to drive less capacitive loads. A CAD tool is used to choose the more appropriate cells considering the load condition.
III. MACRO CELLS

A. Register-File Macro
The register-file macro circuit diagram is shown in Fig.  2 . The macro enables six reads and four writes at the same time. In order to realize the six read ports with a small memory cell size, a nonpaired (single) bit-line scheme is necessa~. In the nonpaired bit-line scheme, however, it 0018-9200/92$03 .00 @ 1992 IEEE IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 27, NO. 1I , NOVEMBER 1992 ,:mj~. -. The SATI can be laid out in a narrow pitch because the sense amplifier is constructed with only three MOS-FET's. It is essentially an inverter that can receive a bitline swing of less than 1 V using a replica biasing scheme which provides the best operating point to the sense amplifier. A CMOS inverter gives excellent amplification characteristics for a small input signal if the input signal is near the logic threshold voltage of the CMOS inverter. It has been impossible to make use of this characteristic because the logic threshold voltage can vary with process, temperature, and voltage conditions.
The replica biasing scheme is employed to overcome this difficulty. The bias circuit is constructed with a differential amplifier, a dummy cell, and a dummy sense amplifier. The gate voltage of the NMOS transistor N 1 is controlled so as to set the logic threshold voltage at the middle of the bit-line swing by using the feedback loop with an inverter replica. The SATI circuit tracks process, voltage, and temperature variation by using the replica biasing circuit. The worst-case access time can also be reduced by using this bias circuit.
Another key circuit technique is a BiCMOS bit-line load for a fast bit-line recovery from ZEROread. A bipolar transistor and PMOS transistor load pull up the bit line when the ONEdata are read out faster than a single NMOS load that is conventionally used as a bit-line load. The NMOS load is slow in pulling up the bit line because it operates near its threshold voltage. The bipolar transistor shifts the operation vohage of the bit line down to V~D -V~~to make the bias point suitable for the SATI sense amplifier. The PMOS transistor acts as a nonlinear load. The NMOS transistor in parallel with the bipolariPMOS load is used to stabilize the bit-line swing. Without the NMOS load, the bit-line swing can easily become very large and can cause degradation in speed because of the nonlinear nature of the PMOS device.
For fast write recovery, a new circuit is used where the write recovery process takes place as soon as the write enable WE signal is disabled. When the macro select MS signal is deselected, the macro consumes no dc power, which enables the use of IDDQ testing (standby current test) for the entire device. This feature has been added to all of the macro's with no performance reduction.
B. Cache Macro
If the target system is a processor, a high-density cache memory increases system performance dramatically. A 32-kilobyte BiCMOS cache macro has been designed employing highly resistive polysilicon-load four-transistor cells to achieve high density and high speed. Data bus width is 128 b and a two-way set associative organization is adopted. The circuit diagram of the BiCMOS cache macro is shown in Fig. 3 . The TAG portion of the macro consists of TAG memories and comparators. The comparators are built with CMOS EXCLUSIVE-OR circuits and ECL wired-oR circuits to improve the speed, as the comparators in the critical path. The TAG data are compared with TAGINs by the CMOS transfer circuits maintaining a bit-line swing, and the results are oRecUby the ECL wired-cm circuits with an ECL level. The comparator output is obtained from the collector of a bipolar transistor.
Complicated HIT logic is also built with an ECL-like circuit to improve the speed. The ECL HIT logic generates a HIT signal from the comparator output that has an ECL level, combining the special bit and the valid bit. The HIT signal, which is converted to a CMOS level, activates way selectors of DATA memory. The sense amplifiers for the DATA memory are placed after the way selectors to reduce the power consumption to about a half when compared with the conventional architecture [6] . Thus only 128 sense amplifiers are activated in spite of the two-way organization. The power-speed trade-off is crucial for large-bandwidth caches. The output of the BiCMOS sense amplifier is applied to the SATI circuit and converted to full CMOS levels.
C. TLB Macro
The TLB macrocell is another key component to construct a high-performance processor. The TLB consists of a content-addressable memory (CAM) for a directory search and an SRAM for a physical address fetch. Fig. 4 shows a circuit diagram of the CAM. The CAM cell performs the search operation comparing the input DATA and the stored data. The NAND structure of the match line causes the match line of only the selected row to be activated and the corresponding SRAM word line to be enabled. This eliminates complex and marginal timing circuits for driving the SRAM word lines in the conventional NoR-type match circuit [7] . A triple-match-line architecture [8] is used to further speed up the match operation. A bipolar hit generation logic is adopted to realize a highspeed 64-input ORgate. In the SRAM part of the TLB, the same bit-line load and sense-amplifier scheme described in the register file macro are used.
D. Adder Macro
A 32-b asynchronous adder was also designed. An asynchronous feature is important as a standard cell macro, since an appropriate clock may not always be available. Fig. 5 shows a circuit diagram of the adder macro. The most salient feature of the adder is the use of the chained carry lookahead (CCLA) circuit. The outputs of the carry select adders (CSA's) are combined by a carry circuit similar to a Manchester chain which generates the required carries efficiently. The BiNMOS driver is placed at the first stage of the chained CLA. The MOSFET's in the chain are sized in geometric series to minimize delay.
IV. EXPERIMENTAL RESULTSON LARGEMACROS
A microphotograph of a fabricated test chip is shown in Fig. 6 . The process is a double-poly, triple-metal BiCMOS process. The 0.5-~m BiCMbS process is based on a 0.5-pm CMOS process with added-on buried p+ and n+ layers beneath p and n wells. The n-epitaxial layer is 1.5 pm thick. The n-p-n bipolar transistor is made with an ion-implanted emitter and the transition frequency of the transistor is 10 GHz. Fabricated polysilicon widths for PMOS and NMOS gates are 0.6 and 0.5 pm, respectively. Fig. 7 shows the measured waveforms of the cache macro. Observed ADDRESS to HIT delay was 5 ns, and the ADDRESS to D OUT delay was 8 ns. This is the fastest cache ever reported for this size. Most of the delay data were obtained by using latches placed at the input and the output and by varying the strobe timing to these latches. The characteristics of the BiCMOS macros are summarized in Table 1 .
Measured access time of the register-file macro is 3 ns, while the simulated delay for a full CMOS conventional circuit is 5 ns. As for the cache macro, 5-ns ADD to HIT delay is obtained. The ADD to D OUT delay of 8 ns is faster than the 10-ns delay of a CMOS conventional circuit. The TLB macro shows 3-ns DATA to HIT delay measured under typical conditions, while the simulated delay for a full CMOS implementation without the triplematch-line architecture is 4 ns. The aldder macro shows3 -ns measured addition time, whereas the simulated delay for a full CMOS adder with the chained CLA configuration is 3.5 ns and that without the chained CLA is 4.5 ns. The average speed advantage of PBiNMOS macros over pure CMOS implementation was about 20%.
The register-file macro achieves the power consump- tion of 0.5 W at 1OO-MHZ operation. As for the cache macro, measured power consumption is 0.6 W when the data bus width is 128 b. The area of the register file is 3.6 mm x 3.8 mm, the cache is 6.0 mm x 6..5 mm, the TL13 is 2.0 mm X 3.0 mm, and the adder is 1.2 mm X 0.3 mm, excluding 1/0 pads used for testing.
V. CONCLUSIONS
Several BiCMOS and CMOS circuits are devised to realize high-speed operation at the low supply voltage of 3.3 V. The proposed circuits include PBiNMOS gates, a self-aligned threshold inverter sense amplifier, bipolar bitline load, ECL HIT logic, a NANDCAM, and chained CLA. The advantages of the proposed circuits are verified using a fabricated test chip with 0,5-pm BiCMOS process. BiCMOS macros proposed in this paper will enhance the performance of systems on a chip. [1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
Hiroyuki Hara was born in Tokyo, Japan, on November 19, 1960 . He received the B.S. degree in electronic engineering from Sibaura Institute of technology, Tokyo, Japan, in 1983. He joined Toshiba Corporation, Kawasaki, Japan, in 1983. Since then he has been engaged in bipolar and BiCMOS LSI development and designing. He is now in the Semiconductor Device Engineering Laboratory, Toshiba Corporation, Kawasaki, Japan, where he has been engaged in the research and development of BiCMOS macrocells for high-performance ASIC's. Mr. Hara is a member of the Institute of Electronics, Information and Communication Engineers of Japan.
Takayasu Sakurai (S'77-M'78) was born in Tokyo, Japan on January 10, 1954. He received the B. S., M. S., and Ph.D. degrees in electronic engineering from University of Tokyo, Tokyo, Japan, in 1976 , 1978 , and 1981 . His Ph.D. work is on electronic structures ofa Si-Si02 interface.
In 1981 he joined the Semiconductor Device Engineering Laboratory, Toshiba Corporation,
