CMOS-storage emitter-access (CSEA) memory cell offers faster access than the MOS cells used in conventional BiCMOS SRAM'S, but using it in large memory arrays poses several problems. This paper describes novel BiCMOS circuit approaches to address the problems of decoding power, electronic noise, level translation, and write disturbance. It also reports results on a 64-kb CSEA SRAM using these techniques. The device, fabricated in a 0.8-~m BiCMOS technology, achieves read access and write pulse times of less than 4 ns while dissipating 1.7 W at a case temperature of 70"C.
memory consists of only low-swing signals, so no level amplification is required in the decoding path.
A schematic diagram of the CSEA memory cell appears in Fig. 1 . As in many CMOS SRAM'S, the data are stored in a static latch formed by cross-coupled CMOS inverters (transistors IW,~2 and PI, P2 in the figure); this configuration provides a robust storage element that dissipates no static power. The sources of transistors P 1 and P 2 are connected to the read word line (rather than the positive supply traditionally used); this signal has ECL-like voltage transitions, always remaining several MOS threshold voltages above the negative supply and therefore giving the latch excellent noise immunity. The read and write ports of the cell are independent. The read path uses small swings on the read word line and the read bit line for sensing, while the write path uses CMOS-like levels on the write word line and the write bit line for storing a new value into the cell. Fig. 2 depicts a simplified view of the read access path of a CSEA memory. A row address change causes a switch in the differential outputs of at least one push-pull address buffer, pulling current out of the previously selected diode decoder while allowing the newly selected decoder to rise. These decoder changes couple through Darlingtonconnected pairs to drive the word lines. The previously selected read word line is discharged by a shared current source while the newly selected one is pulled up by the Darlington follower. The small swing on the selected read word line couples to the read bit line through bipolar transistor Q1 if PMOS device P1 is conducting (i. e., if the cell stores a one). The cell state is detected by switching a shared current through the selected bit line and into the differential pair formed by Q1 in the cell and the bit-line reference transistor Q2. The bit-line reference is set to be approximate y the midpoint of the word-line swing, so unselected cells on the selected bit line receive essentially none of the bit-line current regardless of their state. The collector of the reference transistor goes into a cascode amplifier which feeds the output buffer. This access path uses only small swing signals, and hence is quite fast. Prior work on this type of memory has produced a sub-4-ns 4K SRAM in a 1.5-pm technology [6] .
The cell area penalty for the CSEA memory cell is fairly small. Transistor Q1 is wired in an emitter-follower configuration, so its collector is readily shared with collectors in adjacent cells and (in many BiCMOS technologies) with the n-well containing the PMOS devices. The density may often be further improved by merging the T Tl-lmethods to improve the sense noise margin. The next section covers circuit techniques for both simplifying and speeding up writes. Finally, Section V describes an experimental 64-kb SRAM built to verify these techniques. source of P1 with the base of Q1. Hence, the primary density penalty is the second word line required by the cell. Because the cell is tolerant of large internal collector resistance (the base of Q1 is never less than 2~~~down from~cc, so the voltage drop across its collector resistor may be this much without saturating Q 1) the~cc wire may be routed on the buried layer of the well/collector and strapped by metal every eight cells or so. For purposes of comparison, a CSEA cell occupying 125 pm2 supplies twice the read current of a 117-pm2 six-transistor (6T) CMOS cell [3] implemented in the same technology [9] . While the access time and density of the CSEA memory make it an attractive candidate for large, fast, on-chip caches, this memory organization is not without its limitations. The use of an ECL decoder leads to fast access, but it also greatly increases the power requirements as the memory size grows. The use of a single bit line for sensing the cell is also troubling, since this cell will not have the common-mode noise cancellation that is found in standard differential bit-line designs. Finally, writing the CSEA memory is a little tricky, because the cell needs CMOS levels on the write word line and the write bit line.
This paper looks at these issues and provides some methods for overcoming their problems, with the specific goal of producing a 256-kb CSEA SRAM in a O.8-pm BiCMOS technology. Section II focuses on reducing the power required in the high-speed bipolar decoders used by this memory. Section III discusses the problems associated with the single-ended sensing of the cell, and
II. READ DECODING CIRCUITS
A CSEA memory uses a diode decoder to achieve highspeed operation. A traditional diode decoder [7] is depicted in Fig. 3 ; it is simply an AND gate implemented in diode logic. To make this decoder structure work for large memories a number of problems must be addressed. Like most ECL circuits, the diode decoder uses a resistor to passively pull the output to the high state and therefore requires static current to keep its output low. Since all but one of the decoder's outputs are zero, the amount of current required for such a decoder array is simplỹ~~C
where~is the number of bits being decoded, AP',.l is the decoder swing (essentially the difference between the selected and unselected read-word line potentials), and R is the pull-up resistance. The diode decoder power therefore scales as the square root of the array size (assuming square arrays, i.e., for an M-bit array, N = 1/2 log2 M). Moreover, the decoder delay increases as the array grows, since the decoder rise time is the pull-up resistance times the node capacitance, which is strongly dependent on the (increasing) number of diodes in each decoder.
Another factor affecting the row decoder speed and power is that as the array grows, the number of cells per word line increases, so the pull-down current on this line also needs to increase to maintain the delay. Unless the decoder resistance is decreased by the same proportion, the current gain of the Darlington-connected word-line drivers will need to increase, which can lead to ringing in their transient response. While word-line ringing will not disturb the cell value (as in a bipolar RAM), it is undesirable because it can lead to increased sensing delays. Thus, both the delay and power of these decoders must be reduced to build fast 256-kb and larger CSEA memories. 
A. Predecoding Address Buffer
In a conventional diode decoder each address line is driven by a push-pull buffer (Fig. 4(a) ) which steers a current Z into the diode decoders [8] . To reduce the capacitance of the critical resistor node, the number of diodes in each decoder can be reduced by predecoding the inputs. A predecoding address buffer is shown in Fig.  4(b) . In this buffer, one of the two inputs is level-shifted and then these two inputs are used to feed four different two-level series stacks. The output of these four gates each control a current of 21/3. The four outputs represent each combination of two inputs (AB, etc.), so the number of diodes in each decoder maybe halved. Furthermore, these outputs have a somewhat faster fall time than the traditional ones, since each line uses two thirds as much pulldown current to drive half as many diodes and two thirds as much interconnect width (since the currents are much too high for minimum-width wires).
Predecoding improves the overall decoding speed at the expense of some stacked gates (with a small associated access penalty) and about 25% wasted current, since the high output now sources pull-down current which was previously steered away. However, this lack of current steering in the pull-down current provides an opportunity for further performance enhancements, which are described in the next two sections.
B. PMOS Load Diode Decoder
By partitioning the memory into multiple arrays, each with its own set of decoders, current could conceivably be steered into only the selected array, saving the power that would otherwise go into unselected arrays. Unfortunately, the unselected diode decoders would then all float high (since the current required to keep them low has been removed). All the associated read word lines would therefore float high as well, and the current required to rapidly discharge these lines upon reselection of the array would dwarf the power savings of partial array activation. For such a scheme to be effective requires a diode decoder which can be powered down without letting its output float high. Fig. 5 depicts a diode decoder with a power-down input. The standard resistor load in the diode decoder is replaced by a PMOS transistor (P 3), allowing the resistance of the load to be adjusted. In the selected (powered up) state, each decoder in the bank will have a high level --on BSQ and a low level on BSP; BSP is held by Q3 to set the resistance of P 3 to the desired (low) value for normal decoder operation. Replica techniques are utilized to generate the base voltage for Q3.
A decoder is deselected by pulling BSP high (using Q4) to (Vcc -V~~), which greatly increases the resistance of P 3. Very little current is then required to keep the decoder outputs low; this is readily provided by BSQ, which transitions downward upon reselection. BSQ also guarantees that no word lines are high in unselected banks and provides some safety margin during selection transitions (i. e., in case P3 turns on slightly before the decoder inputs begin pulling current). Transistor P4 is a very weak device to provide a path to Vcc in case BSP ever turns P 3 completely off.
The PMOS load (P 3) is sized to minimize the required gate voltage swing for selection without greatly increasing the decoder's parasitic capacitance. Simulations indicate that a 1-V swing is achievable considering process variations at room temperature, and that the increase in this swing with temperature is no worse than that of the other ECL signals in such a design, thereby maintaining the requirement of only low-swing signals in the read access path.
C. Address Line Sharing
Reducing the power required by the diode decoder creates the new problem of driving the high-capacitance address lines. In a conventional diode decoder these lines have small delays in spite of substantial capacitive load- ing, since they must supply the large currents required to keep the decoders low. In a RAM with multiple banks of decoders and predecoding address buffers, the address lines will slow down due to the loading as the number of banks grow, since the current drive remains fixed.
To reduce this problem, each bank may have its own set of address lines, and therefore minimize the loading on the (segmented) address lines. In this scheme the bank selection circuitry would steer the pull-down current into the selected bank's address lines; since the predecoding address buffers do not themselves use any current steering in the pull-down path, there is room for at least one level of steering here. However, if the current is to be steered among M banks, there may not be enough room for a stacked (i. e., logz M tall) current tree, so one stage of bank address decoding gates may be needed to provide select signals for a one-level (i. e., 1-of-M) current-steering gate; this additional level of gates will slow the access.
In practice a combination of these approaches usually produces the best results. Several adjacent banks may share one set of address lines (the number of banks per set is determined by the allowed delay) with current steering between this and other sets of address lines.
D. Array Design for a 256-kb SRAM
These techniques are utilized to enhance the performance of a hypothetical 64K x 4 CSEA SRAM which could be integrated in a O.S-pm (7-GHz~~) BiCMOS technology [9] . Six-bit (1-of-64) diode decoders are chosen for both rows and columns because the resulting 4K X 4 banks provide a good trade-off between individual decoder power and bank selection overhead.
A block diagram of such a design is shown as Fig. 6 . This design uses four segmented sets of predecoded address lines for both row and column decoding. Each set of address lines is loaded by four banks of decoders; thus there are a total of 16 decoder banks of each type.
Such a design would have a read word-line capacitance of approximately 2.5 pF. With 1.1-V diode decoder swings and 400-pA pull-down current per (low) decoder output, circuit simulations indicate that a 1.5-ns address input to read word-line driver output is readily achieved using a total of 850 mW in the row decoders (including bank selection but excluding the read word-line discharge current). In comparison, a single 64K x 4 array using 8-b diode decoders would require 1800 mW to achieve the same delay. Therefore the techniques of this section have cut the row decoder power (in this case) by 53%.
III. SENSING CIRCUITRY
Another problem with the CSEA memory is its use of single-ended sensing, which raises concerns about noise problems leading to access-time pushout. This section explains how noise affects the memory cell, and why a single-ended design is appropriate for a CSEA memory. Partially because of the large currents used to sense the cell, the noise immunity is quite good.
A. Static Bit-Line Sensing
The Darlington-driven outputs of the row decoders provide both the positive supply for the CSEA latch and the read access selection mechanism. The high (selected) potential on these read word lines is simply Vc--2 V~~;the low (unselected) value is determined by bit-line sensing considerations.
Because the CSEA cell follower (QI from Fig. 2 ) forms a differential pair with the bit-line reference device (Q2), the read operation simply compares the internal voltage of the selected cell to the bit-line reference potential (V~l,.~).In designing the read word-line swing and the bitline reference potential, the quantity of interest is the collector current through Q2, since this current is the input to the sense amplifier. Simple expressions relating these values are readily derived by considering the voltage differences required to achieve desired ratios of the sensed current to the total bit-line current Z,~l.
The worst-case reading of a one occurs when all unselected cells on the bit line store a zero, because any current which enters unselected cells subtracts from 1... (the current through Q2 when reading a one). The maximum value of ZOn. is therefore simply determined by considering the differential pair formed by the cell follower and Q2. Neglecting parasitic resistances for a moment, and assuming that the sense device has an emitter W times as large as the cell follower, the required voltage difference between the selected word line and the bit-line reference is
where k is the Boltzman constant, T is absolute temperature, and q is the electron charge.
When reading a zero, any bit-line current entering unselected cells decreases lZe~O, the current through Q2 when reading a zero. Therefore, the worst-case condition for reading a zero occurs when all unselected cells store a one. This case is depicted in Fig. 7 . The required potential difference between the bit-line reference and the unselected word-line level, given IV cells per bit line, neglecting any bit-line resistance, is given by
A major speed advantage of the CSEA memory cell is the high read current densities supported by the bipolar transistor in the cell, but these same currents expose resistive parasitic which may substantially affect the above calculations. Specifically, bit-line wire resistance and series emitter resistance both reduce the current ratios, and the memory designer has little control over either; density concerns will almost invariably constrain bit-line widths and emitter sizes to be at or near the minimum supported by the technology, which makes these parasitic resistances as large as possible.
Proper placement of the sense device on the bit line can mitigate the effect of the bit-line resistance Rw. If the sense device is at the same end of the bit line as the current source, the bit-line resistance is in series with the cell fol- lower at the opposite end of the bit line. In this case the voltage difference between the selected word line and the bit-line reference must be increased by Z,~l " Rw to maintain the desired value of ZO... The bit-line resistance actually helps when reading a zero, since there will be extra resistance to emitters of cells storing a one but on unselected word lines; however, for values of& near Z,~lthe current not steered into the sense device will be small so the effect of this resistance will be minor.
Alternatively, if the sense device is at the opposite end of the bit line (from the current source) then the bit-line resistance does not affect reading a one, since in all but the worst case the extra resistance shows up in the emitter of the sense device (and hence actually reduces l...). Therefore, the swing only increases by parasitic IR drops associated with the cell follower, so (2) becomes
where R, is the emitter resistance of the cell follower, Rpl is the equivalent on resistance of transistor PI in the CSEA cell, and~is the current gain of the cell follower. When reading a zero in the worst case, each of the many unselected cell followers will steal slightly different amounts of bit-line current due to the distributed nature of the bit-line resistance. Fig. 7 attempts to make this effect more clear. The bit-line resistance can be divided into IV -1 equal pieces and therefore the potential difference between the emitters of adjacent cell followers is simply Z,~lRw/(N -1), assuming ZZ.,O= Z,~l.It follows that the current ratio between adjacent devices is exp
[( 'qZ,,~R~) /(kT(~-1))1 and thus the total current through unselected devices, Zurrsel, is @en by
'-2(G@5) ")
lun,el = 10~~o exp where 10is the current through the bottom unselected cell. By solving the geometric series, and folding the result back into (3), while noticing that both Rw and the emitter resistance of the sense device R=/ W simply add to the required swing, we obtain
While it appears that the swing has grown by Z,~lRw, the solution of the geometric series (which appears in square brackets above) is a smaller quantity than the value it replaces (IV -1, from (3)), so the required swing does not increase as much as when the sense device is nearest the bit-line current source. For the hypothetical 256-kb CSEA SRAM of Section II, the 64-cell bit lines would be approximately 1500 pm long. Assuming 1... is 10% and Z,.,O is 90% of Z,bl and reasonable bit-line resistivity, a required word-line swing of 400 mV is predicted, with the bit-line reference 230 mV down from VrW1(~i~~). As the following sections show, a 550-mV swing is needed in the presence of supply noise.
B. Supply Noise Sensitivity of the Bit-Line Sensing
Electronic noise coupled onto the bit lines from the power supplies will degrade the static sense ratios. Most static RAM's reduce the effects of supply noise by using differential sensing techniques while attempting to make the noise look common mode. A fully differential CSEA cell does not get much benefit from this technique, since the capacitive power supply coupling of the bit line is primarily through the base-emitter capacitance of the cell followers, whose bases are tightly coupled to either V~õ r Vcc (through the read word line). Hence, the noise coupling is strongly dependent on the data stored in the unselected cells and this coupling will not appear common mode, even with differential bit lines.
It is therefore important to reduce this data-dependent supply noise coupling as much as is feasible. Since the read word-line voltage is strongly tied to the Vcc voltage (through the diode decoder and the Darlington driver) and the bit-line reference voltage may be constructed to track Vcc as well, noise suppression is best obtained by making the bit-line voltage as strongly coupled to Vcc as possible (and therefore as weakly coupled to V~~as possible). One approach is to add capacitance from the bit line to VCO thereby increasing the coupling; unfortunately, the amount of capacitance required to have a substantial effect on the coupling is large (because the base-emitter capacitance is such a high fraction of the total bit-line capacitance) so this approach will substantial y slow the bit lines.
In any case, some data-dependent V~~supply noise will make it onto the bit lines so the size and effects of this noise should be predicted. Assuming the bit-line reference voltage does not respond substantially to V~~noise, it is reasonable to amend the static sensing equations ( (4) and (6)) to include noise-related terms.
Since the current out of the collector of the sense device is the actual input to the rest of the sense network, the noise effects are best monitored for their effect on this collector current. If V~Kbounces downward (away from Vcc) while the selected cell stores a zero, the bit line will follow downward as well, increasing the base-emitter voltage on the sense device and thereby increasing the sensed current; in this case there is no margin degradation whatsoever. Similarly, if V~~bounces up while the selected cell stores a one, the sensed current decreases so the margin is not affected. The two complementary situations cause sensing problems.
When V~~bounces up and the selected cell stores a zero, the bit line tries to rise. The injected charge may be simply modeled as a constant current source equal to d VEE/dt times the coupling capacitance. This injected current subtracts directly from the bit-line current, and hence decreases the amount of current available to sense; the magnitude of this decrease is independent of the~b~~~f -,~~(1~~) value, so increasing the word-line swing does not help. The simplest option is to insure that the injected current is fairly small compared with the bit-line current to minimize the effects on the sensed current; fortunately, this is not difficult because the cell follower in the CSEA cell can supply much more read current than a traditional CMOS cell of the same size. These high bit-line currents are otherwise needed to support the relatively large bitline voltage swings of single-ended sensing. The effects of this injected current are also somewhat mitigated by the fact that the case with the lowest static sense ratios (all unselected cells store a one) has the least capacitive coupling to V~~and hence the lowest injected current. Similarly, when unselected cells store zero, the static sense ratio is largest while the injected current is largest. As a result, circuit simulations indicate that the minimum sensed current when reading a zero in the presence of noise is, in the end, not very data dependent.
For the hypothetical 256-kb design, the worst-case bitline coupling to V~~is about 700 fF, so no more than 20 % of the 750-vA bit-line current is lost as long as the supply noise is slower than 300 mV /ns. This current loss is easily mirrored in the sense-amplifier reference circuitry (as will be shown later), so the major effect of such a loss is simply the increased time required to discharge the bitline capacitance.
When V~~bounces down and the selected cell stores a one, the bit line attempts to follow. In this case the in-jetted charge (which may be converted into a current, as above) adds to the bit-line current; any of this current which makes it into the sense device will affect the sense ratio. Because the cell follower has a high equivalent base resistance (due to the ON resistance of PI), the follower cannot instantly supply the extra current needed to keep the bit line from falling. As the bit line falls, bringing the follower's base with it (coupling through the base-emitter capacitance), increasing current will flow into the base from the read word line until the base begins to raise back to the level required to statically supply the extra current. In other words, the impedance looking into the emitter of the cell follower (Ql) has an inductive component, so the bit line will temporarily drop more than a static analysis would suggest. The~rw~(h,gh) -~blref value may be increased by the maximum drop in the bit line to reestablish the sense current ratio from (4). Using an equivalent small-signal model for the cell follower, a simple RLC circuit may be constructed to compute this drop as
where A Zr~lis the worst-case injected current (all unselected cells store a zero), R,~is the resistance seen looking into the emitter of the cell follower ((kT/qZ,~l) + (RPl /6) + R=), L@ĩs rFRrl, rF is the forward transit time of the cell follower, and C,b[ is the total bit-line capacitance. For the 256-kb design, a read word-line swing increase of 120 mV is sufficient to maintain the static current sense ratio with 300 mV /ns of V~~supply noise. Circuit simulations indicate that these equations predict the bit-line sense behavior very accurately given a fixed vbl,.~. Design considerations for this reference will be discussed later.
Because the only paths to V~~in a traditional ECL circuit are through static current sources, the amount of V~ñ oise generated on chip in an ECL system is very small. In order to quantify the amount of externally generated V~~noise, a model may be readily built for the power supply networks which takes into account package lead inductance, supply network resistance, and the array capacitance between the supplies. This network low-pass filters incoming noise edges, limiting the edge rate (i. e., the maximum d V/dt) which the internal V~~supply will experience given an external step input. For large CSEA arrays (larger than 64 kb), practical maximum edge rates are a few tenths of a volt per nanosecond.
C. Bit-Line Reference Design
The desired bit-line reference voltage is specified by (4), (6), and (7). Assuming the current sources are designed to provide voltage swings that are proportional to absolute temperature (PTAT), this bit-line reference may be simply generated as a PTAT drop from the selected read word-line potential (as in [6] ) if the components of these equations are at least nearly PTAT. This is certainly true of the logarithmic terms (with kT/q multipliers) and is roughly true of the Zrblterms, assuming the process' resistors do not force currents to decrease with increasing temperature.
Subsection III-B stipulates that the bit-line reference should not respond to V~~noise; this is extremely difficult to accomplish due to the large base-emitter capacitance of this reference to the many unselected and noise-sensitive bit lines. However, as long as the bit-line reference moves less than (and in the same direction as) the selected bit line, the sensed current will not be adversely affected. Because the unselected bit lines have strongly datadependent responses to V~~noise (since there is no substantial bit-line current to fight such bounces), the V~c oupling onto the bit-line reference is also data-dependent. Without special care this reference may thus bounce more than the selected bit line. This problem is handled by greatly increasing the reference's coupling to Vcc by adding PMOS capacitance until the worst-case bounce is within the desired range. In practice this is not very expensive in terms of area, since PMOS gate capacitances tend to be higher than bipolar base-collector and baseemitter capacitances per unit area, so the additional capacitors may be laid out with the sense devices. For the 256-kb design these PMOS devices occupy 6-pm tall patches of otherwise empty area under the bit-line reference wire and between the sense devices across the full width of the die.
D. Two-Level Cascode Sense Amp
The collectors of the bit-line reference devices are the input to the sense amplifiers. As in most SRAM designs, it is desirable to share the sense amplifiers among multiple bit lines. If unselected bit lines are pulled sufficiently high (by their column decoder) that their sense devices have no collector current, the collectors of these sense devices may be connected to each other; the current out of this node should then be simply the sensed current from the selected bit line. This current may then be converted to a voltage by a resistor connected to Vcc, and the result maybe compared with an appropriate reference to determine the state of the selected cell. Many bipolar RAM's insert a cascode device between the sense resistor and the highly capacitive shared collector node to improve the sensing speed by reducing the loading on the resistor while reducing the required voltage swing on the shared node [10] . As the number of bit lines sharing a sense amplifier increases, the delay at the shared collector node increases as well;
Muse

Ileak
Read Word Line Column Dacodel . .
Pull up
ItiI Fig. 8 . Two-level cascode sense amplifier, not only does the number of collectors (and hence the parasitic capacitance) increase but also parasitic resistance in the wire connecting the collectors increases the voltage swing on the node. A second level of cascoding may be added to reduce this effect. As depicted in Fig. 8 , the bit-line sense currents are summed in a tree fashion, first locally through one cascode device and then globally through a second. This arrangement greatly reduces the capacitance on the shared global sense wire, which has substantial series resistance (approximately 200 0 for the 256-kb design), while reducing both the capacitance and the resistance of the local sense wires. Hence, the overall sensing delay is substantially reduced. Transistor Q3 pulls unselected bit lines to about one V~~below the selected word-line voltage to ensure that these bit lines do not contribute current to the cascode network.
This circuit requires careful placement of the voltages at the bases of the cascode devices to avoid their saturation. Since the VrW~(hi@) potential is two full J"m levels below Vcc and the~bl,efpotential is about one half~m below that, Clampl and Clamp2 may be set at (Vcc -1.5 V~~) and (Vcc -0.5 V~~), respectively, to keep them and the bit-line sense devices out of saturation. This arrangement, however, limits the swing on the sense resistor to about 0.5~B~, which may not provide enough noise margin nor allow enough room for increasing swings with temperature. Alternatively, the cascode potentials maybe lowered somewhat; the bit-line sense and both cascode devices will then operate in "soft" saturation (i.e., with their base-collector junctions slightly forward biased), increasing the available swing for the sense resistor. This latter solution requires careful worst-case design to limit the maximum base-collector biases to prevent access time degradation from charge storage in this junction and latchup from excess current injected into the substrate.
Design of the cascode references must also consider electronic supply noise. With the high current gains provided by cascode sensing, a downward bump in VEE, coupled onto the emitters of the cascodes, could overwhelm the current being sensed. Reducing the percentage coupling to V~~is clearly desirable, but only the wire capacitances (which may be routed over Vcc -connected material rather than the VE~-connected substrate) may be coupled to Vcc without penalty; this is sufficient for the second-level cascode device, whose emitter capacitance is largely wire. Explicit Vcc capacitance may be added to the emitter nodes of the first-level cascodes, but is undesirable because it slows the sensing speed. Simulations indicate that making the first-level cascode's base (i. e., Clampl) coarsely track the voltage response of an ''unselected" first-level cascode's emitter (via a replica network) will greatly improve the noise response.
The weak current source (ll.,~) in the figure prevents first-level cascode devices attached to only unselected bit lines from turning off and thereby increasing the voltage swing required to reselect them. As long as the sum of all of these current sources is small compared with the maximum sensed current (approximately ],b~), the required swing across the sense resistor is not substantially increased. Furthermore, the effects of these current sources are readily included in the sense amplifier's reference generation circuit.
The sense-amplifier reference is shown in Fig. 9 . The goal is to place the reference halfway between the sense resistor's high and low levels, in the presence of electronic noise. A replica circuit is used to duplicate the noise behavior of the sense network. Two sense networks mimic the two read cases, summing their currents across parallel sense resistors. Currents that are the same between the two cases (such as the &k currents) generate the same drop on the reference as on the actual sense resistor, while those that differ (like the bit line current when reading a zero) show up as half their normal effect. This gives the reference the desired behavior. This circuit will also help cancel cascode noise effects, since the reference circuit's cascode networks will behave like the read data's network.
Since these reference sense networks should run the full width of the die for accurate tracking, the replica bit lines may be readily reproduced on a per bank basis, with only the selected bank pulling Z,~lfrom its replicas. In this way the reference may also compensate for local supply and Vwl@,@Jvariations. The net effect is to produce a wellcentered reference which may be simply compared to the associated outputs from the data sense networks using a simple differential pair at the final output buffer. Fig. 10 shows a circuit simulation of a worst-case bankswitching read access for the 64K X 4 CSEA design discussed in the previous sections; a -300-mV /ns VEEnoise pulse is placed 1.35 ns after the input transition so as to maximize the access penalty (note the similar drops in sense out and sense reference). This access requires 0.2 ns for input pad, level shifting, and buffering; 0.65 ns for selecting and driving the gates of the PMOS resistors; 0.6 ns for pulling up the decoder, the Darlington pair, and the lEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 27, NO. 6, JUNE 1992 read word line (to the bit-line reference potential); 0.85 ns for accessing the CSEA cell and reducing the reference device's current to 50% of the bit-line current; 0.6 ns for the two-level cascoded sense amp; and 0.5 ns for the output buffer and 50-i2 output pad drive. This simulated design has a read access time of 3.4 ns at a junction temperature of 70"C, using 550-mV read word-line swings and 2.3 W of power. Building a high-performance robust write path for such a RAM is the topic of the next section.
E. 256K Pe~ormance Summary
IV. WRITE CIRCUITS
A. New Level Converter
The first BiCMOS SRAM'S converted their inputs to CMOS levels as early as feasible, using traditional CMOS decoding techniques augmented with BiCMOS drivers [1] . More recent designs move this translation later in the de- . In an attempt to more closely match read and write times while saving the power required for a second set of decoders, level converters may be placed at the outputs of the read decoders to create the signals needed for writing. However, since this requires one converter per decoder output, and all but one decoder output will be inactive, the converter should dissipate no power when the output is inactive [4] . Fig. 11 depicts such a converter, configured to translate the read word-line value into a full-swing write word line. The first stage is a signal-amplifying ECL inverter, which may share its current source with all other converters in a bank because no more than one word line is ever high. No converter will receive this current except during the write pulse, and then only when the converter is in the selected bank. Transistor P 5 is ratioed to easily overpower N4 and hence provide a rapid rising edge; there is only static power in the one converter with P 5 turned on. Two stages of CMOS inverters (or a single BiCMOS driver) amplify and buffer the output onto the heavily loaded write word line. Transistor N5 is a feedback-driven NMOS device that reduces the delay of both transitions; on the rising edge it is off and hence does not interfere with P5. Once the word line is high, N5 turns on to provide more drive for the falling edge. The reference voltage on the gate of transistor N6 prevents the feedback signal from overpowering P 5 until P 5 begins to turn off. By using similar converters for both the column decoders and the write data, the write operation becomes effectively pipelined, since no input pins will have significantly shorter paths to the memory cells than other input pins. The write pulse and write data signals do, however, need to be delayed relative to the address inputs to make use of this feature (since the address inputs are delayed by decoders). As long as the write operation finishes before the next read is decoded, read-after-write petiormance will not suffer.
B. Write Qualification
The single NMOS write access device of the CSEA cell presents another challenge; in order to avoid writing unselected cells on selected word lines, a third write bit-line voltage level is required. Skewing the cell device ratios to provide adequate noise margins and write speed with the three-level write bit lines (as was done at the 4K level) is expensive in terms of cell area. An alternative would be to add an additional NMOS device to the CSEA cell, which would allow differential writing. This solution, however, increases cell area primarily by requiring an extra write bit line which must contact each cell. Since most cache RAM's (and certainly all on-board caches for microprocessors) access multiple bits at a time, the write disturbance problem may be avoided entirely by doing a full X-Y select on the cells to be written; in other words, the word lines may be divided so that only cells to be written have a high write word line. A simple circuit to perform this write qualification is shown in Fig. 12 . It is essentially identical to the local word-line driver used in many CMOS RAM's which employ divided word-line techniques [11] . The write word line must be inverted; this is not likely to add much delay since the existing buffer stage(s) are already heavily loaded. The other input to this circuit is a full-swing column select signal, which may be produced much like the write word line. The circuit consists of an AND gate which looks like a CMOS inverter with the global word line as its input and the column select line as its positive supply. The loading of these gates on the column select line is comparable to the loading of the write bit lines, which are also driven by the column decoder, so the delay times should be similar.
Since each such gate drives multiple cells (i.e., the access width), the area penalty is reduced. The addition of the global write word line does not increase the cell area, since it replaces the metal strap for the local poly word lines in a conventional design.
V. EXPERIMENTAL
64K SRAM
A. Organization
To test some of the circuit ideas described here, an experimental 64-kb CSEA SRAM was designed in a two-month period and fabricated in the O.8-~m BiCMOS technology referred to earlier. While the technology does provide three levels of metallization, the design made use of only two. The array and decoder design was optimized for a 256K design, but only one quarter of the banks were actually implemented.
The SRAM is externally organized as 16K X 4, and internally as four banks each containing 64 rows and 256 columns, with adjacent bits in a nibble laid out in adjacent columns to allow for write qualification. The chip micrograph in Fig. 13 shows that the word-line decoders for Fig. 13 . Chip micrograph of 16K X 4 CSEA SRAM each bank are placed in the center to reduce the word-line RC time constant. The chip contains four sets of row (word line) decoders, but only two sets of column decoders, since each side may share column decoders between top and bottom due to the presence of parallel sensing paths. The row decoders on each side share predecoded address lines; the pull-down current for both row and column decoders is switched between separate address lines for each side based on the most significant address bit. The second most significant address bit is used in up/down row decoder bank selection and at the final output buffer to choose between the parallel sense paths; this late gate is set up well before the read data arrive, so it does not slow the access.
The CSEA memory cell has an area of 157 pmz; this is larger than mentioned in Section I due to the use of a 0.4-mm design grid. The die measures 6.0 x 4.4 mm and is packaged in a 48-pin ceramic DIP.
B. Results
A left-half to right-half address transition is shown in Fig. 14 . The device has an address access time and write pulse width of less than 4-ns and requires 320 mA of current. All measurements were made at a package temperature of 70°C. Key performance parameters are summarized in Table 1 .
The measured access time is higher than the 3.4 ns predicted from circuit simulation of the more recent 256K design. Several factors contribute to this discrepancy. 
Ut
First, the 64K die has a needless extra buffer gate in the PMOS resistor selection circuitry; simulations using extracted parasitic from the fabricated die predict a penalty of 300 ps. Additional differences are due to long lead lengths in the package and increased wire delay due to the thinner-than-anticipated bit lines. Because the two designs use the same memory cell and decoder layouts, the logical place to expect performance degradation is in the peripheral circuitry (due to longer wires and larger fanouts). However, the 256K is a more mature design; considerable effort has been invested in speeding these paths, using parasitic data gathered from the 64K design.
VI. SUMMARY
The CSEA memory cell is an attractive choice for highspeed, high-density SRAM applications because of its entirely low-swing bipolar read access path coupled with a zero de-power CMOS latch. This paper has described new circuit techniques that should enable the cell to realize these goals. A variable-resistance PMOS device is shown to reduce the power requirements of the bipolar decoders. A thorough analysis of the bit-line sensing circuitry has shown that the single-ended reading of the cell, in conjunction with the two-level cascoded sense amplifier, need not be very sensitive to supply noise. A new BiCMOS level converter allows one converter per word line without requiring much power nor slowing the cycle times. Also, write qualification provides substantial improvements in cell area and write speed by eliminating threelevel write bit lines for arrays with word widths of four or more.
Using these techniques a sub-4-ns 64K CSEA SRAM has been demonstrated. Circuit simulations indicate that a properly designed 256K (64K x 4) design would have an access time of less than 3.5 ns.
