We present a solution to one of the fundamental problems in computer graphics, the hidden surface removal. In most 3D-graphics systems the hidden surface removal is done using the Z-Buffer algorithm. This method, howevel; requires to perform a read-modify-write memory access for each visible pixel, what represents a severe performance limit. We introduce a novel SRAM cell, which incorporates the needed logical units to perform the ZBuffer algorithm on its own. Placed into the page register of conventional DRAMS, almost any pixel rate can be achieved.
Introduction
A common representation of 3-dimensional objects in computer graphics applications is an approximation of their surface by a sufficiently large number of polygons, mostly triangles. Graphical output devices nowadays are almost exclusively raster devices, that is, the display area is comprised of a set of discrete pixels. Each pixel on the screen is associated with one entry in the display buffer (or Frame Buffer), which holds the color of the pixel. For display purposes, each triangle must therefore be decomposed into the set of screen pixels it covers, a process called rasterization. Current state-of-the-art rasterizer units are capable of producing in excess of lOOM pixelsk, if fed with triangles at an appropriate rate.
The non-trivial task of determining the visible parts of an object can be solved by computing for each pixel a measure for the distance to the observer, called Z-Value, and by comparing that Z-Value to the Z-Value of the pixel previously generated for this screen address. This is the so-called Z-Buffer Algorithm [ 1] , [2] , which requires additional memory capacity to hold the Z-Values.
Thus, before a pixel at screen address (x,y) can be written, the value in the Z-Buffer at address (x,y) must be read and compared to the newly generated one. If the new pixel is nearer to the observer, its Z-value must be written into the Z-Buffer, and its color is stored in the Frame Buffer to be displayed on the screen. The new pixel is discarded if it was behind the old one.
Obviously, the algorithm bears an inherent memory bandwidth problem. A typical screen resolution of about 1M pixels and 16 to 32 bits per Z-Value prohibit the use of fast SRAM devices due to the high costs. Slow DRAM devices force the system designer to build highly-interleaved and expensive memory systems to keep up with the fast rasterizer units.
As a solution, we propose to integrate the compare logic into the Z-Buffer memory devices, and to perform the complete Z-Buffer algorithm on chip. In this way, the enormous internal bandwidth is available, and the readmodify-write-cycles are turned into merely write-cycles from an external point of view. In its basic configuration, the Z-Buffer only outputs the fartherhearer-flag as a control signal for the Frame Buffer.
However, the compare logic must be fast, compact and compatible to the DRAM technology to be easily integrated. In this paper we will describe a logic embedded SRAM cell, called CBit cell, capable of performing the Zcompare operation, and a memory architecture, called D A M , which incorporates this cell for extremely high pixel rates.
CBit cells
For explanation purposes, we assume a Z-resolution of 32 bits. Let's consider the logic embedded SRAM cell in Figure 1 , which holds the MSB ZO31 of the old Z-value. It must be compared to the newly generated MSB ZN,,, which is put on the true and inverted bit-lines. The upper half of the schematic consists of a common 6-transistor CMOS static RAM cell. The remaining seven transistors perform the compare operation. They are mainly N-type, and therefore, the cell is called an N-type CBit cell (there is a corresponding P-type cell, which will be introduced later). The operation is as follows:
Prior to any access, the write-signal WR and the selectsignal S31 are held low, the nearer flag " is precharged high. Thus, 330 = 1. An access starts by placing the incoming ZN3l-bit and its inverted value on&e corresponding bit lines. Then, S31 is activated and NN is left floating.
Figure 1 : N-type CBit cell
This will produce logical values on the output lines as given in Table 1 . Only in the case ZN3, = ZO31 the next lower bit must be tested. Thus, the S3o-line can be used to activate the CBit cell holding ZO30. However, we have to consider that the active level of the select signal has changed from the input to the output of the cell. This seems to be very impractical, and one could think about various arrangements of P-type and Ntype transistors at the places T7 -T11. However, regardless what type the pass transistors are, they are good at passing a voltage level which does not switch on other pass transistors of the same type. An inverter would solve the problem, but it would also increase the propagation delay and the transistor count for each cell. Therefore, we construct a P-type CBit cell as shown in Figure 2 , which is activated directly by the active low S30-signal. If selected, the internal state of the P-type CBit cell is shown in Table 2 . In this way, we can construct a complete 32-bit memory word by alternatingly placing N-type and P-type CBit cells into a chain. Each N-type CBit cell is connected to " , each P-type cell is connected to NP, accordingly. The interconnection scheme Is shown in Figure 3 .
In this way, the select signal ripples from one cell to the next down to the first cell holding a ZO-bit which differs from the incoming ZN-bit. The select-signal S.l (denoted EQ in Figure 3 ) of CBit cell 0 is activated if ZN = ZO.
After the worst-case propagation delay time (the time it takes for the select-signal to arrive at CBit cell 0 plus the time needed by this cell to do the compare operation) the S31-signal is deasserted and both NP and NN are sampled.
If one of them is found active, WR is asserted to write the new Z-Value simultaneously into all cells.
WR is passed to the outside world (e.g. the Frame Buffer Controller), indicating that the color of that pixel must be written as well.
A closer look at the CBit cells
First, we'll explain the reason for the presence of T13 in the CBit cells. In Figure 3 , consider the case ZN3, = 2031 = l,ZN30 = Z030 = 1,ZN29 = 0 and Z029 = 1, that is, the nearer-condition was detected in CBit cell 29. Without T13, the ZNgl-line,canying a high level, would be tied to ground via the NN-line in CBit cell 29. This potential contention must be avoided for the P-type cells as well.
Except for write-operations, the level at the gates of T7, T8 and TI2 is static. During a compare operation, in each selected cell always one of the two branches T7-TlO and T8-T9 remains in the OFF-state. To determine the speed of operation, we must examine the case ZNk = zok. The parts performing the logic operation in two adjacent CBit cells is shown in Figure 4 for the case (ZNk = z o k ) AND (ZNk-1 = ZOk-1). The propagation delay of the select-signals is comparable to the delay of two-input NAND-and NOR-gates. A sample implementation of the circuitry was done at the IBM Development Laboratory at Boblingen using IBM's CMOSSL technology (3.3V, 0Spm effective channel length). The following simulation results were obtained: the select-signal ripples through N-type and Ptype cells in about O.lns and 0 . 1 5~ respectively. After beingelected, it takes 0.2411s for an N-type cell to activate the NN-line, and 0.5611s for a P-type cell to pull up the NPline. Figure 5 shows the timing diagram of one N-and Ptype CBit cell combination. The markers indicate delays as explained below. In this implementation, a worst-case 32-bit compare operation takes about 4.4ns. However, the compare time can be brought well into the sub-nanosecond range without increasing the hardware expenses significantly. This is accomplished by breaking the select-chain into a number of shorter sub-chains, activating the select-signals of the CBit cells holding the MSB of each sub-chain simultaneously and combining the results. An example is shown in Figure 6 , where the 32-bit chain is divided into 4 subchains, thereby reducing the overall delay time to approximately one fourth. The additional hardware expenses are as low as 12 transistors. 
Architecture of the ZRAM-Chip
For the Z-Buffer, we propose to modify standard DRAM devices such that the page register is constructed from CBit cells. In the simplest configuration, the ZRAM chip has a 32-bit data interface for the Z-Values, and therefore accepts pixel addresses. The page register is organized such that 32 adjacent CBit cells hold the Z-Value of one pixel. Supposed the DRAM block is organized as lKxlK memory cells, the appropriate page register stores the Z-Values of 32 pixels. Most rasterizers operate screen line oriented, so that assigning screen line fragments to DRAM pages will give a high percentage of page hits. Nevertheless, any page fault will result in a severe performance loss due to the short processing time of one Z-compare operation. Thus we propose to install multiple page registers, and to use a page prefetch mechanism. The operation is as follows: upon receipt, Z-Value and pixel address are stored in FIFO memories placed onto the ZRAM as well. While performing the appropriate page access, new data is allowed to enter the chip, and succeeding page faults are detected. As soon as operation starts with the just loaded page register, a new page access is initiated, and so forth. A block diagram of the proposed architecture is shown in Figure 7 . For simplicity, we assume the DRAM device being built as one single memory block. Due to the pipelined operation, the nearer-flags are handed out to the Frame Buffer Controller with a certain latency.
Conclusion
We presented a memory design which represents the ultimate solution to the notorious Z-Buffer Bottleneck. No new technology is required, and additional hardware expenses are small. Not described in this paper (since beyond the technical focus of the Conference) are further possibilities of this design, such as supporting Anti-Aliasing on sub-pixel resolution and further increasing the pixel rate by placing parts of the rasterizer unit onto the ZRAM as well.
7

