Abstract-Recently, the level of realism in PC graphics applications has been approaching that of high-end graphics workstations, necessitating a more sophisticated texture data cache memory to overcome the finite bandwidth of the AGP or PCI bus. This paper proposes a multilevel parallel texture cache memory to reduce the required data bandwidth on the AGP or PCI bus and to accelerate the operations of parallel graphics pipelines in PC graphics cards. The proposed cache memory is fabricated by 0.16-m DRAM-based SOC technology. It is composed of four components: an 8-MB DRAM L2 cache, 8-way parallel SRAM L1 caches, pipelined texture data filters, and a serial-to-parallel loader. For high-speed parallel L1 cache data replacement, the internal bus bandwidth has been maximized up to 75 GB/s with a newly proposed hidden double data transfer scheme. In addition, the cache memory has a reconfigurable architecture in its line size for optimal caching performance in various graphics applications from three-dimensional (3-D) games to high-quality 3-D movies. This architecture also leads to optimal power consumption with an adaptive sub-wordline activation scheme. The pipelined texture data filters and the dedicated structure of the L1 caches implemented by the DRAM peripheral transistors show the potential of DRAM-based SOC design with better performance-to-cost ratio.
be represented by simply attaching the original scanned surface images to the 3-D graphics model surfaces, as shown in Fig. 1 .
However, the texture mapping operation requires intensive system memory access because texture data is usually stored in a PC's system memory, and then loaded into a graphics card through the AGP or PCI bus on demand. This is called pull architecture, which is better in terms of memory utilization as compared to push architecture [9] . In push architecture, all the texture data is loaded into the PC graphics memory before starting rendering. This limits the size of the texture data to the graphics memory size and leads to low memory utilization because the graphics memory can only be used when 3-D graphics applications are running. Therefore, in most PC graphics cards, pull architecture is preferred from the viewpoint of memory utilization. Although Intel's AGP bus has been developed for more efficient texture data loading with a small L1 texture cache within a pull architecture [10] , more efficient use of the finite bus bandwidth by a more sophisticated texture cache memory is still required to increase the graphics realism with interactive frame rate. Furthermore, due to the widespread use of parallel rendering architectures even in PC graphics cards, it is important to support parallel graphics pipelines without texture cache memory access conflicts among the parallel graphics pipelines [11] .
With consideration of the aforementioned requirements, the new cache memory design has two goals. The first is to reduce the required bandwidth on the AGP or PCI bus for loading texture image data, and the second is to support parallel graphics pipelines for maximum speed operations. Fig. 2 shows a block diagram of the proposed cache memory architecture. It is composed of four components: an 8-MB DRAM L2 cache memory, 8-way parallel SRAM L1 cache memories, eight pipelined texture filter modules, and a serial-to-parallel loader. All of these components are integrated on a single chip and fabricated using 0.16-m DRAM-based SOC technology.
The large DRAM L2 cache memory reduces the required data bandwidth on the AGP or PCI bus by exploiting the interframe texture data coherency, which is similar to that found in MPEG algorithms [9] . Since most of the texture data for rendering the current graphics frame is reused for rendering the next frame, the large DRAM L2 cache memory can reduce the required data bandwidth on the AGP or PCI bus by 20 times for 1024 768 screen resolution. This will be explained in detail in the performance analysis section of this paper. pipelines with dedicated texture data filter modules. The independent L1 cache memories remove the cache access conflicts by parallel graphics pipelines, and enable each graphics pipeline to run at its maximum speed.
For maximizing the advantages of the proposed architecture, wide data bandwidth between the L2 and L1 cache memories is crucial for smoothing parallel L1 cache refill operations. For this goal, a wide internal bus (IBUS) has been adopted, and a newly proposed hidden double data transfer scheme maximizes the IBUS bandwidth up to 75 GB/s. This wide IBUS bandwidth enables eight L1 caches to be serviced by a large DRAM L2 cache memory without starvation, which is unfeasible in PCB-level design. In addition, the cache line sizes of the L2 and L1 caches can be reconfigured in the range of 4 4, 8 8, and 16 16 pixel areas to keep optimal caching performance for various graphics applications from 3-D games to high-quality 3-D movies [12] . Furthermore, the dedicated SRAM L1 cache and the pipelined filter structure based on the texture mapping algorithm show good performance in spite of using low-speed DRAM peripheral transistors in the DRAM-based SOC design. This results in a more cost-effective design than that with expensive merged DRAM logic (MDL) technology [13] . Although the use of embedded DRAM to store texture image data has been studied by other researchers [14] , their architecture was based on push architecture, which has limits in texture image size, and there was no actual VLSI implementation.
In Section II, the texture mapping and filtering algorithm by trilinear interpolation is introduced. In Section III, details of the cache architecture are described. In Sections IV-VII, details of the sub-blocks and the adopted circuit techniques are explained. In Section VIII, the performance improvement is demonstrated by running real 3-D graphics applications, and the chip implementation results are shown. Finally, conclusions are presented in Section IX.
II. TEXTURE MAPPING AND TRILINEAR INTERPOLATION
The conceptual texture mapping operation is composed of two steps. The first is wrapping 3-D graphics object surfaces with 2-D scanned texture images, as shown in Fig. 1 . The second step is projecting the textured 3-D graphics objects onto a 2-D screen. Thus, it incorporates two-stage floating-point matrix calculations (2-D texture image space 3-D object space 2-D screen space). However, the actual texture mapping process takes reverse calculation steps to remove unnecessary calculations for hidden objects in 3-D space [8] . Therefore, inverse transform matrix calculations are performed on each pixel of the 2-D screen. Although this inverse mapping process reduces the required processing power, it brings another problem of aliasing artifacts in the generated graphics scenes [8] . Since the screen pixel is not a mathematically defined point, but rather an area, the corresponding portion in the texture image is also an area. Furthermore, although the size of the screen pixel is fixed, the size of the corresponding area in the texture image space varies according to the geometrical relationships between the 3-D object and the 2-D screen in 3-D space [8] . Thus, to find a representative pixel value in the corresponding texture area, another calculation step, called filtering, is required. Although a pixel in the texture image is called a "texel" in 3-D graphics, in this paper we simply call it a "pixel" for nonspecialists in 3-D graphics.
The proposed cache architecture is specially designed for the texture filtering method based on trilinear interpolation with mipmap texture images [8] , which is adopted by most of today's 3-D graphics hardware. Trilinear interpolation is a kind of 2-D image filtering operation to reduce the aliasing artifacts in 3-D graphics scenes. This filtering algorithm first makes the mipmap. It is made by prefiltering an original texture image and resampling it to make a half-size image, and repeating these operations until the texture image size becomes 2 2, as shown in Fig. 3 . The original texture image is named the level-of-detail 0 (LOD 0) image, and the smallest image is named the LOD image. The trilinear interpolation selects two neighboring LOD images which have the closest 1 : 1 area relationship with the screen pixel area [15] , [16] . Then, this method reads eight pixel values, four from even LOD levels and four from odd LOD levels, and weights them to evaluate a final pixel value to be mapped onto the screen pixel. This trilinear interpolation method using the mipmap can evaluate the final filtered pixel value in a fixed cycle time regardless of the size variation of the texture image portion to be filtered in the original texture image (LOD 0 image).
In an even or odd LOD level image of the two selected LOD level images, four pixel values at four neighboring integer coordinates are read and interpolated to find a representative pixel value of the four neighboring pixel values by weighting the distances between the four integer coordinates and the transformed floating-point coordinates from a pixel coordinate in the screen space by the transform matrix. This process is illustrated in Fig. 3 . It is called bilinear interpolation. Since the calculated LOD value is also a floating-point value, and is between the two selected integer LOD levels, the two bilinear interpolated pixel values from the two selected integer LOD level images are interpolated again by weighting the fractional value of the floating-point LOD value. The final value becomes a trilinear interpolated pixel value to be mapped onto a screen pixel. Therefore, the texture cache based on the trilinear interpolation method using mipmaps stores portions of the mipmap texture images in different LOD levels and in different texture images. The DRAM L2 cache size is 8 MB, which is optimal for 1024 768 screen resolution in most graphics applications [9] , and the optimal size of each SRAM L1 cache is 16 kB [12] . Each SRAM L1 cache has parallel output data paths for transferring eight pixel values simultaneously to its trilinear interpolator. The pipelined trilinear interpolator ( texture filter) generates a final aliasing-free pixel value in each clock cycle with a latency corresponding to the number of pipeline stages. The parallel data path of the SRAM L1 cache and the pipelined trilinear interpolator allow cost-effective system performance in spite of using DRAM-based SOC technology, which is lower in speed than the expensive MDL technology. The cache lines of the L2 and L1 caches are mapped onto 2-D texture image blocks, not on a one-dimensional line, for a lower cache miss rate [12] . Furthermore, the cache line size is reconfigurable in the range of 4 4, 8 8, and 16 16 pixel areas to maintain optimal caching performance for various graphics applications [12] .
IV. HIDDEN DOUBLE DATA TRANSFER
For the parallel SRAM L1 caches to operate with sufficient cache refill bandwidth, the IBUS bandwidth has been maximized by a hidden double data transfer scheme. This scheme is similar with that of the page mode operation in conventional DRAMs. However, the application is different in that it is for maximizing the bus bandwidth on the wide data bus where the bus width normally cannot be maximized due to the size difference between the small DRAM cell and the large logic I/O pitch in SOC designs. This scheme transfers 2-bit data through a single-bit IBUS pair during a single DRAM read/write cycle. During a DRAM write cycle, the 2-bit data in SP_LATCH_L and SP_LATCH_R, for example, are written into the two DRAM cells having the logically same row address 0 (0L, 0R). These 2-bit data correspond to the 2-bit color information of a pixel. The detailed operations of the write cycle are explained in the second write cycle (Write Cycle 2) of Fig. 6 . During W4 clock cycle, a sub-wordline (SWL) is activated, and two bitline signals (BL_D0, BL_U0) are developed by the lower and upper sense amplifiers (SA_D0, SA_U0). In the next write clock cycle (W5), the cell data in SP_ LATCH_L is written into the left DRAM cell (0L) through the lower sense amplifier (SA_D0), assuming that the data in SP_LATCH_L is 1. In the W6 and W7 clock cycles, the cell data in SP_LATCH_R is written into the right DRAM cell (0R) through the upper sense amplifier (SA_U0), assuming that the data in SP_LATCH_R is 0. Since the DRAM cell write operation takes at least two clock cycles, the left DRAM cell (0L) write operation is performed during the W5 and W6 clock cycles, and write operation of the right DRAM cell (0R) during the W6 and W7 clock cycles. Therefore, the W6 clock cycle hides another memory cell data write operation, reducing one clock cycle in the writing operation.
More clock cycles can be reduced in the DRAM read cycle (Read Cycle 1) as shown in Fig. 6 . The two DRAM cells data (0L, 0R) is read in a similar manner as the DRAM write cycle. However, data transfer from the two DRAM cells occurs during the DRAM cell data restoring cycles (R1, R2) for both of the DRAM cells through a single-bit IBUS pair, resulting in double data transfers during a single cell data restoration time. Therefore, the peak bandwidth from the L2 cache to the L1 caches is 75 GB/s when the cache line sizes of the L2 and L1 caches are configured for 16 architecture in its cache line size. Since the optimal cache line size varies according to the characteristics of incoming graphics applications [12] , changing the cache line size to its optimal value results in a lower miss rate. It also reduces power consumption by removing unnecessary sub-wordline activation [17] .
The L2 and L1 cache memory cells are divided into 16 sub-groups on a main wordline in each LOD part, as shown in Fig. 7 . Each sub-group corresponds to one sub-wordline for partial activation. In the L1 cache, each sub-wordline has 32 SRAM cells, which cover a 2-D-mapped 4 4 pixel area in the texture image space with 2-bit color data for each pixel. The L1 and L2 block selectors adaptively activate 1, 4, or 16 sub-groups simultaneously to change cache line sizes in both L1 and L2 caches. The cache line sizes can be configured to 4 One sub-wordline of the L2 cache contains 128 DRAM cells, which are divided into four logically different row address groups. The 32 DRAM cells that make a logical group have the same logical address, and only one logical group is involved in a L2 cache read/write cycle. This logically divided row addressing scheme can increase the DRAM cell efficiency by assigning more DRAM cells on a sub-wordline.
VI. SRAM L1 CACHE WITH SCALABLE PARALLEL 2-D COLUMN DECODER
To achieve sufficient L1 cache access speed in spite of using low-speed DRAM peripheral transistors in DRAM-based SOC technology, each SRAM L1 cache has parallel output data paths to its trilinear interpolator, supplying eight pixel data simultaneously in a single clock cycle. With this parallel L1 cache data path, eight times wider L1 cache access bandwidth has been achieved at a clock speed of 150 MHz. This simultaneous pixel data access enables to generate a filtered pixel value in every clock cycle with initial latency in the trilinear interpolator. Fig. 8 shows the cell matrix of the SRAM L1 cache containing 2-bit color information with a parallel output data path. There are four SRAM cell matrices for 8-bit color information in the four MPTC banks (MPTC_Bank A, B, C, D), as shown in the layout photograph of Fig. 17 . In a SRAM cell matrix, one sub-wordline contains 16 pixel data for a 4 4 pixel area with 2-bit color information per pixel. The column decoder only receives the address of the upper-left pixel among the four neighboring target pixels, which are necessary for texture filtering in each LOD part, to reduce the pin count of the input address, as shown in Fig. 4 . The other three neighboring pixels are automatically selected by a new column decoder. Fig. 8 shows the unit block of the column decoder, which simultaneously generates four selection signals from the single input address in each LOD part. The column decoder can also change its decoding range from 4 4 to 16 16 for the reconfigurable cache block size.
To meet the functional requirements of the column decoder, a scalable parallel 2-D column decoder has been newly designed. It has the ability to simultaneously select four neighboring target pixel data, and it also possesses a scalable architecture for variable decoding range by merging multiple unit column decoders. The scalable parallel 2-D column decoder is composed of two blocks: a unit column decoder (Unit CDEC) array and a propagation channel, as shown in Fig. 9 . The unit CDEC covers a 4 4 pixel area, and acts as an independent 2-D column decoder when the L1 cache line size is configured to a 4 4 pixel block. However, when the L1 cache line size is configured to an 8 8 or 16 16 pixel block, the propagation channel bridges the unit CDECs to provide a wider decoding range.
For the multiple unit CDECs to be merged as a single CDEC, boundary problems should be solved. These occur when the four neighboring target pixels reside on the edges of different 4 4 pixel blocks. The propagation channel solves this problem by transferring propagation signals to neighboring unit CDECs in the texture image space. Fig. 10 shows cases when the propagation signals are generated assuming that the cache block size is increased from a 4 4 to 8 8 pixel block. Fig. 10(a) shows the memory cell mapping relationship between a 4 4 pixel block and 16 memory cells. Fig. 10(b) shows a case when the decoding range is enlarged to an 8 8 pixel block and four unit CDECs cooperate as a single large CDEC by exchanging propagation signals. The propagation signals can be classified into four cases as follows.
Case 1) (no propagation): the propagation channel acts as a simple block decoder. Case 2) ( propagation) : two pixels reside on block and two pixels reside on block. Case 3) ( propagation) : two pixels reside on block and two pixels reside on block. Case 4) ( propagation): each pixel resides on , , , blocks. Case 1 means that no propagation signal is generated. In Case 2, the propagation signal is transferred from the unit CDEC for the block to the unit CDEC for the block, and notifies the unit CDEC for the block to generate the output signals for the right two pixels. In Case 3, the propagation signal is generated for the bottom two pixels. In Case 4, and propagation signals are generated for the upper-right and lower-left pixels, respectively, and then, for the lower-right pixel, and propagation signals from the unit CDECs for the and the blocks are generated. Although the propagation channel delay can lower the operation speed, in this design it was not critical because of the low chip-level clock speed, 150 MHz. 
VII. PIPELINED TRILINEAR INTERPOLATOR
The eight pixel data simultaneously transferred from the L1 cache are processed in the three-stage pipelined interpolator, which generates a filtered output value in each clock cycle with initial latency. The trilinear interpolator uses the 4-bit fractional parts of the physical L1 cache input addresses as the weighing factors, as shown in Fig. 13(a) . These 4-bit addresses divide a 1 1 integer pixel block into 16 sub-blocks, as shown in Fig. 13(a) , and the 4-bit addresses are used as the weighting factors for the four neighboring pixel values at the integer coordinates. The target pixel coordinates in the two LOD parts are different because their image sizes are different in mipmap. Therefore, the physical L1 cache address is used in its original form in one LOD part, and 1-bit right-shifted physical L1 cache address is used as the target pixel coordinate in the other LOD part. Fig. 13(b) shows the interpolation steps in the three-stage pipeline. In the first stage, it interpolates the eight pixel values along the direction by the 4-bit fractional part of the input address in each LOD part, which gives four interpolated values. In the second stage, interpolations are performed along the direction by the 4-bit fractional part of the address , resulting in two interpolated pixel values. Finally, in the third stage, the two interpolated pixel values from the even and odd LOD parts are interpolated by the 4-bit fractional value of the LOD value , which also divides the distance between two integer LOD values into 16 steps. Fig. 14 shows the operations of the L1 cache and the three-stage pipelined trilinear interpolator. Three input addresses are sampled at the first three rising clock edges, and proceed into the four-stage pipeline including the L1 cache access stage. Trilinear interpolated pixel values are obtained from the sixth clock cycle, assuming that the stored pixel values in the L1 cache are as shown in Fig. 14(b) .
VIII. PERFORMANCE GAIN AND CHIP IMPLEMENTATION
For the architectural analysis, a software graphics pipeline and a multilevel parallel texture cache model have been implemented using C++. This architecture analysis environment allows clock-level performance analysis. As a test model, the Quake III Arena computer game has been used [11] . Fig. 15 shows a sample graphics frame and its model data characteristics. An external SRAM tag memory was assumed to be used, and a prefetching technique with a latency first-in-first-out (FIFO) was used to hide the tag memory access latency [18] . Fig. 16(a) shows the required bandwidth for the L2 and L1 cache replacement when rendering 50 consecutive frames in the Quake III Arena game. The upper graph shows the required bandwidth on the IBUS for replacing the 8-way parallel SRAM L1 caches, and the lower graph for the DRAM L2 cache re- placement. The required average bandwidths for the L2 and L1 caches are 210 kB/frame and 4.7 MB/frame, respectively. Thus, the L2 cache has reduced the required bandwidth on the AGP or PCI bus to about 20 times smaller than that without it. Fig. 16(b) shows the parallel speedup by the parallel L1 caches with wide IBUS bandwidth. Compared with the PCB-level parallel cache design (assuming L1 cache block transfer time for 16 16 block size to be 200 ns), this single-chip architecture can achieve parallel speedup of eight without parallel speedup saturation. This results in a sustained texture data access speed of 6.6 Gpixels/s, and a trilinear interpolated pixel rate of 825 Mpixels/s.
A prototype chip has been fabricated by 0.16-m DRAM-based SOC technology using 1-poly and three-metal layers (1 W 2 Als). Fig. 17 shows its die photograph. This chip is for only one color component (R, G, B, or A), and has four parallel SRAM L1 caches as an experimental version. Increasing the number of the L1 caches leads to long IBUS lines, which can reduce the operating frequency of the chip. Since this chip has large line drivers for the IBUS lines, up to eight SRAM L1 caches can be attached on the IBUS for peak performance without lowering operation frequency. The chip is vertically divided into two parts, one for even LOD level images and another for odd LOD level images. There are four memory banks (MPTC_Bank A, B, C, D), each containing 2-bit color information. The die size of the prototype chip is 15.6 mm 7.5 mm, and the operation frequency of the filter and the SRAM is 150 MHz. The die size of the prototype chip is large because it was designed to have a large operation margin as a prototype chip and uses DRAM peripheral transistors for the logic circuits, such as SRAMs and texture filters. The die size is anticipated to shrink by 60% of the prototype chip with a more optimal design using MDL technology in which logic circuits can be implemented with smaller die area. Using the DRAM peripheral transistors in the SRAM and the texture filters lowers the operation frequency; however, the parallel data path between the SRAM and the texture filters compensates for the low operation frequency, resulting in comparable pixel throughput to that when using a logic optimized process.
The voltages for DRAM cores, logic circuits including SRAMs, and I/O logics are 2.0, 2.3, and 3.3 V, respectively. Average power consumption is 89 mW when the line sizes of the L2 and L1 cache are configured to 16 16 and 4 4, respectively, which are known as the optimal cache line sizes in most PC graphics game applications. Fig. 18 shows the shmoo plot of the prototype chip. Fig. 19 shows the experimental graphics board for validating the operation of the prototype chip. There are four prototype chips on the graphics board, each one for one color component, R, G, B, or A. The graphics board has four DSP processors, each having 1-Gflops processing capability, for satisfying the high throughput of the prototype chip. The graphics board with the prototype chip showed the same performance measured in the architecture simulation, except for the reduction of total pixel rate due to the reduced number of SRAM L1 caches. This is because the simulation results are based on a clock-level simulation, which exactly models the operations of the prototype chip.
IX. CONCLUSION
For greater realism of 3-D graphics scenes in PCs with interactive frame rate, a dedicated single-chip multilevel parallel texture cache memory has been proposed and fabricated by 0.16-m DRAM-based SOC process technology. The integrated large DRAM L2 cache has solved the bandwidth bottleneck problem on the AGP or PCI bus, and the eight independent SRAM L1 caches accelerate the operations of the parallel graphics pipelines without L1 texture cache access conflicts. The maximized IBUS bandwidth by the hidden double data transfer scheme smoothes parallel L1 cache replacement operations, even in 8-way parallel SRAM L1 caches. Furthermore, by the use of reconfigurable cache line architecture, optimal cache miss rate and lower power consumption have been achieved in compliance with various graphics application characteristics. The SRAM L1 caches and the pipelined texture filter architecture implemented using DRAM peripheral transistors allowed a more cost-effective design than that using expensive MDL process technology.
