The caching behavior of multimedia applications has been described as having high instruction reference locality within small loops, very large working sets, and poor data cache performance due to non-locality of data references. Despite this, there is no published research deriving or measuring these qualities. Utilizing the previously developed Berkeley Multimedia Workload, we present the results of execution driven cache simulations with the goal of aiding future media processing architecture design. Our analysis examines the differences between multimedia and traditional applications in cache behavior. We find that multimedia applications actually exhibit lower instruction miss ratios and comparable data miss ratios when contrasted with other widely studied workloads. In addition, we find that longer data cache line sizes than are currently used would benefit multimedia processing.
INTRODUCTION
Multimedia is an amalgamation of various data types such as audio, 2D and 3D graphics, animation, images and video within a computing system or within a user application [4] . Put simply, a multimedia application is one which operates on data to be presented visually or aurally. The purpose of this work is to explore the cache behavior of real world multimedia applications. An important motivation is the widespread belief (seemingly without any actual basis in research) that data caches are not useful for multimedia applications because of the streaming nature of the data upon which they operate [9] , [13] , [21] , [22] , [25] . The results presented in this paper strongly suggest that contemporary media processing applications perform no worse than traditional integer and floating point workloads.
Further motivating our study is the large role memory latency plays in limiting performance. Consider Table 1 , which compares the performance with caching against the same system with all cache levels (L1 and L2) disabled. This was done by setting the appropriate BIOS parameters on our test system at boot time and then measuring the performance on real hardware. From this experiment we can see how highly dependent modern microprocessor performance is on an efficient memory hierarchy. The difference in latency between levels of contemporary memory hierarchies is substantial, explaining the enormous slowdown we observe when the caches are disabled on our test system. Note that the system time (time spent in the operating system) slowdown is considerably less than that of the user time. This corroborates the generally held belief that the memory locality within operating system code is very poor, as it exhibits less of a performance degradation when caching is disabled.
RELATED WORK
There have been a limited number of multimedia caching studies. In [34] the data cache behavior of MPEG-2 video decoding is studied with the goal of optimizing playback performance through the cache sensitive handling of the data types used. It was found that although it has been suggested that caches are critically inefficient for video data (several media processor chips dispense with data caches entirely), there was sufficient reuse of values for caching to significantly reduce the raw required memory bandwidth. [17] , [10] , and [39] study the usefulness of caching the textures used in 3D rendering. A texture cache with a capacity as small as 16 KB has been found to reduce the required memory bandwidth three to fifteen times over a non-cached design and exhibit miss ratios around 1% [17] . The addition of a larger second level of texture cache (2 MB) to a small first level cache (2 KB) can reduce the memory bandwidth from 475 MB/s to around 92 MB/s [10] .
There have been several studies of prefetching for multimedia. [41] examines different hardware data prefetching techniques for MPEG-1 (encoding and decoding) and MPEG-2 (decoding). Three hardware prefetching techniques were considered, with the most successful found to reduce the miss count by 70% to 90%. [35] presents a combined hardware/software solution to prefetching for multimedia. Based on cycle accurate simulation of the Trimedia VLIW processor running a highly optimized video de-interlacing application, it was found that such a prefetching scheme was able to eliminate most data cache misses, with the effectiveness dependent on the timing parameters involved. [11] suggests a two-dimensional prefetching strategy for image data, due to the two separate degrees of spatial locality inherent in image processing (horizontal and vertical). When their 2D prefetching technique was applied to MPEG-2 decoding as well as two imaging applications (convolution and edge tracing), 2D prefetch was found to reduce the miss ratio more than one block look-ahead. Hardware implementation aspects of prefetching are discussed in [37] .
WORKLOADS 3.1 Berkeley Multimedia Workload
For our study of the cache behavior of multimedia applications, we employ the Berkeley Multimedia Workload, which we develop and characterize in [28] . A description of the component applications and data sets is given in Table 2 . The main driving force behind application selection was to strive for completeness in covering as many types of media processing as possible. Open source software was used both for its portability (allowing for cross platform comparisons) as well as the fact that we could directly examine the source code.
The Berkeley workload represents the domains of 3D graphics (Doom, Mesa, POVray), document and image rendering (Ghostscript, DjVu, JPEG), broadband audio (ADPCM, LAME, mpg123, Timidity), speech (Rsynth, GSM, Rasta) and video (MPEG-2). Three MPEG-2 data sets are included to cover Digital Video Disc (DVD) and High Definition Television or HDTV (720P, 1080I) resolutions. The parameters of the DVD, and HDTV data sets are listed in Table 3 . "Frames" is the number of frames in the data set. 
Other Workloads
For comparison purposes, we have included the results of several previous studies of the cache behavior of more traditional workloads.
SPEC92/SPEC95
SPEC CPU benchmarks are taken to be generally representative of traditional workstation applications, with the integer component reflecting system or commercial applications, and the floating point component representing numeric and scientific applications. In [16] Gee analyzed the cache behavior of the SPEC92 benchmark suite running on DECstations with MIPS R2000 or R3000 processors and version 4.1 of the DEC Ultrix operating system. Because the SPEC benchmarks are typically run in a uniprogrammed environment, no cache flushing or other method was used to simulate multiprogramming. Gee also found that for the SPEC92 benchmark suite, system time is insignificant compared to user time, and so operating system memory behavior was unimportant for that study. SPEC95 is an upgraded version of the SPEC92 benchmark suite. It consists of eight integer intensive and ten floating-point intensive applications, several of which are shared with SPEC92. In general, the applications were designed to have larger code size and greater memory activity than those of SPEC92. 
Multiprogramming Workload (Mult)
The authors of [5] generated miss ratios for very long address traces (up to 12 billion memory references in length) on the Titan RISC architecture in order to evaluate the performance of a variety of cache designs. Three individual traces were used in addition to another which was a multiprogrammed workload consisting of several jobs. Our comparison includes their miss ratio results for their 7.6 billion reference (68.5% instruction, 30.6% load, 15.4% store) multiprogramming workload (referred to as "Mult" by the authors of [5] ). [31] introduced the concept of design target miss ratios (DTMRs), intended to represent typical levels of performance across a wide class of workloads and machines, to be used for hardware design. The DTMRs were synthesized from real (hardware monitor) measurements that existed in the literature and from trace driven simulations using a large number of traces taken from several architectures, and originally coded in several different languages.
Design Target Miss Ratios (DTMR)

VAX 11/780, VAX 8800
Two studies done at Digital Equipment Corporation (DEC) supply miss ratios for a time-shared engineering workload taken with a hardware monitor on VAX 11/780 and VAX 8800 machines [7] , [8] . The 11/780 has an 8-KB, write through, unified cache with an 8-byte block size and associativity of two. The 8800 has a 64-KB, write-through, direct mapped, unified cache with a 64-byte block size. On the VAX 11/780 it is possible to disable half of the two-way associative cache through special control bits; a technique which allowed for the measurement of a 4-KB, direct mapped, unified cache configuration as well.
Agarwall Mul3
In [1] an analysis of the effect of operating system references and multiprogramming was presented for a workload of eleven application programs (30 traces in all). The platform used to gather the traces was a VAX 11/780 running either the Ultrix or VMS operating system. All of the traces were gathered through the ATUM scheme of microcode modification, and were roughly 400,000 
Amdahl 470
In [30] , hardware monitor measurements taken at Amdahl Corporation on Amdahl 470V machines are presented. A standard internal benchmark was run containing supervisor, commercial and scientific code. Supervisor state miss ratios were found to be much higher than problem state miss ratios.
METHODOLOGY
In order to measure cache miss ratios, we modified the LibCheetah v2.1 implementation [3] of the trace driven Cheetah cache simulator [36] to operate in an execution driven mode. It was also extended to allow for traces longer than ¾ ¿½ references long. Cheetah simultaneously evaluates many alternative uniprocessor caches, but restricts the design options that can be varied. For each pass through an address trace, all of the caches evaluated must have the same block size, do no prefetching, and use the LRU or MIN replacement algorithms. Other cache simulators were also considered for this study (TychoII [40] , Dinero IV [14] ), but were found to be considerably slower than Cheetah or otherwise unsuitable for use in execution driven simulation due to dynamic memory allocation issues. DEC's ATOM [12] was used to instrument target applications with the modified Cheetah simulator, allowing for execution driven cache simulation. See [38] and [33] for overviews of trace driven simulation in general, and [27] for a comparison of the performance of a variety of execution and trace driven solutions.
Trace Length
Many cache studies utilize trace lengths that are a fraction of an application's total run time due the enormous simulation times re- The graphs depict the number of cache misses per 1,000,000 instructions executed for two sample applications. In order to be able to simulate the effects of a program's behavior, it is necessary to have a trace which captures all of its behavior. We found that although there are some applications (notably many of the SPEC92/95 benchmarks) that exhibit uniform cache behavior over their entire run times, our multimedia workload applications did not share this property. The result of this is that full applications traces are the only way to completely characterize average cache behavior.
A second difficulty with short trace lengths specific to cache simulations is the cold start problem. Cache simulation programs typically start with an empty cache which becomes filled as the simulation progresses. All initial memory accesses will miss the cache (compulsory misses), so cold start effects can potentially dominate if traces are too short to mitigate these effects. Traces of a billion or more references may be needed to fully initialize multi-megabyte cache configurations [20] . Our work traces application programs with realistic data sets for full execution runs. The trace lengths for the component Berkeley Multimedia Workload applications are given in Table 4 . Table 4 lists the amount of time the CPU spends either in user space (user time) doing actual work for the application, or in system space (system time) serving I/O requests and dealing with other overhead on behalf of the application. Both user time and system time are machine dependent, and vary based on the instruction set, clock cycle length and other architectural parameters. Data time is machine independent, and is the inherent time length of the data set. For example, 24 frames of a DVD movie might represent one second of data time, even though decoding requires only 0.5 seconds of computation (the sum of system and user time). 
Operating System Behavior
In general, studies including operating system behavior are rare because of the difficulty involved in obtaining this information. User space is freely manipulated, but tracing system space usually requires that modifications be made to the operating system. Although our traces only include user state references, we can assume that this represents almost all of the memory behavior of the programs under study; less than 1% of our multimedia workload was system time. To some degree this may be an artifact of the nature of the Berkeley Multimedia Workload, which requires system time only for file I/O. In an actual multimedia application where data must be transferred to and from I/O devices such as network, disk, or sound and video controller cards, a larger amount of OS activity could be present.
Multiprogramming
Despite the fact that the Berkeley Multimedia Workload is dominated by user time computation, it is because of multiprogramming that we cannot entirely ignore operating system behavior. When a context switch occurs, the instructions and data of the newly scheduled process may no longer be in the cache from the last time it was run due to the memory use of programs scheduled in the interim. The number of cycles in this interval (limited by the quantum) affects the cache miss ratio. Although a quantum length that depends on clock time or external events remains constant with architectural change (typically 10 to 100 ms), the number of cycles É in each quantum increases over time for various reasons, including less efficient software and a speedup of the processor relative to the speed of real time events.
Multimedia
Although the level of multiprogramming on a desktop workstation is typically low, multimedia applications are often multi-threaded. For example, in the case of on screen DVD movie play back, there are typically several concurrent threads of execution, each dealing with a particular aspect of MPEG-2 decoding (e.g. audio, video, bitstream parsing/demuxing). Acceptable playback requires that decoding be fast enough to leave time for computing the other components in that unit of time (otherwise video frames may need to be dropped) and to prevent latency effects from disrupting the perceived synchronization between audio and video. These requirement affect scheduling, and are not taken into account in an application which operates in a batch or offline mode. The effect of multiprogramming can be roughly approximated by periodically flushing (clearing) a simulated cache. The context switch intervals of the actual applications from the Berkeley Multimedia Workload were not measured and used for this because they are primarily file based applications, typically converting between compressed and uncompressed format without presenting the resulting data to the user. So, although the algorithms they employ (and therefore their memory access patterns) should for the most part be similar to their "real world" counterparts, their scheduling behavior is vastly different. In order to correctly simulate the effect of multiprogramming for our multimedia workload, the average context switching interval for commercial (closed source) Microsoft Windows applications was measured on real hardware. The applications were chosen to correspond as closely as possible to those comprising the Berkeley Multimedia Workload, such that, for example, the context interval measured for actual DVD video playback was used in our simulations of MPEG-2 video decoding at DVD resolutions.
Microsoft Windows NT and Windows 2000 both maintain a large amount of performance information for a large number of system objects including context switch count, user time and system time per thread. By dividing the sum of system time and user time by the measured context switch count it was possible to compute the average context switch interval for each type (domain) of multimedia application. accelerator card (AGP Nvidia Riva TNT) were installed. Table 5 lists these intervals as measured by the Windows 2000 performance counters, which return results in terms of time (CPU cycles).
In our cache simulations, we simulate normal task switching by flushing the cache every É ÓÒØ ÜØ instructions. Because our cache simulation is instruction, rather than cycle based, we require cache purge intervals measured in terms of instructions executed between cache flushes. In order to convert our context switch interval data from cycles to instructions we need to know the corresponding cycles per instruction (CPI) ratio. However, we can not simply treat x86 CISC instructions as being equivalent to the RISC Alpha instructions of our simulation platform, due to the inherently different amounts of work done by each class of instructions. In order to approximate the equivalent number of Alpha RISC-like instructions in each context switch interval, we divide the number of x86 Athlon cycles by the typical number of cycles per micro-op (CP Op) (the details of our CPI and CP Op measurements are given in [29] ). Note that in a real system the interval between task switches is variable, not fixed; since we don't have the distribution of interinterrupt times, we chose to use a constant interval. Alternately, we could have chosen some other distribution, such as exponential, normal or uniform. The simulation quanta (cache flush intervals) applied to each application are listed in Table 4 .
SPEC95
SPEC95 was simulated without multiprogramming (cache flushing) for several reasons. First, it is normally run in a uniprogrammed mode in order to extract the highest benchmark performance [16] . Many UNIX-type operating systems maintain context switch counts on a per process basis which is accessible through the getrusage() system call. The average context switch interval was computed in the same manner as for the Windows multimedia applications. Table 6 lists context switch intervals for SPEC95 measured for Compaq Tru64 Unix v5.0 running on a DEC DS20 workstation (dual 21264 processors, each running at 500 MHz), with 2 GB of RAM, again running in a system with a single active task.
CINT95
Context Interval
Simulation Details
The component applications for both the multimedia workload and SPEC95 were compiled for the Alpha AXP architecture running Digital UNIX v4.0E with the default optimization levels in the case of the multimedia workload, and the base optimization level for SPEC95 (the same compiler optimization flags on all applications: -fast -O5 -non_shared). The resulting binaries were then instrumented with the Cheetah cache simulator using ATOM and run on 300 MHz DEC Alpha AXP machines with 128 MB of RAM.
All of the applications in the Berkeley Multimedia Workload are written in C with the exception of DjVu which is coded in C++. Data sets were chosen to be on the order of real workloads, with long enough traces (instruction and data) to exercise very large caches, or to at least touch as much address space as the corresponding real applications. The trace lengths and other relevant simulation characteristics are listed in Table 4 . Total simulation time for our work, not including false starts, machine down time and other simulation problems, was 24.4 days of CPU time for the multimedia workload, and 147.2 days of CPU time for SPEC95 simulations, for a grand total of 171 days of CPU time. The machine type used for simulation was a DEC AlphaStation 255 workstation with a single 300 MHz Alpha 21064a processor).
RESULTS
The two major determinants of cache performance are access time (the latency from the beginning of an access until the time the requested data is retrieved) and miss ratio (the fraction of cache references which are not found in the cache) [30] . Based on the latencies of a particular cache memory candidate design, in combination with the simulated or measured miss ratio, it is possible to select the design with the highest overall performance (lowest average memory access time) at some level of implementation cost.
Complete tables of the results from all of our simulations are available on the world wide web at http://www.cs.berkeley.edu/~slingn/research/, from which the cache performance of any application set of interest can be computed.
As it is necessary to reduce the large volume of our simulation results into a more easily digestible form, we use averaging where necessary to compress results. Because the number of applications representing a particular application domain (audio, speech, document, video, 3D) is arbitrary, we will let each of the five application domains comprise a total of 20% of the averaged workload result, with the component applications of each domain being weighted equally.
Capacity
Capacity, or total cache size, has the greatest effect on miss ratio, and so it is one of the most important cache design parameters. Capacity, especially for L1 caches which are typically on the same die as the CPU, is limited by physical die size and implementation cost. In addition, the larger the capacity of a cache, the slower it is due to increased loading of critical address and data lines, thus requiring additional buffering [24] . In order to study the effect of cache capacity on miss ratio, caches were simulated ranging in size from 1K to 2M bytes.
Other Workloads
The results of other studies on the effect of cache size on the miss ratio for a variety of other workloads are presented alongside our simulation results for the Berkeley Multimedia Workload. All of the miss ratios presented in Figures 2, 3 , and 4 are for caches with a line size of 32 bytes and two-way associativity, which represent common values for these parameters. Because the results shown have been gathered from a motley assortment of studies of disparate ages and architectures, many of which did not analyze configurations precisely identical to ours in terms of line size and associativity, we use adjusted results taken from [16] . These adjustments modify the original results of the studies according to the ratios of miss ratios found in [18] for differences in associativity, and [32] for variations in line size. Extensions to larger cache sizes were made for the DTMR results using the Ô ¾ rule from [30] . It is important to note that many of the other studies included for comparison purposes also measured or simulated multiprogramming behavior, but because they are based on older machine architectures, their É (quantum) lengths and therefore their context switch intervals are significantly shorter than those used in our simulations.
The most significant result of Figures 2, 3 , and 4 is that far from multimedia applications exhibiting degenerate cache behavior in comparison to more traditional workloads, our results demonstrate that they actually perform better for nearly all cache configurations. We believe that this is attributable to several factors. First, most of the comparison workloads are for timeshared machines on which task switching between users occurred very frequently. Further, the comparison studies are of architectures with much lower clock speeds than modern processors, and so exhibit higher miss ratios due to shorter context switch intervals based on real time periods. Even so, the uniprogrammed SPEC92 and SPEC95 benchmarks still demonstrate higher miss ratios than our multimedia workload. We believe that this is because many multimedia algorithm building blocks (such as the discrete cosine transform and fast Fourier transform) internally reference the same data locations repeatedly. In the case of streaming multimedia applications, data is typically copied into a fixed region of memory (buffer) from the source file of network interface device. Even algorithms which simply traverse enormous arrays of data without re-referencing (such as color space conversion, subsampling) typically do so in linear memory order, and so benefit greatly from the "prefetching" effect of long cache lines. In addition, multimedia data types are typically small (8-bits for video and speech, 16-bits for audio, single precision (32-bit) floating point for 3D geometry calculations). This means that in comparison to the other workloads which utilize full 32-bit integers or 64-bit (double precision) floating point, more multimedia data elements fit in a single cache line, thus improving the relative hit ratio.
Multimedia Domains
When broken down into the five application domains (audio, speech, document, video and 3D graphics), some important trends become apparent (Figures 5, 6 , and 7). Instruction cache miss ra- tios are quite similar across the various application domains, with a 16 KB or 32 KB cache being sufficient. This supports the idea that multimedia applications are dominated by small kernel loops, rather than large code sizes. Data cache miss ratios show significant variation between domains. Speech, video, and audio domains exhibit similar (low miss ratio) cache performance, while the document and 3D applications have higher miss ratios. This is attributable to the non-linear way in which data sets are traversed during processing for these applications. 
SIMD Effects
The motivation behind the SIMD within a register approach taken by multimedia extensions such as Intel's MMX or Motorola's AltiVec is the fact that on general purpose microprocessors, data paths are typically 32 or 64-bits wide, while multimedia applications typically deal with narrower width data. By packing multiple narrow operations into the wider native processor data path, it is possible to improve performance. Although it might be expected that current scalar compilers would place multiple short values into a register and then extract them with register to register operations in order to minimize memory access overhead, we found that this was not the case for the two compilers available on our DEC Alpha test platform. Instead, multiple independent short loads are issued. Because of this, the use of SIMD instruction set extensions for multimedia will result in higher cache miss ratios, although the total number of memory references would decrease, due to the folding of several scalar load operations into a single parallel operation for sub-word data types which are adjacent in memory. Note that programs employing multimedia (SIMD) instruction sets are likely to be hand-coded, as no currently available commercial compilers are able to generate SIMD instructions automatically; this will also affect their memory behavior.
Line Size
The block or line size of a cache memory is another cache design parameter that strongly affects cache performance [32] . Generally, increasing the line size decreases the miss ratio, since each fetch from memory retrieves more data, thus fewer accesses outside the cache are required. When the line size is made too large, memory pollution can adversely affect cache performance, causing material to be loaded that is either never referenced or evicting information that would have been referenced before being replaced. Large lines also decrease the likelihood of "line crossers" -multibyte memory accesses across the boundary between two cache lines, such as occur with many CISC architectures. This type of unaligned access incurs a performance penalty since it usually requires two cache accesses; string operations can induce multiple cache data misses Additionally, small line sizes require a greater number of bits be dedicated to tag space than for larger lines, although a sector or sub-block cache is one way to avoid this problem. (See [26] for an investigation into sub-sector cache design issues.)
In addition to affecting the performance metric of miss ratio, large line sizes can have long transfer times and create excessively high levels of memory traffic [32] . It is possible to model the time to fetch a cache line, Ø Ð Ò , assuming no prefetching and that all loads load a full cache line: For every cache capacity there is an optimal line size that minimizes the average memory reference delay. In order to select an optimal line size, it is necessary to minimize Ø Ð Ò ¡ Ñ´Äµ, where Ñ´Äµ is the miss ratio as a function of line size. To investigate the effect of line size choice on miss ratio, instruction and data caches were simulated with line sizes ranging from 16 bytes to 256 bytes and total capacities ranging from 1 KB to 2 MB. For the sake of example, we use the parameters measured for the memory hierarchy on a 500 MHz AMD Athlon system, listed in Table 7 (the methodology used to obtain these parameters is detailed in [29] ). Because we are only considering one level caches in this work, we use the measured L2 parameters for the memory miss latency and bandwidth.
In the case of the largest caches simulated (1M and 2M capacity), the largest line size of 256 bytes produced minimal average delay for instruction caches. Table 8 summarizes the mean memory reference delay for the multimedia workload for SPEC92 and SPEC95, in addition to the Berkeley Multimedia Workload. The best values are highlighted in bold text. Some of the instruction cache results exhibit anomalies for extremely small miss ratios due to the limited precision of our results in those instances (only a few misses for many millions of instruction references). Our results indicate that for the Berkeley Multimedia Workload (as well as SPEC95), instruction cache line sizes should be as large as possible, due to the extremely low miss ratios exhibited for even moderate capacities. Instructions are likely to be accessed sequentially, so the fetching of large line sizes pays off. Data caches, on the other hand, have clearly optimal line sizes, depending on the total cache capacity. In the selection of an optimal line size, it should be kept in mind that large line sizes can be problematic in multiprocessor systems where system bus bandwidth must be shared. Very long line sizes may also cause real-time problems, as when I/O operations cause buffer overruns due to an inability to get on the memory bus. With many desktop computer manufacturers already offering 2 and even 4-way multiprocessor support, this may have a limiting effect on the usefulness of long cache lines.
Associativity
Determining optimal associativity is important because changing associativity has a significant impact on cache performance (latency) and cost. Increasing set associativity may require additional multiplexing in the data path as well as increasing the complexity of timing and control [24] . [18] develops a rule of thumb for how associativity affects miss ratio: reducing associativity from eightway to four-way, from four-way to two-way, and from two-way to direct mapped was found to cause relative miss ratio increases of approximately 5, 10, and 30 percent, respectively. In order to see how associativity affects miss ratios for our multimedia workload, miss ratio spreads were calculated for unified, data and instruction caches for our suite of multimedia applications. Miss ratio spread computes the benefit of increasing associativity, and is defined in [18] :
Where Ñ´ Òµ is the miss ratio for Ò-way set associativity, .
As in [18] , a block size of 32 bytes was chosen, with all simulated caches utilizing LRU replacement. The miss ratio spreads of the Berkeley Multimedia Workload as well as SPEC92 and SPEC95 are shown in Figure 8 . Please note that in order to preserve visual detail across the wide range of workload behaviors observed, the subfigures use different vertical scales. Unlike the original [18] study, our curves are not smoothed or averaged.
From the miss ratio spread results in Figure 8 , we can see that instruction caches for multimedia applications (and generally for SPEC92 and SPEC95) benefit from 2-or 4-way associativity for moderate size caches (16 KB to 256 KB). For the multimedia workload, most of the benefit from associativity seems to be obtained with two-way set associativity; additional associativity does not to improve performance significantly, except for small cache sizes. Increasing associativity can also be a useful way to increase overall cache capacity when limited by virtual memory constraints (a limited number of page offset bits to index the cache). This was the approach taken both by the designers of Motorola's G4 processor (which includes 8-way associative L1 caches) as well as the IBM 3033 which has a 16-way associative 64k cache.
MULTIMEDIA TRENDS
The final determination we would like to make is what cache designers should plan for to support future multimedia applications. This can be thought of in terms of the potential for data set expansion within each multimedia application domain. We expect that audio and speech application data sets will not change significantly in size, as current data sets are already at the limit of human audio fidelity. Document processing should also not change as current documents are sufficient for either printing or previewing at laser printer resolutions.
Video resolutions are not yet close to the limits of the human eye. This can be seen in the high resolution digital formats currently in the pipeline for consumer level products: DVD (720x480), HDTV 720P (1280x720), and HDTV 1080I (1920x1080). In order to determine if the working set size of video applications is increasing, and therefore larger cache capacities are necessary to support these new resolutions, we compared the effect of cache capacity on miss ratios for them in Figures 9 and 10 utilizing the ratio of miss ratios for increasing resolution. Our results were obtained by running the same MPEG-2 decoding and encoding applications with data sets at DVD, HDTV 720P and HDTV 1080I resolutions. As an example of how to interpret the figures, DVDµ720P refers to the ratio of miss ratios of 720P/DVD resolutions. This metric shows the relative change in miss ratio for the higher resolution compared to the preceding lower resolution.
From Figure 9 , we can see that instruction miss ratios are hardly affected by changes in resolution and although there are some minor fluctuations, the ratios are generally quite close to 1.0. Data miss ratios ( Figure 10) show a stronger influence for small caches (capacities less than 32K), but level off for larger caches. The type of data locality being exploited by data caches for digital video is presumably at the block or macroblock level (which are the same size in all formats) rather than the frame level since caches are equally effective on all resolutions above a minimum working set size. Figure 9 : Instruction Cache Trend -ratio of miss ratios for increasing resolution Previous research ( [17] , [10] , [39] ) has found that even a small texture cache located on a 3D accelerator board reduces the required bandwidth to main memory significantly. Past architectural trends suggest that all 3D rendering functionality will eventually be folded into the main processor, at such time as there is adequate silicon (and perhaps pins) to devote to it. We found that 3D applications exhibited the poorest locality of the multimedia domains. Moving 3D functionality entirely onto the CPU (and therefore sharing the cache with other applications) may require the reorganization of program structures to render vertices in an order amenable to LRU caching ( [17] examines several approaches for doing this) or larger Texture size is dependent more upon the quality of rendered output rather than on display resolution, and is therefore subject to great pressure for growth [19] .
SUMMARY 7.1 Cache Design Parameters
In this paper we have provided a thorough analysis of three important cache parameters in order to support multimedia applications: cache capacity, line size and set associativity. Using execu- tion driven simulation, a large design space was simulated incorporating multiprogramming effects. As can be seen from Table 9 , currently available processors are very similar in their cache design choices and based on our derived design parameters, are for the most part well suited for multimedia .
Capacity A moderate instruction cache capacity of 16 KB or 32
KB was found to be sufficient for all of the applications in our multimedia workload. Despite the widespread misconception that multimedia applications exhibit poor data cache performance, the Berkeley Multimedia Workload was found to exhibit quite low miss ratios. Optimal data cache size depends on the type of multimedia applications that are of interest. For the most common audio, speech and video multimedia applications, a data cache of 32 KB in capacity is large enough to exhibit low (<1%) miss ratios. Document and 3D processing exhibit less locality, and in fact even the largest cache sizes simulated (2 MB) still suffered significant misses for 3D graphics. As mentioned, this is due in large part to the fact that 3D graphics primitives (vertices) are processed in object order rather than memory order, leading to poor memory referencing behavior.
