Multimedia SIMD extensions such as MMX and AltiVec speedup media processing, however, our characterization shows that the attributes of current general-purpose processors enhanced with SIMD extensions do not match very well with the access patterns and loop structures of media programs. We find that 75-85% of the dynamic instructions in the processor instruction stream are supporting instructions necessary to feed the SIMD execution units rather than true/useful computations, resulting in the underutilization of SIMD execution units (only 1-12% of the peak SIMD execution units' throughput is achieved). Contrary to focusing on exploiting more data level parallelism (DLP), in this paper, we focus on the instructions that support the SIMD computations and exploit both fine-and coarsegrained instruction level parallelism (ILP) in the supporting instruction stream. We propose the MediaBreeze architecture that uses hardware support for efficient address generation, looping and data reorganization (permute, packing/unpacking, transpose, etc). Our results on multimedia kernels show that a 2-way processor with SIMD extensions enhanced with MediaBreeze provides a better performance than a 16-way processor with current SIMD extensions. In the case of application benchmarks, a 2-/4-way processor with SIMD extensions augmented with MediaBreeze outperforms a 4-/8-way processor with SIMD extensions. A first-order approximation using ASIC synthesis tools and cell-based libraries shows that this acceleration is achieved at a 10% increase in area required by MMX and SSE extensions (0.3% increase in overall chip area) and 1% of total processor power consumption.
Introduction
Contemporary computer applications are multimedia-rich, involving significant amounts of audio and video compression, 2D image processing, 3D graphics, speech and character recognition, communications, and signal processing. While dedicated media-processors and media-tailored ASICs are used in small, low-power embedded devices such as PDAs, cell-phones, and set-top boxes, augmenting the general-purpose processor with media-tailored enhancements has been the course of action in general- commercial general-purpose processors, but also in DSP processors such as the TMS320C64 processor from Texas Instruments [3] and the TigerSharc processor from Analog Devices [4] .
Obviously the microprocessor design community has embraced the SIMD paradigm for media extensions. Although compiler capabilities to automatically exploit the SIMD extensions have been meager [50] [51] [52] [53], media-rich applications have exploited the paradigm through the use of assembly libraries and compiler intrinsics and have shown significant performance benefits [5] [6] [7] [8] .
While the improvement in performance has been encouraging and exciting, we notice that performance does not scale with increasing the SIMD execution resources. Hence, we embark on a study to understand the behavior of multimedia applications on SIMD extensions and the nature of the data level parallelism (DLP) in multimedia applications. More specifically, we attempt to answer the following:
• SIMD enhanced general-purpose processors (GPPs) typically exploit the sub-word parallelism between independent loop iterations in the inner loops of multimedia programs. Where does DLP in 3 media applications reside? Does most of the DLP reside in the inner loops, or is there significant DLP in the outer loops? • Nested loops are required for processing multimedia data streams and this necessitates the use of multiple indices while generating addresses. GPPs contain limited support to compute addresses of elements with multiple indices. How many levels of nesting are required in common media algorithms? Are the addressing sequences primarily sequential?
• While SIMD extensions are capable of performing multiple computations in the same cycle, it is essential to provide data to the SIMD computation units in a timely fashion in order to make efficient use of the sub-word parallelism. Providing data in a timely fashion requires supporting instructions for address generation, address transformation (data reorganization such as packing, unpacking, and permute), processing multiple nested loop branches, and loads/stores. Are these supporting instructions a dominant part of the instruction stream?
• What percentage of the peak computation rate is achieved for the SIMD execution units in GPPs? If the computation rate is low, what are the reasons that prevent the SIMD execution units from achieving a good computation rate?
• What are effective techniques to further enhance the performance of media applications on SIMD enhanced GPPs? This paper has two major contributions. The first contribution of the paper is the characterization of media workloads from the perspective of support required for efficient SIMD processing. Typically, studies have focused on the true/core computation part of the algorithms, whereas we show that significant additional performance enhancements can be achieved by focusing on the supporting instructions. The second contribution of the paper is the MediaBreeze architecture, which illustrates how characterization studies can be used to design cost-effective architectural enhancements. The major focus of the proposed architecture is on the instructions that support the true/core computations, rather than on the true/core computations themselves.
The rest of the paper is organized as follows. In section 2 we describe the benchmarks used in the study. Section 3 performs sensitivity experiments on the scalability of conventional instruction level parallelism (ILP) and DLP techniques. In section 4 we describe studies to detect bottlenecks in the execution of SIMD programs on GPPs. In section 4.1, we describe the loop nesting and access patterns in multimedia applications and their mapping onto GPPs with SIMD extensions. In section 4.2, we classify dynamic instructions into two fundamental categories, the true/core computation instructions and the overhead/supporting instructions and analyze their mix in media benchmarks. In section 4.3, we measure 4 the percent of peak computation rate achieved for the SIMD execution units in GPPs by conducting experiments on two different superscalar processors. Section 4.4 identifies additional bottlenecks in conventional ILP processors that limit the computation rate of the SIMD execution units. Based on the understanding of the behavior of multimedia applications and the bottlenecks in GPPs with SIMD extensions, in section 5 we propose the MediaBreeze architecture that incorporates explicit hardware support for processing the overhead/supporting instructions efficiently. The cost of incorporating the MediaBreeze hardware support to a SIMD enhanced GPP is evaluated in section 6. Section 7 discusses related work, and the paper is summarized in section 8.
Description of Benchmarks
We use nine multimedia benchmarks to study the architectural implications of GPPs with SIMD extensions. Table 1 lists the benchmarks along with a small description and the dynamic instruction count. Sample source code for each of the benchmarks is provided in [9] . SIMD version of the benchmarks was created for two processors, namely, Pentium III and Simplescalar based superscalar processor. The Pentium III MMX code was generated using assembly and compiler intrinsics, while the Simplescalar SIMD code was generated using instruction annotations and assembly code. The code was compiled by Intel C/C++ compiler version 4.5 and gcc version 2.6 respectively, using maximum compiler optimizations (including loop unrolling). Our suite includes applications (g711, aud, jpeg, ijpeg, and decrypt) and kernels (cfa, dct, motest, and scale). The kernels used are major components in image and video processing standards such as JPEG, MPEG, H.263, etc. Many of these benchmarks are also part of media benchmark suites such as MediaBench [10] . 
3

A Scalability test
Media applications are known to contain significant amount of DLP and a logical approach to improve performance is to scale the processor resources to extract more parallelism. To understand the ability of wide out-of-order superscalar processors to increase performance of multimedia programs, we performed experiments scaling the various resources of the processor (using a modified Simplescalar-3.0 simulator [11] enhanced with 64-bit SIMD execution units). All components are scaled as in Table 2 .
Each of the nine benchmarks was modified to incorporate SIMD code using assembly and instruction annotations of the modified Simplescalar simulator. Fig. 1(a) shows the instructions per cycle (IPC) for different processor configurations for each of the benchmarks. We, incidentally, also note that almost the same performance can be achieved even if the SIMD execution units were not scaled; i.e. the non-SIMD components are scaled up to the 16-way processor keeping the SIMD component constant as a 2-way processor (i.e. 2 SIMD ALUs and 1 SIMD multiplier). The IPC for this case is depicted in Fig. 1(b) . The percentage increase in IPC when scaling both the SIMD and non-SIMD resources over the case of scaling only the non-SIMD resources is shown in Fig. 1 (c). The observation suggests that SIMD execution units are already underutilized and bottlenecks are concealed elsewhere in the non-SIMD portion of the application.
Identification of Bottlenecks
It is evident that there are bottlenecks in SIMD style media processing and that it is not possible to get significant amounts of additional performance improvements by merely increasing the SIMD resources. We investigate characteristics of media programs that point towards the bottlenecks in current SIMD architectures.
Nested loops in multimedia applications
In this section we investigate the nature of multimedia loops to understand the levels of nesting, stride patterns, and the location of the parallelism. Desktop/workstation multimedia applications such as streaming video encoding/decoding (MPEG 1/2/4 and Motion JPEG), audio encoding/decoding (ADPCM, G.7xx, MP3, etc), video conferencing (H.323, H.261, etc), 3D games, and image processing (JPEG, filtering) typically operate on sub-blocks in a large 1-or 2-dimensional block of data. Audio applications operate on chunks of one-dimensional data samples at a time (for example, the MP3 codec operates on "frames" which are smaller components of the complete audio signal that last a fraction of a second). Image and video applications operate on sub-blocks of two-dimensional data at a time (for example, the DCT algorithm operates on 8x8 pieces of data in a large image such as 1600x1200 pixels).
Such a division of data into sub-blocks results in the data being accessed with different strides at various instances in the algorithm. Fig. 2 depicts a 2-dimensional block of data that is accessed with four different strides two in the vertical direction and two in the horizontal direction.
7
Source code for the aforementioned algorithms involves the usage of multiple nested loops (commonly 'for' loops in C language) to process the data streams. Much of the available parallelism in multimedia applications is seen to be DLP that resides at the various levels of nesting. The dimensions of each sub-block for most multimedia algorithms are small (filtering typically uses 3x3 or 5x5 or 7x7 subblocks, DCT operates on 8x8 sub-blocks, and motion estimation operates on 16x16 sub-blocks) resulting in limited parallelism in the innermost loop [12] . However, the number of sub-blocks themselves is large since the size of the data stream can be on the order of several MB. Consequently, a significant part of the DLP in multimedia applications resides outside the innermost loop; the way applications are coded currently.
Existing GPPs with SIMD extensions exploit DLP between independent loop iterations in the innermost loops leading to significant untapped available DLP in multimedia applications. Fig. 3 shows the SIMD C-code implementation of the discrete cosine transform (the DCT is a major component in JPEG image and MPEG video coding) which operates on 8x8 sub-blocks in an image of a given height and width. The second matrix is transposed before doing the computation because accessing the second matrix in column-major order results in a significant amount of overhead. This is particularly true when using SIMD instructions because a SIMD register needs to be packed with an element from different rows (and hence not contiguous). If a SIMD register holds eight elements, then all eight rows of a matrix need to be loaded into the cache and then elements belonging to the same column are packed into the register.
It is possible to eliminate one of the transpose operations (either from row or column 1D-DCT) if a transposed version of the DCT coefficients is available. In Fig. 3 , there are a total of five nested for-loops for the DCT routine. Current SIMD instructions exploit data level parallelism (DLP) in the innermost forloop (variable 'm'). The number of iterations would be scaled down according to the width of the available SIMD datapath (currently 64 or 128 bits wide) and size of each element (8-bit, 16-bit, or 32-bit). Processing (DSP) applications unveils invocation of several address patterns, often multiple simultaneous sequences [13] . Fig. 4 shows the typical access patterns in media and DSP kernels. 
Overhead/Supporting instructions
The discussion in the previous section points to the need of several instructions to compute addresses and otherwise support the core SIMD computations.
In this section, we analyze the media instruction stream by focusing on the two distinct sets of operations: the true/core computations as required by the algorithm and the overhead/supporting instructions such as address generation, address transformation (data movement and data reorganization such as packing and unpacking), loads/stores, and loop branches. Consider the DCT code in Fig. 3 . The true/core computation instructions for the DCT routine are the multiply (of DCT coefficients and data) and the accumulate operations (addition of multiplied values). This is shown in bold in Fig. 3 . All the other instructions are denoted as overhead;
their sole purpose is to aid in the execution of the true/core computation instructions. Many of them arise due to the programming conventions of general-purpose processors, abstractions and control flow structures used in programming, and mismatch between how data is used in computations versus the sequence in which data is stored in memory. A similar kind of classification of instructions into access and execute instructions was performed in decoupled access-execute (DAE) processors [14] [15]. In our classification, the overhead component includes loop branches and reduction operations [16] that are specific to multimedia applications (e.g. packing/unpacking, and permute) in addition to the memory access task. The instructions contributing to the overhead are:
• Address generation -considerable processing time is dedicated in performing the address calculations required to access the components of the data structures/arrays, which is sometimes called address arithmetic overhead.
• Address transformation -transforming the physical pattern of data into the logical access sequence (transposing the matrix in Fig. 3 , packing/unpacking data elements in SIMD computations, and reorganizing data in other ways.
• Loads and Stores -data is not always available in registers and has to be fetched from memory or stored to memory, the so-called access overhead.
• Branches -performing control transfer (for each of the 5 nested for-loops in the example). In order to quantify the amount of overhead/supporting instructions in multimedia programs, we evaluated the performance of six of the nine benchmarks listed in Table 1 . Jpeg, ijpeg, and decrypt are not used in this experiment because the source code for these three benchmarks includes initialization routines and file I/O. Five of the six benchmarks (except g711) were mapped in such a way that the SIMD execution units perform every true/core computation. Fig. 6 shows the breakdown of dynamic instructions into various classes (memory, branch, integer, SIMD overhead, and SIMD/true computation). It is seen that the overhead/supporting instructions that are required to assist the SIMD computation (true/core computations) instructions dominate the dynamic instruction stream (75-85%). A significant number of instructions are required for processing the loop branches and computing the strides for accessing the data organized in sub-blocks. The Pentium III processor has more memory references than the Simplescalar based processor because the x86 ISA has fewer logical registers (8 versus 32 in conventional RISC processors).
SIMD throughput and efficiency
In this section, we evaluate the throughput of the SIMD units to understand the impact of the overwhelming number of instructions needed to support the SIMD computations. We define SIMD efficiency as the ratio of the execution cycles ideally necessary for the true/core computation instructions to the overall execution cycles actually consumed. In other words, SIMD efficiency indicates what fraction of the peak throughput of the SIMD units is actually achieved. The actual execution cycles are . This is assuming that there is one multiplier and it is pipelined and the addition/accumulation can take place in parallel. Thus, an 8x8 matrix multiply should take 512 cycles on a machine with one multiplier (in the pure dataflow model), and take 128 cycles on a machine with 4 multipliers (assuming that there are at least 4 adders for the accumulation). If this algorithm were to take 2500 cycles on a real machine with one multiplier, then the efficiency of computation is 20% (512/2500). The important thing to note here is that if efficiency achieved is low, it suggests opportunities for further enhancement.
We measure the SIMD efficiency on two platforms, a Pentium III machine and a 2-way Simplescalar simulator, for each of the first six benchmarks described in Table 1 . The MMX extensions in the Pentium III processor provide fixed-point SIMD capability with two 64-bit MMX ALUs and one 64-bit MMX multiplier. The SSE extensions provide floating-point SIMD capability. The Simplescalar processor execution core is similarly configured to contain two 64-bit SIMD ALUs and one 64-bit SIMD multiplier. Table 4 shows the execution statistics and SIMD efficiency for each of the benchmarks. The ideal number of execution cycles is computed by identifying the number of required true/core computation operations and the available SIMD execution units (2 ALUs and 1 multiplier in both the processors). processor is 3 cycles, while that of the Simplescalar configuration is 1 cycle. Hence two memoryintensive benchmarks (scale and g711) achieve a better efficiency for the Simplescalar configuration. We also measured similar statistics for the Pentium III and the Simplescalar based processor without SIMD extensions. We found that the execution time is worse than SIMD enhanced processors, but the efficiency is higher for non-SIMD processors (2.5% -16.5%). This is because a 64-bit SIMD execution unit counts towards a peak rate of either 4 or 8 computations per cycle (16-bit or 8-bit data), whereas the scalar execution unit counts toward a single computation per cycle. While it is true that SIMD enhancements
were not added to improve efficiency of processing but to speedup multimedia programs, our characterization highlights the gap between peak rate and achieved rate for SIMD programs and points to ample opportunities for performance improvement.
Memory access and branch bottlenecks
Memory latency prevents processors from fetching data in a timely fashion to achieve peak throughput. Also, supporting wide issue processors requires the ability to fetch across multiple branches.
In this section, we investigate how memory latency and branch prediction impact the performance of these media kernels and applications. Table 5 shows the IPC with unit cycle memory access (i.e. a perfect L1 cache) and perfect branch prediction for the 2-, 4-, 8-, and 16-way processors with SIMD extensions. It is seen that different programs vary in their sensitivity to memory latency and branch prediction. Scale and g711 benchmarks are memory bound programs and improve significantly due to a unit cycle memory access but show negligible increase in IPC due to perfect branch prediction. Cfa, dct, and mot are benchmarks that operate on sub-blocks in a 2-D structure requiring five levels of loop nesting and benefit the most from perfect branch prediction and the ability to fetch across multiple branches in a single cycle. A unit cycle memory access has negligible performance impact on these three benchmarks.
The remaining four benchmarks (aud, jpeg, ijpeg, and decrypt) benefit equally from both perfect branch prediction and unit cycle memory access. It is evident from this experiment that it is extremely important to provide low latency memory access and excellent branch prediction extending over multiple branches in order to achieve good performance.
5
Hardware Support for Efficient SIMD Processing
Decoupling Computation and Overhead
The characterization of media applications presented in the previous sections showed that supporting or overhead related instructions dominate the instruction stream. Obviously, overhead/supporting instructions need to be either eliminated, alleviated, or overlapped with the true/core computations for better performance, i.e. the higher the overlap of overhead/supporting instructions, the higher the SIMD efficiency. We exploit the observed characteristics of the media programs and propose to augment GPPs (having SIMD execution units) with specialized hardware to efficiently overlap the overhead/supporting instructions. We refer to this as the MediaBreeze architecture. • Loads and stores: The same load/store units present in conventional ILP processors are used for this purpose.
• Branch processing: To eliminate branch instruction overhead, MediaBreeze employs zero-overhead branch processing using dedicated hardware loop control and supports up to five levels of loop nesting. All branches related to loop increments (based on indices used for referencing data) are handled by this technique. This is done in many conventional DSP processors such as the Motorola 56000 and TMS320C5x from Texas Instruments [18] .
• Data Station: This is the register-file for the SIMD computation and is implemented as a queue.
Dedicated register-files are present in conventional machines for SIMD either as a separate register file (as in AltiVec) or aliased to the floating-point register file (as in MMX).
• Breeze instruction memory and decoder: In order to program/control the hardware units in the MediaBreeze architecture, a special instruction called the Breeze instruction is formulated. The Breeze instruction is a multidimensional vector instruction. The Breeze instruction memory stores these instructions once they enter the processor. Fig. 8 illustrates the structure of the Breeze instruction.
Five loop index counts (bounds) are indicated in the Breeze instruction to support five level nested loops (in hardware) [18] [42] . None of our benchmarks required more than five nested loops. The MediaBreeze architecture allows for three input data structures/streams and produces one output structure. This was chosen because some media algorithms can benefit from this capability (current SIMD execution units sometimes operate on three input registers to produce one output value). Each data structure/stream has its own dedicated address generation unit to compute the address every clock cycle with the base address specified in the Breeze instruction. Due to the sub-block access pattern in media programs, data is accessed with different strides at various points in the algorithm (as described in section With the support for multiple levels of looping and multiple strides, the Breeze Instruction is a complex instruction and decoding such an instruction is a complex process in current RISC processors.
MediaBreeze instead handles the task of decoding of the Breeze Instruction. MediaBreeze has its own instruction memory to hold a Breeze instruction. Two additional 32-bit instructions are also added to the ISA of the general-purpose processor for starting and interrupting the MediaBreeze. These 32-bit instructions (fetched and decoded by the traditional instruction issue logic) indicate the start and the Fig. 8 architecture; however, the effort will be slightly higher to that of compiling for SIMD extensions. In spite of the lack of adequate compiler support for SIMD extensions, it has been clear that SIMD extensions still enhance media application performance.
Multicast: A technique to aid in data transformation
The MediaBreeze uses a technique called Multicast to eliminate the need for transposing data structures, to allow for reordering of the computations, and to increase reuse of data items soon after fetch Spatial locality can be exploited in the first matrix due to multiple data elements in each cache block, while the second matrix incurs a compulsory miss on each column the first time; assuming that two consecutive rows do not fit in a cache-block. In a machine with no SIMD execution units, during each iteration for the second matrix, a new cache-line has to be loaded as data belongs to the same column but different cache-line. However, for the case of SIMD processing, multiple cache-lines need to be loaded and data belonging to the required column needs to be reorganized from a vertical to a horizontal direction (packing). This involves substantial overhead and usually, the second matrix is transposed prior to the computation to eliminate the column-access pattern.
The transposing overhead can be eliminated using the Multicast technique. Instead of using column-access pattern, row-order access pattern is used for matrix B, while for matrix A, a single element is multicast to all eight sub-element locations in the SIMD register. Then instead of doing the eight multiplications to generate the first element C1,1 of the result matrix, all eight multiplications using A1,1 (i.e. the first partial product of each of the result terms in the first row) are performed. The sequence of multiplications in a normal SIMD matrix multiply and a multicast matrix multiply are illustrated in Fig. 9 .
After 64 multiplications, all eight result terms of the first row of the result matrix will be simultaneously generated. The algorithm using the multicast technique is always operating on multiple independent output values, while traditional techniques compute one result term at a time. This eliminates the need for transposing the second matrix. It also increases the reuse of items that were loaded, thus improving the cache behavior of the code. The MediaBreeze architecture provides hardware support for multicasting.
This allows the use of cache-friendly algorithms to perform many media algorithms. In this example, broadcast rather than multicast was employed, because one element is transmitted to all eight registers.
However, in several applications such as horizontal/vertical downsampling/upsampling, and filtering, several elements are multicast into the sub-element locations, many-to-many mapping as opposed to oneto-many mapping and hence the name multicast. The multicast technique is a superset of existing data reorganization instructions in current SIMD extensions such as AltiVec's splat [2] and MDMX's packed accumulators [6] [16] .
If the dimension of the matrices to be multiplied is large, then the multicast method needs temporary registers or an accumulator to store the accumulated results. However, multimedia applications operate on sub-blocks in huge matrices as opposed to processing the entire matrix as a whole. A SIMD parallelism of 8 or 16 is quite adequate to capture most media sub-block rows/columns. Other common operations where multicast is extremely useful include 1-D and 2-D filtering, and convolution. For example, when using MMX for implementing a finite impulse response (FIR) filter, multiple copies of the filter coefficients are needed (equal to the SIMD parallelism) to reduce considerable overhead due to misalignment of coefficient data.
Example encoding using the Breeze instruction
The Breeze instruction is a densely encoded instruction and hence most media algorithms can be processed in just a few Breeze instructions. Fig. 10 shows the pseudo-code for the implementation of the Breeze instruction. Given a start address for each of the data streams, each address is incremented based on the stride and the loop level during each cycle. 1 A 1,1 A 1,1 A 1,1 A 1,1 A 1,1 A 1,1 A 1,1   B 1,1 B 1,2 B 1,3 B 1,4 B 1,5 B 1,6 B 1,7 B 1 • one accumulation of SIMD result and one SIMD reduction operation
• four SIMD data reorganization (pack/unpack, permute, etc) operations
• shifting & saturation of SIMD results
Performance Evaluation and Results
To measure the impact of the MediaBreeze architecture, we modified the PISA version of Simplescalar-3.0 (sim-outorder) to simulate Breeze instructions using instruction annotations. We use the same SIMD execution units' configuration as in a Pentium III processor (two 64-bit SIMD ALUs and one 64-bit SIMD multiplier). The memory system for the MediaBreeze architecture is modified to allow for cache miss stalls and memory conflicts (i.e., the SIMD pipeline stalls in the event of a cache miss) since applications that have the least amount of SIMD instructions (i.e., it is the superscalar pipeline that accounts for a bulk of the execution time rather than the MediaBreeze pipeline) and a 2-way MediaBreeze architecture is only slightly faster than a 2-way SIMD processor. On the other hand for the remaining three benchmarks (aud, jpeg, and ijpeg), a 2-way MediaBreeze architecture is significantly faster than a 2-way SIMD processor.
The MediaBreeze pipeline is susceptible to memory latencies because it operates in-order. Thus
MediaBreeze is unable to achieve maximum SIMD efficiency on three of the four kernels (cfa, dct, and scale) in spite of them being mapped completely to one or two Breeze instructions. To reduce the impact of memory latencies on the MediaBreeze architecture, we introduced a prefetch engine to load future data into the L1 cache. Since the access pattern of each data stream is known in advance based on the strides, the prefetch engine does not load any data that is not going to be used. The regularity of the media access patterns prevents the risk of superfluous fetch very commonly encountered in many prefetching environments. The prefetch engine 'slips' ahead of the loads for computation and the computation itself to gather data into the L1 cache. Table 6 shows the speedup of the MediaBreeze architecture with prefetching for the 2-way and 4-way configurations (prefetching was also incorporated into the baseline The geometric mean of the speedup of the 2-way MediaBreeze processor over a 2-way SIMD processor for the five applications (not including the kernels -cfa, dct, mot, and scale) is 1.73 while that of a 4-way SIMD processor over a 2-way SIMD processor is 1.59. Therefore, on average, a 2-way GPP with SIMD extensions augmented with the MediaBreeze hardware achieves a performance slightly better than a 4-way superscalar SIMD processor on media applications. A similar trend is observed for the case of a 4-way GPP with SIMD extensions augmented with the MediaBreeze hardware being slightly superior to an 8-way superscalar SIMD processor.
Since the Breeze instruction is densely encoded, few Breeze instructions are needed for any media-processing algorithm. The number of dynamic instructions that need to be fetched and decoded is shrunk tremendously (as shown in Fig. 13 ), leading to a reduced use of the instruction fetch, decode, and issue logic in a superscalar processor. The instruction fetch and issue logic are a significant consumer of power in speculative out-of-order processors. Once a Breeze instruction is interpreted, the instruction 
6
Hardware cost of the MediaBreeze Architecture
Implementation methodology
To estimate the area, power, and timing requirements of the MediaBreeze architecture, we developed VHDL models for the various components. Using Synopsys synthesis tools [21] , we used a cell-based methodology to target the VHDL models to two ASIC cell-libraries from LSI Logic [22] [23]. Table 7 lists the libraries and technologies used for evaluating the implementation cost. Table. 7. Cell-based Libraries (LSI Logic) used in synthesis
Library name Description lcbg12-p (G12-p)
A 0.18-micron L-drawn (0.13-micron L-effective) CMOS process. Highest performance solution at 1.8 V with high drive cells optimized for long interconnects associated with large designs.
lcbg11-p (G11-p)
A 0.25-micron L-drawn (0.18-micron L-effective) CMOS process. Highest performance solution at 2.5 V.
The Synopsys synthesis tools estimate area, power, and timing of circuits based on information provided in the ASIC technology library. The ASIC technology library provides four kinds of information.
• Structural information. This describes each cell's connectivity to the outside world, including cell, bus, and pin descriptions.
• Functional information. This describes the logical function of every output pin of every cell so that the synthesis tool can map the logic of a design to the actual ASIC technology.
• Timing information. This describes the parameters for pin-to-pin timing relationships and delay calculation for each cell in the library.
• Environmental information. This describes the manufacturing process, operating temperature, supply voltage variations, and design layout. The design layout includes wire load models that estimate the effect of wire length on design performance. Wire load modeling estimates the effect of wire length and fanout on resistance, capacitance, and area of nets.
26
We use the default wire load models provided by LSI Logic's ASIC libraries. The Synopsys synthesis tools compute timing information based on the cells in the design and their corresponding parameters defined in the ASIC technology library. The area information provided by the synthesis tools is prior to layout and is computed based on the wire load models of the associated cells in the design.
Average power consumption is measured based on the switching activity of the nets in the design. In our experiments, the switching activity factor originates from the RTL models as the tool gathers this information from simulation. The area, power, and timing estimates are obtained after performing maximum optimizations for performance in the synthesis tools. The hardware cost results obtained by this technique is only a first order approximation based on the accuracy of the synthesis tools and cell-based libraries. The interested reader is referred to [21] for further information regarding the capabilities and limitations of the synthesis tools.
6.2
Hardware implementation of MediaBreeze units
Address generation
The MediaBreeze architecture supports three input and one output data structures/streams. Each of the four data streams has a dedicated address generation hardware unit. 
Looping
The MediaBreeze architecture incorporates five levels of loop nesting in hardware to eliminate branch instruction overhead for loop increments. A similar mechanism was commercially implemented in the TI ASC [24] (two levels of do-loop nesting in addition to a self-increment loop).
Conventional DSP processors such as the Motorola 56000 and the TMS320C5x from TI also use such a technique for one or more levels of loop nesting. 
Breeze instruction memory
The Breeze instruction memory stores the Breeze instruction once it enters the processor. We do not estimate the cost of this storage because the ASIC libraries are not targeted for memory cells. However, the area, power, and timing estimates of the Breeze instruction memory are similar to an SRAM structure. One Breeze instruction occupies 120 bytes. The Breeze instruction memory holds one or more Breeze instructions.
Existing hardware units
The remaining hardware units that are required for the operation of the MediaBreeze architecture are the SIMD computation unit, data reorganization, load/store units, and data station. These hardware units are already present in commodity SIMD GPPs. However, the Breeze instruction decoder controls the operation of these units as opposed to the conventional control path. This mandates an extra multiplexer to differentiate between control from the conventional control path and the Breeze instruction decoder. We do not model any of the existing hardware units. Table 8 shows the composite estimates of timing, area, and power consumption for the hardware looping and address generation circuitry when implemented using the cell-based methodology. The power Table 8 correspond to a clock frequency of 1 GHz. The hardware cost of commercial SIMD implementations [25] [26] is also shown in Table 8 . [26] . Thus, the increase in area due to the MediaBreeze units for SIMDrelated hardware is less than 10% and the overall increase in chip area is less than 0.3%. with speeds over 1 GHz typically consume a power ranging from 50 W to 150 W and MediaBreeze hardware increases power by less than 1%. We believe that the overall energy consumption of the MediaBreeze architecture would be less than that of a superscalar processor with SIMD extensions because the Breeze instruction reduces the total dynamic instruction count (0.2 to 40% in our media applications not including kernels). The instruction fetch and issue logic are expected to consume greater than 50% of the total execution power (not including the clock power) in future speculative processors [27] . Once a Breeze instruction is interpreted, the instruction fetch, decode, and issue logic in the superscalar processor can be shutdown to save power.
Area, power, and timing results
Timing
Pipelining the hardware looping logic into two stages (in a 0.18-micron technology) would allow for incorporating it into current high-speed superscalar out-of-order processors with over 1 GHz clock frequency. Similarly the address generation stage needs to be divided into three pipe stages to achieve frequencies greater than 1 GHz. The timing results show that incorporating the MediaBreeze hardware into a high-speed processor does not elongate the critical path of the processor (after appropriate pipelining). The Breeze instruction decoder multiplexers that control the hardware units introduce an extra gate delay in the pipeline. However, using a cell-based methodology gives a conservative estimate while custom design (typically used in commercial GPPs) would allow for greater clock frequencies for the added MediaBreeze hardware. In spite of adding five pipeline stages, the overall pipeline depth of a processor is not affected because the looping and address generation stages bypass the conventional fetch, decode and issue pipeline stages.
Related Work
The proposed solution combines the advantages of SIMD, vector, DAE, and DSP processors. The DAE concept present in the IBM System 360/370, CDC 6600 [30] , CDC7600, CRAY-1, CSPI MAP-200, SDP [31] , PIPE [32] , SMA [19] , WM [33] , DS [34] , etc demonstrated the potential of decoupling memory accesses and computations [14] [15] . There also has been research in specialized access processors and address generation coprocessors [13] [35] . The concept of embedding loops in hardware was implemented commercially in the TI ASC [24] (do-loop in this case). The SMA architecture [19] provided similar flexibility in accessing matrices. This concept was seen to be successful in all these machines as well as many DSP processors [18] . Typically all these techniques were successful only for a limited class of applications. This work extends beyond past work to create an integrated environment in which both media and general-purpose workloads can excel.
Previous media characterizations have concentrated on measuring the performance benefits of
There are a few research efforts in identifying the bottlenecks in exploiting sub-word parallelism using SIMD extensions. Fridman discusses approaches to data alignment for subword parallelism in the TigerSharc processor using four sub-word MAC units in [28] . Thakkar and Huff discuss the need for data alignment for SSE extensions in [29] . We perform a comprehensive detection of bottlenecks in SIMD-style extensions.
The proposed Breeze instruction captures all the overhead/supporting operations in addition to capturing the DLP in the true/core computation and has some similarities to the vector parameter file in the TI ASC machine. Compiling for SIMD extensions is still in its infancy [50] Corbal et al. [36] proposed to exploit DLP in two dimensions instead of one dimension as in current SIMD extensions. A 20% performance improvement was achieved using their Matrix-oriented architecture named MOM. However, the overhead factor is not significantly reduced. Vassiliadis et al.
[37] [38] have concurrently proposed the Complex Streamed Instruction set (CSI) that can exploit two levels of looping. Though they are able to eliminate some overhead because each of their complex instructions can eliminate two loops, our solution is more comprehensive. Lee and Stoodley [39] proposed simple vector microprocessors for media applications, but they used in-order simple processors for scalar processing and vectors for media processing. While we commend the approach, such an architecture cannot achieve good performance over several application domains because the scalar processor is in-order. Ranganathan et al. [5] observe that out-of-order execution is beneficial to media applications. There are several components in many multimedia applications that cannot exploit DLP, but require good branch prediction and speculation to exploit ILP, and hence we also favor the use of the outof-order processor. It is important to have a general-purpose processor achieve sustained performance on different domains of workloads.
Rixner et al. [40] developed the Imagine architecture for bandwidth-efficient media processing.
This architecture is based on clusters of ALUs processing large data streams and is built as a co-processor for a high-end multimedia system. The methodology adopted is to put additional computation units, while our approach is to improve the utilization of the existing computation units by reducing the overhead.
Another related effort is the PipeRench coprocessor that is reconfigurable [41] . The Burroughs Scientific Processor (BSP) [42] was a pure-SIMD array processor that had special-purpose hardware (called Alignment networks) for packing and unpacking data. In addition, they have powerful SIMD instructions of which many are being used in current SIMD extensions. Vermuelen et al. [20] described how DCT,
Reed-Solomon code and other similar media oriented operations could be enhanced with a hardware accelerator that works in conjunction with a GPP. However, the accelerator has to be designed for each algorithm. Retargeting the accelerator to another algorithm incurs significant effort, while, in our case, only Breeze instruction encoding needs to be performed.
Conclusion
This paper analyzes multimedia workloads and proposes architectural enhancements for improving their performance on general-purpose processors. Based on an investigation of loop structures and access patterns in multimedia algorithms, we find that significant amount of parallelism lies outside the innermost loops (between loop levels 3 and 6 as indicated in Table 3) , and it is difficult for SIMD units to exploit the parallelism. The characteristics preventing SIMD computation units from computing at their peak rate are analyzed. The major findings of the bottleneck analysis are:
• Approximately 75-85% of instructions in the dynamic instruction stream of media workloads are not performing true/core computations. They are performing address generation, data rearrangement, loop branches, and loads/stores.
• The efficiency of the SIMD computation units is very low because of the overhead/supporting instructions. Our measurements on a Pentium III processor with a variety of media kernels and applications illustrate SIMD efficiency ranging only from 1% to 12%.
• Increasing the number of SIMD execution units does not impact performance positively leading us to conclude that resources for overhead/supporting instructions need to be scaled. We observe that a significant increase in scalar resources is required to increase the SIMD efficiency using conventional ILP techniques. An 8-way or 16-way integer processor is necessary to process the overhead instructions for the SIMD width in current processors.
The paper then addresses the issue of executing the overhead instructions efficiently. Many recent enhancements such as increasing the SIMD width have targeted exploiting additional parallelism in the true/core computation while the MediaBreeze architecture proposed in the paper focuses on the overhead instructions and the ability of the hardware to eliminate, alleviate, and overlap the overhead.
MediaBreeze exploits the nature of the overhead instructions to devise simple hardware by combining the advantages of SIMD, vector, DAE, and DSP processors. The major findings are:
• Eliminating and reducing the overhead using specialized hardware that works in conjunction with state-of-the-art superscalar processor and SIMD extensions can dramatically improve the performance of media workloads without deteriorating the performance of general-purpose workloads. On multimedia kernels, we find that a 2-way processor with SIMD extensions augmented with hardware support significantly outperforms a 16-way processor with SIMD extensions.
• On multimedia applications, a 2-way processor with SIMD extensions with the supporting MediaBreeze hardware outperforms a 4-way superscalar processor with SIMD extensions. Similarly a 4-way processor with SIMD extensions added with MediaBreeze hardware is superior to an 8-way superscalar with SIMD extensions.
• The cost of adding the MediaBreeze hardware to a SIMD GPP is negligible compared to the performance improvements. Using ASIC synthesis tools and libraries, we find that the MediaBreeze 33 hardware units occupy less than 0.3% of the overall processor area, consumes less than 1% of the total processor power, and on appropriate pipelining does not elongate the critical path of a GPP.
Our analysis shows that increasing the number of SIMD execution units to get more parallelism is not the right approach. But if any media processor designer decides to exploit more parallelism just by scaling the current architectures, they should scale the non-SIMD part much more aggressively than the SIMD part.
Figures ======= Fig. 1. (a) IPC with both the SIMD and non-SIMD resources scaled, (b) IPC with non-SIMD resources scaled, but SIMD resources are constant (same as 2-way processor configuration), and (c) performance improvement of (a) over (b) Fig. 2 . A 2-D data structure in which sub-blocks of data are processed. The data elements surrounded by the dotted ellipse form one sub-block. Each sub-block requires two strides (one each along the rows and columns of the sub-block, namely stride-4 and stride-3). Additional two strides (stride-2 and stride-1) are required for accessing different sub-blocks in the horizontal and vertical direction Table. 1. Description of the multimedia benchmarks Table. 2. Processor and memory configurations Table. 3. Summary of key media algorithms and the required nested loops along with their primitive addressing sequences 37 Table. 4. Execution statistics and efficiency of media programs Table. 5. Performance (IPC) with unit cycle memory accesses and perfect branch prediction Table. 6. Performance of the MediaBreeze architecture with prefetching Table. 7. Cell-based Libraries (LSI Logic) used in synthesis Table . 8. Timing, Area, and Power estimates for hardware looping and address generation (the Breeze instruction decoder was merged into the looping and address generation)
