Abstract
Introduction
Media applications will become one of the most demanding types of workloads in the near future. Standards such as MPEG-4 or MPEG-7 will be eminently multithreaded, and a commodity PC will face the need to execute several media streams (encoders, decoders, 3D processing) concurrently. Therefore, the design of future media architectures will have to take into account both the tight real-time requirements of mono-threaded applications and the need to provide high throughput to run multiple tasks simultaneously.
It seems unlikely that current generation microprocessors ' This work has been supported by the Ministry of Education of Spain under contract CICYT TIC98-0511OCO2-01 and by the CEPBA. We would like to thank the reviewers of this paper for their useful comments. will be able to meet these requirements in the future if we only scale them in the traditional way (Le, by adding more functional units and by increasing issue width). We believe that future media workloads intrinsically require some form of on-chip parallel processing if we want to succeed in delivering the required performance. Therefore, architectures able to exploit thread level parallelism, such as Simultaneous Multithreaded Processors (SMT) or Chip Multiprocessors (CMP), appear as the architectures of choice to provide the desired throughput. Of course, in order to deal with the high performance requirements of the kernels of such media workloads we will still need special purpose instructions that can exploit the data-level parallelism available. Thus, p-SIMD processing is a suitable way of improving uni-threaded performance for media kernels with a modest investment in hard ware.
This paper advocates the combination of SMT and p-SIMD extensions to achieve the performance required for future media workloads. We will evaluate the performance of two different aggressive SMT processors: one with conventional p-SIMD extensions (such as MMX) and one with longer streaming vector p-SIMD extensions (our MOM ISA extension).
We will demonstrate that SMT execution and streaming vector p-SIMD instructions combine very well together since:
The SMT execution allows mixing scalar and streaming p-SIMD instructions in an efficient way
The latency tolerance properties of the streaming p-SIMD ISA enable the use of decoupled cache hierarchies that avoid the typical cache degradation experienced by multiple threads when running on a SMT Comparing to a baseline consisting of a plain out-oforder superscalar with multimedia extensions, SMT+MMX yields a 2.1X speedup and SMT+MOM achieves a 3.3X speedup.
Multimedia workload trends
In this section we will describe the evolution of media codes and will analyze their main characteristics. We will discuss some common misconceptions about media workloads that will justify why a combined SMT plus SIMD architecture appears as a suitable alternative for running next generation of media workloads.
From kernels to real programs
When studying multimedia algorithms, or, better, multimedia kernels, most authors agree that their most relevant characteristics are the following [ However, studying kernels in isolation can be very misleading, since what is true for basic media kernels, does not necessarily apply to complete programs.
A typical media program is composed of a set of kernels that process data in a stream-like fashion and protocol related overhead (table look-ups, header processing, nonvectorizable coding) very similar to what we can find in a typical SPECint benchmark. Since the kernels are repeatedly invoked on sets of related data, their behavior as a group can be very different from their isolated behavior. For example, while at the kernel level one usually encounters lowlocality stream-like memory patterns, there is usually high locality at the algorithm level.
Researchers creating benchmarks from raw media kernels usually wrap them in long running loops so that measurements can be easily taken. However, repeating a kernel many times on different data exacerbates its stream-like behavior, which may not be a realistic scenario. As reported in [ 1 I], complete media programs characteristics fall somewhere in between raw DLP media kernels and conventional non-numerical applications.
As a result, media-oriented architecture designers should be aware of the following treats that characterize full multimedia programs: It follows from this set of characteristics that architectures strictly focused on exploiting the data level parallelism available at kernel level will fail to deliver the expected performance due to Amdahl's law.
From real programs to future media applications
Future media workloads are not expected to change radically the basic multimedia algorithms. Rather, given the large number of different media sources that can be sent over the Internet, the tendency is to focus on joining heterogeneous media streams into unified protocols. These new applications will pay special attention to managing and maximizing the efficiency of compression and/or encryption by extracting uncorrelated media contents from the same source and applying the best media processing for each of them. Additionally, media programs will no longer be monolithic applications. Interactivity will be applied across the board so that different input/output media streams will execute and communicate among themselves concurrently.
The best example of this tendency is the MPEG-4 standard [12] . MPEG-4 is a new ISO/IEC protocol that uses an object-based approach to describe and compose interactive audiovisual scenes. Uncorrelated objects are coded, encrypted and transmitted separately in order to be composed again at reception. These objects may include digital video (MPEG-2), still image, audio, speech and even audio synthesis or 3D-graphics. Several powerful transformations can be performed over every object in order to compose each of them into the same audiovisual scene thanks to a a higher layer of the protocol. Therefore, future media applications will add several characteristics not contemplated by the current research literature. Dealing with multiple concurrent media streams means that we have high levels of coarse level parallelism (TLP) and that throughput is now also an issue (together with uni-threaded real time requirement of each source). Additionally, an extra layer of protocol means more hardto-vectorize overhead that may further counterbalance the DLP-only nature of multimedia kernels.
Proposed Architecture
The need for coarse grain thread level parallelism comes at a very appropriated moment. Coincidentally, several vendors targeting commercial workloads (OLTP, Web serving and databases in general) have started to focus on new parallel architectures prepared to deal with the abundant, explicit heterogeneous thread-level parallelism that this kind of programs are characterized by.
Looking at recent announcements, two are the most important architectural alternatives: CMPs, or Chip MultiProcessors (Power-4 [ 131, Piranha [ 14]), and SMT processors (Alpha 21464 [15] ). The first alternative, a CMP, is based on joining together several simple processors on a single die, communicated through a conventional cache hierarchy. The second alternative is based on executing independent flows of execution (threads) concurrently on a (typically) highly aggressive superscalar processor.
Our claim is that this kind of architectures are also appropriated for future media applications. In the same vein that OLTP-like applications, throughput is the major concern in workloads involving several concurrent media sources (once real-time requirements have been met, of course), and both CMP and SMT architectures are good alternatives to meet these throughput requirements. Whether one alternative is better or not is still a matter of controversy: SMT allows a better usage of the available resources while CMP does not have the traditional implementation problems of aggressive out-of-order architectures From the point of view of overall performance, we believe that SMT processors are specially well suited for the characteristics of media workloads due to the ability of providing moderate performance even in serial fragments of code or with low number of threads (minimizing the impact of Amdahl's Law). In this section we are going to propose a SMT processor with the inclusion of smart media ISA extensions as the architecture of choice for future media desktop systems.
Baseline Processor
Our SMT processor is built around a common out-of-order superscalar processor, as proposed in [ 161. As shown in figure 2, our proposed architecture closely resembles a 8-way version of a MIPS R10000, able to fetch up to 8 instructions 221 per cycle. Instructions decoded and renamed are distributed by the dispatch logic to the appropriate instruction queue, which can read from its own dedicated register file. Instructions within every queue may issue out-of-order and a graduation window is in charge of retiring instructions in-order to maintain the appearance of sequential execution.
SMT extension
The basic superscalar architecture has been enhanced following [ 16, 171 to support Simultaneous Multithreading (SMT). In order for the processor to be able to execute multiple threads concurrently, minor changes are needed for three of the stages of the pipeline: fetch, decode and commit. The fetch engine is able to select up to two groups of 4 instructions per cycle out of the pool of available threads (provided they are not stalled under an I-cache miss or a branch missprediction). For the initial evaluations, the fetch selection strategy is a classic round robin policy.
As proposed in [ 161, all threads share a common register pool. The decode engine is able to rename instructions from different threads using a per-thread renaming table and a shared common free register pool. Inside the execution queues no additional logic is required to handle instructions from different threads as renaming provides an easy mechanism to avoid false dependences. Some additional logic is required in the graduation window in order to allow perthread retirements, as well as a mechanism to perform perthread instruction flush in case of miss-speculation.
SIMD Extensions
In spite of the explicit thread level parallelism available in future media applications, specific architecture innovations are still needed in order to fulfill the real-time requirement of single media streams. Simultaneous multithreading might provide good overall computational throughput, but, unfortunately, cannot guarantee that, for instance, the frame rate constraints of a MPEG-2 video stream are met. As discussed in previous sections, we believe that new p-SIMD extensions are necessary to meet the single thread performance requirements of highly demanding multimedia kernels.
We have enhanced our basic SMT core with a multimedia instruction queue, its corresponding SIMD register file and two independent media functional units. Two different sets of multimedia extensions will be evaluated: a p-SIMD instruction set that resembles the Intel SSE [9] extension and our own streaming-SIMD instruction set, named MOM [ 1 I]. Despite differences in their instruction semantics, both use a similar overall architecture.
For the MMX-like instruction set, we have implemented an approximation of SSE [9] integer opcodes with 67 instructions and 32 logical registers (as opposed to 8). We have added some extra features, such as new reduction oper- In order to improve efficiency in reduction operations, we have also included 2 logical packed accumulators of 192 bits. These accumulators allow performing reduction operations over a whole ,U-SIMD stream using a single packed accumulator with high efficiency. Finally, our streaming p-SIMD architecture has one stream length register (renamed through the integer register pool) that allows to determine the real length of each stream register (out of 16). The stride feature is very powerful for multimedia (specially for imagehide0 processing), as it allows to work over small sparse matrices of data.
Architectural Parameters
Our basic processor configuration is able to issue up to 4 integer instructions, up to 4 memory instructions (either loads or stores) and up to 4 floating point instructions per cycle. Additionally, the SMT+MMX processor is able to issue up to two different MMX-like instructions per cycle. On the other hand, our SMT+MOM processor has only one single media functional unit of width 2 (that is, we have two parallel vector pipes so that up to two p-SIMD sub-instructions can be executed every cycle from the same stream). As a result, in contrast with the conventional MMX-like version, the SMT+MOM processor only requires an issue width of 1 for the SIMD queue. Our SMT processor simulator contains a highly detailed memory hierarchy model, where both L1 and L2 cache levels are located on-chip (as in the Alpha 21 364 [ 181). The L1 cache is a 32 KB, direct mapped, write-through cache with 32-byte lines, interleaved among 8 memory banks. The Icache is a 64 KB, 2-way set associative cache with 32-byte lines, interleaved among 4 memory banks. The L2 cache is a IMB, 2-way set associative, write-back cache with 128-byte lines. Both L1 and L2 levels of cache have 8 MSHRs and a 8-depth coalescing write buffers with selective flush policy. L1 and I1 have one cycle of latency while the L2 cache latency is 12 cycles. We have modeled a 128MB Direct Ranibus main memory system which contains a DR-DRAM controller driving 8 Ranibus chips and leveraging up to 3.2 GB/s with a 128-bit wide, bi-directional200Mhz main bus (feeding a 800MHz processor).
We have done preliminary simulations in order to determine the number of physical registers and the window sizes necessary to achieve reasonable (near saturation) processor performance for 1, 2, 4 and 8 threads. The results can be seen in table 1. Note that the size of the stream p-SIMD register file can be up to 8 times the size of the MMX register file. However, as already seen in [ 1 I], organization in lanes and interleaving of the different elements of the vector register into banks help to decrease radically the overall area without any impact on performance. Further discussion on the impact on cycle time of the large number of registers is beyond the scope of this paper.
Workload characterization
In this section we will start by defining our workload, inspired in the MPEG-4 media profiles. Then, using two different ISAs, (Compaq's) Alpha extended with p-SIMD instructions (MMX-style) and Alpha extended with our own streaming p-SIMD instructions, we will present an instruction breakdown of the workload.
Modeled Workload
Trying to evaluate what will be a future media workload is not easy. Parameters that influence overall application behavior, such as the predominance of each media source, the size of its working set, or the level of protocol overhead, are hard to determine a priori. Even already standardized protocols such as MPEG-4 are still slightly ambiguously defined (MPEG-4 was promoted as an I S 0 standard just in 1999) and it is difficult to obtain reliable, non research-oriented source codes. Therefore, our methodology will be based on selecting a set of real multimedia programs that approximate the multiprogrammed contents of a full MPEG-4 application. According to the standard, MPEG-4 is composed of four different profiles (or heterogeneous contents): Table 2 describes the set of benchmarks selected for our multiprogrammed workload.
For each program, we identified the most important functions using profiling and manually rewrote the vectorizable ones using both MMX-like instructions and our stream p-SIMD instructions, by means of our own emulation libraries [ l l ] . We should note that the 3D graphics benchmark (mesa) has not been vectorized because our emulation libraries do not have floating-point p-SIMD instructions. Table 3 shows an instruction breakdown for each of the benchmarks for the two p-SIMD instruction sets under consideration, MMX and MOM. The last row in the table gives the total number of instructions per benchmark (in millions). Note that, to allow for a meaningful comparison, a MOM p-SIMD instruction that operates with, say, a stream length of 1 1 , counts as eleven instructions (i.e., each MOM instruction is multiplied by its stream length). The first four rows present the percentage of each type of instruction, integer arithmetic, floating point, SIMD arithmetic and memory (both scalar and vector memory instructions), in each benchmark.
Instruction breakdown
In sharp contrast with the common belief, table 3 shows that under MMX, our multimedia workload is dominated by the integer pipeline (62% on average). SIMD arithmetic instructions only account for 16% of the overall number of instructions. On top of everything, the workload is characterized for having a very unbalanced distribution of instructions at run-time. Media programs execute typically regions of code with a high percentage of vector instructions and few scalar instructions and other regions of code with no SIMD instructions at all (thus causing severe resource balancing problems).
The stream p-SIMD paradigm substantially reduces the number of integer instructions (around 20%) and memory instructions (around 7%) when compared to MMX. The reason is a phenomena commonly found in any conventional vector architecture. As every stream p-SIMD instruction can pack several MMX-like instructions (thus, replacing multiple instances of a loop) there is an elimination of scalar instructions related to the loop control (that is, backward branches, loop indexes or address calculations). Our stream instructions generate all this information automatically, thanks to the information of the Streani Lengrh register and the stride. There is an even higher reduction of the overall number of vector instructions (62 %). The reason is that our stream ISA can take great advantage of MDMX-like Packed Accumulators (as seen in [ 1 I]). These accumulators are very useful in reduction sequences, eliminating a high amount of logic overhead.
However, in terms of relative percentage, the MOM paradigm does not alleviate the predominance of integer instructions in the instruction mix. Quite the contrary, it slightly increases the percentage of integer instructions. Thus, independently of using MMX or MOM, table 3 clearly shows that the integer pipeline will be the main performance bottleneck within the CPU when executing our approximation of a next generation media workload (ignoring, of course, memory behavior). Therefore, the best we can expect our SMT architecture to do is, at most, hide the execution of all memory and SIMD instructions underneath the execution of the integer instructions of the program.
Performance evaluation
In this section, we are going to evaluate the performance of the SMT architecture with the two SIMD ISAs under study. We will present first performance under an ideal memory system and then, we will evaluate the impact of a realistic memory model. Finally, we will study the performance improvements using smart fetch policies and will analyze alternative cache hierarchies in order to alleviate the memory problem.
Simulation methodology and performance metrics
We have evaluated the performance of the modeled workload using an SMT version of the Jinks simulator [ 1 I ] with 1 , 2 , 4 and 8 threads. In order to do so, we selected a random order of the 8 programs: MPEG-2 encoder, GSM decoder, MPEG-2 decoder, GSM encoder, JPEG decoder, JPEG encoder, mesa and MPEG-2 decoder (2nd time). Simulation starts with as many programs concurrently as the number of contexts allowed by the machine. When a program completes, the next program from the list is initiated. In case that no further programs are available, we initiate again selecting programs from the same list from the beginning. This process is repeated until the end of the g t h context. This avoids having fractions of time with less threads than those allowed by the machine. In order to round to 8 programs, the most significant program (MPEG-2 decode) is included twice. We should note that this methodology gives us a measure of throughput rather than real execution time. We believe that this is the most suitable metric as future media workloads will be characterized by continuous media streams being executed concurrently along the time.
When evaluating the performance of a SMT architecture, we typically use IPC (instructions committed per cycle) as a good indicator of throughput. However, the IPC is not a good measure of performance when comparing different ISAs, as every ISA needs a different number of instructions to execute a given benchmark. Therefore, in this paper we will use a new indicator of performance for the streaming p-SIMD architecture:
i n s t r u c t i o n s instructions:::
x IP'MOM EIPC stands for Equivalent IPC, and intuitively indicates the IPC a SMT+MMX processor should reach in order to match the performance of the SMT+MOM processor. The ratio between the EIPC of the SMT+MOM architecture and the IPC of the SMT+MMX architecture gives a measure of performance Speed-up. Figure 4 shows performance for the two architectures under study with a idealistic memory system (neither cache misses nor bank conflicts). The horizontal dotted line represents the baseline performance of a single thread with MMX instructions. From the results of the figure, we can see that, as we increase the number of threads, SMT+MMX goes from the baseline EIPC of 2.47 up to 5.0 (a speedup of 2.02X). Even better, SMT+MOM goes from an EIPC of 2.98 (20% better than MMX) for a single thread up to 6.19 (a speedup of 2.08X). Overall, SMT+MOM is 2.5 times better than an 8-way superscalar with MMX instructions.
Performance with Ideal Memory Systems
As we will see later, the MOM model achieves even higher relative performance under realistic memory assump- Figure 5 shows performance for the two architectures under study taking into account the effect of the memory system described in section 3. Performance with ideal memory is also presented for comparison purposes.
Performance under Real Memory
From the results of the figure we can observe two main phenomena: (a) Increasing the number of threads may provide diminishing returns (performance with 4 threads is higher with 8 threads), and (b) MOM is more robust to the impact of the memory system (MOM exhibits an average performance degradation of a 12 % in comparison to the 30 % of MMX).
In order to understand these two effects, we may look at table 4 , where the instruction cache hit rate, the L1 cache hit rate and the average memory latency on L1 are shown. We may observe how, as long as we increase the number of threads that can co-exist in the processor, the hit rate of both instruction and L1 caches decrease, due to the mutual interference between threads. This ends up increasing the memory latency, thus reducing performance.
The higher robustness of MOM against the threadinterference is due to two main reasons: a lower hit rate degradation and a higher memory latency tolerance. A lower hit rate degradation is due to the nature of the stream memory accesses. Since a stream memory reference determines several memory accesses from the same thread, the thread interference is reduced. Additionally, the stream memory references exhibit a high memory latency tolerance since, as usual with vector memory references, they can amortize the memory latency across the different elements accessed.
Performance of Advanced Fetch Mechanisms
As described in [20] , the way we select instructions to be fetched may have a relevant impact over final performance. This is specially true for the stream p-SIMD architecture, where we found that the conventional round-robin policy was not able to optimally mix scalar and vector instructions (with 8 threads, only a 1% of the MMX execution cycles perform only vector instructions while in MOM, a 4% of the execution cycles perform only vector instructions).
We have evaluated the performance of the following fetch policies:
ROUND-ROBIN (RR)
The basic round-robin policy used until now. shows the performance for all the fetch policies under study for both the SMT+MMX and SMT+MOM architectures. We can observe that the different fetch policies are only effective with a high number of threads (compared to the round-robin), delivering up to a 9 % of performance improvement. Note, however, that, in spite of the fact that performance degradation is smoothed, performance for 4 threads is still higher that performance for 8 threads.
The ICOUNT is the policy that leverages higher performance for the SMT+MMX model, while OCOUNT is the policy that exhibits the best performance for the SMT+MOM model. BALANCE stands as a cost-effective alternative, given the simplicity of its implementation compared, for instance, with the OCOUNT policy.
Decoupling the Cache Hierarchy
As seen in section 5.3, if we increase the number of threads of the processor, the data locality is reduced due to inter-thread interference. As a result, we incur in higher latency penalties and a loss of bandwidth efficiency. Moreover, as we increase the number of threads, the number of available memory instructions to execute per cycle raises, thus increasing the likelihood of bank collisions (reducing even more the bandwidth efficiency). While the latency penalty impact can be overridden partially by the latency tolerance properties of SMT execution, the loss of effective bandwidth is a problem that directly affects final performance.
In [21] we proposed to bypass vector memory accesses to a higher level cache and to decouple general purpose memory ports into scalar memory ports and vector memory ports (as seen in figure 7 ). Decoupling scalar and vector memory ports into different levels of the cache hierarchy achieves two goals: (a) we effectively decouple the vector working set from the scalar working set, and (b) we reduce the number of memory ports per level of cache, thus reducing bank contention. Figure 7 compares the original 4-port cache organization we have just evaluated with a decoupled organization. In the latter configuration, we have 2 memory ports to access scalar elements from the L1 (single-banked and double-pumped as in the Alpha 2 1264), and 2 memory ports directly connected to L2 used for stream SIMD memory accesses (of course, the L2 still has to talk to the LI and I1 caches). The L2 has two banks connected to the vector memory ports via a crossbar. Naturally, bypassing the L1 when doing stream accesses can cause coherence problems due to interference between vector and scalar data. Consequently, a coherency protocol based on an exclusive-bit policy is used to deal with this situation [21] . Figure 8 shows the performance of the decoupled cache hierarchy for all the different fetch policies studied so far.
The first main conclusion is that the new cache decoupling strategy solves the cache degradation problem: contrary to the data in figure 5 , now the 8-thread configuration is better than the 4-thread configuration. Another observation is that the different fetch policies barely provide any performance benefit for the SMT+MMX architecture, while they provide up to a 7 % of performance improvement for the SMT+MOM processor model. In order to be able to compare the efficiency of the different cache hierarchy strategies, we may look at figure 9, which shows performance results for the three different cases: ideal memory system, conventional memory hierarchy, and the decoupled memory hierarchy. For the From the results of the figure, we can observe that bypassing the L1 cache is very useful but only if we have a large number of threads, since in that case we are able to tolerate the 12-cycles of L2 latency while taking advantage of the higher effective bandwidth. Our MOM instructions benefit even more from the decoupled hierarchy due to its own additional latency tolerance capabilities. As a result, the MOM+SMT architecture exhibits only a 15% of performance degradation compared with the idealistic memory system, while the SMT+MMX exhibits a 30% of performance degradation (for the 8-thread configuration).
Related Work
SIMD execution appears as a natural choice to provide high uni-threaded performance. During the last There has been a large body of research regarding SMT architectures [16, 20, 26, 271 , most of them specially focused on pure TLP to ILP exploitation. Phenomena like branch missprediction and memory latency tolerance, cache degradation and impact of fetch policies (all of them typical of the SMT paradigm) have been deeply studied.
The contribution of this paper is the claim that future media applications will provide explicit TLP that can be exploited with great benefit by SMT processors with p-SIMD media extensions. A previous proposal [ 171 evaluated the performance of a SMT processor with conventional vector enhancements for numerical applications. Nonetheless, we believe that the combination of the two paradigms is more promising for media processing, as numerical codes are dominated by the vector component rather than the scalar one. This paper has shown that ILP exploitation together with an effective mixing of scalar and vector instructions is extremely relevant for media performance.
In [28] , a SMT processor with MMX-like extensions is used to evaluate the performance of a parallel version of an MPEG2-decoder. The authors identified as one of the major performance bottlenecks the large serial fragment of nonvectorizable code. Our proposal differs in two main points: (a) on the convenience of exploiting heterogeneous explicit TLP rather than exploiting TLP over the already vectorized (thus, with high DLP) fragments of code; and (b) on the convenience of using streaming p-SIMD instructions to increase the latency tolerance (thus allowing smarter memory hierarchies).
Finally, there has been another kind of architecture proposals targeted at future media applications. The M-PIRE processor [29] is an architecture explicitly focused on executing MPEG-4 applications. Its basis of implementation is the partition of the processor into independent programmable units, each of them optimized for a certain class of MPEG-4 algorithm. With a similar philosophy, the Sony's PSX2 Emotion Engine [30] uses independent vector/micro-coded units able to exploit SIMD parallelism concurrently with the execution of the main integer core.
Summary
In this paper we have studied and evaluated the performance of an efficient architecture for the next generation of media workloads. We have shown that in order to match the requirements of future standards such as MPEG-4, both DLP and TLP can be exploited efficiently, and we have proposed SMT processors enhanced with p-SIMD extensions as a suitable alternative.
We have evaluated two different p-SIMD alternatives (a MMX-like extension and a stream, vector-like p-SIMD ISA) and have shown the advantages of stream-oriented p-SIMD alternatives such as MOM.
We have seen that while the SMT capabilities allow to hide vector execution behind integer execution, (thus minimizing the impact of Amdahl's Law) the latency tolerance properties of MOM memory streams allow to introduce smarter cache hierarchies that help alleviate the cache performance degradation associated with the inter-thread interference. As a result, while SMT provides a maximum speed-up of 2.1X for the MMX processor (with a 30% of performance degradation compared with idealistic memory performance), the MOM processor achieves a performance improvement of 3.3X (compared with the performance of a uni-threaded MMX model), suffering from only a 15% of performance degradation from the impact of a realistic memory model.
