DLP+TLP processors for the next generation of media workloads by Corbal San Adrián, Jesús et al.
DLP + TLP Processors for the Next Generation of Media Workloads 
Jesus Corbal, Roger Espasa and Mateo Valero 
Departament d’ Arquitectura de Computadors, 
Universitat Polittcnica de Catalunya-Barcelona, Spain’ 
Abstract 
Future media workloads will require about two levels 
of magnitude the performanee achieved by current gen- 
eral purpose processors. High uni-threaded performance 
will be needed to accomplish real-time constraints together 
with huge computational throughput, as next generation of 
media workloads will be eminently multithreaded (MPEG- 
4/MPEG-7). In order to fu@l the challenge of providing 
both good uni-threaded performance and throughput, we 
propose to join the simultaneous multithreading execution 
paradigm (SMT) together with the ability to execute media- 
oriented streaming p-SIMD instructions. 
This paper evaluates the performance of two different ag- 
gressive SMT processors: one with conventional p-SIMD 
extensions (such as M M X )  and one with longer streaming 
vector p-SIMD extensions. We will show that future me- 
dia workloads are, in fact, dominated by the scalar per- 
formance. The combination of SMT plus streaming vector 
p-SIMD helps alleviate the performance bottleneck of the 
integer unit. SMT allows “hiding” vector execution under- 
neath integer execution by overlapping the two types of com- 
putation, while the streaming vector p-SIMD reduces the 
pressure on issue width and fetch bandwidth, and provides 
a powerful mechanism to tolerate latency that allows to im- 
plement smart decoupled cache hierarchies. 
1 Introduction 
Media applications will become one of the most demand- 
ing types of workloads in the near future. Standards such as 
MPEG-4 or MPEG-7 will be eminently multithreaded, and 
a commodity PC will face the need to execute several media 
streams (encoders, decoders, 3D processing) concurrently. 
Therefore, the design of future media architectures will have 
to take into account both the tight real-time requirements 
of mono-threaded applications and the need to provide high 
throughput to run multiple tasks simultaneously. 
It seems unlikely that current generation microprocessors 
‘This work has been supported by the Ministry of Education of Spain 
under contract CICYT TIC98-0511OCO2-01 and by the CEPBA. We would 
like to thank the reviewers of this paper for their useful comments. 
0-7695-1019-1/01 $10.00 0 2001 IEEE 
2 19 
will be able to meet these requirements in the future if we 
only scale them in the traditional way (Le, by adding more 
functional units and by increasing issue width). We believe 
that future media workloads intrinsically require some form 
of on-chip parallel processing if we want to succeed in de- 
livering the required performance. Therefore, architectures 
able to exploit thread level parallelism, such as Simultane- 
ous Multithreaded Processors (SMT) or Chip Multiproces- 
sors (CMP), appear as the architectures of choice to pro- 
vide the desired throughput. Of course, in order to deal with 
the high performance requirements of the kernels of such 
media workloads we will still need special purpose instruc- 
tions that can exploit the data-level parallelism available. 
Thus, p-SIMD processing is a suitable way of improving 
uni-threaded performance for media kernels with a modest 
investment in  hard ware. 
This paper advocates the combination of SMT and p- 
SIMD extensions to achieve the performance required for 
future media workloads. We will evaluate the performance 
of two different aggressive SMT processors: one with con- 
ventional p-SIMD extensions (such as MMX) and one with 
longer streaming vector p-SIMD extensions (our MOM ISA 
extension). 
We will demonstrate that SMT execution and stream- 
ing vector p-SIMD instructions combine very well together 
since: 
The SMT execution allows mixing scalar and stream- 
ing p-SIMD instructions in an efficient way 
The latency tolerance properties of the streaming p- 
SIMD ISA enable the use of decoupled cache hierar- 
chies that avoid the typical cache degradation experi- 
enced by multiple threads when running on a SMT 
Comparing to a baseline consisting of a plain out-of- 
order superscalar with multimedia extensions, SMT+MMX 
yields a 2.1X speedup and SMT+MOM achieves a 3.3X 
speedup. 
2 Multimedia workload trends 
In this section we will describe the evolution of media 
codes and will analyze their main characteristics. We will 
discuss some common misconceptions about media work- 
loads that will justify why a combined SMT plus SIMD ar- 
chitecture appears as a suitable alternative for running next 
generation of media workloads. 
From kernels to real programs 
When studying multimedia algorithms, or, better, multime- 
dia kernels, most authors agree that their most relevant char- 
acteristics are the following [ 1, 21: 
0 Small data type sizes 
0 Computationally intensive small tight loops 
0 Large amounts of Data Level Parallelism 
0 Stream-like patterns, low data locality 
0 Large memory bandwidth requirements 
Most major vendors of high performance processors have 
realized the importance of media workloads and have ex- 
tended their ISAs with new instructions targeted at exploit- 
ing data-level parallelism over small data types (p-SIMD 
parallelism) [3,4,5,6,7,8] .  These new instructions are gen- 
erally complemented with stream prefetching instructions in 
an attempt to alleviate the memory latency difficulties ex- 
posed by low.-data locality, streaming kernels [S, 9, 101. 
However, studying kernels in isolation can be very mis- 
leading, since what is true for basic media kernels, does not 
necessarily apply to complete programs. 
A typical media program is composed of a set of ker- 
nels that process data in a stream-like fashion and protocol 
related overhead (table look-ups, header processing, non- 
vectorizable coding) very similar to what we can find in a 
typical SPECint benchmark. Since the kernels are repeat- 
edly invoked on sets of related data, their behavior as a group 
can be very different from their isolated behavior. For ex- 
ample, while at the kernel level one usually encounters low- 
locality stream-like memory patterns, there is usually high 
locality at the algorithm level. 
Researchers creating benchmarks from raw media ker- 
nels usually wrap them in long running loops so that mea- 
surements can be easily taken. However, repeating a kernel 
many times on different data exacerbates its stream-like be- 
havior, which may not be a realistic scenario. As reported 
in [ 1 I ] ,  complete media programs characteristics fall some- 
where in between raw DLP media kernels and conventional 
non-numerical applications. 
As a result, media-oriented architecture designers should 
be aware of the following treats that characterize full multi- 
media programs: 
0 Computationally intensive small tight loops + integer- 
0 Large amounts of DLP in a restricted portion of the 
like code (protocol overhead) 
execution + restricted ILP in the rest 
APPLICATIONS 
rmenzauon 
L)cT 
FIR 
Figure 1. From kernels to media programs, from 
programs to real applications. 
0 Stream-like patterns at kernel level but high locality at 
0 Memory bandwidth is the main bottleneck but ILP is 
algorithm level 
also a concern 
It follows from this set of characteristics that architec- 
tures strictly focused on exploiting the data level parallelism 
available at kernel level will fail to deliver the expected per- 
formance due to Amdahl’s law. 
From real programs to future media applications 
Future media workloads are not expected to change radically 
the basic multimedia algorithms. Rather, given the large 
number of different media sources that can be sent over the 
Internet, the tendency is to focus on joining heterogeneous 
media streams into unified protocols. These new applica- 
tions will pay special attention to managing and maximizing 
the efficiency of compression and/or encryption by extract- 
ing uncorrelated media contents from the same source and 
applying the best media processing for each of them. Ad- 
ditionally, media programs will no longer be monolithic ap- 
plications. Interactivity will be applied across the board so 
that different input/output media streams will execute and 
communicate among themselves concurrently. 
The best example of this tendency is the MPEG-4 stan- 
dard [12]. MPEG-4 is a new ISO/IEC protocol that uses 
an object-based approach to describe and compose interac- 
tive audiovisual scenes. Uncorrelated objects are coded, en- 
crypted and transmitted separately in order to be composed 
again at reception. These objects may include digital video 
(MPEG-2), still image, audio, speech and even audio syn- 
thesis or 3D-graphics. Several powerful transformations can 
be performed over every object in order to compose each of 
them into the same audiovisual scene thanks to a a higher 
layer of the protocol. 
220 
Therefore, future media applications will add several 
characteristics not contemplated by the current research lit- 
erature. Dealing with multiple concurrent media streams 
means that we have high levels of coarse level parallelism 
(TLP) and that throughput is now also an issue (together 
with uni-threaded real time requirement of each source). 
Additionally, an extra layer of protocol means more hard- 
to-vectorize overhead that may further counterbalance the 
DLP-only nature of multimedia kernels. 
3 Proposed Architecture 
The need for coarse grain thread level parallelism comes 
at a very appropriated moment. Coincidentally, several ven- 
dors targeting commercial workloads (OLTP, Web serving 
and databases in general) have started to focus on new par- 
allel architectures prepared to deal with the abundant, ex- 
plicit heterogeneous thread-level parallelism that this kind 
of programs are characterized by. 
Looking at recent announcements, two are the most im- 
portant architectural alternatives: CMPs, or Chip Multi- 
Processors (Power-4 [ 131, Piranha [ 14]), and SMT proces- 
sors (Alpha 21464 [15]). The first alternative, a CMP, is 
based on joining together several simple processors on a sin- 
gle die, communicated through a conventional cache hierar- 
chy. The second alternative is based on executing indepen- 
dent flows of execution (threads) concurrently on a (typi- 
cally) highly aggressive superscalar processor. 
Our claim is that this kind of architectures are also appro- 
priated for future media applications. In the same vein that 
OLTP-like applications, throughput is the major concern in 
workloads involving several concurrent media sources (once 
real-time requirements have been met, of course), and both 
CMP and SMT architectures are good alternatives to meet 
these throughput requirements. Whether one alternative is 
better or not is still a matter of controversy: SMT allows a 
better usage of the available resources while CMP does not 
have the traditional implementation problems of aggressive 
out-of-order architectures 
From the point of view of overall performance, we be- 
lieve that SMT processors are specially well suited for the 
characteristics of media workloads due to the ability of pro- 
viding moderate performance even in serial fragments of 
code or with low number of threads (minimizing the impact 
of Amdahl’s Law). In this section we are going to propose a 
SMT processor with the inclusion of smart media ISA exten- 
sions as the architecture of choice for future media desktop 
systems. 
Baseline Processor 
Our SMT processor is built around a common out-of-order 
superscalar processor, as proposed in [ 161. As shown in fig- 
ure 2, our proposed architecture closely resembles a 8-way 
version of a MIPS R10000, able to fetch up to 8 instructions 
221 
per cycle. Instructions decoded and renamed are distributed 
by the dispatch logic to the appropriate instruction queue, 
which can read from its own dedicated register file. Instruc- 
tions within every queue may issue out-of-order and a grad- 
uation window is in charge of retiring instructions in-order 
to maintain the appearance of sequential execution. 
SMT extension 
The basic superscalar architecture has been enhanced fol- 
lowing [ 16, 171 to support Simultaneous Multithreading 
(SMT). In order for the processor to be able to execute multi- 
ple threads concurrently, minor changes are needed for three 
of the stages of the pipeline: fetch, decode and commit. The 
fetch engine is able to select up to two groups of 4 instruc- 
tions per cycle out of the pool of available threads (provided 
they are not stalled under an I-cache miss or a branch mis- 
sprediction). For the initial evaluations, the fetch selection 
strategy is a classic round robin policy. 
As proposed in [ 161, all threads share a common regis- 
ter pool. The decode engine is able to rename instructions 
from different threads using a per-thread renaming table and 
a shared common free register pool. Inside the execution 
queues no additional logic is required to handle instructions 
from different threads as renaming provides an easy mech- 
anism to avoid false dependences. Some additional logic 
is required in the graduation window in order to allow per- 
thread retirements, as well as a mechanism to perform per- 
thread instruction flush in case of miss-speculation. 
SIMD Extensions 
In spite of the explicit thread level parallelism available in 
future media applications, specific architecture innovations 
are still needed in order to fulfill the real-time requirement 
of single media streams. Simultaneous multithreading might 
provide good overall computational throughput, but, unfor- 
tunately, cannot guarantee that, for instance, the frame rate 
constraints of a MPEG-2 video stream are met. As discussed 
in previous sections, we believe that new p-SIMD exten- 
sions are necessary to meet the single thread performance 
requirements of highly demanding multimedia kernels. 
We have enhanced our basic SMT core with a multimedia 
instruction queue, its corresponding SIMD register file and 
two independent media functional units. Two different sets 
of multimedia extensions will be evaluated: a p-SIMD in- 
struction set that resembles the Intel SSE [9] extension and 
our own streaming-SIMD instruction set, named MOM [ 1 I]. 
Despite differences in their instruction semantics, both use 
a similar overall architecture. 
For the MMX-like instruction set, we have implemented 
an approximation of SSE [9] integer opcodes with 67 in- 
structions and 32 logical registers (as opposed to 8). We 
have added some extra features, such as new reduction oper- 
Figure 2. Our basic model of a SMT processor. 
ations and multiple source registers, not present in the orig- 
inal SSE. 
The MOM instruction set was introduced in [ l l ]  and 
combines the advantages of typical p-SIMD instructions 
with the parallelism offered by conventional vector ISAs. 
By exploiting two different dimensions of parallelism (par- 
allel loops), MOM is able to generate stream instructions 
that work on up to 16 conventional MMX-like registers and 
fuse into a single opcode the equivalent of 16 MMX-like in- 
structions. Results presented in [ 1 I ]  demonstrated that this 
kind of p-SIMD streams are very well adapted to the char- 
upon the MDMX multimedia extension set from MIPS [4]. 
MOM has 121 different opcodes and 16 logical stream p- 
SIMD registers (each composed of 16 MMX-like registers). 
In order to improve efficiency in reduction operations, we 
have also included 2 logical packed accumulators of 192 
bits. These accumulators allow performing reduction oper- 
ations over a whole ,U-SIMD stream using a single packed 
accumulator with high efficiency. Finally, our streaming p- 
SIMD architecture has one stream length register (renamed 
through the integer register pool) that allows to determine 
the real length of each stream register (out of 16). 
Figure 3 presents a comparison between conventional 
MMX-like instructions and our MOM stream instructions. 
A MMX-like instruction is characterized by two parame- 
ters: the size of the p-SIMD register (64-bits typically) and 
the number of packed sub-word elements (dependent on the 
size of the sub-word elements). A stream p-SIMD instruc- 
tion adds two more parameters: the Stream Length and the 
Stride. The Stream Length is the number of MMX-like p- 
SIMD registers over which the same instruction is going to 
be executed (up to 16). The Stride determines, for stream 
memory instructions, the distance in memory positions be- 
tween consecutive p-SIMD registers inside the same stream. 
acteristics of most media kernels. MOM is loosely based 
+ + + +  
16 bits 
Figure 3. Comparison between (a) a conventional 
MMX-like p-SIMD instruction and (b) a Stream p- 
SIMD instruction. 
The stride feature is very powerful for multimedia (specially 
for imagehide0 processing), as it allows to work over small 
sparse matrices of data. 
Architectural Parameters 
Our basic processor configuration is able to issue up to 4 in- 
teger instructions, up to 4 memory instructions (either loads 
or stores) and up to 4 floating point instructions per cycle. 
Additionally, the SMT+MMX processor is able to issue up 
to two different MMX-like instructions per cycle. On the 
other hand, our SMT+MOM processor has only one single 
media functional unit of width 2 (that is, we have two par- 
allel vector pipes so that up to two p-SIMD sub-instructions 
can be executed every cycle from the same stream). As a 
result, in contrast with the conventional MMX-like version, 
the SMT+MOM processor only requires an issue width of 1 
for the SIMD queue. 
222 
Table 1. Architectural parameters based on number 
of threads. 
Our SMT processor simulator contains a highly detailed 
memory hierarchy model, where both L1 and L2 cache lev- 
els are located on-chip (as in the Alpha 21 364 [ 181). The L1 
cache is a 32 KB, direct mapped, write-through cache with 
32-byte lines, interleaved among 8 memory banks. The I- 
cache is a 64 KB, 2-way set associative cache with 32-byte 
lines, interleaved among 4 memory banks. The L2 cache is 
a IMB, 2-way set associative, write-back cache with 128- 
byte lines. Both L1 and L2 levels of cache have 8 MSHRs 
and a 8-depth coalescing write buffers with selective flush 
policy. L1 and I1 have one cycle of latency while the L2 
cache latency is 12 cycles. We have modeled a 128MB Di- 
rect Ranibus main memory system which contains a DR- 
DRAM controller driving 8 Ranibus chips and leveraging up 
to 3.2 GB/s with a 128-bit wide, bi-directional200Mhz main 
bus (feeding a 800MHz processor). 
We have done preliminary simulations in order to deter- 
mine the number of physical registers and the window sizes 
necessary to achieve reasonable (near saturation) processor 
performance for 1, 2, 4 and 8 threads. The results can be 
seen in table 1. Note that the size of the stream p-SIMD 
register file can be up to 8 times the size of the MMX reg- 
ister file. However, as already seen in [ 1 I] ,  organization in 
lanes and interleaving of the different elements of the vector 
register into banks help to decrease radically the overall area 
without any impact on performance. Further discussion on 
the impact on cycle time of the large number of registers is 
beyond the scope of this paper. 
4 Workload characterization 
In this section we will start by defining our workload, 
inspired in the MPEG-4 media profiles. Then, using two 
different ISAs, (Compaq’s) Alpha extended with p-SIMD 
instructions (MMX-style) and Alpha extended with our own 
streaming p-SIMD instructions, we will present an instruc- 
tion breakdown of the workload. 
4.1 Modeled Workload 
Trying to evaluate what will be a future media workload 
is not easy. Parameters that influence overall application be- 
havior, such as the predominance of each media source, the 
size of its working set, or the level of protocol overhead, are 
hard to determine a priori. Even already standardized proto- 
cols such as MPEG-4 are still slightly ambiguously defined 
(MPEG-4 was promoted as an I S 0  standard just in 1999) 
and it is difficult to obtain reliable, non research-oriented 
source codes. 
Therefore, our methodology will be based on selecting a 
set of real multimedia programs that approximate the multi- 
programmed contents of a full MPEG-4 application. Ac- 
cording to the standard, MPEG-4 is composed of four dif- 
ferent profiles (or heterogeneous contents): 
0 MPEG-4 control (BIFS-base scene descriptors) 
0 MPEG-4 video (MPEG-2) 
0 MPEG-4 still image (2D and 3D graphics) 
0 MPEG-4 audio (speech codec, audio synthesis) 
We have used programs from the Mediabench suite [ 191 
that represent each one of the aforementioned profiles. 
MPEG-2 encode and decode are examples of the MPEG- 
4 video profile. The public domain implementation of 
OpenGL, mesa, and the JPEG codinglencoding match the 
3D and 2D MPEG-4 still image profiles. Finally, gsm en- 
code and decode represent the MPEG-4 audio speech pro- 
file. The only profile not included in our workload is the 
MPEG-4 control (that deals with VRML-like composition- 
ing of the sources into the same scene). Note that this last 
profile would likely have a certain impact on performance, 
as it would create dependences and synchronization needs 
between threads. Table 2 describes the set of benchmarks 
selected for our multiprogrammed workload. 
For each program, we identified the most important func- 
tions using profiling and manually rewrote the vectorizable 
ones using both MMX-like instructions and our stream p- 
SIMD instructions, by means of our own emulation libraries 
[ l l ] .  We should note that the 3D graphics benchmark 
(mesa) has not been vectorized because our emulation li- 
braries do not have floating-point p-SIMD instructions. 
4.2 Instruction breakdown 
Table 3 shows an instruction breakdown for each of the 
benchmarks for the two p-SIMD instruction sets under con- 
sideration, MMX and MOM. The last row in the table gives 
the total number of instructions per benchmark (in millions). 
Note that, to allow for a meaningful comparison, a MOM p- 
SIMD instruction that operates with, say, a stream length 
of 1 1 ,  counts as eleven instructions (i.e., each MOM in- 
struction is multiplied by its stream length). The first four 
223 
j instances 1) description I data set I characteristics 0 
Table 2. Multi-programmed workload description. 
MPEG2enc 1) MPEG2dec 11 
mmx I mom 1 1  mmx 1 mom 1 1  mmx I mom 11 mmx I mom 11 mmx I mom I( mmx I mom 11 mmx I mom 111 mmx 1 mom 
#ins 11 642.7 I 364.9 11 69.8 I S9.8 )I 160.3 I 135.8 11 109.4 I 106.4 11 177.9 1 161.3 11 105.2 I 105.0 11 93.8 I 93.8 111 1429 I 1087 U 
Table 3. Instruction breakdown (“YO) and instruction count (in millions of instructions) 
rows present the percentage of each type of instruction, inte- 
ger arithmetic, floating point, SIMD arithmetic and memory 
(both scalar and vector memory instructions), in each bench- 
mark. 
In sharp contrast with the common belief, table 3 shows 
that under MMX, our multimedia workload is dominated by 
the integer pipeline (62% on average). SIMD arithmetic in- 
structions only account for 16% of the overall number of 
instructions. On top of everything, the workload is charac- 
terized for having a very unbalanced distribution of instruc- 
tions at run-time. Media programs execute typically regions 
of code with a high percentage of vector instructions and few 
scalar instructions and other regions of code with no SIMD 
instructions at all (thus causing severe resource balancing 
problems). 
The stream p-SIMD paradigm substantially reduces the 
number of integer instructions (around 20%) and memory 
instructions (around 7%) when compared to MMX. The 
reason is a phenomena commonly found in any conven- 
tional vector architecture. As every stream p-SIMD instruc- 
tion can pack several MMX-like instructions (thus, replac- 
ing multiple instances of a loop) there is an elimination of 
scalar instructions related to the loop control (that is, back- 
ward branches, loop indexes or address calculations). Our 
stream instructions generate all this information automati- 
cally, thanks to the information of the Streani Lengrh regis- 
ter and the stride. There is an even higher reduction of the 
overall number of vector instructions (62 %). The reason is 
that our stream ISA can take great advantage of MDMX-like 
Packed Accumulators (as seen in [ 1 I]). These accumulators 
are very useful in reduction sequences, eliminating a high 
amount of logic overhead. 
However, in terms of relative percentage, the MOM 
paradigm does not alleviate the predominance of integer 
instructions in the instruction mix. Quite the contrary, 
i t  slightly increases the percentage of integer instructions. 
Thus, independently of using MMX or MOM, table 3 
clearly shows that the integer pipeline will be the main per- 
formance bottleneck within the CPU when executing our ap- 
proximation of a next generation media workload (ignoring, 
of course, memory behavior). Therefore, the best we can 
expect our SMT architecture to do is, at most, hide the exe- 
cution of all memory and SIMD instructions underneath the 
execution of the integer instructions of the program. 
5 Performance evaluation 
In this section, we are going to evaluate the performance 
of the SMT architecture with the two SIMD ISAs under 
study. We will present first performance under an ideal 
memory system and then, we will evaluate the impact of 
a realistic memory model. Finally, we will study the per- 
formance improvements using smart fetch policies and will 
analyze alternative cache hierarchies in order to alleviate the 
memory problem. 
5.1 Simulation methodology and performance 
metrics 
We have evaluated the performance of the modeled work- 
load using an SMT version of the Jinks simulator [ 1 I ]  with 
1 , 2 , 4  and 8 threads. In order to do so, we selected a random 
order of the 8 programs: MPEG-2 encoder, GSM decoder, 
MPEG-2 decoder, GSM encoder, JPEG decoder, JPEG en- 
coder, mesa and MPEG-2 decoder (2nd time). Simulation 
starts with as many programs concurrently as the number 
of contexts allowed by the machine. When a program com- 
pletes, the next program from the list is initiated. In case that 
no further programs are available, we initiate again selecting 
programs from the same list from the beginning. This pro- 
cess is repeated until the end of the g t h  context. This avoids 
224 
--c S M T + M M X  IPC 
--C S M T + M O M  ElPC 
6 -  
4 -  
B '  
2 -  
na 
#threads 
LI Hit Rate 
LI Latency 
Figure 4. Performance with perfect cache. 
MOM 98.710 98.2% 96 .64  93.9% 
MMX 98.7% 97.6% 94.2% 86.8% 
MOM 98.4% 98.110 96.9% 93.7% 
MMX 1.39 1.59 2.38 6.81 . 
MOM 1.74 1.86 2.43 4.51 
having fractions of time with less threads than those allowed 
by the machine. In order to round to 8 programs, the most 
significant program (MPEG-2 decode) is included twice. 
We should note that this methodology gives us a measure 
of throughput rather than real execution time. We believe 
that this is the most suitable metric as future media work- 
loads will be characterized by continuous media streams be- 
ing executed concurrently along the time. 
When evaluating the performance of a SMT architecture, 
we typically use IPC (instructions committed per cycle) as 
a good indicator of throughput. However, the IPC is not 
a good measure of performance when comparing different 
ISAs, as every ISA needs a different number of instructions 
to execute a given benchmark. Therefore, in this paper we 
will use a new indicator of performance for the streaming 
p-SIMD architecture: 
i n s t r u c t i o n s  
instructions:::  x IP'MOM 
EIPC stands for Equivalent IPC, and intuitively indicates 
the IPC a SMT+MMX processor should reach in order to 
match the performance of the SMT+MOM processor. The 
ratio between the EIPC of the SMT+MOM architecture and 
the IPC of the SMT+MMX architecture gives a measure of 
performance Speed-up. 
5.2 Performance with Ideal Memory Systems 
Figure 4 shows performance for the two architectures un- 
der study with a idealistic memory system (neither cache 
misses nor bank conflicts). The horizontal dotted line rep- 
resents the baseline performance of a single thread with 
MMX instructions. From the results of the figure, we can 
see that, as we increase the number of threads, SMT+MMX 
goes from the baseline EIPC of 2.47 up to 5.0 (a speedup 
of 2.02X). Even better, SMT+MOM goes from an EIPC of 
2.98 (20% better than MMX) for a single thread up to 6.19 (a 
speedup of 2.08X). Overall, SMT+MOM is 2.5 times better 
than an 8-way superscalar with MMX instructions. 
As we will see later, the MOM model achieves even 
higher relative performance under realistic memory assump- 
A 
t SMT+MMX IFK (idsal) 
4SMT+MOM ElPC (ideal) 
-.A-. SMT+MMX DT Ired) - +-' SMT+MOM ElPC (read) 
Figure 5. Performance under real memory system. 
0 II I I thread I Zthreads I 4lhread I 8 threads 1 n I Hit Rate II MMX I 99.0% I 97.8% I 96.9% I 93.7% II 
tions and when a better mix of scalar and vector instructions 
is performed. 
5.3 Performance under Real Memory 
Figure 5 shows performance for the two architectures un- 
der study taking into account the effect of the memory sys- 
tem described in section 3. Performance with ideal memory 
is also presented for comparison purposes. 
From the results of the figure we can observe two main 
phenomena: (a) Increasing the number of threads may pro- 
vide diminishing returns (performance with 4 threads is 
higher with 8 threads), and (b) MOM is more robust to the 
impact of the memory system (MOM exhibits an average 
performance degradation of a 12 % in comparison to the 30 
% of MMX). 
In order to understand these two effects, we may look at 
table 4, where the instruction cache hit rate, the L1 cache 
hit rate and the average memory latency on L1 are shown. 
We may observe how, as long as we increase the number 
of threads that can co-exist in the processor, the hit rate of 
both instruction and L1 caches decrease, due to the mutual 
interference between threads. This ends up increasing the 
memory latency, thus reducing performance. 
The higher robustness of MOM against the thread- 
interference is due to two main reasons: a lower hit rate 
degradation and a higher memory latency tolerance. A lower 
hit rate degradation is due to the nature of the stream mem- 
ory accesses. Since a stream memory reference determines 
several memory accesses from the same thread, the thread 
225 
1 Ihr 2 Ihr 4 lhr 8 Ihr 
~ M M X  RR 
~ M M X  ic 
MMX BL 
 MOM RR 
EMOM IC 
OMOM oc
O M O M  BL 
- 
Figure 6. Impact of the different fetch policies. 
interference is reduced. Additionally, the stream memory 
references exhibit a high memory latency tolerance since, as 
usual with vector memory references, they can amortize the 
memory latency across the different elements accessed. 
Performance of Advanced Fetch Mechanisms 
As described in [20], the way we select instructions to be 
fetched may have a relevant impact over final performance. 
This is specially true for the stream p-SIMD architecture, 
where we found that the conventional round-robin policy 
was not able to optimally mix scalar and vector instructions 
(with 8 threads, only a 1% of the MMX execution cycles 
perform only vector instructions while in MOM, a 4% of 
the execution cycles perform only vector instructions). 
We have evaluated the performance of the following fetch 
policies: 
ROUND-ROBIN (RR) The basic round-robin policy used until now. 
ICOUNT(1C) Based on the fetch policy proposed in [20]; priority is 
given to those threads with the lower number of instructions decoded 
but not issued. 
OCOUNT (OC) - Similar to ICOUNT, but taking into account the infor- 
mation of the Sfream Len@ register from the MOM architecture to 
give lower priority to those threads with the higher number of not- 
issued operations. 
BALANCE (BL) - Focused on mixing scalar and vector instructions. If 
there are no instructions in the vector pipeline, threads that fetched 
vector instructions the last time are given priority. In other case, 
threads that did not fetch any vector instruction the last time are given 
priority. A round-robin policy is used to chose between threads with 
the same priority. 
Figure 6 shows the performance for all the fetch poli- 
cies under study for both the SMT+MMX and SMT+MOM 
architectures. We can observe that the different fetch poli- 
cies are only effective with a high number of threads (com- 
pared to the round-robin), delivering up to a 9 % of perfor- 
mance improvement. Note, however, that, in spite of the fact 
that performance degradation is smoothed, performance for 
4 threads is still higher that performance for 8 threads. 
The ICOUNT is the policy that leverages higher per- 
formance for the SMT+MMX model, while OCOUNT 
is the policy that exhibits the best performance for the 
SMT+MOM model. BALANCE stands as a cost-effective 
alternative, given the simplicity of its implementation com- 
pared, for instance, with the OCOUNT policy. 
5.4 Decoupling the Cache Hierarchy 
As seen in section 5.3, if we increase the number of 
threads of the processor, the data locality is reduced due to 
inter-thread interference. As a result, we incur in higher la- 
tency penalties and a loss of bandwidth efficiency. More- 
over, as we increase the number of threads, the number of 
available memory instructions to execute per cycle raises, 
thus increasing the likelihood of bank collisions (reducing 
even more the bandwidth efficiency). While the latency 
penalty impact can be overridden partially by the latency 
tolerance properties of SMT execution, the loss of effec- 
tive bandwidth is a problem that directly affects final per- 
formance. 
In [21] we proposed to bypass vector memory accesses to 
a higher level cache and to decouple general purpose mem- 
ory ports into scalar memory ports and vector memory ports 
(as seen in figure 7). Decoupling scalar and vector memory 
ports into different levels of the cache hierarchy achieves 
two goals: (a) we effectively decouple the vector working 
set from the scalar working set, and (b) we reduce the num- 
ber of memory ports per level of cache, thus reducing bank 
contention. 
Figure 7 compares the original 4-port cache organization 
we have just evaluated with a decoupled organization. In the 
latter configuration, we have 2 memory ports to access scalar 
elements from the L1 (single-banked and double-pumped as 
in the Alpha 2 1264), and 2 memory ports directly connected 
to L2 used for stream SIMD memory accesses (of course, 
the L2 still has to talk to the LI and I1 caches). The L2 
has two banks connected to the vector memory ports via a 
crossbar. Naturally, bypassing the L1 when doing stream 
accesses can cause coherence problems due to interference 
between vector and scalar data. Consequently, a coherency 
protocol based on an exclusive-bit policy is used to deal with 
this situation [21]. 
Figure 8 shows the performance of the decoupled cache 
hierarchy for all the different fetch policies studied so far. 
The first main conclusion is that the new cache decoupling 
strategy solves the cache degradation problem: contrary to 
the data in figure 5 ,  now the 8-thread configuration is bet- 
ter than the 4-thread configuration. Another observation is 
that the different fetch policies barely provide any perfor- 
mance benefit for the SMT+MMX architecture, while they 
provide up to a 7 % of performance improvement for the 
SMT+MOM processor model. 
In order to be able to compare the efficiency of the 
different cache hierarchy strategies, we may look at fig- 
ure 9, which shows performance results for the three dif- 
ferent cases: ideal memory system, conventional memory 
hierarchy, and the decoupled memory hierarchy. For the 
226 
L1 fi, 
PSO Psl 
1x1 
Pv 
Figure 7. Proposed cache hierarchies: (a) conven- 
tional memory ports, (b) decoupled scalar and vec- 
tor memory ports. 
P V I  
s 
I thr 
~ M M X  RR 
~ M M X  IC 
I M M X  BL 
MOM RR 
O M O M  IC 
E M O M O C  
O M O M  BL 
- 
4 thr x Br 
Figure 8. Impact of the different fetch policies under 
the decoupled cache hierarchy. 
MMX-like architecture, results with the ICOUNT policy are 
given, while MOM performance results are presented with 
the OCOUNT policy. 
From the results of the figure, we can observe that by- 
passing the L1 cache is very useful but only if we have a 
large number of threads, since in that case we are able to 
tolerate the 12-cycles of L2 latency while taking advantage 
of the higher effective bandwidth. Our MOM instructions 
benefit even more from the decoupled hierarchy due to its 
own additional latency tolerance capabilities. As a result, 
the MOM+SMT architecture exhibits only a 15% of per- 
formance degradation compared with the idealistic memory 
system, while the SMT+MMX exhibits a 30% of perfor- 
mance degradation (for the 8-thread configuration). 
6 Related Work 
SIMD execution appears as a natural choice to pro- 
vide high uni-threaded performance. During the last 
8 . .  <.. 
c" 
-t SMT+MMX IPC ideal mcm - SMT+MOM ElPC ideal mem 
-A- SMT+MMX IPC cunVenlionP1 LI - *- SMT+MOM ElPC convcnlional L1 
SMT+MMX IPC decoupled LI-L2 
__*_. SMT+MOM ElPC dccoupled LI-L2 
Figure 9. Performance benefits of bypassing L1 on 
vector memory accesses. 
years, general-purpose designers have been including p- 
SIMD (sub-word level SIMD parallelism) extensions to 
their instruction-set architectures. Intel's media extension 
family (MMX [3] ,SSE [9] and SSE2 [lo]) is perhaps the most 
widely known set of media-enhanced instructions but almost 
all of the main general purpose vendors have included one 
of their own [5,  6, 7, 4, 81. Authors have also proposed 
using more classical vector designs or specialized streaming 
architectures to target multimedia programs [22,23,24,25]. 
There has been a large body of research regarding SMT 
architectures [16, 20, 26, 271, most of them specially fo- 
cused on pure TLP to ILP exploitation. Phenomena like 
branch missprediction and memory latency tolerance, cache 
degradation and impact of fetch policies (all of them typical 
of the SMT paradigm) have been deeply studied. 
The contribution of this paper is the claim that future me- 
dia applications will provide explicit TLP that can be ex- 
ploited with great benefit by SMT processors with p-SIMD 
media extensions. A previous proposal [ 171 evaluated the 
performance of a SMT processor with conventional vector 
enhancements for numerical applications. Nonetheless, we 
believe that the combination of the two paradigms is more 
promising for media processing, as numerical codes are 
dominated by the vector component rather than the scalar 
one. This paper has shown that ILP exploitation together 
with an effective mixing of scalar and vector instructions is 
extremely relevant for media performance. 
In [28], a SMT processor with MMX-like extensions is 
used to evaluate the performance of a parallel version of an 
MPEG2-decoder. The authors identified as one of the major 
performance bottlenecks the large serial fragment of non- 
vectorizable code. Our proposal differs in two main points: 
(a) on the convenience of exploiting heterogeneous explicit 
227 
TLP rather than exploiting TLP over the already vectorized 
(thus, with high DLP) fragments of code; and (b) on the 
convenience of using streaming p-SIMD instructions to in- 
crease the latency tolerance (thus allowing smarter memory 
hierarchies). 
Finally, there has been another kind of architecture pro- 
posals targeted at future media applications. The M-PIRE 
processor [29] is an architecture explicitly focused on ex- 
ecuting MPEG-4 applications. Its basis of implementa- 
tion is the partition of the processor into independent pro- 
grammable units, each of them optimized for a certain 
class of MPEG-4 algorithm. With a similar philosophy, 
the Sony's PSX2 Emotion Engine [30] uses independent 
vector/micro-coded units able to exploit SIMD parallelism 
concurrently with the execution of the main integer core. 
7 Summary 
In this paper we have studied and evaluated the perfor- 
mance of an efficient architecture for the next generation of 
media workloads. We have shown that in order to match 
the requirements of future standards such as MPEG-4, both 
DLP and TLP can be exploited efficiently, and we have pro- 
posed SMT processors enhanced with p-SIMD extensions 
as a suitable alternative. 
We have evaluated two different p-SIMD alternatives 
(a MMX-like extension and a stream, vector-like p-SIMD 
ISA) and have shown the advantages of stream-oriented p- 
SIMD alternatives such as MOM. 
We have seen that while the SMT capabilities allow to 
hide vector execution behind integer execution, (thus min- 
imizing the impact of Amdahl's Law) the latency toler-  
ance properties of MOM memory streams allow to intro- 
duce smarter cache hierarchies that help alleviate the cache 
performance degradation associated with the inter-thread in- 
terference. As a result, while SMT provides a maximum 
speed-up of 2.1X for the MMX processor (with a 30% of 
performance degradation compared with idealistic memory 
performance), the MOM processor achieves a performance 
improvement of 3.3X (compared with the performance of 
a uni-threaded MMX model), suffering from only a 15% 
of performance degradation from the impact of a realistic 
memory model. 
References 
[I] K. Diefendorff and P.K. Dubey. How multimedia workloads will 
change processor design. IEEE Micro, Sep 1997. 
[2] T.M. Conte et. al. Challenges to combine general-purpose and multi- 
media processors. IEEE Computer, Dec 1997. 
[3] A. Peleg and U. Weiser. MMX technology extension to the INTEL 
architecture. IEEE Micro, August 1996. 
[4] Mips extension for digital media with 3D. Technical Report 
http://www.mips.com, MIPS technologies, Inc.. 1997. 
[5] K. Diefendorff, P.K. Dubey, et. al. Altivec extension to powerPC 
accelerates media processing. IEEE Micro, March-April 2000. 
[6] 3DNow! technology manual. http://www.amd.com, Advanced Micro 
Devices, Inc., 1999. 
[7] M. Tremblay, J.M. O'Connor, V. Narayanan. and L. He. VIS speeds 
new media processing. IEEE Micro, August 1996. 
[SI R. Weiss J. Hicks. Motion video 
instructions (MVI). http://www.alphalinux.org/docs/MVI-full.html, 
Compaq, 1999. 
[9] Pentium 111 processor: Developer's manual. 
http://developer.intel.comldesign/Pentiumlll, INTEL, 1999. 
IO] http://developer.intel.comldesign/processor/index.htm. Willamette 
Architecture Software Developer Manuals. Intel. 2000. 
I I ]  J. Corbal, R. Espasa, and M. Valero. Exploiting a new level of DLP'  
in multimedia applications. MICRO, 1999. 
121 R. Koenen. MPEG-4, multimedia for our time. IEEE Spectrum, 
February 1999. 
[I31 K. Diefendorff. Power4 Focuses on Memory Bandwidth. Micropro- 
cessor Report, October 1999. 
[I41 L.A. Barroso, K. Gharachorloo, et. al. S. Smith, R. Stets, and 
B. Verghese. Piranha: A Scalable Architecture Based on Single-Chip 
Multiprocessing. In ISCA'OO, June 2000. 
[IS] J. Emer. Simultaneous Multithreading: Multiplying Alpha's Perfor- 
mance. In Presenration at the Microfrocessor Foruni'99, October 
1999. 
[ 161 D.M.Tullsen, S.J.Eggers, and H.M.Levy. Simultaneous multithread- 
ing: Maximizing on-chip parallelism. ISCA-22, June 1995. 
[I71 R. Espasa and M. Valero. Exploiting Instruction- and Data- Level 
Parallelism. IEEE Micro. Sept/Oct 1997. 
[I81 Peter Bannon. Alpha 21364: A Scalable Single-chip SMP. 
http://www.digital.comlalphaoem/microprocessorforum.htm, Com- 
paq Computer Corporation, 1998. 
[I91 C. Lee, M. Potkonjak, and W.H. Magione-Smith. Mediabench: A 
tool for evaluating and synthesizing multimedia and communication 
systems. MICRO 30, 1997. 
[20] D.M.Tullsen, S.J.Eggers, J.S.Emer, et al. Exploiting choice: Instruc- 
tion fetch and issue on an implementable simulataneous multithread- 
ing processor. ISCA-23, May 1996. 
[21] F. Quintana, J. Corbal, R. Espasa, and M. Valero. Adding a vector 
unit on a superscalar processor. ICs, June 1999 
1221 W. J. Dally. Tomorrow's computing engines (Keynote Speech). In 
HPCA-4, February 1998. 
[23] K .  Asanovic. Vector microprocessors. Phd thesis, University of Cal- 
ifornia at Berkeley, 1998. 
[24] C. G. Lee and M. G. Stoodley. Simple Vector Microprocessors for 
Multimedia Applications. In MICRO 31, December 1998. 
[2S] C. Kozyrakis and D. Patterson. A new direction for computer archi- 
tecture research. IEEE Computer, November 1998. 
[26] W. Ya"oto, M.J. Serrano, A.R. Talcott, R.C. Wood, and M. Ne- 
mirovsky. Performance estimation of multistreamed, superscalar pro- 
cessors. 27th Hawjaii International Conference on System Sciences, 
January 1994. 
[27] S. Hily and A. Seznec. Out-of-order execution may not be cost effec- 
tive on processors featuring simultaneous multithreading. HPCA, Jan 
1999. 
[28] H. Oehring, U. Sigmund, and T. Ungerer. MPEG-2 video de- 
compression on simultaneous multithreaded multimedia processors. 
PACT'99, October 1999. 
[29] J.  Kneip, B. Schmale, and H. Moller. Applying and implementing the 
MPEG-4 multimedia standard. IEEE Micro, Nov-Dec 1999. 
130) A. Kunimatsu, N. Ide, and T. Sat0 et. al. Vector unit architecture for 
emotion synthesis. IEEE Micro. March-April 2000. 
228 
