A vector-µSIMD-VLIW architecture for multimedia applications by Salamí San Juan, Esther & Valero Cortés, Mateo
A Vector-µSIMD-VLIW Architecture for Multimedia Applications
Esther Salamı´ and Mateo Valero ∗
Computer Architecture Department
Universitat Polite`cnica de Catalunya, Barcelona, Spain
{esalami,mateo}@ac.upc.es
Abstract
Media processing has motivated strong changes in the
focus and design of processors. These applications are
composed of heterogeneous regions of code, some of them
with high levels of DLP and other ones with only modest
amounts of ILP. A common approach to deal with these ap-
plications are µSIMD-VLIW processors. However, the ILP
regions fail to scale when we increase the width of the ma-
chine, which, on the other hand, is desired to achieve high
performance in the DLP regions. In this paper, we propose
and evaluate adding vector capabilities to a µSIMD-VLIW
core to speed-up the execution of the DLP regions, while, at
the same time, reducing the fetch bandwidth requirements.
Results show that, in the DLP regions, both 2 and 4-issue
width Vector-µSIMD-VLIW architectures outperform a 8-
issue width µSIMD-VLIW in factors of up to 2.7X and 4.2X
(1.6X and 2.1X in average) respectively. As a result, the
DLP regions become less than 10% of the total execution
time and performance is dominated by the ILP regions.
1 Introduction
As technology evolves, the number of transistors to be
included on a single chip will continue increasing [40]. To
take beneﬁt of these additional resources, most of the tradi-
tional techniques focus on exploiting more Instruction Level
Parallelism (ILP) [28].
Superscalar processors are the most traditional ILP im-
plementation for the general purpose domain. However, it is
widely assumed that current superscalar processors cannot
be scaled by simply fetching, decoding and issuing more
instructions per cycle [17]. Branches, the instruction cache
∗This work has been supported by the Ministry of Science and Tech-
nology of Spain and the European Union (FEDER funds) under contract
TIC2001-0995-C02-01 and TIN2004-07739-C02-01, and by the European
HiPEAC network of Excellence. We also acknowledge the Supercomput-
ing Center of Catalonia (CESCA) for supplying the computing resources
for our research.
bandwidth, the instruction window size, the register ﬁle and
the memory wall are some of the aspects that currently limit
the scalability of superscalar processors. And, even if these
problems could be overcome with future technology, the
performance results would not pay off the amount of chip
area and power and the design effort required [27].
Very Long Instruction Word (VLIW) processors are an-
other form of exploiting ILP that requires less hardware
complexity. The compiler and not the hardware is respon-
sible for identifying groups of independent operations and
packaging them together into a single VLIW instruction [9].
The ﬁrst generation of VLIW processors were successful
in the scientiﬁc domain [4, 29], and it has also been the
architecture of choice for most media embedded proces-
sors [26, 38, 37]. However, some relevant facts, such as
code compatibility and non-deterministic latencies, have
contributed to the belief that VLIW processors are not ap-
propriate for the general-purpose domain. At present, a
revival of the VLIW execution paradigm is observed. HP
and Intel have recently introduced a new style of archi-
tecture known as Explicitly Parallel Instruction Computing
(EPIC) [35] and a speciﬁc architecture implementation: the
Itanium Processor Family (IPF) [36]. EPIC retains com-
patibility across different implementations without the com-
plexity of superscalar control logic.
Another kind of parallelism that can be found in pro-
grams is Data Level Parallelism (DLP) (or Single Instruc-
tion Multiple Data (SIMD)) [10]. The DLP paradigm tries
to specify with a single instruction a large number of oper-
ations to be performed on independent data words. Tradi-
tionally, this kind of parallelism has been successfully ex-
ploited in the supercomputing domain by vector [31, 3, 39]
and array [13, 30] processors. However, during the last
decade, the increasing signiﬁcance of media processing has
motivated a great interest in exploiting sub-word level paral-
lelism (also called µSIMD parallelism) [25]. In the general
purpose domain, these changes have been very straightfor-
ward with the inclusion of multimedia extensions such as
SSE [15] or Altivec [24]. This is also a form of DLP in
which short data are packed into a single register and opera-
Proceedings of the 2005 International Conference on Parallel Processing (ICPP’05) 
0190-3918/05 $20.00 © 2005 IEEE 
tions are carried out simultaneously on the different register
elements. A third way of exploiting DLP comes from the
combination of the previous two [6, 18, 20]. These archi-
tectures adapt to typical multimedia patterns by extending
the scope of vectorization to two dimensions.
In the media domain, µSIMD-VLIW processors have
been widely proposed [12, 26, 38, 8], as they are able to
exploit DLP by means of the µSIMD operations and ILP by
the use of wide-issue static scheduling. But, although media
applications are usually characterized by high amounts of
DLP, there is also a signiﬁcant part of code that exhibits only
modest amounts of ILP, thus taking little beneﬁt from in-
creasing the processor resources. And, even though VLIW
processors are simpler than superscalar designs, very high
issue rates require decoding more operations in parallel and
complicate the register ﬁles, which clearly increases power
consumption.
In this paper, we propose and evaluate a new architec-
ture that includes vector operations in a µSIMD-VLIW pro-
cessor to exploit the DLP typical of multimedia kernels in
a more efﬁcient way and with lower fetch bandwidth re-
quirements. Although a quantitative analysis on power con-
sumption is out of the scope of this paper, it is widely as-
sumed that vector architectures contribute to increase power
efﬁciency [2, 20]. Initial results for a reduced number of
benchmarks were ﬁrst presented in [34]. Apart from a more
extensive evaluation, additional contributions also include
a thorough description of the architecture and of the static
scheduling of vector operations.
The rest of the paper is organized as follows. Section 2
deﬁnes the concept of scalar and vector regions and eval-
uates their scalability separately. Section 3 overviews the
Vector-µSIMD-VLIW architecture and discusses the main
compilation issues. Section 4 describes the modeled ar-
chitectures and the simulation framework. Next, section 5
presents quantitative data such as speed-up and operation
per cycle rates. Finally, the last section summarizes the
main conclusions.
2 Scalar and Vector Regions
Most media applications consist on a set of algorithms
that process streams of data in a pipeline fashion. Further-
more, the same set of operations are performed over the el-
ements inside the stream. Therefore, media kernels exhibit
high amounts of DLP [7]. On the other hand, there is also a
signiﬁcant portion of code that is difﬁcult to vectorize. That
is some protocol related processing overhead such as ﬁrst
order recurrences, table look-ups and non-streaming mem-
ory patterns with large amounts of indirections. Therefore,
a real media program is composed of heterogeneous regions
of code with highly variable levels of parallelism: some of
them with high amounts of DLP and the other ones with
Table 1. Vector regions
Benchmark %Vect Vector Regions
JPEG ENC 29.56 % RGB to YCC color conversion
Forward DCT
Quantiﬁcation
JPEG DEC 18.46 % YCC to YCC color conversion
H2v2 up-sample
MPEG2 ENC 52.29 % Motion estimation
Forward DCT
Inverse DCT
MPEG2 DEC 23.11 % Form component prediction
Inverse DCT
Add block
GSM ENC 18.66 % LTP parameters
Autocorrelation
GSM DEC 0.91 % Long term ﬁltering
only modest amounts of ILP. We will refer to those regions
that can be vectorized with the term of Vector Regions and
to the remaining non-DLP regions of code with the term of
Scalar Regions.
In order to evaluate the scalar and vector regions sepa-
rately, we have marked the start and end point of the most
computational intensive vector regions in the source codes.
These regions generally correspond to one or two levels of
nested loops plus some previous initializations. Table 1 lists
the selected benchmarks, the parts of each program that
have been considered as vector regions, and the percent-
age of the execution time they represent in a 2-issue width
µSIMD-VLIW architecture. These benchmarks are repre-
sentative programs of image, audio and video, all from the
UCLA Mediabench suite [22].
Figure 1 shows the speed-up of 2, 4 and 8-issue
width µSIMD-VLIW architectures over the 2-issue width
µSIMD-VLIW (see section 4 for details about methodol-
ogy and processor conﬁgurations). The dashed lines rep-
resent the speed-up in the vector/scalar regions over the
vector/scalar regions of the 2-issue width architecture. The
solid lines refer to the speed-up in the full application.
From the graphs, it can be observed that, except for the
gsm enc, the scalar regions fail to scale above 4-issue width.
While increasing the width of the architecture from 2 to 4
provides an average speed-up of 1.24X in the scalar regions,
moving from 4 to 8-issue only introduces a small 1.03X
performance improvement. As far as the vector regions is
concerned, they exhibit potential to beneﬁt from wider is-
sue scheduling, but this parallelism could be exploited in
a more efﬁcient way by conventional DLP oriented tech-
niques. Furthermore, even though the vector regions scale
up to 3.19X for the jpeg dec application (2.49X in average),
Proceedings of the 2005 International Conference on Parallel Processing (ICPP’05) 
0190-3918/05 $20.00 © 2005 IEEE 
2w 4w 6w 8w
1
2
3
JPEG_ENC
2w 4w 6w 8w
1
2
3
JPEG_DEC
2w 4w 6w 8w
1
2
3
MPEG2_ENC
SPEED-UP APPLICATION
SPEED-UP SCALAR REGIONS
SPEED-UP VECTOR REGIONS
2w 4w 6w 8w
1
2
3
MPEG2_DEC
2w 4w 6w 8w
1
2
3
GSM_ENC
2w 4w 6w 8w
1
2
3
GSM_DEC
Figure 1. Scalability of scalar and vector re-
gions in µSIMD-VLIW architectures
the vectorization percentage is low (24 % in average) and
the lack of scalability in the scalar regions (1.28X in aver-
age) limits the performance of the complete application.
In any case, the actual performance achieved is very far
from the theoretical peak performance and do not pay off
the hardware complexity inherent in very aggressive archi-
tectures. We claim that Vector-µSIMD extensions arise as
a better candidate to invest in, as they clearly reduce the
fetch pressure, simplify the control ﬂow and memory ac-
cess, and speed-up the performance of the vector regions
without detrimental effects over the scalar part.
3 Adding Vector Units to a VLIW processor
3.1 Vector-µSIMD ISA Overview
Our Vector-µSIMD ISA is based on the Matrix Oriented
Multimedia (MOM) extension [6]. It can be viewed as a
conventional vector ISA where each operation is a MMX-
like operation. But it does not include costly vector opera-
tions, such as conditional execution, gathers or scatters.
It provides vector registers of 16 64-bit words each, vec-
tor load and vector store operations to move data from/to
memory to/from the vector registers, and a set of computa-
tion operations that operate on vector registers. Since each
word can pack either eight 8-bit, four 16-bit or two 32-bit
items, each vector register can hold a matrix of up to 16x8
elements. The architecture also provides 192-bit packed ac-
cumulators similar to those proposed in the MDMX mul-
timedia extension. Additionally, two special registers are
required to control the execution of vector operations: the
vector length register and the vector stride register.
As far as terminology is concerned, we reserve the term
operation to refer to each independent machine operation
codiﬁed into a VLIW instruction. Each vector operation
executes so many sub-operations as the vector length dic-
tates. Finally, as the maximum vector length is 16 and each
sub-operation can operate on either eight 8-bit, four 16-bit
or two 32-bit items, a vector operation can perform up to
16x8 micro-operations.
3.2 The Vector-µSIMD-VLIW Architecture
Figure 2 shows the main components of the proposed ar-
chitecture. Essentially, it is a VLIW processor with the ad-
dition of a vector register ﬁle, one or more vector functional
units, and a modiﬁed cache hierarchy specially targeted to
serve vector accesses. Both, the vector register ﬁle and the
vector functional units can be clusterized in independent
vector lanes. This can be achieved with relatively simple
logic by replicating the functional units, splitting each vec-
tor register across each lane and assigning each functional
unit to a certain lane. The different elements of a vector
register are interleaved across lanes, allowing all lanes to
work independently. From the point of view of implemen-
tation, a vector register ﬁle scales better than a µSIMD one,
due to the organization in lanes, which reduces the number
of ports per cluster. For aggressive conﬁgurations, a vector
register ﬁle can provide larger storage capacity with simi-
lar area cost and less access time [5]. In this work, we use
four independent vector lanes. As our vector lengths are
relatively short, a larger number of lanes would not pay off.
The Vector-µSIMD-VLIW architecture also includes a
simple accumulator register ﬁle and adds limited connection
between the lanes to be able to perform the last series of ac-
cumulation in a reduction operation. Only one of the lanes
needs to read and write the source and destination packed
accumulator. This lane is the responsible for performing
the last reduction.
We use a vector cache [27] in the second level of the
memory hierarchy. The vector cache is a two-bank inter-
leaved cache targeted at accessing stride-one vector requests
by loading two whole cache lines (one per bank) instead of
individually loading the vector elements. Then, an inter-
change switch, a shifter, and a mask logic correctly align the
data. Scalar accesses are made to the L1 data cache, while
Proceedings of the 2005 International Conference on Parallel Processing (ICPP’05) 
0190-3918/05 $20.00 © 2005 IEEE 
Figure 2. Vector-µSIMD-VLIW architecture
vector accesses bypass the L1 to access directly the L2 vec-
tor cache. If the L2 port is B×64-bit wide, these accesses
are performed at a maximum rate of B elements when the
stride is one, and at 1 element per cycle for any other stride.
A coherency protocol based on an exclusive-bit policy plus
inclusion is used to guarantee coherency.
3.3 Compilation Issues
The Achilles’ heel of the proposed architecture is, obvi-
ously, the compiler; but nowadays there are compilers that
allow basic autovectorization for µSIMD architectures, and
the same compiler techniques could be used to generate
Vector-µSIMD code. As we do not have a reliable com-
piler at our disposal yet, we have used emulation libraries
to hand-write µSIMD and Vector-µSIMD code to evaluate
the approach, and the compiler replaces the emulation func-
tions calls by the corresponding operation.
From the VLIW point of view, new register ﬁles and
functional units have been added, and some extra consider-
ations must be taken into account by the scheduler, which is
the module that needs the most detailed information about
the target architecture, as it is responsible for assigning a
schedule time to each operation, subject to the constraints of
data dependence and resource availability. For every input
and output operand, an earliest and a latest read and write la-
tency must be speciﬁed respectively [1]. Figure 3.a depicts
the execution of a 3 cycles fully-pipelined scalar operation.
In this example, the source registers are read sometime dur-
ing the ﬁrst cycle after the initiation of the operation, and
the result is written at the end of three cycles.
In the case of a vector operation, these values also de-
pend on the vector length (V L) and on the number of par-
allel vector lanes (LN ). As up to LN sub-operations are
initiated per cycle, the last input operand will be read at
(V L − 1)/LN, and the last output will be written at
L + (V L − 1)/LN, being L the latency of one sub-
operation (see Figure 3.b). The number of parallel vector
lanes is a ﬁxed parameter from the architecture and it is
known at compile time; but the vector length is variable for
1 3 4 5 6 t20
I_0
I_1
I_2 A_2
A_1
A_0
Ter = 0
Tlr = 0
Tew = 0
Tlw = L
(a) Scalar operation
0 1 2 4 5 6 t3
A_2
A_1
A_2A_2A_2
A_1A_1A_1
V_0 A_0A_0A_0A_0
V_2
V_1
Ter = 0
Tlr = (V L − 1)/LN
Tew = 0
Tlw = L + (V L − 1)/LN
(b) Vector operation
Figure 3. Latency descriptors (Ter = earliest
read, Tlr = latest read, Tew = earliest write, Tlw
= latest write, L = ﬂow latency, VL = vector
length, LN = vector lanes)
each operation, and will be dynamically set. Fortunately,
the vector length register is usually initialized with an im-
mediate value, and a simple data ﬂow analysis is able to
provide the right value to the compiler. In the few cases in
which the vector length is not known at compile time, the
compiler must assume the maximum vector length (16) in
order to ensure correctness. Note that, for a vector unit with
four parallel lanes, the penalty to pay would be three extra
cycles at worst (that is, if the vector length turns out to be
four or less).
The same latency descriptors are taken for vector mem-
ory operations, but replacing the number of vector lanes
by the width of the L2 port (in elements). As it was men-
tioned in Section 3.2, in the proposed architecture, the exe-
cution time of a vector memory operation also depends on
the stride. For simplicity, our compiler schedules all vector
memory operations as having a stride of one and hitting in
the L2 vector cache, and the processor stalls at run-time if
either of the two assertions is not true.
On the other hand, providing a register ﬁle which sup-
ports concurrent acceses to the same vector register, the
compiler can do chaining [31] of two vector operations with
a dependence on a vector register operand by just schedul-
ing the second one before the ﬁrst operation has completed
execution.
3.3.1 Code Example
Figure 4 shows the scheduling of a vector code generated
by the compiler for a 2-issue width VLIW architecture with
two vector units and a wide 4x64 bit port to the vector cache.
Latencies are 2 cycles for the vector units and 5 cycles for
the vector cache. This code is taken from the dist1 func-
tion in the mpeg2 enc application, and computes the sum of
Proceedings of the 2005 International Conference on Parallel Processing (ICPP’05) 
0190-3918/05 $20.00 © 2005 IEEE 
absolutes differences (SAD) between two blocks of 8x16
pixels. It is assumed that registers R1 and R2 keep the
initial address of the blocks, and lx is the stride between
consecutive rows. As the registers are 64 bit wide, and the
stride between rows is not one, we need two vector regis-
ters to keep each block. The SAD operation is implemented
using a packed accumulator that allows parallel execution
over the vector elements. Finally, the values packed in the
accumulators are reduced and the ﬁnal result is stored.
VS=lx VL=8
R5=SUM(A1)
m
f
m
n
o
VA
LU
1
R6=SUM(A2)
(m)
(d)(c)
(e) (f)
(g) (h)
(i)
(j)
(k)
(l)
i
g
g
a
j
j
pL
1
pL
2
c
c
i
p
VA
LU
0
IA
LU
1
IA
LU
0
b
h
e
d
k
k
l 11
13
12
10
9
8
7
6
5
4
3
2
1
(b)(a)
14
(n)
(o)
(p)
A2=0
R4=R2+8
R3=R1+8
[R7]=R5
R5=R5+R6
A2=SAD(V3,V4)
A1=SAD(V1,V2)
V4=[R4]
V3=[R3]
V2=[R2]
A1=0
V1=[R1]
cyc
0
18
17
16
15
Figure 4. Scheduling of motion estimation for
a 2-issue Vector-µSIMD-VLIW processor
It can be observed that this kernel is memory bound and,
in fact, the second vector unit is not used at all, as the sec-
ond SAD operation (m) must wait for the data being loaded
from memory and cannot be scheduled earlier. Chaining
is performed between the vector loads (g) and (j) and the
vector SAD operations (k) and (m) respectively. Note also
that the vector loads are scheduled as having a stride of one,
that is, as if they will produce four elements by cycle. As
this assumption is not true, the processor will be stalled at
run-time, thus incurring in a great penalty in performance,
as we will see in the evaluation section.
Note that the two innerloops in the scalar version have
been totally eliminated, and the Vector-µSIMD architecture
only needs to decode 16 operations to process one complete
block, in front of the 172 operations required in the µSIMD
versions of code.
4 Methodology
4.1 Compilation and Simulation Framework
For our experiments we have used Trimaran [21]. Tri-
maran is a compiler infrastructure for supporting state of
the art research in compiling for ILP architectures. The sys-
tem is currently oriented towards EPIC architectures. To
expose sufﬁcient ILP it makes use of advanced techniques
such as Superblock [14] or Hyperblock[23] formation. The
architecture space is characterized by HPL-PD [19], a pa-
rameterized processor architecture.
Our internal release of the compiler also includes Pcode
Interprocedural Pointer Analysis [11] and Cost Effective
Memory Disambiguation [32]. Therefore, our scalar ver-
sions of code include memory disambiguation (inherent in
the vector versions), which introduces an average perfor-
mance speed-up of 1.32X (for a 8-issue width architecture)
over the same codes compiled with the public release of Tri-
maran. We have used emulation libraries to hand-write the
applications with µSIMD and Vector-µSIMD extensions.
The compiler has been modiﬁed to detect the emulation
functions calls and replace them by the related low level
operations. Both, the compiler and the HPL-PD machine
description have been enhanced with the new operations,
register ﬁles and functional units. The simulator has also
been extended to include the new ISAs and a detailed mem-
ory hierarchy.
4.2 Modeled Architectures
We have evaluated 2, 4 and 8-issue width VLIW and
µSIMD-VLIW architectures and two different 2 and 4-issue
width Vector-µSIMD-VLIW conﬁgurations. Table 2 sum-
marizes the general parameters of the ten architectures un-
der study. In order to support the high computational de-
mand of multimedia applications, our conﬁgurations are
quite aggressive in the number of arithmetic functional
units. Latencies are based on those of the Itanium2 pro-
cessor [16].
The µSIMD-VLIW architecture includes 64-bit registers
together with functional units able to operate on up to eight
8-bit items in parallel. This extension provides 67 opcodes
fairly similar to Intel’s SSE [15] integer opcodes. Note that
the vector architectures are not balanced against the same
issue width VLIW or µSIMD-VLIW architectures because
we consider them as an alternative to wider issue proces-
sors. For example, the arithmetic capability of the 2-issue
Vector2 and the 4-issue Vector1 conﬁgurations is compara-
ble to that of the 8-issue µSIMD conﬁguration, not to the 2
or 4-issue µSIMD.
The ﬁrst level data cache is a 16 KB, 4-way set asso-
ciative cache with one port for the reference 2-issue width
Proceedings of the 2005 International Conference on Parallel Processing (ICPP’05) 
0190-3918/05 $20.00 © 2005 IEEE 
Table 2. Processor conﬁgurations
VLIW +µSIMD +Vector1 +Vector2
Resource 2 / 4 / 8 w 2 / 4 / 8 w 2 / 4 w 2 / 4 w
Int regs 64 / 96 / 128 64 / 96 / 128 64 / 96 64 / 96
SIMD regs – 64 / 96 / 128 20 / 32 x16 20 / 32 x16
Acc regs – – 4 / 6 4 / 6
Int units 2 / 4 / 8 2 / 4 / 8 2 / 4 2 / 4
SIMD units – 2 / 4 / 8 1 / 2 x4 2 / 4 x4
L1 ports 1 / 2 / 3 1 / 2 / 3 1 1 / 2
L2 ports – – 1 x4 1 x4
architecture. We consider pseudo-multi-ported caches for
the conﬁgurations with greater number of ports. There is a
256KB vector cache in the second level and a 1MB cache
in the third level. Latencies are 1 cycle to the L1, 5 cycles
to the L2, 12 cycles to the L3 and 500 cycles to main mem-
ory. We have not simulated the instruction cache since our
benchmarks have small instruction working set. The com-
piler schedules all memory operations assuming they hit in
the cache and the processor is stalled at run-time in case of
a cache miss or bank conﬂict.
5 Evaluation
5.1 Speed-up in Vector Regions
Figure 5.a shows the performance speed-up obtained in
the vector regions with perfect memory simulation. By per-
fect memory we consider that all accesses hit in cache, but
with the corresponding latency. That is, all scalar accesses
are served after 1 cycle of latency and all vector accesses
in the Vector conﬁgurations go to the L2 and take 5 cycles
plus the additional cycles to serve all vector data elements
(which slightly favours the VLIW and µSIMD-VLIW con-
ﬁgurations). For each architecture, the graph shows the
speed-up of the vector regions over the execution time of
the vector regions in the 2-issue width VLIW architecture.
As it was to be expected, both µSIMD and Vector ar-
chitectures clearly outperform the same issue VLIW archi-
tecture. The 2-issue width Vector2 architecture outperforms
the same width µSIMD architecture in a factor ranging from
3.0X to 6.2X (4.4X in average). Furthermore, the 8-issue
µSIMD is outperformed by both, the 2-issue Vector2 in a
factor of up to 2.6X (1.7X in average), and the 4-issue Vec-
tor2 in a factor of up to 4.0X (2.3X in average).
We also observe that half of the benchmarks do not
take much beneﬁt of increasing the number of vector units
(that is, when going from Vector1 to Vector2). This is be-
cause they have vector regions similar to the motion esti-
mation example explained in section 3.3, with very short
vector lengths and small loops. Examples of this include
the form component prediction and the add block regions in
the MPEG2 decoder and the calculation of the long term pa-
rameters in the GSM encoder. On the contrary, other bench-
marks shuch as the JPEG encoder and decoder, whose vec-
tor regions are characterized by larger vector lengths (ex.
color conversions or upsampling) and/or larger loop sizes
(ex. DCT’s), exhibit a signiﬁcant improvement in perfor-
mance when the number of vector units is doubled.
Figure 5.b shows the speed-up of the vector regions
again, but with the simulation of the memory hierarchy.
We observe that the Vector architectures exhibit the high-
est performance degradations when considering a realistic
memory system. This fact may seem counterintuitive, since
vector architectures are well known for their capability to
tolerate memory latency. Two reasons explain this behav-
ior. First, the vector lengths are not long enough to take
beneﬁt of this characteristic. Second, VLIW architectures
are very sensitive to non-deterministic latencies.
As it was explained before, during the scheduling, the
compiler assumes that all vector accesses have a stride of
one, and the processor stalls at run-time if this assertion is
not true. That is what happens in the mpeg2 enc benchmark,
in which the stride of the main region (the motion estima-
tion) is the image width. Moreover, in this kernel, these
memory operations represent an important fraction of the
overall code, resulting in a high performance degradation
(close to 200%). Apart from this, all benchmarks exhibit
high hit ratios and very low performance degradation when
considering realistic memory.
5.2 Speed-up in Complete Applications
Figure 6 shows the speed-up for complete applications.
As it was to be expected, the benchmark that exhibits the
highest performance improvement is the mpeg2 enc (up to
4.74X speed-up for the 4-issue Vector2). Even though there
are other benchmarks (such as gsm enc) with similar (or
even greater) speed-ups in the vector regions, the impact in
the overall performance is not so signiﬁcant, due to the low
vectorization percentage. The 4-issue Vector2 architecture
slightly outperforms the 8-issue µSIMD in all the applica-
tions (1.03X in average). Note also that the 4-issue Vector1
conﬁguration achieves, in average, the same performance
than the 8-issue µSIMD, with only one port to the ﬁrst level
cache and two vector units.
The gap between the different architectures decrease
with the issue width of the machine. For example, while
the 2-issue Vector2 exhibits a factor of 1.22X of perfor-
mance improvement (in average) over the 2-issue µSIMD,
the 4-issue Vector2 only outperforms the 4-issue µSIMD
in a 1.14X. That makes sense, as a wide enough µSIMD-
VLIW architecture is able to exploit as ILP the parallelism
Proceedings of the 2005 International Conference on Parallel Processing (ICPP’05) 
0190-3918/05 $20.00 © 2005 IEEE 
2w 4w 8w 2w 4w 8w 2w 4w 8w 2w 4w 8w 2w 4w 8w 2w 4w 8w
0
5
10
15
20
25
VLIW
+uSIMD
+Vector1
+Vector2
JPEG_ENC JPEG_DEC MPEG2_ENC MPEG2_DEC GSM_ENC GSM_DEC
(a) Perfect memory
2w 4w 8w 2w 4w 8w 2w 4w 8w 2w 4w 8w 2w 4w 8w 2w 4w 8w
0
5
10
15
20
25
VLIW
+uSIMD
+Vector1
+Vector2
JPEG_ENC JPEG_DEC MPEG2_ENC MPEG2_DEC GSM_ENC GSM_DEC
(b) Realistic memory
Figure 5. Speed-up in vector regions
that the Vector-µSIMD-VLIW exploits as DLP.
On the other hand, the vector regions only represent a
40% of the total execution time in the 2-issue VLIW archi-
tecture. When most of the available DLP parallelism is ex-
ploited via multimedia extensions, the remaining scalar part
becomes the bottleneck. In the 4-issue Vector2 architecture,
the vector cycles represent less than 10% of the overall ex-
ecution time (except for the mpeg2 enc). By the Amhdal
Law, further improvements in the execution of the vector
regions would be imperceptible in the complete application.
5.3 Operations per Cycle
Figure 7 shows the dynamic operation count normalized
by the dynamic operation count of the base VLIW archi-
tecture. We have distinguished the contribution of each re-
gion. Regions from R1 to R3 are the fractions of code that
have been vectorized in the µSIMD and Vector versions in
the same order they are listed in Table 1 (for example, in
mpeg2 enc, R1 accounts for the motion estimation and R2
and R3 for the forward and inverse two dimensional DCT).
Region R0 always refers to the remaining scalar part.
The results conﬁrm that the µSIMD and Vector-µSIMD
versions of code require to execute much less operations
than the scalar versions. This may not seem so obvious
if we take into account that these versions are sometimes
based on algorithms that require to execute much more op-
erations [33].
As can be observed, the Vector architecture executes an
average of 84% fewer operations in the vector regions than
VLIW
+uSIMD
+Vector
VLIW
+uSIMD
+Vector
VLIW
+uSIMD
+Vector
VLIW
+uSIMD
+Vector
VLIW
+uSIMD
+Vector
VLIW
+uSIMD
+Vector
0.0
0.2
0.4
0.6
0.8
1.0
R3
R2
R1
R0
JPEG_ENC JPEG_DEC MPEG2_ENC MPEG2_DEC GSM_ENC GSM_DEC
Figure 7. Normalized operation count
the µSIMD (19% fewer in the complete application). The
obvious reason is that Vector architectures can pack more
micro-operations into a single operation (up to 81.10 for
the jpeg dec application and 38.78 in average). Moreover,
there is an additional reduction on the number of operations
involved in the loop-related control. This reduction in the
number of operations to fetch and decode also translates
into a decrease in power consumption.
Table 3 shows the average number of operations per cy-
cle for the scalar and vector regions of code separately. It
conﬁrms our belief that the non-vector regions of code do
not beneﬁt from scaling the width of the machine above 4
issue width. Fetching 1.84 operations per cycle does not pay
off the hardware complexity of a 8-issue width architecture.
For the µSIMD and Vector-µSIMD versions we also show
the average number of micro-operations executed per cy-
cle. The Vector-µSIMD ISA obtains the highest speed-ups
by exploiting more data parallelism in the vector regions
(up to 14.00 micro-operations per cycle) and with the low-
Proceedings of the 2005 International Conference on Parallel Processing (ICPP’05) 
0190-3918/05 $20.00 © 2005 IEEE 
2w 4w 8w 2w 4w 8w 2w 4w 8w
1
2
3
4
1.
00 1
.4
4 1.
70
1.
00 1.
28 1.
38
1.
00 1
.4
3 1.
77
1.
29 1
.7
1
1.
94
1.
07 1.
37 1.
46
2.
81
3.
86
4.
47
1.
56 1.
95
1.
19 1.
42
3.
93
4.
54
1.
60 2
.0
1
1.
23 1.
48
3.
90
4.
74
JPEG_ENC JPEG_DEC MPEG2_ENC
2w 4w 8w 2w 4w 8w 2w 4w 8w
1
2
3
4
1.
00 1.
23
1.
24
1.
00 1
.5
3 1.
79
1.
00 1.
10
1.
121.
26 1.
64 1.
74
1.
33
1.
94 2.
17
1.
03 1.
12
1.
131.
45 1.
69
1.
58
2.
21
1.
04 1.
121.
45 1.
82
1.
58
2.
21
1.
04 1.
13
MPEG2_DEC GSM_ENC GSM_DEC
2w 4w 8w
1.0
1.5
2.0
2.5
VLIW
+uSIMD
+Vector1
+Vector2
1.
00
1.
34 1
.5
0
1.
47
1.
94
2.
15
1.
79
2.
15
1.
80
2.
22
AVERAGE
Figure 6. Speed-up in complete applications
est fetch bandwidth requirements (just 1.37 operations per
cycle), making it an ideal candidate for embedded systems,
where high issue rates are not an option. However, for wide
issues, the µSIMD ISA exhibits more ﬂexibility to bene-
ﬁt from wide static scheduling and also reaches signiﬁcant
micro-operations per cycle rates, but at a higher cost.
Table 3. OPC = operations per cycle, µOPC =
micro-operations per cycle, SP = speed-up
Scalar regs Vector regs Application
OPC SP OPC µOPC SP OPC µOPC SP
2w VLIW 1.44 1.00 1.80 1.80 1.00 1.59 1.59 1.00
+µSIMD 1.44 1.00 1.78 4.68 2.88 1.52 2.32 1.47
+Vector1 1.44 1.00 0.87 7.91 9.33 1.36 2.12 1.79
+Vector2 1.44 1.00 0.98 10.10 10.61 1.37 2.15 1.80
4w VLIW 1.77 1.24 3.03 3.03 1.66 2.14 2.14 1.34
+µSIMD 1.78 1.24 2.95 7.80 4.62 1.98 3.05 1.94
+Vector1 1.71 1.20 1.24 11.64 12.87 1.63 2.55 2.15
+Vector2 1.76 1.23 1.37 14.00 14.09 1.69 2.64 2.22
8w VLIW 1.84 1.28 4.54 4.54 2.47 2.42 2.42 1.50
+µSIMD 1.84 1.29 4.47 12.07 6.76 2.18 3.38 2.15
6 Conclusions
The actual performance achieved by very wide issue
VLIW architectures is very far from the theoretical peak
performance and do not pay off the related hardware com-
plexity. By analyzing the scalability of the scalar and vector
regions of code separately, we have shown that the scalar
regions do not beneﬁt from increasing the width of the ma-
chine above 4-issue width. On the other hand, the kind of
parallelism found in the vector regions could be exploited
in a more efﬁcient way by means of SIMD execution.
To exploit the data parallelism inherent in the vector re-
gions, we have proposed the addition of one or more vector
units together with a vector register ﬁler and a wide port
to the L2 that provides the bandwidth required by the vec-
tor regions. This extension can be viewed as a conventional
short vector ISA where each element is operated in a MMX-
like fashion. This enhancement has a minimal impact on the
VLIW core and provides high performance in the vector re-
gions for low issue rates.
We have evaluated the proposed architecture for com-
plete applications of audio, video and image processing and
compared it againts a VLIW architecture with and with-
out µSIMD extensions. In the vector regions, a 4-issue
width Vector-µSIMD-VLIW architecture outperforms the
8-issue µSIMD-VLIW architecture in a factor of up to 4.2X
(2.1X in average). Furthermore, a 4-issue architecture with
only one port to the ﬁrst level cache and two vector units
achieves, in complete applications, similar performance to
that of the 8-issue µSIMD-VLIW.
On the other hand, it has been seen that Vector-µSIMD-
VLIW architectures do not perform well in front of non
stride-one memory references and exhibit the highest per-
formance degradations when considering a realistic mem-
ory system, mainly due to the high sensitivity of VLIW ar-
chitectures to non-deterministic latencies. Future research
must be done to improve the memory hierarchy and to test
more ﬂexible scheduling techniques.
Proceedings of the 2005 International Conference on Parallel Processing (ICPP’05) 
0190-3918/05 $20.00 © 2005 IEEE 
References
[1] S. Aditya, V. Kathail, and B. R. Rau. Elcor’s machine de-
scription system: Version 3.0. Technical Report HPL-98-
128, Information Technology Center, 1998.
[2] K. Asanovic, J. Beck, B. Irissou, B. Kingsbury, N. Morgan,
and J. Wawrzynek. The t0 vector microprocessor. In Hot
Chips VII, pages 187–196, August 1995.
[3] V. Bongiorno and G. Shorrel. Cray sv1, sv1e, sv1ex –
overview, 2000. http://www.cray.com/products/systems/.
[4] R. P. Colwell, R. P. Nix, J. J. O’Donnell, D. B. Papworth,
and P. K. Rodman. A vliw architecture for a trace schedul-
ing compiler. IEEE Trans. on Computers, C-37(8):967–979,
August 1988.
[5] J. Corbal. N-Dimensional Vector Instruction Set Architec-
tures for Multimedia Applications. PhD thesis, UPC, Depar-
tament d’Arquitectura de Computadors, 2002.
[6] J. Corbal, R. Espasa, and M. Valero. Exploiting a new level
of dlp in multimedia applications. In Proceedings of the
32nd Int. Symp. on Microarchitecture, pages 72–79, 1999.
[7] K. Diefendorff and P. Dubey. How multimedia workloads
will change processor design. IEEE Computer, 30(9):43–
45, Sept 1997.
[8] P. Faraboschi, G. Brown, J. A. Fisher, G. Desoli, and
F. Homewood. Lx: a technology platform for customizable
VLIW embedded processing. In Proc. of the 27th Int. Symp.
on Computer Architecture 2000, pages 203–213, June 2000.
[9] J. A. Fisher. Trace scheduling: A technique for global mi-
crocode compaction. IEEE Trans. on Computers, C-30:478–
490, July 1981.
[10] M. Flynn. Some computer organizations and their effective-
ness. IEEE Trans. on Computing, C–21(9):948–960, 1972.
[11] D. M. Gallagher. Memory Disambiguation to Facilitate
Instruction-Level Parallelism Compilation. PhD thesis, Uni-
versity of Illinois, 1995.
[12] L. Gwennap. Majc gives vliw a new twist. Microprocessor
Report, 13(12):12–15, September 1999.
[13] R. M. Hord. The Illiac IV, the ﬁrst supercomputer. Computer
Science Press, 1982.
[14] W. W. Hwu, S. A. Mahlke, W. Y. Chen, P. P. Chang, N. J.
Warter, R. A. Bringmann, R. G. Ouellette, R. E. Hank,
T. Kiyohara, G. E. Haab, J. G. Holm, and D. M. Lavery. The
superblock: An effective technique for vliw and superscalar
compilation. Supercomputing, 7:229–248, 1993.
[15] Pentium iii processor: Developer’s manual. Technical re-
port, Intel, 1999.
[16] Intel. Intel itanium2 processor reference manual
for software development and optimization, 2004.
http://developer.intel.com/design/itanium2/manuals/.
[17] M. Johnson. Superscalar Microprocessor Design. Prentice-
Hall, Englewood Cliffs, New Jersey, 1991.
[18] B. Juurlink, S. Vassiliadis, D. Tcheressiz, and H. A. Wi-
jshoff. Implementation and evaluation of the complex
streamed instruction set. In Proceedings of the International
Conference on Parallel Architectures and Compilation Tech-
niques, pages 73–82, September 2001.
[19] V. Kathail, M. Schlansker, and B. R. Rau. Hpl-pd architec-
ture speciﬁcation: Version 1.1. Technical Report HPL-93-
80(R.1), Hewlett–Packard Lab., 2000.
[20] C. Kozyrakis. A media-enhanced vector architecture for em-
bedded memory systems. Technical Report CSD-99-1059,
UCB, 27, 1999.
[21] H. P. Lab., R.-I. Group, and I. Group. Trimaran user manual,
1998. http://www.trimaran.org/docs.html.
[22] C. Lee, M. Potkonjak, and W. H. Mangione-Smith. Medi-
abench: A tool for evaluating and synthesizing multimedia
and communicatons systems. In Proceedings of the 30th Int.
Symp. on Microarchitecture, pages 330–335, 1997.
[23] S. A. Mahlke, D. C. Lin, W. Y. Chen, R. E. Hank, and R. A.
Bringmann. Effective compiler support for predicated exe-
cution using the hyperblock. In Proceedings of the 25th Int.
Symp. on Microarchitecture, pages 45–54, Dec. 1992.
[24] H. Nguyen and L. K. John. Exploiting SIMD parallelism
in DSP and multimedia algorithms using the altivec tech-
nology. In Proceedings of the International Conference on
Supercomputing, pages 11–20, 1999.
[25] A. Peleg and U. Weiser. Mmx technology extension to the
intel architecture. IEEE Micro, 16(4):42–50, 1996.
[26] Trimedia tm-1300. http://www-us3.semiconductors.com/.
[27] F. Quintana, J. Corbal, R. Espasa, and M. Valero. Adding a
vector unit on a superscalar processor. In Proc. of the Inter-
national Conference on Supercomputing, pages 1–10, 1999.
[28] B. R. Rau and J. A. Fisher. Instruction-level parallel pro-
cessing: history, overview, and perspective. The Journal of
Supercomputing, 7(1-2):9–50, 1993.
[29] B. R. Rau, D. W. L. Yen, W. Yen, and R. A. Towle. The cydra
5 departmental supercomputer. IEEE Computer, 22(1):12–
35, January 1989.
[30] S. F. Reddaway. Dap-a distributed array processor. In Pro-
ceedings of the 1st annual symposium on Computer archi-
tecture, pages 61–65. ACM Press, 1973.
[31] R. Russel. The cray-1 computer system. Comunications of
the ACM, 21(1):63–72, January 1978.
[32] E. Salamı´, J. Corbal, C. Alvarez, and M. Valero. Cost effec-
tive memory disambiguation for multimedia codes. In Proc.
of the Int. Conf. on Compilers, Architecture, and Synthesis
for Embedded Systems, pages 117–126, 2002.
[33] E. Salamı´, J. Corbal, R. Espasa, and M. Valero. An eval-
uation of different dlp alternatives for the embedded media
domain. In Proceedings of the 1st Workshop on Media Pro-
cessors and DSPs, pages 100–109, November 1999.
[34] E. Salamı´ and M. Valero. Initial evaluation of multimedia
extensions on vliw architectures. In Proceedings of the 4th
international workshop on Systems, Architectures, Model-
ing, and Simulation, pages 403–412, July 2004.
[35] M. S. Schlansker and B. Raw. Epic: Explicitly parallel
instruction computing. In IEEE Computer, pages 37–45,
February 2000.
[36] H. Sharangpani and K. Aurora. Itanium processor microar-
chitecture. IEEE Micro, 20(5):24–43, September 2000.
[37] TI. TMS320C62XX family, 1999. http://www.ti.com/sc/-
docs/products/dsp/tms320c6201.html.
[38] Introducing tigersharc, 1999. http://www.analog.com/new/-
ads/html/SHARC2.
[39] A. van der Steen and J. Dongarra. The nec sx-5, 2001.
http://www.top500.org/ORSC/2001.
[40] A. Yu. The future of microprocessors. IEEE Micro,
16(6):46–53, 1996.
Proceedings of the 2005 International Conference on Parallel Processing (ICPP’05) 
0190-3918/05 $20.00 © 2005 IEEE 
