PLX is a small, fully subword-parallel instruction set architecture designed for very fast multimedia processing, especially in constrained environments requiring low cost and power such as handheld multimedia information appliances. In PLX, we select the most useful multimedia instructions added previously to microprocessors. We also introduce a few novel features: a new definition of predication requiring very few bits in each predicated instruction, and datapath scalability from 32-bit to 128-bit words, which allows different degrees of subword parallelism without any changes to the ISA. Performance results from basic multimedia kernels testify to PLX's superiority for multimedia processing.
INTRODUCTION
Multimedia processing involves compute-intensive operations and constitutes an increasingly greater fraction of the general-purpose processor's workload [ 11. To achieve better multimedia performance, instruction set architectures (ISAs) have added multimedia extensions [2, 3] , such as MAX-2 [4] to PA-RISC processors [5] , MMX [6] to IA-32 processors, and a superset of these to IA-64 [7] processors. These ISAs exploit the following two properties of multimedia applications:
Huge amounts of data parallelism Extensive use of low-precision data These two properties are exploited well by the use of subword parallelism, also called microSIMD parallelism [2, 8] . In a subword-parallel architecture, the processor's datapath is partitioned into multiple lower-precision segments called the subwords, and the instructions operate in parallel on these subwords (Figure 1 ).
PLX is a fully subword-parallel ISA designed for very fast media processing [9] . We introduce the PLX architecture along with some examples that highlight some of its features, such as low-cost multiplication, a new definition of predication, and datapath scalability.
Rsl:
Rs2:
PLX INSTRUCTIONS
PLX instructions can be classified into three major groups based on the functional unit responsible for their execution: ALU instructions, shift and permute instructions, and multiply instructions ( Figure 2 ). All instructions are 32-bits long and subword sizes are 1,2,4 and 8 bytes.
Basic ALU instructions shown in Table 1 include parallel add and subtract (with modular or saturation arithmetic), parallel shift and add, parallel average, parallel maximum and minimum, logical and compare instructions (Section 3). Pshift [left I right] add instructions allow low-cost integer and fixed-point multiplication in the ALU without requiring a separate multiplier. Since the shift amounts are limited to 1, 2 or 3 bits to the right or left, they are realized by a small pre-shifter added to the ALU [8, 10] . Because multiplications can be performed efficiently and inexpensively in the ALU, a separate integer multiplier becomes optional for very low-cost and low-power PLX implementations (as indicated by the dotted lines in Figure 2 ). 
Low-cost multiplication

Full multiplication
While they are low-cost and effective, the pshift [left 1 right ] instructions only allow multiplication by constants. Therefore PLX also includes instructions to multiply two registers (Table 2 ). These instructions are handled by a separate optional multiplier unit.
Pmultiply shift right right-shifts the products before writing the lower-order half of the bits to the destination register. This allows selection of the desired 16-bits of each product. Pmultiply odd and pmultiply even only multiply the odd or even indexed subwords of the source registers, producing fulllength products.
PREDICATION
All PLX instructions are predicated. PLX has 128 1-bit predicate registers organized into 16 predicate register sets of 8 predicate registers each. At any given time, only one of these predicate register sets is active and the registers in this set are numbered PO through P7. The active predicate register set is changed in software.
The predicate registers P1 to P7 can be set and cleared using compare instructions (PO is always true). This definition of predication requires only three bits in each emphasize the effects of ISA features, we keep the instruction to speclfy a predicate register compared to the microarchitecture as simple as possible by using a singleseven bits that would be required if the 128 predicate issue pipeline and assuming a perfect memory system, registers were addressed directly.
where all loads and stores take a single cycle. Execution Two types of compare instructions set the predicate latencies are properly accounted for, with single-cycle registers in PLX. They are illustrated below, comparing ALU and SPU instructions and three-cycle multiply two registers, R1 and R2, for equality.
instructions. Whenever possible, instructions are scheduled to eliminate pipeline stalls caused by data dependencies.
Example: cmp . eq R1, R2, P1, P2
The simulation software used is part of a comprehensive ISA research toolbox developed under Operation: IfRl=R2, P1 t 1 and P2 f 0, else P1 f 0 and P2 4-1. compiler as well as other auxiliary tools for workload characterization and performance analysis.
Performance results are shown in Table 4 , as speedups of the second, third and fourth setups over the first one.
Digital filtering
The first type of compare is useful for implementing if-then-else statements without conditional branch instructions. The second type differs from the first because it writes the predicate registers only if the relation specified in the re1 field is true. This allows multiple cmp . pwl . re1 instructions to be executed in
The most common DSP kemel is the digital filter. the same cycle, targeting the same predicate registers, to
Applications include frequency-domain alterations of speedup complex conditional expressions. The values in signals; low-pass, high-pass and band-pass filtering; the predicate registers must be initialized before using audio equalization, adaptive filtering and speech cmp.pw1. re1 instructions.
compression. We simulate a 4-tap finite impulse response (FIR) filter that uses fixed-point numbers for needed for a stand-alone processor.
both input data and coefficients. This algorithm benefits most from subword-parallelism and the low-cost 4. DATAPATH SCALABILITY multiplication that is offered by the pshift [left I right] add instructions.
The 64-bit PLX is 4.48 times faster than the basic ISA, and also 4.07 times faster than MMX. A 128-bit PLX doubles the performance of a 64-bit PLX.
Discrete cosine transform
PLX also has load, store and jump instructions, as PLX can be implemented as a 32-bit, 64-bit or 128-bit architecture without any changes to the ISA. To allow this, PLX instructions are designed to work for these different word sizes. All subword sizes of 1, 2, 4 and 8 bytes are supported, up to the larger of the word size or 8 bytes: a 32-bit PLX does not support 8 byte subwords and a 128-bit PLX does not support 16-byte subwords.
Compared to a 64-bit PLX, a 32-bit implementation has a lower performance, but also a lower cost. On the other hand, doubling the datapath width to 128 bits effectively doubles the subword parallelism, but at a lower cost compared to a superscalar implementation with an equivalent degree of operation parallelism.
The Discrete Cosine Transform (DCT) and its inverse (IDCT) are commonly used code kemels in image and video compression such as JPEG, MPEG and H.261. We run simulations for an 8x8 2-dimensional DCT using the
The most time critical operations in the IDCT algorithm are matrix transposition and multiplication by fractional constants. Using mix instructions, the
EXAMPLES AND PERFORMANCE
transposition of an 8x8 matrix of 16-bit IDCT coefficients is achieved very efficiently. The average Performance of PLX is verified by simulating three number of parallel shift and add (and other) instructions commonly used multimedia algorithms in four different required per multiplication in AAN DCT is only 3.5. setups: 1) using a basic RISC-like 64-bit ISA without Since four 16-bit multiplications are done in parallel in a subword parallelism or predication; 2) using 64-bit single-issue 64-bit PLX, this is more than one MMX instructions; 3) using 64-bit PLX; and 4) using multiplication per cycle, using just a single ALU (and no 128-bit PLX to demonstrate datapath scalability. multiplier). A superscalar implementation with multiple In all cases, the algorithms are hand-coded and ALUs can achieve even higher multiplication optimized in their respective assembly languages. To performance using the parallel shift and add instructions.
Median filter
FIR Filter
AAN DCT Median Filter
Median filter is an image-processing algorithm used for noise reduction. Its most compute-intensive step is to find the median of nine 8-bit pixels enclosed within a 3x3 box that is stepped across the whole image. To illustrate the PLX compare instructions and predication feature, we use a bubble-sort algorithm to sort the nine pixels, and then take the center value in the sorted list as the desired median.
We ...
The double semi-columns are used to separate instruction groups that must be executed in different cycles. The theoretical limit for the execution of this sequence is three cycles. For a two-way superscalar implementation, six cycles are required.
The median filter was further optimized by the use of shift right pair, pmaximum and pminimum instructions, for an overall speedup of about lox over the non-subword-parallel implementation. A pair of pmaximum and pminimum instructions can sort 8 bytes in parallel on a 64-bit PLX, and 16 bytes in parallel on a 128-bit PLX.
In each case, 64-bit PLX is much faster than h4MX (which has the same degree of subword parallelism), and 128-bit PLX provides a further 2x speedup.
CONCLUSIONS
The PLX architecture is capable of delivering very high multimedia performance at only a fraction of the complexity of existing microprocessors with multimedia extensions. The 32-bit instructions of PLX result in a higher code density compared to architectures with longer instructions such as IA-64. In addition PLX has a novel definition of predication that allows all instructions to be predicated with 128 predicate registers, while only consuming three bits in each instruction. We plan to investigate more thoroughly the usefulness of predication in reducing branch penalties in media programs. Another novel property of PLX is datapath scalability, which allows processor implementations with different datapath sizes using the same ISA. This gives extra flexibility in balancing complexity versus cost.
Our results show that very high multimedia performance can be achieved with a simple and low-cost ISA like PLX. This makes PLX especially suitable for constrained environments such as wireless multimedia information appliances, where high multimedia performance and low cost and power are required.
