Increasing focus on multimedia applications has prompted the addition of multimedia extensions to most existing general purpose microprocessors. This added functionality comes primarily with the addition of short SIMD instructions. Unfortunately, access to these instructions is limited to in-line assembly and library calls. Generally, it has been assumed that vector compilers provide the most promising means of exploiting multimedia instructions. Although vectorization technology is well understood, it is inherently complex and fragile. In addition, it is incapable of locating SIMD-style parallelism within a basic block.
Introduction
The recent shift toward computation-intensive m ultimedia workloads has resulted in a v ariety o f n e w m ultimedia extensions to current microprocessors 6, 10, 16, 18, 20] . Many new designs are targeted speci cally at the multimedia domain 3, 7, 11] . This trend is likely to continue as it has been projected that multimedia processing will soon become the main focus of microprocessor design 8] .
While di erent processors vary in the type and number of multimedia instructions o ered, at the core of each i s a s e t of short SIMD or superword operations. These instructions operate concurrently on data that are packed in a single regPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro t or commercial advantage and that copies bear this notice and the full citation of the rst page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior speci c permission and/or a fee. PLDI ister or memory location. In the past, such systems could accommodate only small data types of 8 or 16 bits, making them suitable for a limited set of applications. With the emergence of 128-bit superwords, new architectures are capable of performing four 32-bit operations with a single instruction. By adding oating point support as well, these extensions can now be used to perform more general purpose computation.
It is not surprising that SIMD execution units have a ppeared in desktop microprocessors. Their simple control, replicated functional units, and absence of heavily-ported register les make them inherently simple and extremely amenable to scaling. As the number of available transistors increases with advances in semiconductor technology, datapaths are likely to grow e v en larger. Today, u s e o f m ultimedia extensions is di cult since application writers are largely restricted to using in-line assembly routines or specialized library calls. The problem is exacerbated by inconsistencies among di erent instruction sets. One solution to this inconvenience is to employ v ectorization techniques that have been used to parallelize scienti c code for vector machines 5, 14, 15] . Since a number of multimedia applications are vectorizable, this approach promises good results. However, many important m ultimedia applications are di cult to vectorize. Complicated loop transformation techniques such as loop ssion and scalar expansion are required to parallelize loops that are only partially vectorizable 2, 4, 17] . Consequently, no commercial compiler currently implements this functionality. This paper presents a method for extracting SIMD parallelism beyond vectorizable loops.
We b e l i e v e that short SIMD operations are well suited to exploit a fundamentally di erent t ype of parallelism than the vector parallelism associated with traditional vector and SIMD supercomputers. We denote this parallelism Superword L evel Parallelism (SLP) since it comes in the form of superwords containing packed data. Vector supercomputers require large amounts of parallelism in order to achieve speedups, whereas SLP can be pro table when parallelism is scarce. From this perspective, we h a ve d e v eloped a general algorithm for detecting SLP that targets basic blocks rather than loop nests. In some respects, superword level parallelism is a restricted form of ILP. ILP techniques have been very successful in the general purpose computing arena, partly because of their ability to nd parallelism within basic blocks. In the same way that loop unrolling translates loop level parallelism into ILP, v ector parallelism can be transformed into SLP. This realization allows for the parallelization of vector- izable loops using the same basic block analysis. As a result, our algorithm does not require any of the complicated loop transformations typically associated with vectorization. In fact, vector parallelism alone can be uncovered using a simpli ed version of the SLP compiler algorithm.
The remainder of this paper is organized as follows: Section 2 de nes superword level parallelism and compares it to other forms of parallelism. Section 3 describes the compiler algorithm used to extract superword level parallelism. A v ariation of this algorithm targeting vector parallelism is discussed in Section 4. Section 5 presents results on multimedia and scienti c benchmarks. Section 6 discusses architectural features that complement SLP compilation. Section 7 outlines reasons why w e believe SLP algorithms will be successful, and Section 8 concludes.
Superword Level Parallelism
This section begins by elaborating on the notion of SLP and the means by w h i c h it is detected. Terminology is introduced that facilitates the discussion of our algorithms in Sections 3 and 4. We t h e n c o n trast SLP to other forms of parallelism and discuss their interactions. This helps motivate the need for a new compilation technique.
Description of Superword Level Parallelism
Superword level parallelism is de ned as short SIMD parallelism in which the source and result operands of a S I M D operation are packed in a storage location. Detection is done through a short, simple analysis in which independent isomorphic statements are identi ed within a basic block. Isomorphic statements are those that contain the same operations in the same order. Such statements can be executed in parallel by a t e c hnique we call statement packing, a n e xample of which i s s h o wn in Figure 1 . Here, source operands in corresponding positions have been packed into registers and the addition and multiplication operators have been replaced by their SIMD counterparts. Since the result of the computation is also packed, unpacking may be required depending on how the data are used in later computations. The performance bene t of statement p a c king is determined by the speedup gained from parallelization minus the cost of packing and unpacking.
Depending on what operations an architecture provides to facilitate general packing and unpacking, this technique can actually result in a performance degradation if packing and unpacking costs are high relative to ALU operations. One of the main objectives of our SLP detection technique is to minimize packing and unpacking by locating cases in which packed data produced as a result of one computation can be used directly as a source in another computation. Packed statements that contain adjacent memory references among corresponding operands are particularly well suited for SLP execution. This is because operands are effectively pre-packed in memory and require no reshu ing within a register. In addition, an address calculation followed by a load or store need only be executed once instead of individually for each element. The combined effect can lead to a signi cant performance increase. This is not surprising since vector machines have been successful at exploiting the same phenomenon. In our experiments, instructions eliminated from operating on adjacent memory locations had the greatest impact on speedup. For this reason, locating adjacent memory references forms the basis of our algorithm, discussed in Section 3.
Vector P arallelism
Vector parallelism is a subset of superword level parallelism. Our results in Section 5 show that 20% of dynamic instruction savings on the SPEC95fp benchmark suite are from non-vectorizable code sequences.
To better explain the di erences between superword level parallelism and vector parallelism, we present t wo short examples, shown in Figures 2 a n Figure 2 (b). The rst loop is vectorizable, but the second must be executed sequentially.
Figure 2(c) shows the loop from the perspective o f S L P . After unrolling, the four statements corresponding to the rst statement in the original loop can be packed together. The packing process e ectively moves packable statements to contiguous positions, as shown in part (d). The code motion is legal because it does not violate any dependences (once scalar renaming is performed). The rst four statements in the resulting loop body can be packed and executed in parallel. Their results are then unpacked so they can be used in the sequential computation of the nal statements. In the end, this method has the same e ect as the transformations used for vector compilation, while only requiring loop unrolling and scalar renaming. Figure 3 shows a code segment that averages the elements of two 16x16 matrices. As is the case with many multimedia kernels, our example has been hand-optimized for a sequential machine. In order to vectorize this loop, a vector compiler would need to reverse the programmerapplied optimizations. Were such methods available, they would involve constructing a for loop, restoring the induction variable, and re-rolling the loop. In contrast, locating SLP within the loop body is simple. Since the optimized code is amenable to SLP analysis, hand-optimization has had no detrimental e ects on our ability to detect the available parallelism.
Loop Level Parallelism
Vector parallelism, exploited by v ector computers, is a subs e t o f l o o p l e v el parallelism. General loop level parallelism is typically exploited by a m ultiprocessor or MIMD machine. In many cases, parallel loops may not yield performance gains because of ne-grain synchronization or loop-carried communication. It is therefore necessary to nd coarse-grain parallel loops when compiling for MIMD machines. Traditionally, a MIMD machine is composed of multiple microprocessors. It is conceivable that loop level parallelism could be exploited orthogonally to superword level parallelism within each processor. Since coarse-grain parallelism is required to get good MIMD performance, extracting SLP should not detract from existing MIMD parallel performance.
2.4 SIMD Parallelism SIMD parallelism came into prominence with the advent of massively parallel supercomputers such as the Illiac IV 9] . The association of the term \SIMD" with this type of computer is what led us to utilize the term Superword Level Parallelism when discussing short SIMD operations.
SIMD supercomputers were implemented using thousands of small processors that worked synchronously on a single instruction stream. While the cost of massive S I M D parallel execution and near-neighbor communication was low, distribution of data to these processors was expensive. For this reason, automatic SIMD parallelization centered on solving the data distribution problem 1]. In the end, the class of applications for which SIMD compilers were successful was even more restrictive than that of vector and MIMD machines.
Instruction Level Parallelism
Superword level parallelism is closely related to ILP. In fact, SLP can be viewed as a subset of instruction level parallelism. Most processors that support SLP also support ILP in the form of superscalar execution. Because of their similarities, methods for locating SLP and ILP may extract the same information. Under circumstances where these types of parallelism completely overlap, SLP execution is preferred because it provides a less expensive and more energy e cient solution.
In practice, the majority of ILP is found in the presence of loops. Therefore, unrolling the loop multiple times may provide enough parallelism to satisfy both ILP and SLP processor utilization. In this situation, ILP performance would not noticeably degrade after SLP is extracted from a program.
SLP Compiler Algorithm
Our SLP compiler algorithm can be divided into several distinct phases. First, loop unrolling is used to transform vector parallelism into SLP. Alignment analysis then attempts to determine the address alignment o f e a c h load and store instruction. This is needed for compiling to architectures that do not support unaligned memory accesses. Next, the intermediate representation is transformed into a low l e v el form and a series of standard compiler optimizations is applied.
The core of our algorithm begins by locating statements with adjacent memory references and packing them into groups of size two. From this initial seed, more groups are discovered based on the active set of packed data. All groups are then merged into larger clusters of a size consistent w i t h the superword datapath width. Finally, a n e w s c hedule is produced for each basic block, where groups of packed statements are replaced with SIMD instructions.
The following subsections describe each of these phases in detail. Figure 4 presents a simple example to highlight the core routines and Figure 5 lists the pseudo code. Both will be referenced throughout this section.
Loop Unrolling
Loop unrolling is performed early since it is most easily done at a high level. As discussed, it is used to transform vector parallelism into basic blocks with superword level parallelism. In order to ensure full utilization of the superword datapath in the presence of a vectorizable loop, the unroll factor must be customized to the data sizes used within the loop. For example, a vectorizable loop containing 16-bit values should be unrolled 8 times for a 128-bit datapath. Our system currently unrolls loops based on the smallest data type present.
Alignment Analysis
Alignment analysis determines the alignment of memory accesses with respect to a certain superword datapath width. For architectures that do not support unaligned memory accesses, alignment analysis can greatly improve the performance of our system. Without it, memory accesses are assumed to be unaligned and the proper merging code must be emitted for every wide load and store.
One situation in which merging overhead can be amortized is when a contiguous block of memory is accessed within a loop. In this situation, overhead can be reduced to one additional merge operation per load or store by u s i n g data from previous iterations.
Alignment analysis, however, can completely remove t h i s overhead. For FORTRAN sources, a simple interprocedural analysis can determine alignment information in a single pass. This analysis is ow-insensitive, context-insensitive, and visits the call graph in breadth-rst order. For C sources, we use an enhanced pointer analysis package developed by Rugina and Rinard 21] . Since this pass also provides location set information, we can consider dependences more carefully when combining packing candidates. A full discussion of alignment analysis is beyond the scope of this paper. A complete description will be given in 13].
Our compilation system is capable of operating both with and without alignment constraints. For simplicity, we d escribe subsequent phases of the algorithm assuming no architectural support for unaligned accesses. As such, later phases assume alignment information has been annotated to each load and store instruction where possible.
3.3 Pre-optimization SLP analysis is most useful when performed on a three address representation. This way, the algorithm has full exibility i n c hoosing which operations to pack. If isomorphic statements are instead matched by the tree structure inherited from the source code, long expressions must be identical in order to parallelize. On the other hand, identifying adjacent memory references is much easier if address calculations maintain their original form. We therefore annotate each load and store instruction with this information before attening. After attening, several standard optimizations are applied to an input program. This ensures that parallelism is not extracted from computation that would otherwise be eliminated. Optimizations include constant propagation, copy propagation, dead code elimination, common subexpression elimination, loop-invariant code motion, and redundant load/store elimination. As a nal step, scalar renaming is performed to remove output and anti-dependences since they can inhibit parallelization.
Identifying Adjacent Memory References
Because of their obvious impact, statements containing adjacent memory references are the rst candidates for packing. We therefore begin the core of our analysis by scanning each basic block to nd independent pairs of such statements. Adjacency is determined using both alignment information and array analysis.
In general, duplicate memory operations can introduce several di erent packing possibilities. Dependences will eliminate many of these possibilities and redundant load/store elimination will usually remove the rest. In practice, nearly every memory reference is directly adjacent t o at most two other references. These correspond to the references that access memory on either side of the reference in question. When located, the rst occurrence of each pair is added to the PackSet.
De nition 3.1 A Pack is an n-tuple, hs1 ::: sni, where s1 : : : s n are independent isomorphic statements in a basic block.
De nition 3.2 A PackSet is a set of Packs.
In this phase of the algorithm, only groups of two statements are constructed. We refer to these as pairs with a left and right element.
De nition 3.3 A Pair is a Pack of size two, where the rst statement is considered the left element, and the second statement is considered the right element.
As an intermediate step, statements are allowed to belong to two groups as long as they occupy a left position in one of the groups and a right position in the other. Enforcing this discipline here allows the Combination phase to easily merge groups into larger clusters. These details are discussed in Section 3.6. 
Extending the PackSet
Once the PackSet has been seeded with an initial set of packed statements, more groups can be added. This is done by nding new candidates that can either:
Produce needed source operands in packed form, or Use existing packed data as source operands. This is accomplished by following def-use and use-def chains of existing PackSet entries. If these chains lead to fresh packable statements, a new group is created and added to the PackSet. For two statements to be packable, they must meet the following criteria:
The statements are isomorphic. has mem ref, which returns true if a statement accesses memory, 2 ) adjacent, w h i c h c hecks adjacency between two memory references, 3) get alignment, which r e t r i e v es alignment information, 4) set alignment, which sets alignment information when it is not already set, 5) deps scheduled, which returns true when, for a given statement, all statements upon which it is dependent have b e e n s c heduled, 6) rst, which returns the PackSet membercontaining the earliest unscheduled statement, 7) est savings, which estimates the savings of a potential group, 8) isomorphic, which c hecks for statement isomorphism, and 9) independent, which returns true when two statements are independent.
The statements are independent. The left statement is not already packed in a left position. The right statement i s n o t already packed in a right position. Alignment information is consistent. Execution time of the new parallel operation is estimated to be less than the sequential version. The analysis computes an estimated speedup of each p otential SIMD instruction based on a cost model for each i nstruction added and removed. This includes any p a c king or unpacking that must be performed in conjunction with the new instruction. If the proper packed operand data already exist in the PackSet, then packing cost is set to zero.
As new groups are added to the PackSet, alignment i nformation is propagated from existing groups via use-def or def-use chains. Once set, a statement's alignment determines which position it will occupy in the datapath during its computation. For this reason, a statement c a n h a ve only one alignment. New groups are created only if their alignment requirements are consistent with those already in place.
When a single de nition has multiple uses, there is the potential for many di erent packing possibilities. If this occurs, the cost model is used to estimate the most profitable possibilities based on what is currently packed. These groups are added to the PackSet in order of their estimated pro tability as long as there are no con icts with existing PackSet entries.
In the example, part (c) shows new groups that are added after following def-use chains of the two existing PackSet entries. Part (d) introduces new groups discovered by f o l l o wing use-def chains. The pseudo code for this phase is listed as extend packset in Figure 5 .
Combination
Once all pro table pairs have b e e n c hosen, they can be combined into larger groups. Two groups can be combined when the left statement of one is the same as the right statement o f the other. In fact, groups must be combined in this fashion in order to prevent a statement from appearing in more than one group in the nal PackSet. This process, provided by the combine packs routine, checks all groups against one another and repeats until all possible combinations have b e e n made. Figure 4 (e) shows the result of our example after combination.
Since the adjacent memory identi cation phase uses alignment information, it will never create pairs of memory accesses that cross an alignment boundary. All packed statements are aligned based on this initial seed. As a result, the combination phase will never produce a group that spans an alignment boundary. Combined groups are therefore guaranteed to be less than or equal to the superword datapath size.
Scheduling
Dependence analysis before packing ensures that statements within a group can be executed safely in parallel. However, it may be the case that executing two groups produces a dependence violation. An example of this is shown in Figure 6 . Here, dependence edges are drawn between groups if a statement in one group is dependent on a statement i n the other. As long as there are no cycles in this dependence graph, all groups can be scheduled such that no violations occur. However, a cycle indicates that the set of chosen groups is invalid and at least one group will need to be eliminated. Although experimental data has shown this case to be extremely rare, care must be taken to ensure correctness.
The scheduling phase begins by scheduling statements based on their order in the original basic block. Each statement i s s c heduled as soon as all statements on which i t i s dependent h a ve b e e n s c heduled. For groups of packed statements, this property m ust be satis ed for each statement i n the group. If scheduling is ever inhibited by the presence of a cycle, the group containing the earliest unscheduled statement is split apart. Scheduling continues until all statements have b e e n s c heduled.
Whenever a group of packed statements is scheduled, a new SIMD operation is emitted instead. If this new operation requires operand packing or reshu ing, the necessary operations are scheduled rst. Similarly, i f a n y statements require unpacking of their source data, the required steps are taken. Since our analysis operates at the level of basic blocks, each basic block assumes all data are in an unpacked con guration upon entry to the block. For this reason, all variables that are live on exit are unpacked at the end of the block.
Scheduling is provided by the schedule routine in Figure 5 . In the example of Figure 4 , the result of scheduling is shown in part (f). At the completion of this phase, a new basic block has been constructed wherever parallelization was successful. These blocks contain SIMD instructions in place of packed isomorphic statements. As we w i l l s h o w i n Section 5, the algorithm can be used to achieve speedups on a microprocessor with multimedia extensions.
A Simple Vectorizing Compiler
The SLP concepts presented in the previous section lead to an elegant implementation of a vectorizing compiler. Vector parallelism is characterized by the execution of multiple iterations of an instruction using a single vector operation. This same computation can be uncovered with unrolling by limiting packing to unrolled versions of the same statement. With this technique, each statement has only one possible grouping, which means that no searching is required. Instead, every statement can be packed automatically with its siblings if they are found to be independent. The pro tability o f e a c h group can then be evaluated in the context of the entire set of packed data. Any groups that are deemed unpro table can be dropped in favor of their sequential counterparts. The pseudo code for this algorithm is shown in Figure 7 .
While not as general as the algorithm described in the previous section, this technique shares many of the same desirable properties. First, the analysis itself is extremely simple and robust. Second, partially vectorizable loops can be parallelized without complicated loop transformations. Most importantly, this analysis is able to achieve g o o d r esults on scienti c and multimedia benchmarks. The drawback to this method is that it may not be applicable to long vector architectures. Since the unroll factor must be consistent with the vector size, unrolling may p r oduce basic blocks that overwhelm the analysis and the code generator. As such, this method is mainly applicable to architectures with short vectors.
In Section 5, we w i l l p r o vide data that compare this approach to the algorithm described in Section 3.
Results
This section presents potential performance gains for SLP compiler techniques and substantiates them using a Motorola MPC7400 microprocessor with the AltiVec instruction set. All results were gathered using the compiler algorithms described in Sections 3 and 4. Both were implemented within the SUIF compiler infrastructure 23].
Benchmarks
We measure the success of our SLP algorithm on both scienti c and multimedia applications. For scienti c codes, we use the SPEC95fp benchmark suite. Our multimedia benchmarks are provided by t h e k ernels listed in Table 1 .
SLP Availability
To e v aluate the availability of superword level parallelism in our benchmarks, we calculated the percentage of dynamic instructions eliminated from a sequential program after parallelization. All instructions were counted equally, including SIMD operations. When packing was required, we assumed that n-1 instructions were required to pack n values into a single SIMD register. These values were also used for unpacking costs. Measurements were obtained by instrumenting source code with counters in order to determine the number of times each basic block w as executed. These numbers were then multiplied by the number of static SUIF instructions in each basic block. Results for both sets of benchmarks are listed in Table 2 and illustrated in Figure 8 . The performance of each benchmark is shown for a variety o f hypothetical datapath widths. It is assumed that each datapath can accommodate SIMD versions of any standard data type. For example, a datapath of 512 bits can perform eight 64-bit oating point operations in parallel. To u n c o ver the maximum amount of superword level parallelism available, we compiled each b e n c hmark without alignment constraints. This allowed for a maximum degree of freedom when making packing decisions.
For the multimedia benchmarks, YUV greatly outperforms the other kernels. This is because it operates on 16-bit values and is entirely vectorizable. The remaining kernels are partially vectorizable and still exhibit large performance gains.
For the SPEC95fp benchmark suite, some of the appli- cations exhibit a performance degradation as the datapath width is increased. This is due to the large unroll factor required to ll a wide datapath. If the dynamic iteration counts for these loops are smaller than the unroll factor, the unrolled loop is never executed. For turb3d and applu, the optimal unroll factor is four. A 256-bit datapath is therefore su cient since it can accommodate four 64-bit operations.
In fpppp, the most time-intensive l o o p is already unrolled by a factor of three. A 192-bit datapath can support the available parallelism in this situation. In Figure 9 and Table 3 we compare the SLP algorithm to the vectorization technique described in Section 4. For the multimedia benchmarks, both methods perform identically. However, there are many cases in the scienti c applications for which the SLP algorithm is able to nd additional packing opportunities. In Figure 10 , we s h o w the available vector parallelism as a subset of the available superword level parallelism.
SLP Performance
To test the performance of our SLP algorithm in a real environment, we targeted our compilation system to the AltiVec 19] instruction set. Of the popular multimedia exten- sions available in commercial microprocessors, we believe AltiVec best matches the compilation technique described in this paper. AltiVec de nes 128-bit oating point and integer SIMD operations and provides a complementary set of 32 general purpose registers. It also de nes load and store instructions capable of moving a full 128 bits of data.
Our compiler automatically generates C code with AltiVec macros inserted where parallelization is successful. We then use an extended gcc compiler to generate machine code. This compiler was provided by Motorola and supports the AltiVec ABI (application binary interface). Due to the experimental nature of the AltiVec compiler extensions, it was necessary to compile all benchmarks without optimization. Base measurements were made by compiling the unparallelized version for execution on the MPC7400 superscalar unit. In both cases, the same set of SUIF optimizations and the same gcc backend were used. Since AltiVec does not support unaligned memory accesses, all benchmarks were compiled with alignment constraints in place 13]. Table 4 and Figure 11 present performance comparisons on a 450MHz G4 PowerMac workstation. Most of the SPEC95fp benchmarks require double precision oating point support to operate correctly. Since this is not Table 4 : Speedup on an MPC7400 processor using SLP compilation.
supported by A l t i V ec, we w ere unable to compile vectorized versions for all but two o f t h e b e n c hmarks. swim utilizes single precision oating point operations, and the SPEC92fp version of tomcatv provides a result similar to the 64-bit version.
Our compiler currently assumes that all packed operations are executed on the AltiVec unit and all sequential operations are performed on the superscalar unit. Operations to pack and unpack data are therefore required to go through memory since AltiVec provides no instructions to move d a t a b e t ween register les. Despite this high cost, our compiler is still able to exploit superword level parallelism and provide speedups.
6 Architectural Support for S L P The compiler algorithm presented in Section 3 was inspired by the multimedia extensions in modern processors. However, several limitations make it di cult to fully realize the potential provided by SLP analysis. We list some of these limitations below:
Many m ultimedia instructions are designed for a speci c high-level operation. Figure 11 : Percentage improvement of execution time on an MPC7400 processor using SLP compilation.
Although our system is capable of compiling for machines that do not support unaligned memory accesses, the algorithm is potentially more e ective without this constraint. Architectures supplying e cient unaligned load and store instructions might i m p r o ve the performance of SLP analysis. The rst three points discuss simple processor modi cations that we hope will be incorporated into future multimedia instruction sets as they mature. The last two points address di cult issues. Solving them in either hardware or software is not trivial. More research is required to determine the best approach.
7 Keys to General Acceptance of SLP Many of the techniques developed by the academic compiler community are not accepted in mainstream computing. A good example is the work on loop level parallelization that has continued for over three decades. However, in a very short period of time, ILP compilers have become universal. We believe the following characteristics are critical to the general acceptance of a compiler optimization: Robustness: If simple source code modi cations drastically alter program performance, success becomes dependent upon the user's understanding of compiler intricacies. For example, techniques to uncover loop level parallelism are prone to wide uctuations in performance. A c hange in one statement o f the loop body may result in a vector compiler's sequentialization of the entire loop. In the case of ILP and SLP, failure to parallelize a few statements will not signi cantly impact aggregate performance. This makes methods for their extraction much more robust.
Scalability: Compiler techniques must be able to handle large programs if they are to gain acceptance for real applications. Some analyses required by l o o p optimizations do not scale well to large code sizes because of dependence on global program analysis. Although global analysis can improve the e ectiveness of ILP and SLP, it is not required. Therefore, complexity grows linearly with program size. This results in smooth scaling to larger applications.
Simplicity: Complex compiler transformations are more prone to bugs than simple analyses. Problems are likely to appear only under very speci c conditions, making them di cult to detect. Many time-critical projects are compiled without optimizations in order to avoid possible compiler errors. Coarse-grain parallelization and vectorization require involved analyses that are more likely to exhibit this behavior. However, most ILP techniques, as well as the SLP techniques presented in Section 3, are extremely simple to understand, implement and validate. In addition, it is often the case that simplicity leads to faster compilation.
Portability: Optimizations that are dependent on particular features of a source language or programming style will not become universal. Techniques for extracting loop level parallelism are limited because they only apply to programs written with loops and arrays. Alternatively, ILP and SLP techniques are applied at the level of basic blocks, making them less dependent on source code characteristics.
E ectiveness: No compiler technique will be used if it does not substantially improve program performance. In Section 5, we showed that our algorithm for detecting SLP can provide remarkable performance gains. We believe SLP compiler techniques have the potential to become universally accepted as viable and e ective methods of extracting SIMD parallelism. As a result, we expect future architectures to place increasing importance on SLP operations.
Conclusion
In this paper we i n troduced superword level parallelism, the notion of viewing parallelism from the perspective of partitioned operations on packed superwords. We s h o wed that SLP can be exploited with a simple and robust compiler implementation that exhibits speedups ranging from 1.24 to 6.70 on a set of scienti c and multimedia benchmarks.
We a l s o s h o wed that SLP concepts lead to an elegant i mplementation of a vectorizing compiler. By comparing the performance of this compiler to the more general SLP algorithm, we demonstrated that vector parallelism is a subset of superword level parallelism.
Our current compiler implementation is still in its infancy. While successful, we believe its e ectiveness can be improved. By extending the SLP analysis beyond basic blocks, more packing opportunities could be found. Furthermore, SLP could o er a form of predication, in which un lled slots of a wide operation could be lled with speculative computation. If data are invalidated due to control ow, they could simply be discarded. Recent research has shown that compiler analysis can signi cantly reduce the size of data types needed to store program variables 22] . Incorporating this analysis into our own has the potential of drastically improving performance by increasing the number of operands that can be packed and executed in parallel.
Today, most desktop processors are equipped with multimedia extensions. Nonuniformities in the di erent instruction sets, exacerbated by a l a c k of compiler support, has left these extensions underutilized. We have shown that SLP compilation is not only possible, but also applicable to a wider class of application domains. As such, we believe S L P compilation techniques have t h e p o t e n tial to become an integral part of general purpose computing in the near future.
