Recovery Methodology to Avoid Loss for SLP  by shuai, Wei et al.
Procedia Engineering 15 (2011) 204 – 208
1877-7058 © 2011 Published by Elsevier Ltd.
doi:10.1016/j.proeng.2011.08.041
Available online at www.sciencedirect.com
Available online at www.sciencedirect.com




Advanced in Control Engineeringand Information Science 
Recovery Methodology to Avoid Loss for SLP 
Wei shuaia, Zhao Rong-caia, Yao Yuana a* 
a Zhengzhou Information Science and Technology Institute, Zhengzhou450002,China 
Abstract 
Nowadays more and more processors are integrated with SIMD extensions, and many compilers have applied auto-
vectorization. SLP is an vectorization algorithm that could vectorize scientific applications more effectively than 
traditional algorithm. However, if basic blocks have not vectorized efficiently by SLP then the vectorization 
performance will degrade. To solve that problem this paper brings SLP that applied recovery methodology. The 
algorithm adopts SLP algorithm to vectorize program and then esitimate the vectorization benifit based cost model, at 
last recover the basic blocks that haven’t vectorized efficiently to their original states. Experiment results indicate that 
with the adoption of the new policy, the speedup gain for some applications can reach 29.4%. 
© 2011 Published by Elsevier Ltd. 
Selection and/or peer-review under responsibility of [CEIS 2011] 
SIMD; vectorization; data dependence analysis; SLP algorithm 
1. Introduction
Exploiting subword parallelism becomes more and more common for Modern computational platforms.
Intel released Pentium chips featuring MMX at 1996; at 1997 Motorola introduced Altivec instruction set, 
at 2000 AMD released Athlon chips which contain 3DNow! Instruction set that is similar to SSE 
instruction set [1]. There are some product compilers that can automatically vectorize for these instruction 
sets. SIMD technology improves performance of applications through the realization of executing one 
* Corresponding author. Tel.: 8637181631324; fax: 8637181631324. This paper is supported by the CHB National Major
Science and Technology Project Foundation of China under Grant No. 2009ZX01036. 
E-mail address: weis0906@163.com.  
Open access under CC BY-NC-ND license.
Open access under CC BY-NC-ND license.
205Wei shuai et al. / Procedia Engineering 15 (2011) 204 – 2082 Wei shuai，et al/ r ce i  i eeri  00 ( ) 0 0 00  
instruction on multi-data [2]. As SIMD adds support for float types, it begins to be used to improve the 
performance of scientific program.  
Traditional vectorization approach brought out by Randy Allen and Ken Kennedy [3] is based on 
dependency analysis, it vectorizes the statements that are not part of SCC form outer to inner. Because it 
has initially targeted vector machines, the dependence distance is not considered. Once the statements 
form dependence circle, then they can’t be vectorized. Current SIMD extensions have short vector 
register, so we can ignore the dependence when the dependence distance is larger than VF. Larsen and 
Amarasinghe brought out SLP approach [4] at 2000, which could vectorize scientific applications more 
effectively, but if the dependence distance is short than VF, then it still can’t be vectorized if it form 
dependence circle, unless making some pre-transformation.  
This paper brings a new approach based on SLP that applied recovery methodology. It first adopts SLP 
algorithm to vectorize program and then estimate the vectorization benefit based cost model, at last 
recover the basic blocks that haven’t vectorized efficiently to their original states. The paper is organized 
as follows: Section 2 briefly describes compiler framework. In section 3, a set of approaches are brought 
out to solve the problem. Section 4 presents results on scientific benchmark and analyzes the reason. 
2. Compiler Framework：
The compiler framework of Open64 is shown in Figure1. The front end comes from GCC, which 
transfer source program into whirl tree, then proceed to do inter procedural analysis (IPA), loop level 
analysis and optimization (LNO), the whirl optimization (WOPT). However, the vectorization compiler 
didn’t include code generation model, the whirl trees at last are translated into source C code through 
whirl2c module, replacing vectorizable statements by corresponding SIMD intrinsic functions. 
Vectorization work is mainly implemented in LNO phase, first adopts SLP algorithm to vectorize 
program and then estimate the vectorization benefit based cost model, at last recover the basic blocks that 
haven’t vectorized efficiently to their original states. 
Fig 1. vectorization compiler framework based on open64 
3. The  Vectorization algorithm with recovery strategy 
This section describes the algorithm to vectorize loops with recovery strategy in detail. This algorithm 
is based on SLP, first adopts SLP algorithm to vectorize program and then estimate the vectorization 
benefit based cost model, at last recover the basic blocks that haven’t vectorized efficiently to their 
original states. 
206  Wei shuai et al. / Procedia Engineering 15 (2011) 204 – 208 Author name / Procedia Engineering 00 (2011) 000–000 3
3.1. SLP Algorithm 
To vectorize scientific applications more effectively, Larsen and Amarasinghe bring out Superword 
Level Parallelism at 2000. We refer to the algorithm as SLP algorithm in the paper. The vectorization 
target of SLP is basic block, it packs statements through identifying isomorphic statements whose 
operations are vectorizable, and the process of identifying these isomorphic statements is called packing. 
The main phases of SLP algorithm is shown as follows: 
1.   Loop unrolling, unroll the loop according to unroll factor in order to expose more statements that 
can be packed. Normally the unroll factor is equal to VF.   
2.  Alignment analysis, mainly determines the alignment of memory access. 
3.  Pre-optimization, such as three-address, copy prorogate, redundant load/store elimination, dead 
code elimination and so on.  
4. Initial the packset by packing the statements that have adjacent memory accesses, as illustrated 
in Figure2 (a) and Figure2 (b). 
5. Extend the packset following DU and UD chains, as illustrated in Figure1(c) and Figure2 (d). 
6. Combine packs that have common statements, as illustrated in Figure2 (e). 
7. Schedule packs according to dependence graph, as illustrated in Figure2 (f). If dependence 
violation happens at scheduling stage, at least one pack needs to be split apart. 
Fig 2.SLP algorithm chart [4] 
3.2. Estimate the Vectorization Cost 
The total benifit is the sum of vectorization cost of the vectorized statements in the basic block. 
Suppose there are n vectorization statements in the basic block. Pi is the benifit of the vectorized 
statement i, Di is the iteration count of vectorized statements i. Then the benefit of basic block can be 
expressed as blow:  
                                                                                                                            (1) 
3.2.1.  Calculate iteration count 
 If this block is in a iterative space of d levels, supposing the initial value for each level in the loop is 
Sj and the terminated value is Ej, the step is Tj, and so the total iterative count in the iteration space of 
this block is: 
                                                                                                                      (2) 
207Wei shuai et al. / Procedia Engineering 15 (2011) 204 – 2084 ei shuai，et al/ r ce i  i eeri  00 ( ) 0 0 00  
3.2.2. Calculate the benifit of each statement 
The benefit and cost analysis algorithm take the abstract syntax tree or the AST as the research 
foundation. The benefit of each is computed by a traverse to each statement in the AST from top to 
bottom called width priority and then we take an accumulation of each operator and scalar and vector 
latency as the desired result. The latency can be found in the static CPU instruction latency table. Suppose 
a statement and its corresponding AST for BB is as follows: 
Fig 3.speedups for benchmarks 
For a statement in the basic block, its accumulative vector instruction latency is VCi, the accumulative 
value for scalar instruction latency is SCi, the vector factor is VF(e.g. the vector length is 32byte, the 
length of double type is 8byte and so the vector factor should be 4), the benefit  Pi of this block should be: 
　                                                   Pi = SCi - VCi/VF                                                                          (3) 
Take the statement shown in Figure 3 as example, if the data types for this statement are double and 
the benefit should be the accumulation of the costs of two double multiply, one double plus. The vector 
factor for double is 4, so the vector register can operate on four elements simultaneously, the cost for 
scalar multiply and add operations are 6 and 2, the cost for vector multiply and add operations are also 6 
and 2, then the benefit is 2*6+2–(2*6+2)/4=7.5. However, sometimes when the required vector for some 
operation can’t be achieved by the result of some other operations, then some regrouped operation is 
needed, and then the corresponding cost need to be taken into consideration.  
4. Recover the baisc blocks that have not sufficiently vectorized 
Some important aspects may influence the vectorization performance, such as the simdization 
operation brings reduction on total instruction number compared with a serial operation, the memory 
access pattern and instruction merging operation, especially the code after three-address transformation 
will be not suitable for vectorization. All these factors may cause that the vectorization may not profit. 
And if we try to vectorize loops that are not suitable for vectorization, t the performance of the program 
will degrade. 
Due to above reasons, we make a copy of the basic block before vectorization, if the statements 
haven’t been adequately vectorized or the vectoriztion can’t profit then recover the statements to their 
original states. There are also some global data structures that have relations to these statements need to 
be recovered such the def-use graph and array dependence graph. So before doing vectorization we make 
copies of these data structures that have relations to these statements in the basic block, and recover these 
data structures if the statements are recovered. 
208  Wei shuai et al. / Procedia Engineering 15 (2011) 204 – 208 Author name / Procedia Engineering 00 (2011) 000–000 5
5. Experiment Result and Analysis 
The following examples demonstrate performance improvement achievable with NSLP algorithm on 
alpha architecture. The compiling process consists of two steps. In the first step, our compiler translates 
source code to C code or FORTRAN code with SIMD instructions implemented by intrinsic functions. In 
the second step, the simdized code is complied by base compiler to object program. Our compiler is 
realized in three versions, adopting SLP, SLP that applied recovery policy and traditional vectorization 
algorithm respectively. Then each version is run several times using the same compilation option and the 
run times are averaged. We collect some applications suitable for vectorization from NPB3.2 and SPEC 
CPU2000 as benchmarks, including BT、MG and UA from NPB3.2 and swim、mgrid、gaglel from 
SPEC CPU2000. 
5.1. Experiment Results Analysis 
Figure 4 shows the speedups achieved by SLP, SLP that applied recovery policy. Overall the 
performance achieved by SLP algorithm that applied recovery policy is better than SLP, because SLP 
algorithm that applied recovery policy not only inherit the advantages of SLP, which can do partially 
vectorization and can vectorize both loops and basic block, it also avoid the performance loss that caused 
by insufficient vectorization. The average speedup factors achieved by SLP that applied recovery policy 
is 12.7% better than SLP, and the performance gain reach 29.4% for application mgrid. 
Fig 4.speedups for benchmarks 
References 
[1] J. Stewart. An Investigation of SIMD instruction sets. University of Ballarat School of Information Technology and 
Mathematical Sciences, 2005. 
[2] D. Nuzman, I. Rosen, A. Zaks. Auto-Vectorization of Interleaved Data for SIMD. In PLDI’06, Ottawa, Canada, June 2006, 
132-143. 
[3] S. Larsen and S. Amarasinghe: Exploiting superword level parallelism with multimedia instruction sets．In Proc of the 
ACM SIGPLAN Conference on Programming Language Design and Implementation, June 2000, pages 145—156 
[4] Randy Allen and Ken Kennedy. Optimizing Compilers for Modern Architectures——A Dependence-Based Approach. US: 
Morgan Kaufmann Publishers, 2001. 
[5] D. A. Padua. Multiprocessors: Discussion of some theoretical and practical problems. Ph.D. thesis, Department of Computer 
Science, University of Illinois at Urbana-Champaign, November 1979. Report 79-990. 
