Meta-State conversion by Dietz, H. G.
Purdue University
Purdue e-Pubs




Purdue University School of Electrical Engineering
Follow this and additional works at: http://docs.lib.purdue.edu/ecetr
This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact epubs@purdue.edu for
additional information.





Parallel PFocessing Laboratory 
School of Electrical Engineering 
Purdue University 
West Lafayette, IN 47907- 1285 
hankd@ecn.purdue.edu 
Abstract 
In MIMD (Multiple Instruction stream, Multiple Data stream) execution, each processor has 
its own state. Although these states are generally considered to be independent entities, it is also 
possible to view the set of processor states at a particular time as single, aggregate, "Meta 
State." O~nce a program has been converted into a single finite automaton based on Meta States, 
only a sinpJe pmgram counter is needed. Hence, it is possible to duplicate the MIMD execution 
using SIMD (Single Instruction stream, Multiple Data stream) hardware without the ovehead of 
interpretation or even of having each processing element keep a copy of the MIMD cod:e. In this 
paper, we present an algorithm for Meta-State Conversion (MSC) and explore some properties of 
the technique. 
Keywords:: Meta-State Conversion (MSC), compiler optimization, compiler transformation, 
MIMD, SIMD. 
' This work wrs supported in pa* by the Office of Naval Research (ONR) under grant number 
N00014-91-J-4013 and by the National Science Foundation (NSF) under award number 
90156%-CDA. 
Meta-State Conversion 
Table of Contents 
1 . introduction .......................................................................................................................... 
1.1.. MIMD Emulation ................................................................................................... 
1 .2! . Meta-State Conversion ........................................................................................... 
2 . Meta-State Conversion ......................................................................................................... 
2.11. Consuuction of the MIMD Control-Flow Graph ................................................... 
2.2. Handling Of Function Calls ................................................................................... 
2.3. Base Conversion Algorithm ................................................................................... 
2.4. MIMD State Time Splitting Algorithm ................................................................. 
2.5. Meta State Compression Algorithm ...................................................................... 
2.6. Barrier Synchronization Algorithm ....................................................................... 
3 . SIMD Coding of the Meta-State Automaton ............................................................... , ........ 
3 ' 1  .Common Subexpression Induction .................................. ............................... ..... 
3.2. Multiway Branch Encoding .................................................................................... 
3.2.1. No Exit Arc ............................................................................................... 
3.2.2. Single Exit Arc ......................................................................................... 
3.2.3. Multiple Exit Arcs .................................................................................... 
3.2.4. Multiple Exit Arcs Involving Baniers ..................................................... 
...................................................... 3.2.5. Restricted Dynamic Process Creation 
4 . Implementation .................................................................................................................... 
4.1. The Input Lan y a g e  ............................................................................................... 
4.'2. The Conversion Process ......................................................................................... 
........................................................................................................... 4.3, An Example 
5 . Concbusions .......................................................................................................................... 
Page 1 
..... ....... . ......... - -... ........ 
Meta-State Conversion 
1. Introduction 
The differences between data parallelism (SIMD execution) and control parallelism (MIMD 
execution) are at least superficially quite large. In a data parallel program, parallelism is specified 
in terms of performing the same operation simultaneously on all elements of a data structure; this 
naturally fits the SIMD execution model. It is also easy to see that, because the abilities of a 
MIMD am a superset of the abilities of a SIMD, the data parallel model can be extended to 
MIMD tiirgets [Phi891 [LiM901. However, the control parallel model suggests that each proces- 
sor can take its own path independent of all others, and this characteristic seems to require the 
multiple instruction streams possible only in MIMD execution. Control parallelism is impossible 
on a SIM:D with only one instruction stream... or is it? 
*.re are two basic approaches that might allow SIMD hardware to efficiently support a 
control parallel programming model: "MIMD emulation" and "meta-state conversiori." 
1.1. MnvID Emulation 
Perhaps the most obvious way to make SIMD hardware mimic MIMD executiori is to write 
a SIMD program that will interpretively execute a MIMD instruction set. In the simplest terms, 
such an iinterpreter has a data structure, replicated in each SIMD PE, that corresponds to the inter- 
nal regilers of each MIMD processor. Likewise, each PE's memory holds a copy of the MIMD 
code to be executed. Hence, the interpreter structure can be as simple as: 
Basic MIMD Interpreter Algorithm 
1. Each PE fetches an "instmction" into its "instruction register" (IR) and updates its 
"program counter" (PC). 
2. Each PE decodes the "instruction" from its lR. 
3. Repeat steps 3a-3c for each "instruction" type: 
a) Disable all PEs where the IR holds an "instruction" of a different type. 
b) Simulate execution of the "instruction" on the enabled PEs. 
c) Enable all PEs. 
4. Go to step 1. 
The only difficulty in implementing an interpreter with the above structure is that the simulated 
machine will be very inefficient. 
A rrurnber of researchers have used a wide range of "tricks" to produce more efficient 
MIMD interpreters [NiT90], [WiHgl], and [DiC92]. However, some overhead cannot be 
removed: 
1. Instructions must be fetched and decoded. 
2. Instructions must be accessible to all PEs, hence, each PE typically will have a copy 
of the entire MIMD program's instructions. In,a massively-parallel machine, this 
wastes a huge amount of memory. 
Page 2 
Meta-State Conversion 
3. 'l"kre will be some ove&ad associated with the interpreter itself, e.g., the cost of 
jumping back to the start of the interpreter loop. 
Although problems 1 and 3 merely slow the execution, the second severely restricts the size of 
MIMD pnograms. For example, the Purdue University School of Electrical Engineering has a 
16K processing element MasPar MP-I. [Bla90] with only 16K bytes of local memory for each PE. 
Even with very careful encoding, 16K bytes cannot hold a very large MIMD program. 
Although meta-state conversion is more difficult to implement and more restrictive in its 
abilities, il: can eliminate even these three ovehead problems. 
13. Ma-State Conversion 
In MUMD execution, each processor has its own state. Although these states are generally 
considered to be independent entities, it is also possible to view the set of processor states at a 
particular time as single, aggregate, "Meta State." Using static analysis based on rhe timing 
described in [Di090], a compiler can convert the MIMD program into an automaton based on 
meta states. 
Onu: a program has been converted into the fonn of a meta-state automaton, it is no longer 
necessary for each PE to fetch and decode instructions, nor is it necessary that each PE have a 
copy of the program in local memory. Only the SIMD control unit needs to have a clopy of the 
meta-state automaton; PEs merely hold data. Further. because there is no interpreter,  there is no 
interpretation overhead. Literally, the meta-state automaton is a SIMD program that preserves 
the relativle timing properties of MIMD execution. 
Hoarever, just as interpretation has drawbacks, so to does meta-state conversion: 
1. If there are N processors each of which can be in any of S states, then it is pssible that 
there may be as many as S!I(S-N)! states in the meta-state automaton. Without some 
means to ensure that the state space is kept manageable, the technique is not practical. 
2. In execution, meta-state transitions are based on examining the aggregate of the 
MIMD state transitions for a l l  processors. 
3. Meta-state transitions N-way branches keyed by the aggregate of the MIMD state 
transitions. 
4. Dynamic creation of new processes is difficult to accommodate, since construction of 
the meta-state automaton requires that all possible MIMD states can be predicted at 
compile time. 
Fortunatel.y, we have developed a number of techniques that can control the state space explosion 
suggested above. Making meta-state transitions based on aggregate infomation is conceptually 
simple, but requires some hardware supprt, e.g., the "global or" of the MasPar MP-1 [BlaWI. 
The efficient implementation of N-way branches is a difficult problem, but can be accomplished 
using cua;tomized hash functions indexing jump tables [Die92a]. Unfortunately, the fully 
dynamic creation of processes seems to be impractical -but that is exactly the case in which the 
interpntadon scheme works best. Consequently, this paper focuses on techniques to control the 
Page 3 
Meta-State Conversion 
state explosion, and restricts the input MIMD code to be formulated as an SPMD program. 
The second section of this paper p m n t s  the meta-state conversion algorithm, using an 
example to clarify the p m s s .  Section 3 discusses issues involving how the resulting meta-state 
automatoln can be efficiently encoded for SIMD execution. In section 4, we discuss how the pro- 
totype implementation was constructed. and give a simple example of the output generated. 
Finally, section five summarizes the contributions of this work and directions for future study. 
2. Meta-State Conversion 
The meta-state conversion algorithm is surprisingly straightforward; perhaps it would be 
more accurate to say that it is familiar. The process of converting a set of MIMD states that exist 
at a particular point in time into a single meta state is strikingly similar to the process of convert- 
ing an M A  into a DFA, as used in constructing lexical analyzers. 
To Ixgin, the code for the MIMD processes is converted into a set of control flow graphs in 
which eac:h node (MIMD state) represents a basic block [CoS70]. Each of these MIMD states has 
zero, one,. or two, exit arcs. A MIMD state with no exit arcs marks the end of that ~~rocess. A 
single exit arc represents unconditional sequencing (e.g.. an unconditional branch), whereas two 
exit arcs respectively represent the ''TRUE" and "FALSE" successors of that MIMD state (e.g., 
targets of' a conditional branch). In addition, it is assumed that we know in which, particular 
MIMD state each process beings execution; these states are called MIMD start states. 
The set of MIMD start states forms the start state of the meta-state automaton. Since each 
MIMD st;m state may have up to two successors, each process may pick either of its two possible 
successocs. If we further assume that there may be multiple processes in each MIMD state, it is 
fuxther possible that both successors might be chosen. Hence, for a meta state that consists of one 
MIMD start state, there may be as many as three meta-state successors. In general, from n 
MIMD sti~rt states, there could be as many as 3" meta-state successors. 
To clarify the operation of the algorithm, we will trace the algorithm's actions on a simple 
example. The framework for the example is the following SPMD code: 
if ( A )  I 
do { B } while (C); 
} else { 
do { D ) while ( E ) ;  
1 
F 
Listing 1: Example MIMD (SPMD) Code 
It is assumed that all processors begin executing this code simultaneously and that processors 
computing different values for the parallel expressions A, C, and E are the only sources of asyn- 
chrony (i.c, there are no external interrupts). 
Page 4 
- - - .. . . .. -. - .. . .. . . -- 
Meta-State Conversion 
2.1. Construction of the MIMD Control-Flow Graph 
Before meta-state conversion can be applied, the program must be converted into a form 
that facilitates the analysis. The most convenient form is that of a traditional control-.flow graph 
in which each node represents a maximal basic block. Constructing the control-flow graph in the 
usual way, code straightening [CoS70] and removal of empty nodes are applied to obtain the sim- 
plest possible graph. The result of this is figure 1. State 0 corresponds to block A, state 2 
correspon~ds to B followed by C, state 6 corresponds to D followed by E, and state 9 corresponds 
to F. 
Figure 1: MIMD State Graph for Listing 1 
2.2. Handling Of Function Calls 
Although our example case does not contain any function calls, it is important that meta- 
state conversion be applicable to codes that contain arbitrary function calls - perhaps including 
recursive function invocations. Thus, we need some way to represent function calVretwm directly 
using conltrol flow arcs in the MIMD state graph. 
In the case of non-recursive function calls, it is sufficient to use the traditional :solution of 
in-line expansion of the function code (i.e., of the MIMD state graph for the function body). 
SurprisinpJy, recursive function calls also can be treated using in-line expansion - and an addi- 
tional "tri~ck" that converts return statements into ordinary multiway branches. 
Consider the following C-like code fragment in which the m a i n  program invokes the 
recursive W o n  g: 
Page 5 
- . . . . . - - - -- - - . --. _ - . - - . 
Meta-State Conversion 
main ( )  
b: ... 
c :  g o ;  
d: ... 
... 
g o ;  
e: . .. 
1 
Listing 2: Example Recursive Function Call 
The only difficulty in in-line expanding g  is that the target of any r e t u r n  statements in g  is 
not known until runtime. However, at compile time we can compute the set of d l  possible 
r e t u r n  targets given that g  was initially invoked from a particular position. 
When in-line expanding the call to g  from position a, we know that any r e t u r n  state- 
ments within g  must return to either position b or e ,  and can replace the r e t u r n  statements 
with the immpriate multiway branch. Likewise, when in-line expanding g  called fmm position 
c ,  r e t u r n  statements are vanslated into multiway branches targeting d or e. The result is a 
call-free cmnml flow graph for the entire program; thus, the meta-state conversion algorithm can 
ignore &. direct handling of function calls without loss of generality. 
23. Base Conversion Algorithm 
The following C-based pseudo code gives the base algorithm for meta-state conversion. 
Page 6 





/* G:Lven the meta-state automaton start state x, 
generate the reat of the automaton 
/ 
do ( 
,I* Mark thia meta state as done * /  
mark-meta-state-done (x) ; 
,'* Add arcs to any meta states yl x-y */  
reach(x, x, 0); 
/ *  Get another meta state to process */  
x - get-unmarked-meta-state ()  ; 
r'* Repeat while there is a meta state to do */  
) while (x ! - 0 )  ; 
1 
int 
reach(atnrt, a, t) 
set start:, 8, t; 
{ 
/* Wtke entries for all meta states tl start-t */  
if (21 - 0 )  { 
/'* All MIMD state transitions from within start have been 
considered, hence, t must be a meta state 
* / 
make-meta-state-transition(start, t); 
} else ( 
/I* Select a MIMD state and process its transition(s), 
recureing to complete the meta state 
* / 
element e, next, fnext; 
el - [el e E s]; 
a1 - s - {e); 
next - next-MIMD-etate (e) ; 
f'next - next-MIMD-state-if-£alse(e); 
/'* Take each possible path and both paths */  
i.£ (next) { 
reach(start, s, t u next) ; 
if (fnext) ( 
reach (etart, a, t u fnext) ; 
reach (start, s, t u next u fnext) ; 
1 
) else { 




Applying the above algorithm to our simple example, the resulting meta-state graph is: 
Page 7 
-- .- - . . -- .. .- -_- .... . _ 
Meta-State Conversion 
Figure 2: Meta-State Graph for Listing 1 
2.4. MIMD State T i m  Splitting Algorithm 
In the base conversion algorithm, we made the assumption that each MIMD state took 
exactly the same amount of time to execute. However, such an assumption is unrealistic: 
If each instnrction is treated as a separate MIMD state, then reasonable size programs will 
gene:rate unreasonably large automata. This makes the analysis for meta-state conversion 
much slower and also can result in an impractically large meta-state automaton. In addition, 
some computers have instruction sets in which even the execution time of different types of 
instruction varies widely. 
If instead we simply treat each maximal basic block as a MIMD state and ignore the differ- 
ences in execution time between these blocks, this can result in very poor processor utiliza- 
tion For example, if a block that takes 5 clock cycles to execute is placed in: the same 
meta.-state as one that takes 100 cycles. then the parallel machine may spend up to 95% of 
its processor cycles simply waiting for the transition to the next meta state. 
In other words, the meta-state automaton embodies an execution time schedule for the code, and 
it is necesrary that the execution time of each block be taken into account if a good schedule is to 
be p d u a d .  
The= are many possible ways in which timing information could be incorporated, but our 
overriding concern must be keeping the state space manageable, and this greatly restricts the 
choice. Clearly, the smallest MIMD state automaton results from treating each maximal basic 
block as a MIMD state; hence, this will be our initial assumption. As the conversion is being 
Page 8 
Meta-State (:onversion 
performal, we may be fortunate enough to have a l l  the MIMD states merged into each meta state 
happen tcb have the same cost. If the costs differ, but do not differ by a significant enough 
amount, are can ignore the difference. 
This leaves only the case of a meta state that contains MIMD states of widely varying cost, 
for example, the 5 and 100 cycle MIMD states mentioned above. The solution we plopose is a 
simple heuristic that will break the 100 cycle MIMD state into an approximately 5 cycle MIMD 
state which is unconditionally followed by the remaining portion of the original 100 cycle state. 
Since this change might also affect the construction of other meta states that had incorporated the 
original 100 cycle MIMD state, the construction of the meta-state automaton is restarted to ensure 
that the Arul  meta-state automaton is consistent. 
The following pseudocode gives the algorithm for performing MIMD state splitting based 




f l a g  
t ime- spl i t- state  (a )  
s e t  8; 
I 
/* Determine i f  t ime  imbalance between M I M D  s t a t e s  wi th in  t h e  
meta s t a t e  a is s u f f i c i e n t  t o  warrant  tirne s p l i t t i n g  t h e  
more expensive M I M D  s t a t e s  t o  g e t  a b e t t e r  balance; t h i s  
assumes t h a t  each M I M D  s t a t e  a l r e a d y  h a s  an  execu t ion  t i m e  
a s s o c i a t e d  with  it 
/ 
f l a g  d i d s p l i t ;  
/* Ignoro z e r o  execution t i m e  components because you can ' t  
'do any th ing  about  them anyway 
*/  
s - s - (el e E 8, t i m e ( e )  -- 0 ) ;  
/* G e t  minimum and maximum M I M D  s t a t e  t i m e s .  .. * /  
min - min-MIMD-state-time (8) ; 
max - max-MIMD-state-time (s) ; 
/* :[a enough tirne wasted t o  be worth s p l i t t i n g ?  Not i f  t h e  
d i f f e r e n c e  between t i m e s  i s  a l r e a d y  a t  n o i s e  l e v e l  
( s p l i t- d e l t a )  o r  i f  t h e  u t i l i z a t i o n  i s  a l r e a d y  s u r e  t o  be 
q r e a t o r  than  an a c c e p t a b l e  percentage (spl i t- percentage)  
*/ 
i f  I[ (min + s p l i t- d e l t a )  > max) r e t u r n  (FALSE) ; 
i f  (min > ( ( s p l i t q e r c e n t  mex) / 1 0 0 ) )  r e t u r n  (FALSE): 
/* S p l i t t i n g  s e e m s  u s e f u l . .  . do i t ,  i f  p o s s i b l e  * /  
d i d ~ p l i t  - FALSE; 
whi le  (a !- 0) ( 
element e; 
e - [el e E a ] ;  
s - s - ( e l ;  
i f  ( t i m e ( e )  > min) { 
/* I f  p o s s i b l e ,  s p l i t  t h i s  node i n t o  two 
nodes, t h e  f i r s t  wi th  t ime min, t h e  
second with  t h e  remaining t i m e . . .  
*/ 
... 
d i d s p l i t  - TRUE; 
1 
1 
r e t u r n  ( d i d s p l i t )  ; 
1 
The splitting of a state is illustrated in the next two figures. The relevant portion of the ini- 
tial MIMT) state graph is: 
Page 10 
Meta-State C~nversion 
Figure 3: MIMD States Before Time Splitting 
Suppose that meta-state conversion would combine states a and P and that P takes much longer 
to execute than a ,  i.e., ta<tg. The state splitting algorithm would attempt to convert tkds portion 
of the state graph into: 
Figure 4: MIMD States After Time Splitting 
Thus, states a and P' would be merged - without any idle time being introduced for either 
thread of execution. 
25. Meta State Compression Algorithm 
Despite the reduction in state space possible using maximal basic blocks and time splitting, 
the automata created can be very large. Hence, it is useful to find a way to reduce the upper 
bound on tlle number of meta states created. 
B~!.wJs~ MIMD nodes with zero or one exit arc can only increase the state spaa: linearly, 
the explosion in meta state space is related to the occurrence of MIMD states that have two exit 
arcs. Each such MIMD state could contribute three meta states: the TRUE successor,, FALSE 
successor, ;and both successors. However, if there are many processes in any given MIMD state, 
Page 1 1 
Meta-State Conversion 
it is easy to see that the most probable case is that of both successors. Further, the case of both 
successors can always emulate either successor, since it has the code for both. Thus. a very 
dramatic duction in meta state space can be obtained by simply assuming that both successors 
are always taken. 
i n t  
reach ( s t a r t ,  a ,  t )  
set s t a r t ,  s, t; 
( 
/* Make r n t r i e s  f o r  a l l  m t a  s t a t r s  tJ star t- y * /  
i f  ((0 - a, ( 
/ *  A l l  M I M D  s ta t .  t r a n s i t i o n s  from wi th in  s t a r t  have been 
considered, hence, t must be a  m t a  s t a t e  
+ / 
make-mta-state-transition ( s t a r t ,  t) ; 
) el.se ( 
/ +  S e l e c t  a  M I M D  s t a t e  and process  i t s  t r a n s i t i o n ( s ) ,  
r ecurs ing  t o  complete t h e  meta s t a t e  
* /  
element e, next ,  fnext ;  
e - [el E a ] ;  
s - s - {el; 
next  - next-MI-state (e) ; 
fnex t  - next-MIMD-state-if-false(%); 
/* A.lways t a k e  a l l  pos s ib l e  paths .  .. +/  
i f  (next)  ( 
i f  ( fnex t )  ( 
reach ( s t a r t ,  s, t U next u fnex t )  ; 
) else ( 
r e a c h ( s t a r t ,  s ,  t U next) :  
1 
) e l s e  ( 




Rctuming to our example code, the meta-state compression algorithm results in a graph 
with only ]two meta-states, compared to eight for the uncompressed graph: 
Figure 5: Compressed Meta-State Graph for Listing 1 
Page 12 
Meta-State (Ionversion 
Notice that ma-state transitions into compressed portions of the graph are uncondilional; i.e., 
there is nr) need to use a globalor to determine what states are present. The disadvantage is 
that the average meta-state is wider, which implies that the SIMD implementation will be less 
efficient. 
2.6. Barrler Synchronization Algorithm 
W l e  the above compression scheme produces very small automata, it does increase over- 
head somewhat in that each meta state becomes much more complex. Hence, it is useful to seek 
yet another method to reduce the state space - without adding to the complexity of each meta 
state. Calreful use of bamer synchronization provides such a mechanism. 
set 
barrier-nync ( a )  
set 8; 
{ 
/*  If s is  a meta s t a t e  t h a t  c o n t a i n s  a MIMD s t a t e  
which i s  a b a r r i e r  synchron iza t ion  p o i n t ,  then  
t l n e  barrier should prevent  any t r a n s i t i o n s  p a s t  
t l n a t  M I M D  s t a t e .  Hence, u n l e s s  a l l  p rocessors  
have reached t h e  b a r r i e r  (i .e. ,  every M I M D  s t a t e  
w i t h i n  s is  a  b a r r i e r  s tate) ,  simply remove t h e  
blrrier states from s 
/ 
set waits; 
/*  O ~ n s t r u c t  t h e  s e t  of MIMD b a r r i e r  w a i t  states * /  
wait,n - {el e E a ,  is- barrier- wait  (e) -- TRUE) ; 
/* Has everyone reached t h e  b a r r i e r ?  * /  
i f  ( w a i t s  -- 8) { 
/*  Y e s ;  go i n t o  a l l  b a r r i e r  s t a t e  * /  
r e t u r n  ( w a i t s )  ; 
) else ( 
/*  No; remove b a r r i e r s  from meta s t a t e  * /  
r e t u r n ( 8  - w a i t s ) ;  
1 
1 
For c:xarnple, consider modifying the code framework of listing 1 to contain a barrier sync 
attheendofthe i f :  
i f  ( A )  { 
do { B ) while (C) ; 
) e l se  { 
do { D ) while ( E )  ; 
1 
wait; / *  barr ier  sync. of a l l  threads * /  
F 
Listing 3: Listing 1 + Banier Synchronization 
Page 13 
. . - -. ... . - - 
Meta-State Conversion 
The barrier synchronization does not result in a nmtime operation, but rather constrains the asyn- 
chrony   IS &fined by the above algorithm. The result is a meta-state graph of the forn~: 
Figure 6: Meta-State Graph for Listing 3 
3. SIMI3 Coding of the Meta-State Automaton 
Given a MIMD program that has been converted into a meta-state graph, it is not trivial to 
find an efficient coding of the meta-state automaton for a SIMD architecture. The meta-state 
graph does reduce control flow to a single instruction stream. but that instruction stream would 
appear to1 execute different types of instructions in parallel - the meta-state graph employs a 
variation on VLIW semantics. 
There are two aspects of the graph that mimr VLIW constructions1: the apparently simul- 
taneous execution of different types of instructions and the use of multiway branches generated 
by merging multiple (binary) branches. Thus, we must efficiently implement these VLIW-like 
execution structures on SIMD hardware. 
3.1. C o n m n  Subexpression Induction 
Any meta state that merged two or more MlMD states effectively contains multiple instruc- 
tion sequrnces that are supposed to execute simultaneously. Given that it is impossible for a trad- 
itional SUMD machine to simultaneously execute different types of instructions on different pro- 
cessing elements, it would appear that these operations will have to be serialized. However, it is 
quite possible and practical that any operations that would be performed by more than one 
sequence can be executed in parallel by all processors. Common subexpression induction (CSI) 
nLc met.-ntrte graph is not nuitrbk fa execution on r baditional VLIW becruse which processing 
elcmmks execute which instructions is datermined statically fa VLIW, but dynamically in h e  graph. 1.e.. 
the graph wlould be appropriate for r VLlW in which each processing element could select at nmtime which 
instruction Ikld it would execute. rather than having each processing element stalically associated with a 
purficul~ instruction field. 
Page 14 
-- - - - -- - -  I - - 
Meta-State Conversion 
[Die921 is an optimization technique that identifies these operations and "factors" than out. 
The CSI algorithm analyzes a segment of code containing operations executed by any of 
multiple threads (enabled sets of SIMD PEs). From this analysis, it determines where lthreads can 
slum the ;same code and what cost is associated with inducing that sharing. Finally, it generates a 
code schedule that uses this sharing, where appropriate, to achieve the minimum execution time. 
Unfortunately, this implies that the CSI algorithm is not simple. 
The algorithm can be summarized as follows. First, a guarded DAG is consuucited for the 
input, then this DAG is improved using inter-thread CSE. The improved DAG is then used to 
compute infomation for pruning the search: earliest and latest, operation classes, and theoretical 
lower bound on execution time. Next, this information is used to create a linear schedule (SIMD 
execution sequence), which is improved using a cheap approximate search and then used as the 
initial schedule for the permutation-in-range search that is the core of the CSI optimiza1:ion. 
33. Multiway Branch Encoding 
At the end of each meta-state's execution, a particular type of multiway branch must be 
executed to move the SIMD machine into the conect next meta state. Before disclussing the 
encoding of these multiway branches, it is useful to specify the precise semantics of meta-state 
transitim;, so that an optimal coding can be achieved. The following defines the possible types 
of meta-state transitions. 
33.1. No Exit Arc 
A nleu state without an exit arc is a terminal node, i.e., it represents the end of the 
program's: execution. Thus, it is implicitly followed by a return to the operating system. There is 
no difficulty in generating code to implement this. 
33.2. Single Exit Arc 
If there is a single exit arc from a meta state, the code for that meta state is is followed by a 
goto (aka, jump) to the code for the target meta state. Again, it is simple to generate an 
efficient ooding. 
Notice that all entries to compressed meta states fall into this category. 
33.3. Mi~ltiple Exit Arcs 
If there are multiple exit arcs from a meta state, then the aggregate of the "pc" values for 
each of the pmssing elements must be used to determine the next state. For example, when, at 
the end of' executing a meta state, some processing elements have "pc" value 2 and others have 
"pc" value 6, meta state (2,6) is the next state. In order to efficiently collect this aggregate, 
each pssiible "pc" value is assigned a bit; thus, a globalor of the "pc" values from all pro- 
cessors determines the aggregate. 
Page 15 
Meta-State Conversion 
33.4. Multipk Exit Arcs Involving Barriers 
Thr: treatment of multiple exit arcs must be slightly adjusted if some, but not all, of the pro- 
cessing elements have reached a barrier at the time a meta state's execution completes. For 
example, in figure 6 the transitions fmm meta states 2, {2,6], and 6 into 2, (2,6], and 6 would not 
be sufficient if even one processing element had reached the barrier (i.e., meta state 9). Conse- 
quently, the processing elements are allowed to set their "pc" value to 9, but they are not permit- 
ted to enter meta state 9 unless all "pc"'s are 9. 
This is accomplished by a simple check to see if ( g l o b a l o r  pc) is contained within the set 
of all bamier states. If it is, then the state transition proceeds normally. Otherwise, the next meta 
state is de%ennined by subtracting the set of all barrier states from the result of the g l o b a l o r .  
3.25. Restricted Dynamic Process Creation 
Although the completely static nature of meta-state conversion makes it impossible to 
efficiently support forking of new processes to execute different programs, a minor encoding trick 
can be used to implement a restricted form of dynamic process creation. This restricted type of 
spawn ir~struction looks just like a conditional jump, except the semantics are that both paths 
must be taken (i.e.. the compressed meta state transition rule). One exit is taken by the original 
processes,, the other by the newly created processes. 
Initially. processing elements that are not in use would be given a "pc" value indicating 
that they ;are not in any meta state. When a spawn ( x )  instruction is reached by N processing 
elements, the original N processing elements do not change their pc values. but N currently- 
disabled processing elements are selected and their pc values are set to x.  No other changes are 
needed, PI-ovided that the number of processes requested does not exceed the number of proces- 
sors available. 
Note: further that processors that complete their processes early can be returned to the pool 
of free prr)cessors by simply executing a h a l t  instruction to set their pc value to indicate that 
they are mlt in any meta state. 
4. Implementation 
The current prototype meta-state converter does not directly generate executable SIMD 
code fium a MIMD-oriented language. Instead, it simply outputs a set of meta-state definitions. 
Each of these meta states must then be common subexpression inducted and the meta-state transi- 
tions (multiway branches) must be encoded using hash functions. However, these last two steps 
are implemented by two software tools developed earlier: 
8 A wmmon subexpression inductor, described in [Die92]. 
8 A hash function generator, described in [Die92a]. 
Thus, in thus paper we will confine the discussion to the implementation of the pmtotype meta- 
state converter. The meta-state converter was written in C using PCmS [PaD92] and Is actually 
a modified version of the mimdc compiler described in [ D i m ] .  
Page 16 
Meta-State Conversion 
4.1. The [nput Language 
The language accepted by the meta-state converter is a parallel dialect of C called MIMDC. 
It supports most of the basic C consuucts. Data values can be either i n t  or f 1 o a t ,  and vari- 
ables can be declared as mono (shared) or po 1 y (private) [Phi 891. 
The1-e are two kinds of shared memory reference supported. The mono variables are repli- 
cated in each pmessor's local memory so that loads execute quickly, but stores involve a broad- 
cast to update all copies. It is also possible to directly access p o l  y values from other pmessors 
using "parallel subscripting ": 
would use: the values of i ,  j, and z on this processor to fetch the value of y from, pmessor 
j, add z, and store the result into the x on processor i. In addition to allowing us: of shared 
memory tor synchronization, MIh4DC supports barrier synchronization [Di090] using a w a i t  
statement. 
43. The Conversion Process 
A brief outline of the prototype implementation is: 
1. As the PCCI'S-generated parser reads the source code, a traditional control-flow graph 
whose nodes are expression trees is built. This control-flow graph is constructed in a 
"normalized" form that ensures, for example, that loops are all of the type that exe- 
cute the body one or more times, rather than zero or more (e.g.. by replicating some 
code and inserting an additional i f  statement). 
2. The control-flow graph is straightened and empty nodes are removed. This maximizes 
the size of the nodes. 
3. The meta-state conversion algorithm is applied. Except for the handling of function 
calls, the prototype implements the full algorithm. 
4. The resulting meta-state graph is straightened and output. 
The cum~l t  prototype implementation does not perform the final encoding of the meta-state auto- 
maton Rence, a CSI tool [Die921 and a tool for finding hash functions [Die92a] are applied by 
hand to p~oduce the final SIMD code in MPL. 
43. An E,xample 
To illustrate how the prototype meta-state converter works, consider the MIMDC program 
presented in listing 4. This example has the same control structure given in listing 1, but is a 
complete program, so that the actual code generated can be given. 
Page 17 
Meta-State Conversion 
m a i n  ( ) 
I 
p o l y  i n t  x; 
if (XI t 
d o  t x - 1; 1 w h i l e  (x); 
) else t 
d o  { x = 2; ) w h i l e  (x); 
1 
r e t u r n  (x) ; 
1 
Listing 4: Example MIMDC Plrogram 
Without compression or time cracking, the resulting meta-state SIMD automaton, written in 
MPL [Mas911 for the MasPar MP-1 [Blam], is given in listing 5. The code within each meta 
state is simple SIMD stack code using MPL macros for each operation. The only surprising stack 
operation is JumpF ( x ,  y) , which simply sets each processing element's p c  equal to 2' if the 
topof-stack value is "FALSE" or to 2 y  if it is "TRUE." The a p c  is simply the aggregate 
obtained by oring the values of all the individual pcs; the switch at the end of each1 meta state 
simply eroploys a customized hash function to ensure that the multiway branch is implemented 
efficiently. For example, at the end of meta state 0 (i.e., ms - 0). instead of a swi tc :h  on apc 
with cases for B I T  ( 2 ) [ B I T  ( 6 ) , B I T  ( 6 ) . and BIT ( 2 ) , a hash function is applied to make 
the case values contiguous so that the MPL compiler will use a jump table to implement the 
switch.  
5. Conclusions 
Although meta-state conversion is a complex and slow process, it does provide a mechani- 
cal way to transform control-parallel (MIMD) programs into pure SIMD code. Further, the exe- 
cution of the meta-state program can be very efficient. In particular, fine-grain MIMD code is 
generally inefficient on most MIMD machines due to the cost of runtime synchronization, but 
synchronization is implicit in the meta-state converted SIMD code, and hence has no runtime 
cost. 
While the prototype implementation demonstrates the feasibility and comctrress of the 
meta-statc conversion algorithm, it does not yet automate the process of generating the final 
SIMD a l e .  Future work will integrate the code generation process and will benchmark perfor- 








ape - globalor(pc): 
mrltch r (-.PC) >> 5 )  r 3) 1 
cam. I: goto mm-2-6; 
caa. 2: goto ma-6: 
cam. 3: qoto mm-2: 
1 
N - 2  : 
if (pc L BITl2)) I 
Puah(l1 Pumh(0) LdL Pumhll2) 
StL Popl2) Pumhl4) LdL 
Junpl0.2) 
t 
ape - qlobalor(pc); 
switch I ((-.PC) >> a) L 3) 
cam. 1 I qoto u-2-9: 
cam. 2 1  qoto am-9: 
Cam. 31 pot0 as-2; 
1 
nm-9 : 
lf (pc L BIT(9)) ( 
Push(4l LdL 
R.t (3  I 
1 
I* no nmxt mmta mtate .I 
.xit(Ol: 
nm-2-9 r 
if (pc L BIT(23) ( 
Pumh(1) PumhIO) LdL 
Pumh(l2) StL Popl2) 
I 
If (pc L (BIT(2) 1 BIT(91)) ( 
Pumh(4) LdL 
I 
if (pc L BIT(2)) I 
JunpPO.2) 
t 
if tpc L BIT(9I1 ( 
Rmt(3t 
apt - qlobalor(pc): 
mrltch I((-apc) >> I) 6 31 ( 
cam. 1: qoto u-2-9: 
cam. 2: qoto u-9: 
cam. 3: qoto N-2; 
1 
mm-6 : 
if (pc L BIT(6)I ( 
Pumh(2) Pumh(0) LdL Pumh(l.7) 
StL ~opl2) Pumh(4) LdL 
J ~ W 0 . 6 l  
I 
apc - qlDbalor(pcl; 
multch I (-.pc) >> 8) r 3) ( 
cam. 1: goto mm-6-9: 
cam. 2: goto mm-9: 
cam. 3: goto nm-6: 
1 
am-6-9: 
if (pc L BIT(6)) ( 
Pumh(21 Pumh(0l LdL 
Pumh(l2) StL Pop(21 
) 
if (pc L (BIT161 1 BITl9))) I 
Pumhl4) LdL 
) 
lf (PC L BIT16)) I 
J~npF(9,6) 
I 
lf tpc r BIT19)) I 
Rat (3) 
t 
ape - qlobalor(pc): 
mrltch ((1-mpc) >> 81 r 3) ( 
cam. 1: qoto mm-6-9: 
cam. 2: qoto ms-9; 
cam. 3: qoto nm-6; 
1 
mm-2-6 I 
lf lpc L BITl21) ( 
Pumhlll 
1 
if (pc L BIT(6)I I 
Pumh(2) 
) 
if (pc r (BIT(2) I BIT(6111 ( 
PumhIO) LdL Pumh(l2) StL 
Pop(2l Purhl4) LdL 
t 
if (pc L BIT(2)) I 
JwpP(9.21 
if ~ p c  L BIT(()) I 
JunpF(9.6) 
I 
apc - qlobalor lpcl ; 
mrltch (((apc >> 6) ' ape) & 15) ( 
cam. 5: qoto mm-2-6: 
cam. 0 :  qoto mm-9: 
came 9: qoto ms-6-9; 
cam. 12: qoto m--2-9; 
cam. 13: qoto ma-2-6-9: 
1 
mm-2-6-9 : 
if (pc L BIT(2I1 I 
Pumhll) 
1 
if (pc L BIT(6)) ( 
Pumh(2) 
if (pc r (BIT12) I BIT(6))) I 
PumhlOl LdL Pumhll2) 
StL Pop(2) 
t 
lf (pc L (BIT(21 I BITIO I BIT19) 1 )  ( 
Pumh(4) LdL 
t 
if (pc r BIT(211 1 
JUnpF(9.2) 
t 
lf (PC L BIT16) I I 
JunpF(9.6) 
t 
lf lpc L BIT(911 I 
Ret (31 
t 
ape - qlobalor(pc): 
muitch 1 ((apc > w  6) apc) L 15) I 
cmme 5: goto as-2-6: 
came 8: qoto as-9: 
came 9: qoto as-6-9: 
came 12: qoto am-2-9: 
came 13: goto ma-2-6-9: 
I 




[BlaW] T. Blank, ''The MasPar MP-1 Architecture," 35th IEEE Computer Society Intema- 
tional Conference (COMPCON), February 1990, pp. 20-24. 
[CoS70] J. Cocke amd J.T. Schwartz, Programming Languages and Their Compilers, Courant 
Institute of Mathematical Sciences, New York University, April 1970. 
[DiC92] H.G. Dietz and W.E. Cohen. "A Control-Parallel Programming Model Implemented 
On SIMD Hardware," in Proceedings of the Fifh Workshop on Programming 
Lunguages and Compilers for Parallel Computing, August 1992. 
[Die921 H.G. Dietz, "Common Subexpression Induction," to appear in Proceedings of the 
1992 International Conference on Parallel Processing, Saint Charles. Illinois, 
August 1992. 
[Die92a] H.G. Dietz, "Coding Multiway Branches Using Customized Hash Functions," 
Technical Report TR-EE 92-31, School of Electrical Engineering, Purdue University, 
July 1992. 
[Di090] H.G. Dietz, M.T. O'Keefe, and A. Zaafrani, "An Introduction to Static Scheduling 
for MIMD Architectures," Advances in Languages and Compilers for h a l l e l  Pro- 
cessing, edited by A. Nicolau, D. Gelertner, T. Gross, and D. Padua, The MIT Press, 
Cambridge, Massachusetts, 1991, pp. 425-444. 
[LiM90] M. S. Littrnan and C. D. Metcalf, An Exploration of Asynchronous Data-Parallelism, 
Technical Report, Yale University, July 1990. 
[Mas921 MasPar Computer Corporation, MasPar Programming Language (ANSI C compati- 
ble MPL) Reference Manual, Sofnvare Version 2.2, Document Number 9302-0001, 
Sunnyvale, Cali fomia, November 199 1. 
[NiT90] M. Nilsson and H. Tanaka, "MIMD Execution by SIMD Computers," Journal of 
Information Processing, Information hocessing Society of Japan, vol. 13, no. 1, 
1990, pp. 58-61. 
[PaD92] T.J. Parr, H.G. Dietz, and W.E. Cohen, "PCCTS Reference Manual (version 1.00)," 
ACM SICPLAN Notices, Feb. 1992, pp. 88-165. 
[Phi891 M.J. Phillip, "Unification of Synchronous and Asynchronous Models for Parallel 
Programming Languages" Master's Thesis, School of Electrical Engineering, Pur- 
due University, West Lafayette, Indiana, June 1989. 
[WiH91] P.A. Wilsey, D.A. Hensgen, C.E. Slusher, N.B. Abu-Ghazaleh, and D.Y. Hollinden, 
"Exploiting SIMD Computers for Mutant Program Execution," Technical Report 
No. TR 133- 1 1-9 1, Department of Electrical and Computer Engineering,, University 
of Cincinnati, Cincinnati, Ohio, November 1991. 
Page 20 
