Decomposing Meeting Graph Circuits to Minimise Kernel Loop Unrolling by Bachir, Mounira et al.
HAL Id: inria-00637938
https://hal.inria.fr/inria-00637938
Submitted on 3 Nov 2011
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Decomposing Meeting Graph Circuits to Minimise
Kernel Loop Unrolling
Mounira Bachir, Sid Touati, Albert Cohen
To cite this version:
Mounira Bachir, Sid Touati, Albert Cohen. Decomposing Meeting Graph Circuits to Minimise Kernel
Loop Unrolling. 9th Workshop on Optimizations for DSP and Embedded Systems (ODES-9), In
conjunction with: International Symposium on Code Generation and Optimization (CGO), Apr 2011,
Chamonix, France. pp.8. ￿inria-00637938￿
Decomposing Meeting Graph Circuits to Minimise Kernel Loop
Unrolling
Mounira Bachir






INRIA Saclay – Ile-de-France
Albert.Cohen@inria.fr
Abstract
This article studies an important open problem in backend
compilation regarding loop unrolling after periodic register
allocation. Although software pipelining is a powerful tech-
nique to extract fine-grain parallelism, variables can stay
alive across more than one kernel iteration, which is chal-
lenging for code generation. The classical software solution
that does not alter the computation throughput consists in un-
rolling the loop a posteriori (13; 12). However, the resulting
unrolling degree is often unacceptable and may reach absurd
levels. Alternatively, loop unrolling can be avoided thanks to
software register renaming. This is achieved through the in-
sertion ofmove operations. However, inserting those oper-
ations may increase the initiation interval (II) and nullifies
the benefits of software pipelining itself.
We propose in this article a new technique to minimise
the loop unrolling degree generated after periodic register
allocation. In fact, this technique consists on decomposing
the generated meeting graph circuits by inserting move in-
structions without compromising the throughput benefits of
software pipelining.
The different experiments showed that the execution time
is acceptable and good results can be produced when we
have many functional units which can execute move oper-
ations.
Categories and Subject Descriptors D.3.4 [Processors]:
Code generation, Compilers, Optimization
General Terms Algorithms, Performance
Keywords Periodic Register Allocation, Software Pipelin-
ing, Loop Unrolling, Register Move Instructions, Code Op-
timisation
1. Introduction
Our focus is on the exploitation of instruction-level paral-
lelism (ILP) in embedded VLIW processors (12). Increased
ILP translates into higher register pressure and stresses the
register allocation phase(s) and the design of the register
files. In the case of software-pipelined loops, variables can
stay alive across more than one kernel iteration, which is
challenging for code generation and generally addressed
through: (1) hardware support — rotating register files —
deemed too expensive for almost embedded processors, (2)
insertion of registermoves with a high risk of reducing the
computation throughput — initiation interval (II) — of soft-
ware pipelining, and (3) post-pass loop unrolling that does
not compromise throughput but often leads to unpractical
code growth.
We investigate ways to keep the size of the generated code
compatible with embedded system constraints without com-
promising the throughput benefits of software pipelining.
Namely, we want to minimise the unrolling degree result-
ing from periodic register allocation of a software-pipelin d
loop,without altering the initiation interval(II).
Having a minimal unroll factor reduces code size, which
is an important performance measure for embedded sys-
tems because they have a limited memory size. Regarding
high performance computing (desktop and supercomputers),
loop code size may not be important for memory size, but
may be so for I-cache performance. In addition to the min-
imal unroll factors, it is necessary that the code generation
scheme for periodic register allocation does not generate ad-
ditional spill; the number of required registers must not ex-
ceedMAXLIVE (11) (the maximum number of values simul-
taneously alive). Prohibiting spill code aims to maintainII
and to save performance.
When the instruction schedule is fixed then the circular
lifetime intervals (CLI) andMAXLIVE are known. In this
situation, known methods exist for computing unroll factors.
These are:
• modulo variable expansion (MVE) (12; 13) which com-
putes a minimal unroll factor but may introduce spill
(since MVE may need more thanMAXLIVE registers
without proving an appropriate upper-bound);
• Hendren’s heuristic (10) which computes a sufficient un-
roll factor without introducing spill, but with no guaran-
tee in terms of minimal register usage or unrolling de-
gree; and
• the meeting graph framework (6) which is based on
mathematical proofs which guarantee that the unroll de-
Listing 1. Loop Program Example





gree will be sufficient to reach register minimality (i.e.
MAXLIVE), but not that the unroll degree itself will be
minimal.
Bachir et al. (2; 3) claim that the loop unrolling minimi-
sation (LUM) using extra remaining registers is an efficient
method to bring loop unrolling as low as possible — with
no increase of theII. However, some kernel loops may still
require high unrolling degrees. These occasional high un-
rolling degrees suggest that it may be worthwhile to consider
combining the insertion ofmove operations with kernel loop
unrolling.
In this work, we study the loop unrolling problem after a
given periodic register allocation by decomposing the dif-
ferent generated meeting graph circuits (MGC) by insert-
ing move instructions without compromising the through-
put benefits of software pipelining and still using a minimal
number of registers equal toMAXLIVE to allocate the differ-
ent variables. We argue that the approach based on decom-
posing meeting graph circuits can overcome the shortcom-
ings of loop unrolling minimisation.
The rest of this article is organised as follows. Section 2
presents the most relevant related work for code generation.
In Section 3, we present the problem we are dealing by
displaying a motivating example. In section 4, we define
the main notions, present our method and the algorithm we
designed to decompose the different meeting graph circuits
in order to minimise loop unrolling degree. In Section 5,
we present our experimental results, which demonstrate tha
our method is efficient in practice, then we conclude in
Section 6.
Throughout the paper, the different loops are already soft-
ware pipelined and we rely on the meeting graph framework
as the periodic register allocator.
2. Code Generation: Background and
Challenges
We review the main issues and approaches to code genera-
tion for periodic register allocation using the loop example
described in Listing 1.
There are two ways to deal with periodic register allo-
cation: using special architecture support such asrotating
register files, or without using such support. This latter may
require the insertion ofmove operations or loop unrolling
Listing 2. Example of Inserting Move Operations (Register
Renaming)







2.1 Rotating Register File
A rotating register file(RRF) (8) is a hardware mechanism
to prevent successive lifetime intervals from being assigned
to the same physical registers.
In Listing 1, variablea[i] spans three iterations (defined
in iterationi − 2 and used in iterationi). Hence, at least3
physical registers are needed to carry simultaneouslya[i],
a[i + 1] anda[i + 2]. A rotating register fileR automatically
performs themove operation at each iteration.R acts as a
FIFO buffer. The major advantage is that instructions in the
generated code see all live values of a given variable through
a single operand, avoiding explicit register copying. Below
R[k] denotes a register with offsetk from R.




Using a RRF avoids increasing code size due to loop
unrolling, or to decrease the computation throughput due to
the insertion ofmove operations.
2.2 Move Operations
This method is also calledregister renaming. Considering
the example of Listing 1 for allocatinga[i], we use3 regis-
ters and performmove operations at the end of each itera-
tion (5; 14):a[i] in registerR1, a[i + 1] in registerR2 and
a[i+2] in registerR3. Then we usemove operations to shift
registers across the register file at every iteration as shown in
Listing 2. However, it is easy to see that if variablev spans
d iterations, we have to insertd − 1 extramove operations
at each iteration. In addition, this may increase theII and
may require rescheduling the code if thesemove operations
do not fit into the kernel. This is generally unacceptable as it
negates most of the benefits of software pipelining.
2.3 Loop Unrolling
Another method,loop unrolling, is more suitable to maintain
II without requiring hardware support such as RRF. The
resulted loop body itself is bigger but no extra operations
are executed in comparison with the original code. Here
different registers are used for different instances of the
variablea of Listing 1. In Listing 3, the loop is unrolled three
Listing 3. Example of Loop Unrolling











times.a[i+2] is stored inR1, a[i+3] in R2, a[i+4] in R3,
a[i + 5] in R1, and so on.
By unrolling the loop, we avoid inserting extramove op-
erations. The drawback is that the code size will be multi-
plied by 3 in this case, and by the unrolling degree in the
general case. This can have a dramatic impact by causing
unnessary instruction cache misses when the code size of
the loop happens to be larger than the size of the instruction
cache. For simplicity, we did not expand the code to assign
registers forb andc. In addition, brute force searching for
the best solution using loop unrolling has a prohibitive cost,
existing solutions may either sacrifice the register optimality
(10; 13; 15) or incur large unrolling overhead (6; 16).
2.3.1 Modulo Variable Expansion
Lam designed a general loop unrolling scheme calledmo -
ulo variable expansion(MVE) (13). In fact, the major crite-
rion of this method is to minimize the loop unrolling degree
because the memory size of the i-WARP processor is low
(13). The MVE method defines a minimal unrolling degree
to enable code generation after a given periodic register al-
location. This unrolling degree is obtained by dividing the
length of the longest live range (maxv LTv) by the number
of cycles of the kernelα = ⌈maxv LTv
II
⌉. Once the loop is
unrolled, MVE uses the interference graph for allocation.
HavingMAXLIVE the maximum number of values simul-
taneously alive, the problem with MVE is that it does not
guarantee a register allocation with minimal number of reg-
ister equal toMAXLIVE (6; 10), and in general it may lead to
unnecessary spills breaking the benefits of software pipelin-
ing. A concrete examples of this limitation can be found
in (1).
In Listing 1, the longest live range lasts8 cycles and the
number of cycles of the loop is3 cycles, soα = ⌈ 83⌉, and
we should unroll the loop 3 times. Then we can assign to
each variable a number of registers equal to the least integer
greater than the span of the variable that dividesII. In
Listing 4, each variablea, b, c is assigned3 registers using
Listing 4. Example of MVE











MVE: R1, R2, R3 fora, R4,R5,R6 forb, R7, R8, R9 forc,
and the loop is unrolled3 times.
One can verify that it is not possible to allocate the differ-
ent variables on less than 9 registers when unrolling the loop
3 times. But MVE does not ensure a register allocation with
a minimal number of registers, and hence is not optimal. As
we will see in the next section, we need8 registers to al-
locate the different variables. In MVE, the round up to the
nearest integer for choosing the unrolling degree may miss
opportunities for achieving an optimal register allocation.
2.3.2 Meeting Graphs
The algorithm of Eisenbeis et al.(7; 9) can generate a peri-
odic register allocation using a minimal number of registers
equal toMAXLIVE if the kernel is unrolled, thanks to a ded-
icated graph representation called them eting graph. It is a
more accurate graph than the usual interference graph, as it
has information on the number of clock cycles of each vari-
able lifetime and on the succession of the lifetimes all along
the loop. It is based on circular lifetime intervals (CLI). A
preliminary remark is that without loss of generality, we can
consider that the width of the interval representation is con-
stant at each clock cycle. If not, it is always possible to add
unit-time intervals in each clock cycle where the width is
less thanMAXLIVE (9).
The formal definition of the meeting graph is as follows.
Definition 1 (Meeting Graph). Let F be a set of circular
lifetime intervals graph with constant widthMAXLIVE. The
meeting graph related toF is the directed weighted graph
G = (V,E). V is the set of circular intervals. Each edge
e ∈ E represents the relation of meeting. In fact, there is
an edge between two nodesvi andvj iff the intervalvi ends
when the intervalvj begins. Eachv ∈ V is weighted by its
lifetime length in terms of processor clock cycles.
The meeting graph (MG) allows us to compute an un-
rolling degree which enables an allocation of the loops
with RC=MAXLIVE registers. It can have several connected
components of weightµ1, . . . , µk (if there is only one con-
Listing 5. Example of Meeting Graph


























nected component, its weight isµ1 = RC), this leads to the
upper bound of unrollingα = lcm(µ1, ..., µk) (RC if there
is only one connected component). Moreover a possible
lower bound of loop unrolling is computed by decomposing
the graph into as many circuits as possible and then comput-
ing the least common multiple (lcm) of their weights (7; 9).
The circuits are then used to compute the final allocation.
This method can handle variables that are alive during sev-
eral iterations. This allocation always finds an allocation
with a minimal number of registers (MAXLIVE).
Figure 1(a) displays the circular lifetime intervals repre-
senting the different variables (a, b, c) of the loop example
described in Listing 1, the maximum number of variables si-
multaneously aliveMAXLIVE = 8. As shown in Figure 1(b),
the meeting graph is able to use 8 registers to allocate the
different variables instead of 9 with Modulo Variable Expan-
sion by unrolling the loop 8 times. For the loop described in
Listing 1, the meeting graph generates the code shown in
Listing 5.
The main drawback of the meeting graph method is that
the loop unrolling degree can be high in practice although the
number of registers used is minimal. That may cause spuri-
ous instruction cache misses or even be impracticable due to
the memory constraints, like in embedded processors. In or-













(a) Circular Lifetime Intervals (b) Meeting Graph
Figure 1. Meeting graph of the loop example in Listing 1
der to minimise loop unrolling, Bachir et al. (2; 3) propose
a method called loop unrolling minimisation (LUM method)
that minimises the loop unrolling degree by using extra re-
maining registers for a given periodic register allocation. The
LUM method brings loop unrolling as low as possible —
with no increase of theII. However, some kernel loops may
still require high unrolling degrees. This occasional cases
suggest that it may be worthwhile to consider combining the
insertion ofmove operations with kernel loop unrolling.
3. Circuit Decomposition: a Motivating
Example
In this section, we introduce our method which consists
in decomposing the different circuits generated after the
meeting graph periodic register allocation, to minimise loop
unrolling degree and show how it works on an example. Our
study uses the meeting graph framework since it is most
challenging to minimal register allocation when parallelism
between loop iterations is exploited.
Our goal is to avoid a large unrolling degree while still
achieving the use of minimal number of registers. Our ap-
proch is based on the following observation: In the previous
work that propose a periodic register allocation with a min-
imal number of register (7; 16), the high generated loop un-
rolling degree comes from the least common multiple of the
different weights of the generated circuits.
Let us study the loop described in Listing 6. This loop
is software pipelined by DESP (4). The initiation interval
II = 4 and the circular lifetime intervals family is described
in Figure 2(a). This circular lifetime intervals family has
maximum widthMAXLIVE = 3. As shown in Figure 2(b),
the meeting graph is decomposed into two strongly con-
nected componentsC1 = {v1, v2, v4} andC2 = {v3} with
the following weightsµ(C1) = 2 andµ(C2) = 1. Moreover,
the meeting graph achieves a periodic register allocation
with a minimal number of registerRmin = MAXLIVE = 3
if we unroll the loop twice (lcm(2, 1) = 2).
Listing 6. Whetstone cycle4-2 Loop example






Our objective with decomposing the meeting graph cir-
cuits is to find a periodic register allocation with a small
circuit weights which leads to a smalllcm. Our exploratory
method consists on decomposing the different circuits by
addingm extra move instructions which can be performed
in parallel with the other operations.
Let us assume that we can add one move operation at
each clock cycles without increasingII. By the way, if we
split the variablesv4 at clock cycleII = 4 into two vari-
ablesv41 andv42 as shown in Figure 3(a), then the circuit
C1 is decomposed into two circuitsC11 = {v1, v41} and
C12 = {v2, v42} with the following weightsµ(C11) = 1
andµ(C12) = 1, as shown in Figure 3(b). It results a pe-
riodic register allocation with a minimal number of regis-
tersRmin = MAXLIVE = 3 without unrolling the loop by
adding one move operation.





















Figure 2. Circular lifetime intervals and meeting graph for
the loop of Listing 6
We will now describe more precisely the circuit decom-
position algorithm
4. Circuit Decomposition Method
In this section, we describe how to decompose the differ-
ent circuits in order to minimise loop unrolling. In fact,
given the different meeting graph circuitsC1, . . . , Ck, we
look for their best decomposition into many new circuits
C11 , . . . , Ckn such that the least common multiple of their
new weights is the minimal loop unrolling degree. In fact,
decomposing a given circuitC into two circuitsC1 andC2
means thatµ(C) = µ(C1)+µ(C2). Lemma 1 demonstrates
that each new circuit respect the relation of meeting (9).
Lemma 1. Let C be a meeting graph circuit with a weight
µ(C). Let v1, . . . , vn be the different variables composing
the circuit C. If µ(C) > 1 then there exists at least one
0 1 2 3 II=4



















Figure 3. Circuit decomposition method
decompositionC into two circuitsC1 andC2 which respect
the relation of meeting andµ(C) = µ(C1) + µ(C2)
Proof. Let C = {v1, . . . , vn} be a meeting graph circuit. Let
S andE be two functions defined asS(v) is the clock cycle
when the variablev starts andE(v) is the clock cycle when
the variablev ends.
C respects the relation of meeting because∀i = 1, n : vi
ends whenvi+1 begins. In addition, the sum of variables




Furthermore, ifµ(C) > 1 then we can decompose the
circuit C into two circuitsC1 and C2 such thatµ(C1) +
µ(C2) = µ(C) and µ(C1), µ(C2) ∈ N∗. For instance,
µ(C1) = 1 > 0 andµ(C2) = µ(C) − 1 > 0.
In order to prove the relation of meeting for both circuits
C1 andC2, we look for the variables composing each circuit.
In fact, we look for a variablev ∈ C which is alive at a
chosen clock cycle denotedcyclesuch as:cycle= S(v1) +
µ(C1) × II. Let vi be this variable. Two possible cases are
then araised:
1. if S(vi) = cycle then the circuitC is decomposed into
two circuitsC1, C2 without adding any move operation.
The decomposition of the initial circuitC is as follows:
C1 = {v1, . . . , vi−1} andC2 = {vi, . . . , vn}
2. if S(vi) < cycle< E(vi) then the variablev is splitted at
time cycle into two variablesvi1 , vi2 which decompose
the circuitC into two circuits by adding one move oper-
ation. The decomposition of the circuitC is as follows:
C1 = {v1, . . . , vi1} andC2 = {vi2 , . . . , vn}
Circuit decomposition without increasingII proceeds as
follows.
1. First of all, we collect the following information: the cir-
cular lifetime intervals family (CLI), the different meet-
ing graph circuits and the numbermcycle of free move
instruction at each clock cycle (cycle = 0, II − 1). In-
deed, these move instructions, if added, can be executed
in parallel with other operations without altering theII.
2. Secondly, we give to each circuitC an arbitrary start
point denotedS′(C) and an arbitrary end point de-
notedE′(C). For example ifC = {v1, . . . , vn} then
S′(C) = S(v1) andE′(C) = E(vn). In addition,C is
decomposed into two circuits if its weight is:µ(C) > 1.
By the way, the circuitC can be splitted only at some
clock cycles described in the following set:
SPLITC = {cycle| ∀m ∈ N
∗ such that:
cycle= S′(C) + m × II andS′(C) < cycle< E′(C)
andm(cycle mod II) > 0}
In fact, we need this information to exactly know at each
clock cycle we split the circuit. For instance, if we want
to split C at II then the exact split cycle iscycle =
S’(C) + II. From the definition of the setSPLITC , we
have the following inequalities:
S′(C) < cycle< E′(C) ⇒
S′(C) < (S′(C)+m× II) < E′(C) ⇒ 0 < m < µ(C)
Furthermore, decomposing the circuitC means to look
for a variablev in this circuit which is alive at this clock
cycle(S′(C) + m × II).
3. In order to know the total number of possible circuit
decomposition at clock cyclecycle, we look for the
setCIRcycle which contains the different circuits start-
ing at this clock cyclecycle. In addition, each circuit
C ∈ CIRcycle can be decomposed at a given clock cycle
cycle = S′(C) + m × II such asm = 1, µ(C) − 1.
That is, we can deduce the numberwccycle of possible





So hence, the total number of circuit decompositions at




Consequently, the total number of generated circuits at






4. Finally, compute the least common multiple of the gener-
ated circuits and choose the best decomposition with the
minimal loop unrolling degree.
Algorithm 1 implements our solution for decomposing the
differents circuits following the possible added move intruc-
tions without alteringII. In this algorithm, we require the
initial meeting graph circuits denotedMGC, the initiation in-
tervalII and the information about free units at each clock
cyclecycledenoted asmcycle. This algorithm returns the best
circuit decompositionCircuits which provides the minimal
loop unrolling degree. In addition, because we are looking
for all possible circuit decompositions which do not increas
II, we need to use three setsG, Gt andGc to store the dif-
ferent circuits after decomposition.
As we can see in Algorithm 1, the principal algorithm
calls the sub-algorithm calledDecompose-Circuitwith the
following parameters: the initial meeting graph circuits
Circuits, the circuit circuit which we want to decompose
into two circuits, the cyclewindowwhere exactly we want
to decompose the circuitcircuit. This sub-algorithm returns
the new circuit decompositionResultand a boolean equal
to TRUE if the circuit is decomposed,FALSE otherwise.
We also use the following algorithm MINIMIMAL-LCM-
CIRCUITS which provides the best circuit decomposition
where the least common multiple (loop unrolling) is mini-
mal.
Algorithm 1 Circuits decomposition with free move instruc-
tions
Require: MGC the set of the meeting graph circuits,II the
Initiation Interval and at each clock cyclecycle the number of
possible addedmove instructionsmcycle without ateringII
Ensure: Circuits the best circuit decomposition which provides
the minimal loop unrolling degree
G ← {MGC} {at the beginning the setG contains the initial
meeting graph circuits}
for cycle= 0 to II − 1 do
Gt ← G {initially the set Gt contains all the generated
circuits decomposition at clock timecycle− 1}
while mcycle <> 0 do
Gc ← ∅ {initialisation of the setGc which contains the
generated circuits decomposition combination}
for eachCircuits∈ Gt do
for eachcircuit ∈ Circuits do
if S′(circuit) = cyclethen
window← cycle+ II
while window< E′(circuit) do
if Decompose-Circuit(Circuits, circuit, window, Result)
then














Algorithm 1 delivers all possible decomposing circuits









Measurements are taken to study the effectiveness of min-
imising the loop unrolling degree by decomposing the initial
meeting graph circuits by adding free move operations with-
out increasingII and without increasing the number of allo-
cated registers (the allocation is done with a minimal num-
ber of register). In fact, we implemented Algorithm 1 which
generates all the possible new meeting graph circuits decom-
positions (MGCs) following the number of free move oper-
ations at each clock cycle and then choose the best circuits
decomposition with a minimal loop unrolling degree.
We did extensive experiments on more than1900 DDGs
extracted from various known benchmarks, namely Spec92fp,
Spec92int, Livermore loops, Linpack and Nas. In addition,
we studied many theoretical cases depending on the number
of units which can execute the move operation per clock cy-
cle. In fact, we increase the number of functional units which
can execute the move-instruction from one unit to four units
(only for theoretical study). Let us notice that the number
of units which execute the move operations per clock cy-
cles is a parameter of our program. For each case, we varied
the number of architectural registers (Rarch) from 16 to 128
registers.
The different experiments show that the technique of de-
composing the meeting graph circuits is fast in practice. On
average, the execution time is less than 1 seconds at all cases.
For instance, the average of the runtime is about 3 milli-
seconds when we add at most 2 move operations per cycle.
In order to display the main statistics of initial loop un-
rolling degree and final loop unrolling degree resulted by de-
composing circuits method, we show in Table 1 for each ma-
chine configuration, the number of DDGs where using the
circuits decomposition method (CDM) improves loop un-
rolling. As we can see, the number of DDGs decomposition
increases with the number of units which execute move oper-
ations. In order to highlight the main statistics for initial loop
Rarch One move Two moves Three moves Four moves
16 16 75 143 646
32 22 99 179 717
64 29 113 201 759
128 29 113 201 763
Table 1. Number of DDGs where loop unrolling improves
thanks to circuit decomposition
unrolling degree and final loop unrolling degree, we display
the following numbers: the smallest loop unrolling degree
(min), lower quartile (Q1 = 25%), median (Q2 = 50%),
the arithmetic mean of loop unrolling degree (average), up-
per quartile (Q3 = 75%), and largest loop unrolling degree
(MAX). The different statistics are shown in the following
tables:
• Table 2 shows observations when at most one move in-
struction can be added per clock cycle.
• Table 3 shows observations when at most two move in-
structions can be added per clock cycle.
• Table 4 shows observations when at most three move
instructions can be added per clock cycle.
• Table 5 shows observations when at most four move
instructions can be added per clock cycle.
As we can see, thanks to circuits decomposition method
(CDM), we do not unroll25% of DDGs in all configurations.
In addition, we have better improvement when we have four
functional units which can execute more operations. In that
case, we do not unroll50% of DDGs and we unroll75%
of DDGs only twice. However, we have a small gain for
the maximun loop unrolling in each configuration. Notice
that meeting graph proposes to perform a periodic register
allocations by unrolling the loopMAXLIVE orMAXLIVE+1
as described in (9; 2), if final loop unrolling is greater then
MAXLIVE.
Rarch Loop Unrolling MIN Q1 Median Q3 MAX
16
Initial unrolling 2 2 2 5 6
CDM 1 1 1 2 5
32
Initial unrolling 2 2 4 12 210
CDM 1 1 2 5 30
64
Initial unrolling 2 2 6 66 2040
CDM 1 1 3 30 1008
128
Initial unrolling 2 2 6 66 2040
CDM 1 1 3 30 1008
Table 2. Configuration where at most one move can be
added per clock cycle
Rarch Loop Unrolling MIN Q1 Median Q3 MAX
16
Initial unrolling 2 2 3 6 12
CDM 1 1 2 3 7
32
Initial unrolling 2 2 4.5 6.75 210
CDM 1 1 2 4 60
64
Initial unrolling 2 2 6 12 420
CDM 1 1 2 6 210
128
Initial unrolling 2 2 6 12 420
CDM 1 1 2 6 210
Table 3. Configuration where at most two move operations
can be added per clock cycle
6. Conclusion
We presented a new technique to minimise the degree of
loop unrolling after periodic register allocation. This tech-
nique is based on the meeting graph. It searches for the best
Rarch Loop Unrolling MIN Q1 Median Q3 MAX
16
Initial unrolling 2 2 3 5 20
CDM 1 1 1 2 10
32
Initial unrolling 2 2 4 6 210
CDM 1 1 2 3 30
64
Initial unrolling 2 2 4 12 840
CDM 1 1 2 4.25 420
128
Initial unrolling 2 2 4 12 840
CDM 1 1 2 4.25 420
Table 4. Configuration where at most three move operations
can be added per clock cycle
Rarch Loop Unrolling MIN Q1 Median Q3 MAX
16
Initial unrolling 2 2 2 6 30
CDM 1 1 1 2 15
32
Initial unrolling 2 2 2 6 210
CDM 1 1 1 2 30
64
Initial unrolling 2 2 2 6 3696
CDM 1 1 1 2 2310
128
Initial unrolling 2 2 2 6 6864
CDM 1 1 1 2 4290
Table 5. Configuration where at most four move operations
can be added per clock cycle
decomposition of the meeting graph circuits, through the in-
sertion of register move instructions in free scheduling slots
left after software pipelining. The throughput of the software
pipeline is guaranteed to be preserved through the decompo-
sition.
Our experiments showed that the running time of the
algorithm is acceptable, andthat good results can be pro-
duced when multiple functional units are capable of execut-
ing move instructions. In fact, when the architecture has four
functional units which can perform move operations, we do
not unroll50% of the DDGs and we unroll only twice75%
of the DDGs.
References
[1] Mounira Bachir. Loop Unrolling Minimisation for Periodic
Register Allocation. PhD thesis, Université de Versailles
Saint-Quentin-En-Yvelines, France, 2010.
[2] Mounira Bachir, David Gregg, and Sid-Ahmed-Ali Touati.
Using the meeting graph framework to minimise kernel loop
unrolling for scheduled loops. InProceedings of the 22nd In-
ternational Workshop on Languages and Compilers for Par-
allel Computing, Delaware, USA, 2009.
[3] Mounira Bachir, Sid-Ahmed-Ali Touati, and Albert Cohen.
Post-pass periodic register allocation to minimise loop un-
rolling degree. InLCTES ’08: Proceedings of the 2008 ACM
SIGPLAN-SIGBED conference on Languages, compilers, and
tools for embedded systems, pages 141–150, New York, NY,
USA, 2008. ACM.
[4] Antoine Sawaya Christine Eisenbeis. Optimal loop paral-
lelization under register constraints. Technical report, INRIA,
France, January 1996.
[5] Ron Cytron and Jeanne Ferrante. What’s in a name? -or- the
value of renaming for parallelism detection and storage allo-
cation. InProceedings of the 1987 International Conference
on Parallel Processing (ICPP), pages 19–27, 1987.
[6] D. de Werra, Ch. Eisenbeis, S. Lelait, and B. Marmol. On a
graph-theoretical model for cyclic register allocation.Discrete
Applied Mathematics, 93(2-3):191–203, 1999.
[7] Dominique de Werra, Christine Eisenbeis, Sylvain Lelait, and
Elena Stohr. Circular-arc graph coloring: On chords and cir-
cuits in the meeting graph.European Journal of Operational
Research, 136(3):483–500, February 2002.
[8] James C. Dehnert, Peter Y.-T. Hsu, and Joseph P. Bratt. Over-
lapped loop support in the cydra 5. InASPLOS-III: Pro-
ceedings of the third international conference on Architectural
support for programming languages and operating systems,
pages 26–38, New York, NY, USA, 1989. ACM.
[9] Christine Eisenbeis, Sylvain Lelait, and Bruno Marmol. The
meeting graph: a new model for loop cyclic register allocation.
In PACT ’95: Proceedings of the IFIP WG10.3 working con-
ference on Parallel architectures and compilation techniques,
pages 264–267, Manchester, UK, 1995. IFIP Working Group
on Algol.
[10] Laurie J. Hendren, Guang R. Gao, Erik R. Altman, and Chan-
drika Mukerji. A register allocation framework based on hi-
erarchical cyclic interval graphs. InCC ’92: Proceedings of
the 4th International Conference on Compiler Construction,
pages 176–191, London, UK, 1992. Springer-Verlag.
[11] Richard A. Huff. Lifetime-sensitive modulo scheduling.SIG-
PLAN Not., 28(6):258–267, 1993.
[12] P. Faraboschi J. A. Fisher and C. Young.Embedded Comput-
ing: a VLIW Approach to Architecture, Compilers and Tools.
Morgan Kaufmann Publishers, 2005.
[13] M. Lam. Software pipelining: an effective scheduling tech-
nique for vliw machines. SIGPLAN Not., 23(7):318–328,
1988.
[14] Alexandru Nicolau, Roni Potasman, and Haigeng Wang. Reg-
ister allocation, renaming and their impact on fine-grain paral-
lelism. In Proceedings of the Fourth International Workshop
on Languages and Compilers for Parallel Computing, pages
218–235, London, UK, 1992. Springer-Verlag.
[15] B. R. Rau, M. Lee, P. P. Tirumalai, and M. S. Schlansker.
Register allocation for software pipelined loops.SIGPLAN
Not., 27(7):283–299, 1992.
[16] Sid-Ahmed-Ali Touati and C. Eisenbeis. Early Periodic Reg-
ister Allocation on ILP Processors.Parallel Processing Let-
ters, 14(2):287–313, 2004.
