Software Pipelining and Register Pressure in VLIW Architectures: Preconditionning Data Dependence Graphs is Experimentally Better Than Lifetime-Sensitive Scheduling by Brault, Frédéric et al.
Software Pipelining and Register Pressure in
VLIW Architectures: Preconditionning Data
Dependence Graphs is Experimentally Better
than Lifetime-Sensitive Scheduling
Frédéric Brault, Benôıt Dupont-de-Dinechin,




An old debate about an open question
Phase ordering problem:
instruction scheduling before/after register allocation?
Highlighted in the 80’s for sequential code, with register minimisation
Wealth of heuristics for acyclic scheduling




Software pipelining under resource constraints only
→ register pressure often goes out of control
Software pipelining under resource and register constraints
→ to spill or to increase the II – that is the question
Post-pass cyclic register allocation
→ necessary: modulo expansion (unrolling) and register assignment
3 / 17
1. Introduction
Our strategy for VLIW
1 Decoupling register pressure control from instruction scheduling
→ better compiler engineering
→ focus scheduling on the core objectives (II, hiding memory latency)
2 Handling register constraints before scheduled resource constraints
→ Memory operations have unknown static latencies→ Imprecise
scheduling and WCET analysis
3 Avoid spilling instead of scheduling spill code while taking care of II
→ Memory operations consume more power
4 / 17
2. Validating SIRA on real-life benchmarks and architectures
The target platform
ST231 processor
4-issue VLIW processor at 400 MHz
64 general purpose 32-bit registers (GR)
8 1-bit condition registers (BR)
1 LSU, 1 BCU, 4 ALU and 1 MAU functional units
32 KB 4-way Dcache, 32 KB direct-mapped Icache
Toolchain: ST200cc with LAO
Front-end compiler based on Open64
At -O3 optimization level, the LAO backend component performs
VLIW software pipelining
Post-pass register allocation in ST200cc
5 / 17












(b) Reuse Graphs for Register Types t1 and t2




































Eµ = {(kut , v)|(u, v) ∈ Ereuse,t}V k = {kut |u ∈ V R,t}
Ek = {(v, kut)|v ∈ Cons(ut)}
6 / 17
2. Validating SIRA on real-life benchmarks and architectures
Comparing SIRA vs. existing work
Unique features of SIRA
I Optimise for multiple register types simultaneously or one after another
I Model (read and write) delays in accessing registers
I Model register banks, buffers or rotating register files.
I Register pressure guarantee independent of the scheduling algorithm
I Correctness proofs for the model and algorithms
I Reproducible results: standalone C library (SIRAlib), distributed with
experimental data
Validation of the effectiveness of SIRA in a production compiler
I Compiler construction: simplifies scheduling/allocation ordering
I Software engineering: SIRA as an independent C library plugable in any
compiler
I Reproducibility: the source code is publicly released (LGPL)
I Effectiveness: already published for standalone DDG, experimental
results of this talk for an integrated context.
7 / 17
2. Validating SIRA on real-life benchmarks and architectures
SIRA: schedule independent register allocation
Fundamental principle: Theorem [Touati2001]
Let G be a loop DDG. Let G′ the extended DDG of G associated with the
valid reuse graph Greuse,t for the register type t. Then, any software
pipelining σ of G does not require more then
∑
µtu,v registers of type t,
where µtu,v is the reuse distance between u and v in G
reuse,t. Formally:




2. Validating SIRA on real-life benchmarks and architectures
Plugging SIRA into the ST231 toolchain
9 / 17
2. Validating SIRA on real-life benchmarks and architectures
Experiments
Setup
FFMPEG, MEDIABENCH and SPEC CPU2000 benchmarks
ST231 register count lowered to 32 GR, 4 BR, optimized
simultaneously
Instruction schedulers
SIRA frees aggressive scheduling from register pressure worries
1 Optimal: Integer Linear Programming, minimize II and schedule length
2 Unwinding heuristic: unrolling-based method to build modulo schedules
3 Lifetime-sensitive heuristic: minimizes the sum of life-ranges
Questions
Does SIRA improve performance? For which scheduler?
How does a lifetime sensitive heuristic compare with the combination
of SIRA with a pressure-unaware algorithm?
10 / 17
2. Validating SIRA on real-life benchmarks and architectures
Experiments
Setup
Instrumentation of the toolchain yields static numbers about spills
and II
For each benchmark and each scheduler, we compare the numbers
obtained with the scheduler alone to those obtained with both SIRA
and the scheduler
11 / 17
2. Validating SIRA on real-life benchmarks and architectures
Experiments









2. Validating SIRA on real-life benchmarks and architectures
Experiments: cross-comparison
Question
How does a lifetime sensitive heuristic compare with the combination of
SIRA with a pressure-unaware algorithm?
Setup
SIRA + unwinding scheduler vs. lifetime-sensitive scheduler alone
SIRA + optimal scheduler vs. lifetime-sensitive scheduler alone
13 / 17
2. Validating SIRA on real-life benchmarks and architectures
Experiments: cross-comparisons
14 / 17
2. Validating SIRA on real-life benchmarks and architectures
Experiments: spill code in post-pass
Does SIRA reduce spill or prevent it altogether?
Answer: evaluate Loops that do not have spill anymore once SIRA is usedLoops that had spill without SIRA
15 / 17
2. Validating SIRA on real-life benchmarks and architectures
Conclusions
Using SIRA significantly decreases both II and spills, for all schedulers
Not surprisingly, results are less impressive on the lifetime-sensitive
scheduler, since the heuristic already reduce register pressure
The combination of SIRA with an aggressive scheduler outperforms
the lifetime-sensitive approach
16 / 17
2. Validating SIRA on real-life benchmarks and architectures
The speedup debate
Speedups depend on the data input, and the time fraction spend in
the SWP loops.
The compiler optimises for an architectural objective, while speedup
comes from a complex interaction with the micro-architecture and the
experimental environment
If you get a speedup, who guarantees that it comes as a direct
consequence of the plugged optimisation ? Phase ordering, hidden
side effects, etc.
In our case: SWP loops account for 0% to 5% of the whole
applicatiosn execution times. Most of the speedups equal to 1.
The other speedups vary from 0.85 to 2.4. Except in one case
(FFMPEG), all the observed speedups and slowdons come from
I-cache effects !
Do not trust speedups when you work on code optimisation ! Trust
what you can prove or demonstrate, not what you observ. Code
quality is a matter of many metrics, speedup is a single metric among
many others. 17 / 17
