Novel Code Optimization Techniques for DSPs by Leupers, Rainer
Novel Code Optimization Techniques for DSPs
Rainer Leupers
University of Dortmund
Department of Computer Science 12
44221 Dortmund, Germany
e-mail: leupers@ls12.cs.uni-dortmund.de
ABSTRACT
Software development for DSPs is frequently a bottle-
neck in the system design process, due to the poor code
quality delivered by many current C compilers. As a con-
sequence, most of the DSP software still has to be writ-
ten manually in assembly language. In order to overcome
this problem, new DSP-specic code optimization tech-
niques are required, which, in contrast to classical com-
piler technology, take the detailed processor architecture
suciently into account. This paper describes several
new DSP code optimization techniques: maximum uti-
lization of parallel address generation units, exploitation
of instruction-level parallelism through exact code com-
paction, and optimized code generation for IF-statements
by means of conditional instructions. Experimental re-
sults indicate signicant improvements in code quality as
compared to existing compilers.
1. INTRODUCTION
More and more DSP system designs are based on soft-
ware running on programmable processors rather than
on dedicated hardware [1]. This trend towards software-
based implementation is due to the fact, that software
provides higher exibility and better opportunities for
reuse than hardware.
Today, however, software development for DSPs fre-
quently is a bottleneck in the system design process. It is
well-known that many of the currently available C compil-
ers for DSPs cause a signicant overhead in code size and
performance as compared to hand-written assembly code.
This is conrmed by numerous software developers and
recent empirical studies from academia and industry. Ac-
cording to [2], the compiler overhead may be in the order
of several hundred percent. Such an overhead can hardly
be tolerated in presence of real-time constraints and lim-
ited program memory size. Therefore, time-consuming
assembly-level programming is still predominant in the
area of DSP, and better compilers are among the develop-
ment tools most urgently demanded by embedded system
designers [1]. As a consequence, ecient code generation
techniques for DSPs have received high attention during
the last years (cf. [3, 4, 5] for overviews).
The overhead of compiler-generated code is mainly due
to the special architectural features of DSPs, to which
classical code optimization techniques can hardly be ap-
plied. This includes the presence of special-purpose regis-
ters, special addressing modes, and instruction-level par-
allelism. In order to make the use of high-level language
compilers feasible for more DSP applications, new DSP-
specic code optimization techniques are required, which
take into account the detailed processor architecture. An
important constraint in this context is, that high compi-
lation speed is not necessarily an issue for DSP compilers.
Instead, many compiler users are willing to trade higher
compilation times against better code quality. This al-
lows to explore the use of code optimization algorithms of
a comparatively high computational complexity.
The purpose of this paper is to present several new
DSP-specic code optimization techniques. Experimen-
tal results indicate that the use of such techniques may
signicantly reduce the overhead of compiler-generated
code. The organization of the paper is as follows. Section
2 describes techniques for utilization of special address-
ing modes in DSPs. Section 3 is focused on exploitation
of instruction-level parallelism through code compaction.
Section 5 deals with optimized code generation for if-
statements by using conditional instructions. Finally, sec-
tion 6 provides experimental results obtained by applying
the proposed techniques to dierent DSPs.
2. ADDRESS GENERATION
As compared to CISC processors, DSPs show very re-
stricted memory addressing modes. Frequently, only di-
rect (via the instruction word) and indirect addressing
(via special address registers) modes are supported. How-
ever, address generation units (AGUs) DSPs usually pro-
vide support for auto-increment and auto-decrement of
address registers in parallel to operations of the central
data path. Examples are the TI TMS320C25, the Mo-
torola 56k, and the Analog Devices ADSP-210x. This fea-
ture allows for parallel next-address computation, when-
ever the accessed program variables are appropriately
mapped to memory locations. In addition, many AGUs
comprise modify (or index) registers intended to store fre-
quently required address modication constants. Fig. 1
shows the general architecture of such an AGU, which
contains a le of k address registers (ARs) and m modify
registers (MRs).
In the following, we outline optimization techniques
effective
address
modify 
register
file
address
register
file
+/-
"1"
AR pointer
AGU
MR pointer
immediate value
Fig. 1. Address generation unit (AGU) model
that aim at allocating ARs and arranging variables in
memory, such the the use of auto-increment address com-
putations is maximized.
2.1 Scalar variables
After code generation, the exact order of accesses to
scalar program variables is known. Usually, source lan-
guages, such as C, do not prescribe any specic order
of local variables in memory. Therefore, a compiler may
compute a good layout of variables in memory, tailored
towards the variable access sequence.
LOAD AR, 1
AR += 2
AR -= 3
AR += 2
AR ++
AR -= 3
AR += 2
AR --
AR --
AR += 3
AR -= 3
AR += 2
AR ++
b
d
a
c
d
a
c
b
a
d
a
c
d
a
b
c
d
0
1
2
3
LOAD AR, 3
AR --
AR --
AR --
LOAD MR, 2
AR += MR
AR --
AR --
AR += 3
AR -= MR
AR ++
AR --
AR --
AR += MR
b
d
a
c
a
c
b
a
d
a
c
d
d
c
a
d
b
0
1
2
3
cost: 9 cost: 3
a) b)
Fig. 2. Addressing of scalar variables
As an example, consider a variable set V =
fa; b; c; dg and a variable access sequence S =
(b; d; a; c; d; a; c; b; a; d; a; c; d). Suppose, one AR and one
MR are available for generating the required memory ad-
dresses for S. Fig. 2 a) shows (in C-like notation) the
corresponding sequence of AGU operations for a "naive"
memory layout, where variables are mapped to mem-
ory cells in lexicographic order. Since only 4 out of 13
address computations are implemented by parallel auto-
increment/decrement operations, there is a cost value of
9, i.e., 9 extra instructions are required for explicit ad-
dress computations. A better memory layout is shown in
g. 2 b). Additionally, modify register MR is used in the
AGU operation sequence to store the multiply required
address modication value 2. Since the use of MR values
as address modiers neither incurs overhead, in total, this
address generation scheme only requires 3 "costly" AGU
operations, while the others can be executed in parallel
to other machine instructions.
We have designed both heuristic graph-based and ge-
netic algorithm based techniques, which construct good
memory layouts. These techniques are capable of
constructing close-to-optimum scalar address generation
schemes for arbitrary numbers of ARs and MRs in the
AGU.
2.2 Arrays
In contrast to scalar variables, the memory layout for
arrays is typically xed, so that only the allocation of ARs
for accesses to array elements can be optimized. Again,
the goal is maximum utilization of auto-increment ad-
dressing, so as to optimize code size and performance.
Consider the following array access pattern within a for
loop:
for (i = 2; i <= N; i++)
{ /* a_1 */ A[i+1]
/* a_2 */ A[i]
/* a_3 */ A[i+2]
/* a_4 */ A[i-1]
/* a_5 */ A[i+1]
/* a_6 */ A[i]
/* a_7 */ A[i-2]
}
If only a single AR were used, the AGU operation
sequence for computing the addresses in the loop body
would look as follows:
AR1 = &A[3] /* initialize AR1 with &A[2+1] */
for (i = 2; i <= N; i++)
{ /* a_1 */ AR1 -- /* access A[i+1] */
/* a_2 */ AR1 += 2 /* access A[i] */
/* a_3 */ AR1 -= 3 /* access A[i+2] */
/* a_4 */ AR1 += 2 /* access A[i-1] */
/* a_5 */ AR1 -- /* access A[i+1] */
/* a_6 */ AR1 -= 2 /* access A[i] */
/* a_7 */ AR1 += 4 /* access A[i-2] */
}
However, this scheme would involve 5 costly address
computations per loop iteration. Another naive approach
would be to allocate a separate AR for each of the 7 ac-
cesses. In this case, all address computations could obvi-
ously be covered by auto-increment, but this would imply
a waste of ARs. Our optimization technique minimizes
the number of allocated ARs while avoiding address com-
putation overhead. For instance, consider the array access
pairs (a 1, a 2) and (a 1, a 3). Since in both cases the ab-
solute address distance is 1, sharing of an AR would be
possible without introducing costly address computations.
This relation can be modeled by a "distance graph" (g.
3), which contains a node for each array access and an
edge between nodes v and w, if the the address for w can
be computed from the address of v by auto-increment or
decrement.
a_1
A[i+1]
a_2
A[i]
a_3
A[i+2]
a_4
A[i-1]
a_5
A[i+1]
a_6
A[i]
a_7
A[i-2]
Fig. 3. Distance graph for array accesses in loops
One can show that the problem of optimal AR al-
location is equivalent to a path covering problem on
the distance graph. We have designed a branch-and-
bound technique to compute optimal path covers. In
case of the above example, the following addressing
scheme with 3 ARs is optimal, which requires only auto-
increment/decrement operations:
AR1 = &A[3] /* initialize AR1 with &A[2+1] */
AR2 = &A[2] /* initialize AR2 with &A[2+0] */
AR3 = &A[0] /* initialize AR3 with &A[2-2] */
for (i = 2; i <= N; i++)
{ /* a_1 */ AR1 -- /* access A[i+1] */
/* a_2 */ AR2 -- /* access A[i] */
/* a_3 */ AR1 -- /* access A[i+2] */
/* a_4 */ AR2 ++ /* access A[i-1] */
/* a_5 */ AR1 ++ /* access A[i+1] */
/* a_6 */ AR2 ++ /* access A[i] */
/* a_7 */ AR3 ++ /* access A[i-2] */
}
3. CODE COMPACTION
Most DSPs show a certain degree of instruction-level
parallelism (ILP). A TI 'C25, for instance, can execute a
multiply-accumulate operation in parallel to an address
computation within a single instruction cycle. Obviously,
exploitation of ILP is a major source for code optimiza-
tion. A popular technique for this purpose is code com-
paction. Code compaction reads a piece of sequential ma-
chine code, and assigns instructions to a minimumnumber
of control steps, such that all inter-instruction dependen-
cies and restrictions imposed by the instruction format
are obeyed.
As an example, consider the expression tree shown in
g. 4. Sequential assembly code implementing this tree
on a TI 'C25 DSP is shown in g. 5 a). The 'C25 instruc-
tion set allows to combine dierent instruction pairs to
single instructions. For instance, an APAC (add P regis-
ter to accumulator) and an LT (load T register) can be
compacted to an LTA instruction, and an APAC and a
MPY (multiply) can be compacted to MPYA, whenever
not prevented by data dependencies. However, the most
ecient compaction scheme is usually far from obvious.
+
** +
*
*
m5 m6
*
m7 m8
-
m1 m2 m3 m4
m9
Fig. 4. Expression tree
LT m7
MPY m8
PAC
LT m5
MPY m6
APAC
SACL tmp
LT m3
MPY m4
PAC
LT m1
MPY m2
APAC
LT tmp
MPY m9
SPAC
a)
LT m7
MPY m8
LTP m5
MPY m6
LT m3
MPYA m4
SACL tmp
LTP m1
MPY m2
LTA tmp
MPY m9
SPAC
 b)
Fig. 5. Sequential and compacted 'C25 assembly code
Fig. 5 b) shows an optimal compaction for the sequential
assembly code. In this case, a reduction from 16 to 12
instructions (25 %) is achieved.
One problem in code compaction for DSPs is that clas-
sical heuristic code compaction techniques [6], mainly de-
veloped for VLIW machines, can hardly be applied di-
rectly. The instruction format of DSPs sometimes per-
mits alternative encodings for the same instruction, and
also undesired side eects in the compacted code have to
be avoided. In order to overcome the limitations of earlier
heuristic compaction algorithms, we are using a technique
based on Integer Linear Programming. In this approach,
only the problem constraints (such as inter-instruction
conicts and dependencies) are specied in the form of
linear (in)equations. Then, the equation system is solved
by a standard tool. This guarantees (locally) optimally
compacted code. Alternatively, a cycle constraint can be
imposed on the compacted code. The technique is exible
enough to cope with alternative encodings and possible
side eects. Even though Integer Linear Programming is
of exponential complexity, we empirically found that it is
still fast enough to solve many compaction problems of
small to medium size. For a 'C25, code blocks of a length
up to 50 instructions can typically be compacted within
one minute of SPARC-20 CPU time.
4. CONDITIONAL INSTRUCTIONS
The source code of control dominated applications typ-
ically contains a large number of if-then-else (ITE) state-
ments. Classical compiler technology uses conditional
jumps for implementation of ITE statements. However, a
frequent change of control ow due to many conditional
jumps in the machine code has a strongly negative im-
pact on performance in particular for deeply pipelined and
highly parallel VLIW-like processors. On a TI C62xx, for
instance, any jump incurs up to 5 stall cycles resulting in a
performance waste of up to 40 instructions (8 per cycle).
Therefore, recent VLIW DSPs permit to replace condi-
tional jumps by conditional (or predicated) instructions.
A conditional instruction is a term [C] I, where the con-
dition C is a Boolean variable stored in a register and I
is any "regular" machine instruction, e.g., an arithmetic
operation, a register move, or a jump. The semantics of a
conditional instruction is that instruction I is eectively
executed, if and only if the condition C evaluates to true
at the point of time when the control ow in a machine
program reaches instruction I. Otherwise, instruction I
behaves like a no-operation.
The availability of conditional instructions leads to the
presence of two alternative ITE implementation schemes
for a compiler. We denote the schemes with conditional
instructions and conditional jumps by c-exec and c-jump,
respectively.
4.1 The c-jump scheme
Consider an ITE statement of the form
if <cond> then <B_T> else <B_E>
where <cond> denotes a condition, and B T and B E are
the then and else blocks of the statement. The standard
replacement scheme using conditional jumps looks as fol-
lows:
c := evaluate(cond)
[c] goto then_label
B_E
goto join_label
then_label: B_T
join_label: ...
The condition is evaluated into a register c, and de-
pendent on the value of c, either B
T
or B
E
are exe-
cuted. Then, control ow joins at the next instruction.
Let T (B) denote the time to execute a basic block B,
and let J denote the (machine-dependent) jump penalty,
including the time for executing the jump instruction it-
self. If the conditional jump is taken (i.e., condition c is
true) then the execution time for the ITE statement S
is T
T
(S) = J + T (B
T
). If the jump is not taken, then
T
E
(S) = 2  J + T (B
E
). The worst-case execution time is
T (S) = max(T
T
(S); T
E
(S)).
4.2 The c-exec scheme
A semantically equivalent implementation using condi-
tional instructions is:
c := evaluate(cond)
[c] B_T
[!c] B_E
The notation "[c] B T" denotes the conditional exe-
cution of all instructions in block B
T
. The worst-case
execution time when using c-exec is T (S) = T (B
T
B
E
),
where "" denotes the concatenation of basic blocks. In
total, c-exec leads to a shorter worst-case execution time
than c-jump, exactly if
T (B
T
B
E
) < max(J + T (B
T
); 2  J + T (B
E
))
A potential advantage of c-exec lies in the fact, that
in VLIW processors, T (B
T
B
E
) is frequently much less
than T (B
T
)+T (B
E
), because the instructions in B
T
and
B
E
may be partially executed in parallel. On the other
hand it is obvious that c-exec is not guaranteed to be the
fastest alternative in any case.
4.3 Implementation selection
We select the fastest implementation (w.r.t. the worst-
case execution time) for an ITE statement by means
of estimations and a dynamic programming algorithm.
The estimation functions essentially count the number of
statements in the then and else block of an ITE state-
ment. In case of nested ITE statements, some addi-
tional instructions have to be inserted into the c-jump
and c-exec schemes shown above, which ensure the cor-
rect propagation of preconditions to lower-level ITE state-
ments. Preconditions reect the fact, that some nested
ITE statement must only be executed, if the condition
of the "surrounding" ITE statement has been evaluated
to true. This additional code is also taken into account
during estimation.
The main problem is to select the fastest ITE imple-
mentation schemes across all nesting levels, because there
is a cyclic dependence of the execution speed of ITE state-
ments at dierent nesting levels. The dynamic program-
ming algorithm breaks this cyclic dependence while ex-
ploiting the estimations as subroutines. In a bottom-up
fashion, four cost estimation values are computed for each
ITE statement, which depend on whether the statement
is implemented by c-jump or c-exec, and whether or not a
precondition has to be passed the the next nesting level.
Afterwards, a top-down pass actually selects the fastest
implementations for the ITE statements at all levels.
5. EXPERIMENTAL RESULTS
We have experimentally evaluated the techniques out-
lined in the previous sections for dierent DSPs. The
real update complex
mult
complex
update
N real
updates
N complex
updates
fir biquad_one biquad_N dot product convolution
0
100
200
300
400
500
600
700
Fig. 6. Experimental results: relative code size for DSPStone benchmarks and TI 'C25 DSP
address generation and code compaction techniques de-
scribed in sections 2 and 3 have been implemented in
Record, a retargetable compiler for a class of xed-point
DSPs [5]. We have used Record to compile the DSP-
Stone benchmarks [2] into machine code for a TI 'C25
DSP and compared the code size with the machine code
generated by the TI 'C25 ANSI C Compiler. The results
are shown in g. 6.
The left columns show the overhead (in percent com-
pared to hand-written assembly code) produced by the TI
compiler, while the right columns show the corresponding
results produced by Record. On the average, Record
was able to halve the overhead as compared to the TI
compiler. However, this achievement comes at the price
of an increase in compilation time. Due to the use of
comparatively time-intensive optimization techniques, the
compilation speed is in the order of 2-5 generated instruc-
tions per CPU second. As mentioned above, however,
high compilation speed frequently is not the most criti-
cal resource in the area of DSP, but better code quality
justies lower compilation speed.
The optimization of IF-statements described in section
4 has been evaluated for a TI C62xx VLIW DSP (table I).
We have extracted 10 control-intensive pieces of C source
code from an ADPCM transcoder and an MPEG package.
These program fragments have been compiled by means
of the ITE implementation selection algorithm and the
TI assembly optimizer (column "opt"). Again the results
have been compared to those directly produced by the TI
C6x ANSI C compiler (column "TI"). Even though we are
currently using rather simple estimation functions, faster
code has been generated in most cases. This is due to
the fact, that the proposed technique makes more inten-
sive use of conditional instructions (across several nesting
levels) than the TI compiler. However, also an increase
in code size has been measured. So, the applicability of
the optimization technique from section 4 depends on the
code optimization goal (size or speed).
6. CONCLUSIONS
In this paper, we have proposed several new DSP code
optimization techniques beyond the scope of classical
compilers, and we have experimentally shown their prac-
tical applicability. Many of the techniques are ecient
and easy to implement, so that they could be integrated
source opt TI
adapt quant 11 15
adapt predict1 13 13
adapt predict2 22 27
di comp 12 10
outp conv 24 21
code adj1 23 30
code adj2 49 51
code adj3 30 41
detect pos 27 29
nd mv 30 28
TABLE I
Experimental results: worst-case execution time (instruction
cycles) for TI C62xx DSP with and without optimization of
IF-statements
into commercial compilers.
Future work will concentrate on further optimization
techniques, with emphasis on VLIW DSPs. The main
goal is to provide compiler technology, that is capable
of replacing assembly-level programming of DSPs by the
use of high-level languages and compilers, so as to enable
higher productivity in DSP software development.
References
[1] P. Paulin, M. Cornero, C. Liem, et al.: Trends in Embedded Sys-
tems Technology, in: M.G. Sami, G. De Micheli (eds.): Hard-
ware/Software Codesign, Kluwer Academic Publishers, 1996
[2] V. Zivojnovic, J.M. Velarde, C. Schlager, H. Meyr: DSPStone
{ A DSP-oriented Benchmarking Methodology, Int. Conf. on
Signal Processing Applications and Technology (ICSPAT), 1994
[3] P. Marwedel, G. Goossens (eds.): Code Generation for Embed-
ded Processors, Kluwer Academic Publishers, 1995
[4] C. Liem: Retargetable Compilers for Embedded Core Proces-
sors, Kluwer Academic Publishers, 1997
[5] R. Leupers: Retargetable Code Generation for Digital Signal
Processors, Kluwer Academic Publishers, 1997
[6] S. Davidson, D. Landskov, B.D. Shriver, P.W. Mallett: Some
Experiments in Local Microcode Compaction for Horizontal
Machines, IEEE Trans. on Computers, vol. 30, no. 7, 1981,
pp. 460-477
