Time-Constrained Code Compaction for DSPs by Leupers, Rainer & Marwedel, Peter
c IEEE TRANSACTIONS ON VLSI SYSTEMS, VOL. 5, NO. 1, 1997 1
Time-Constrained Code Compaction for DSPs
Rainer Leupers, Peter Marwedel
Abstract| This paper addresses instruction-level paral-
lelism in code generation for DSPs. In presence of potential
parallelism, the task of code generation includes code com-
paction, which parallelizes primitive processor operations
under given dependency and resource constraints. Further-
more, DSP algorithms in most cases are required to guaran-
tee real-time response. Since the exact execution speed of
a DSP program is only known after compaction, real-time
constraints should be taken into account during the com-
paction phase. While previous DSP code generators rely
on rigid heuristics for compaction, we propose a novel ap-
proach to exact local code compaction based on an Integer
Programming model, which handles time constraints. Due
to a general problem formulation, the IP model also cap-
tures encoding restrictions and handles instructions having
alternative encodings and side eects, and therefore applies
to a large class of instruction formats. Capabilities and lim-
itations of our approach are discussed for dierent DSPs.
Keywords| Retargetable compilation, embedded DSPs,
code compaction
I. Introduction
R
ESEARCH on electronic CAD is currently taking the
step towards system-level design automation. For eco-
nomical reasons, contemporary embedded VLSI systems
are of heterogeneous nature, comprising both hardware
and software components in the form of ASICs and embed-
ded programmable processors. Consequently, system-level
CAD tools need to provide support for integrated hardware
and software synthesis. Software synthesis is the task of ex-
tracting those pieces of functionality from a system speci-
cation, which should be assigned to programmable proces-
sors, and mapping these pieces into executable, processor-
specic machine code.
The general optimization goal in hardware/software co-
synthesis of embedded VLSI systems is to minimize the
amount of custom hardware needed to implement a sys-
tem under given performance constraints. This is due to
the fact, that implementation by software provides more
exibility, lower implementation eort, and better oppor-
tunities for reuse. On the other hand, software synthesis
turns out to be a bottleneck in design of systems compris-
ing programmable digital signal processors (DSPs): Most
DSP software is still coded at the assembly-language level
[1], in spite of the well-known drawbacks of low-level pro-
gramming. Although high-level language compilers for o-
the-shelf DSPs are available, the execution speed overhead
of compiler-generated code (up to several hundred percent
compared to hand-crafted code [2]) is mostly unacceptable.
The reason for this overhead is, that compilers are hardly
capable of exploiting the highly dedicated and irregular
architectures of DSPs. Furthermore, there is still no desig-
Authors' aliation: University of Dortmund, Department
of Computer Science 12, 44221 Dortmund, Germany, E-mail:
leupersjmarwedel@ls12.informatik.uni-dortmund.de
nated standard programming language for DSPs. The sit-
uation is even worse for application-specic DSPs (ASIPs).
Since these are typically low-volume and product-specic
designs, high-level language compilers for ASIPs hardly ex-
ist. Nevertheless, ASIPs are expected to gain increasing
market shares in relation to standard DSPs [1].
Current research eorts to overcome the productivity
bottleneck in DSP code generation concentrate on two cen-
tral issues [3]:
Code quality: In order to enable utilization of high-
level language compilers, the code overhead must be
reduced by an order of magnitude. This can only be
achieved by means of new DSP-specic code optimiza-
tion techniques, reaching beyond the scope of classi-
cal compiler construction. Classical optimization tech-
niques, intended for large programs on general-purpose
processors, primarily focus on high compilation speed,
and thus have a limited eect. In constrast, genera-
tion of ecient DSP code justies much higher compi-
lation times. Therefore, there are large opportunities
for new optimization techniques, aiming at very high
quality code generation within any reasonable amount
of compilation time.
Retargetability: In order to introduce high-level lan-
guage compilers into code generation for ASIPs, it is
necessary to ensure exibility of code generation tech-
niques. If a compiler can quickly be retargeted to
a new processor architecture, then compiler develop-
ment will become economically feasible even for low-
volume DSPs. In an ideal case, a compiler supports
retargeting by reading and analyzing an external, user-
editable model of the target processor, for which code
is to be generated. Such a way of retargetability would
permit ASIP designers to quickly study the mutual
dependence between hardware architectures and pro-
gram execution speed already at the processor level.
The purpose of this paper is to present a code optimiza-
tion technique, which aims at thoroughly exploiting poten-
tial parallelism in DSP machine programs by exact local
code compaction. Although code compaction is a well-tried
concept for VLIW machines, eective compaction tech-
niques for DSPs, which typically show a rather restrictive
type of instruction-level parallelism, have hardly been re-
ported. The compaction technique proposed in this paper
takes into account the peculiarities of DSP instruction for-
mats as well as time constraints imposed on machine pro-
grams. Furthermore, it is completely retargetable within a
class of instruction formats dened later in this paper.
Since we perform exact code compaction, the runtimes
are signicantly higher than for heuristic approaches. Nev-
ertheless, as will be demonstrated, our compaction tech-
nique is capable of solving problems of relevant size within
c IEEE TRANSACTIONS ON VLSI SYSTEMS, VOL. 5, NO. 1, 1997 2
acceptable amounts of computation time. Whenever tight
time constraints demand for extremely dense code, exact
code compaction thus is a feasible alternative to heuristic
compaction.
The remainder of this contribution is organized as fol-
lows: Section II gives an overview of the Record compiler
system, which employs the proposed code compaction tech-
nique in order to compile high-quality DSP code in a retar-
getable manner. In Section III, we provide the background
for local code compaction and outline existing techniques.
Then, we discuss limitations of previous work, which moti-
vates an extended, DSP-specic denition of the code com-
paction problem, presented in Section IV. Section V speci-
es a formal model of code compaction by means of Integer
Programming. Real-life examples and experimental results
are given in Section VI, and Section VII concludes with a
summary of results and hints for further research.
II. System overview
T
HE Record compiler project, currently being carried
out at the University of Dortmund, is based on ex-
periences gained with earlier retargetable compilers inte-
grated in the MIMOLA hardware design system [4], [5].
Record is a retargetable compiler for DSPs, for which
the main objective is to nd a reasonable compromise be-
tween the antagonistic goals of retargetability and code
quality. In the current version, Record addresses xed-
point DSPs with single-cycle instructions, and compiles
DSP algorithms written in the DFL language [6] into ma-
chine instructions for an externally specied target proces-
sor model. The coarse compiler architecture is depicted in
gure 1. The code generation process is subdivided into
the following phases:
Intermediate code generation: The
DFL source program is analyzed and is compiled into a
control/dataow graph (CDFG) representation. The
basic CDFG entities are expression trees (ETs) of max-
imum size, which are obtained by data-ow analysis.
Common subexpressions in ETs are resolved heuristi-
cally by node duplication [7].
Instruction-set extraction: A hardware description
language (HDL) model of the target processor is an-
alyzed and is converted into an internal graph rep-
resentation. Currently, the MIMOLA 4.1 HDL [8] is
used for processor modelling, but adaptation towards a
VHDL subset would be straightforward. On the inter-
nal processor model, instruction-set extraction (ISE)
is performed in order to determine the complete set of
register transfer (RT) patterns available on the target
processor [9], [10]. Additionally, extracted RT pat-
terns are annotated with (possibly multiple) binary
encodings (partial instructions). A partial instruction
is a bitstring I 2 f0; 1; xg
L
, where L denotes the in-
struction word-length, and x is a don't care value.
Compared to related retargetable DSP compiler sys-
tems, such as MSSQ [4], CHESS [11], and CodeSyn
[12], the concept of ISE is a unique feature of Record:
It accepts target processor models in a real HDL,
DSP source program
(DFL language)
target processor model
(MIMOLA language)
CDFG generation instruction-set
extraction
tree parser generation
(iburg + C compiler)
code selection and
register allocation
integrated
scheduling + spilling
mode register setting
address assignment
code compaction
vertical code augmented
with AGU operations
transformation
rule library
available
RT patterns
versions for RTs and NOPs
alternative encoding
application-
specific
rewrite rules
expression trees
expression trees
covered by RT patterns
vertical code
target-specific tree parser
Fig. 1. Architecture of the Record compiler
which provides a convenient interface to CAD frame-
works. Furthermore, processors may be modelled at
dierent abstraction levels, ranging from purely be-
havioral (instruction-set) descriptions down to RT-
level netlists, consisting of functional units, registers,
busses, and logic gates. Due to usage of binary deci-
sion diagrams (BDDs) [13] for control signal analysis,
ISE can be performed eciently and also eliminates
undesired eects resulting from syntactic variances in
processor models. In this way, ISE provides the neces-
sary link between hardware-oriented processor models
and advanced code generation techniques. The ex-
tracted RT patterns form a set of tree templates, each
of which represents a primitive, single-cycle processor
operation. Such an operation reads values from regis-
ters, storages, or input ports, performs a computation,
and writes the result into a register, storage cell, or
output port.
Tree parser generation: From the extracted RT pat-
terns, a fast processor-specic tree parser is automati-
cally generated by means of the iburg code generator
generator [14]. The generated tree parser computes
optimal implementations of expression trees with re-
spect to the number of selected RT patterns. This
includes binding of values to special-purpose registers
as well as exploitation of chained operations, such as
multiply-accumulates. The advantages of tree parsing
as a means of integrated code selection and register
allocation for irregular processor architectures have al-
ready been pointed out in [15] and [16]. ISE and tree
parser generation have to be performed only once for
each new target processor and can be reused for dif-
c IEEE TRANSACTIONS ON VLSI SYSTEMS, VOL. 5, NO. 1, 1997 3
ferent source programs.
Code selection and register allocation: ETs in the
intermediate format are consecutively mapped into
processor-specic RTs by the tree parser. The high
speed eciency of tree parsing permits consideration
of dierent, semantically equivalent alternatives for
each ET. Alternative ETs are generated based on a
user-editable library of transformation rules. Trans-
formation rules are rewrite rules, which are necessary
to cope with unforeseen idiosyncrasies in the target
processor, and can also increase code quality by ex-
ploitation of algebraic rules. Simple algebraic rules,
such as commutativity, can already be incorporated
into the tree parser at virtually no extra cost in com-
pilation speed (see also [17], [18]). From each set of
alternative ETs, the one with the smallest number of
required RTs is selected.
Vertical code generation: Code selection and regis-
ter allocation by tree parsing yield a register-transfer
tree, which represents the covering of an ET by
processor-specic RT patterns. During vertical code
generation, register-transfer trees are heuristically se-
quentialized, so as to minimize the spill requirements
for register les with limited capacity. Necessary spill
and reload code is inserted, as well as additional RTs
adjusting possibly required mode register states (arith-
metic modes, shift modes). Mode registers store con-
trol signals, which need to be changed only rarely. In
the area of microprogramming, mode registers corre-
spond to the concept of residual control. Mode regis-
ter requirements of RTs are determined together with
partial instructions during instruction-set extraction.
After vertical code generation, the machine program
consists of a set of RT-level basic blocks.
Address assignment: Generated RTL basic blocks are
augmented with RTs, which ensure eective utilization
of parallel address generation units (AGUs) for mem-
ory addressing. This is accomplished by computing an
appropriate memory layout for program variables [19],
[20].
Code compaction: After address assignment, all RTs
necessary to implement the desired program behav-
ior have been generated, and are one block at a time
passed to the code compaction phase, which is the sub-
ject of this paper. During code compaction, potential
parallelism at the instruction level is exploited, while
obeying inter-RT dependencies and conicts imposed
by binary encodings of RT patterns. The result is ex-
ecutable processor-specic machine code.
III. Local code compaction
C
ODE compaction deals with parallelizing a set of RTs
under given dependency relations and constraints by
assigning RTs to control steps. The set of all RTs assigned
to the same control step form a machine instruction. The
general optimization goal is to minimize the number of
control steps. Local compaction starts from an RTL basic
block BB = (r
1
; : : : ; r
n
), i.e. a sequence of RTs generated
by previous compilation phases. In contrast, global code
compaction permits RTs to be moved across basic block
boundaries. In this paper, we consider local compaction,
because of the following reasons:
1. Eective global compaction techniques need to em-
ploy local techniques as subroutines. However, as in-
dicated by experimental surveys [2], even local com-
paction is not a well-solved problem for DSPs. There-
fore, it seems reasonable to rst study local techniques
in more detail.
2. Popular global techniques, such as Trace Schedul-
ing [21], Percolation Scheduling [22], and Mutation
Scheduling [23], have been shown to be eective mainly
for highly parallel and regular architectures, in partic-
ular VLIWs. Contemporary DSPs, however, are not
highly parallel and tend to have an irregular architec-
ture.
3. In order to preserve semantical correctness of com-
pacted programs, global techniques need to insert com-
pensation code when moving RTs across basic blocks.
Compensation code may lead to a signicant increase
in code size, which contradicts the goal of minimizing
on-chip program ROM area.
For local code compaction, it is sucient to represent an
RT by a pair r
i
= (W
i
; R
i
), where W
i
is a write loca-
tion, and R
i
is a set of read locations. Write and read
locations are registers, memory cells, and processor I/O
ports. Between any pair r
i
; r
j
of RTs in a basic block
BB = (r
1
; : : : ; r
n
), the following dependency relations need
to be obeyed in order to preserve semantical correctness:
 r
j
is data-dependent on r
i
("r
i
DD
,! r
j
"), ifW
i
2 R
j
,
and r
i
produces a program value consumed by r
j
.
 r
j
is output-dependent on r
i
("r
i
OD
,! r
j
"),
W
i
= W
j
and j > i
 r
j
is data-anti-dependent on r
i
("r
i
DAD
,! r
j
"), if
there exists an RT r
k
, such that
r
k
DD
,! r
i
and r
k
OD
,! r
j
:
The dependency relations impose a partial ordering on
fr
1
; : : : ; r
n
g, which can be represented by a DAG:
Denition: For an RTL basic block BB = (r
1
; : : : ; r
n
),
the RT dependency graph (RDG) is an edge-labelled di-
rected acyclic graph G = (V;E;w), with V = fr
1
; : : : ; r
n
g,
E  V  V , and w : E ! fDD;DAD;ODg.
In the hardware model used by Record, all RTs are single-
cycle operations, all registers permit at most one write ac-
cess per cycle, and all registers can be written and read
within the same cycle. This leads to the following "basic"
denition of the code compaction problem:
Denition: A parallel schedule for an RDG G =
(V;E;w) is a mapping CS : fr
1
; : : : ; r
n
g ! N, from RTs
to control steps, so that for all r
i
; r
j
2 V :
r
i
DD
,! r
j
) CS(r
i
) < CS(r
j
)
c IEEE TRANSACTIONS ON VLSI SYSTEMS, VOL. 5, NO. 1, 1997 4
r
i
OD
,! r
j
) CS(r
i
) < CS(r
j
)
r
i
DAD
,! r
j
) CS(r
i
)  CS(r
j
)
Code compaction is the problem of constructing a sched-
ule CS such that
maxfCS(r
1
); : : : ; CS(r
n
)g ! min
The as-soon-as-possible time ASAP (r
i
) of an RT r
i
is
dened as:
ASAP (r
i
) =
maxf maxfASAP (r
j
) + 1 j (r
j
DD
,! r
i
) _ (r
j
OD
,! r
i
)g;
maxfASAP (r
j
) j r
j
DAD
,! r
i
gg
with maxf;g := 1.
The critical path length L
c
of an RDG is
maxfASAP (r
i
) j i 2 f1; : : : ; ngg
which provides a lower bound on the minimum schedule
length.
The as-late-as-possible time ALAP (r
i
) of an RT r
i
is
dened as:
ALAP (r
i
) =
minf minfALAP (r
j
) , 1 j (r
i
DD
,! r
j
) _ (r
i
OD
,! r
j
)g;
minfALAP (r
j
) j r
i
DAD
,! r
j
gg
with minf;g := L
c
.
An RT r
i
lies on a critical path in an RDG, if
ASAP (r
i
) = ALAP (r
i
).
In case of unlimited hardware resources, code compaction
can be eciently solved by topological sorting. Real target
architectures however impose resource limitations, which
may inhibit parallel execution of pairwise independent RTs.
These limitations can be captured by an incompatibility re-
lation
6  V  V
comprising all pairs of RTs, that cannot be executed in
parallel due to a resource conict. Incompatibilities impose
the additional constraint
8r
i
; r
j
2 V : r
i
6 r
j
) CS(r
i
) 6= CS(r
j
)
on code compaction, in which case compaction becomes a
resource-constrained scheduling problem, known to be NP-
hard [24].
Heuristic code compaction techniques became important
with appearance of VLIW machines in the early eighties.
Popular heuristics include rst-come rst-served, critical
path, and list scheduling. These three O(n
2
) algorithms
have been empirically evaluated by Mallett et al. [25]. It
was concluded, that each algorithm is capable of produc-
ing close-to-optimum results in most cases, while diering
in speed, simplicity, and quality in relation to the basic
block length n. Nevertheless, the above techniques were
essentially developed for horizontal machines with few re-
strictions imposed by the instruction format, i.e., resource
conicts are mainly caused by restricted datapath resources.
IV. Compaction requirements for DSPs
M
ANY DSPs, in particular standard components, do
not show horizontal, but strongly encoded instruction
formats in order to limit the silicon area requirements of
on-chip program ROMs. An instruction format is strongly
encoded, if the instruction word-length is small compared
to the total number of control lines for RT-level proces-
sor components. As a consequence, instruction encoding
prevents much potential parallelism, that is exposed, if
the pure datapath is considered, and most inter-RT con-
icts actually arise from encoding conicts. Instruction-
level parallelism is restricted to a few special cases, which
are assumed to provide the highest performance gains with
respect to DSP requirements. A machine instruction of
maximumparallelism typically comprises one or two arith-
metic operations, data moves, and address register updates.
However, there is no full orthogonality between these op-
erations types: Certain arithmetic operation can only be
executed in parallel to a data move to a certain special-
purpose register, an address register update cannot be ex-
ecuted in parallel to all data moves, and so forth. Thus,
compaction algorithms for DSPs have to scan a relatively
large search space in order to detect sets of RTs qualied for
parallelization. The special demands on code compaction
techniques for DSPs are discussed in the following.
A. Conict representation
In presence of datapath resource and encoding conicts,
it is desirable to have a uniform conict representation.
As already observed for earlier MIMOLA-based compilers
[5], checking for inter-RT conicts in case of single-cycle
RTs can be performed in a uniform way by checking for
conicts in the partial instructions associated with RTs.
Two partial instructions
I
1
= (a
1
; : : : ; a
L
); I
2
= (b
1
; : : : ; b
L
)
with a
i
; b
i
2 f0; 1; xg are conicting, if there exists an i 2
f1; : : : ; Lg, such that
(a
i
= 1 ^ b
i
= 0) or (a
i
= 0 ^ b
i
= 1)
In our approach, partial instructions are automatically de-
rived from the external processor model by instruction-set
extraction. Encoding conicts are obviously represented in
the partial instructions. The same holds for datapath re-
source conicts, if control code for datapath resources are
assumed to be adjusted by the instruction word. This as-
sumption does not hold in two special cases: Firstly, there
may be conicts with respect to the required mode regis-
ter states of RTs. In Record, mode register states are
adjusted before compaction by inserting additional RTs.
Therefore, parallel scheduling of RTs with conicting mode
requirements is prevented by additional inter-RT depen-
dencies.
The second special case occurs in presence of tristate
busses in the processor model. Unused bus drivers need to
be deactivated in each control step, in order to avoid unpre-
dictable machine program behavior due to bus contentions.
c IEEE TRANSACTIONS ON VLSI SYSTEMS, VOL. 5, NO. 1, 1997 5
By deriving the necessary control code settings for all bus
drivers already during instruction-set extraction, it is pos-
sible to map bus contentions to usual encoding conicts.
B. Alternative encoding versions
In general, each RT r
i
is not associated with a unique
partial instruction, but with a set of alternative encodings
fe
i1
; : : : ; e
in
i
g. Alternative encodings may arise from alter-
native routes for moving a value through the datapath. In
other cases, alternatives are due to instruction format: The
TMS320C2x DSP [26], for instance, permits execution of
address register updates in parallel to dierent arithmetic
or data move instructions. Each address register update
is represented by a dierent opcode, resulting in a number
of alternative encodings to be considered. The same also
holds for other operations, for instance, the partial instruc-
tions
1) 00111100xxxxxxxx (LT)
2) 00111101xxxxxxxx (LTA)
3) 00111111xxxxxxxx (LTD)
4) 00111110xxxxxxxx (LTS)
are alternative encodings for the same RT, namely load-
ing the "T" register from memory. Compatibility of RTs
strongly depends on the selected encoding versions. Three
RTs r
i
; r
j
; r
k
may have pairwise compatible versions, but
scheduling r
i
; r
j
; r
k
in parallel may be impossible. There-
fore, careful selection of encoding versions during com-
paction is of outstanding importance for DSPs. In [25],
version shuing was proposed as a technique for version
selection, which can be integrated into heuristic algorithms:
Whenever some RT r
i
is to be assigned to a control step t,
the cross product of all versions for r
i
and all versions of
RTs already assigned to t are checked for a combination of
compatible versions. However, version shuing does not
permit to remove an "obstructive" RT from a control step
t, once it has been bound to t, and therefore has a limited
optimization eect.
C. Side eects
A side eect is an undesired RT, which may cause in-
correct behavior of a machine program. Most compaction
approaches assume, that the instruction format is such that
side eects are excluded in advance. However, if arbitrary
instruction formats are to be handled, two kinds of side
eects must be considered during code compaction.
Horizontal side eects occur in weakly enocoded in-
struction formats, where several instruction bits remain
don't care for each control step t. Whenever such a don't
care bit steers a register or memory, which may contain
a live value in CS
t
, the register must be explicitly deac-
tivated. This can be accomplished by scheduling of no-
operations (NOPs) for unused registers. NOPs are special
partial instructions, which ensure that certain registers are
disabled from loading a new value during a certain control
step. Partial instructions for NOPs can be computed as
a "by-product" of instruction-set extraction. As for RTs,
alternative NOP encoding versions can be present for the
same register. However, NOPs do not necessarily exist for
all registers, e.g. in architectures with extensive datapath
pipelining. In this case, compaction must permit to toler-
ate side eects on registers not containing live values. If
this is not taken into account already during compaction,
code generation is likely to fail, although a solution might
exist.
The second type of side eects, which we call vertical
side eects, occurs in presence of strongly encoded in-
struction formats. A vertical side eect is exposed, if an
encoding version e
ik
for an RT r
i
is "covered" by a version
e
jk
0
of another RT r
j
. That is, selection of e
ik
for r
i
im-
plies that r
j
will be executed in the same control step. If
r
j
happens to be an RT ready to be scheduled, this side
eect can be exploited. On the other hand, version selec-
tion must discard version e
ik
, whenever this is not the case,
and r
j
might destroy a live value. Vertical side eects are
exemplied in the TMS320C2x instruction set: The par-
tial instructions LTA, LTD, LTS shown above have a side
eect on the accumulator register. If version selection is
completed before NOPs are packed, then vertical side ef-
fects can be prevented at most by coincidence.
A special aspect of vertical side eects are multiply-
accumulates (MACs) on DSPs. A MAC executes two oper-
ations P = Y * Z and A = A + P within a single cycle. On
some DSPs, for instance Motorola DSP56xxx [27], MACs
are data-stationary, i.e. multiplication and addition are ex-
ecuted in chained mode. In contrast, the TMS320C2x in-
corporates time-stationary MACs, in which case value P
is buered in a register. From a code generation point
of view, there is a strong dierence between these MAC
types: Data-stationaryMACs can already be captured dur-
ing parsing of expression trees, while generation of time-
stationary MACs must be postponed to code compaction.
In turn, this demands for compaction methods capable of
handling vertical side eects.
D. Time constraints
In most cases, DSP algorithms are subject to real-time
constraints. While techniques for hardware synthesis un-
der time constraints are available, incorporation of time
constraints into code generation has hardly been treated
so far. Unfortunate decisions during code selection and
register allocation may imply that a given maximum time
constraint cannot be met, so that backtracking may be
necessary. However, a prerequisite of time-constrained
code selection and register allocation is availability of time-
constrained code compaction techniques. This is due to
the fact, that only compaction makes the critical path and
thus the worst-case execution speed of a machine program
exactly known. Therefore, compaction techniques are de-
sirable, which parallelize RTs with respect to a given (max-
imum) time constraint of T
max
machine cycles. It might be
the case, that a locally suboptimal scheduling decision leads
to satisfaction of T
max
, while a rigid optimization heuristic
fails.
c IEEE TRANSACTIONS ON VLSI SYSTEMS, VOL. 5, NO. 1, 1997 6
E. Approaches to DSP code compaction
Heuristic compaction algorithms have been adopted for
several recent DSP code generators. Wess' compiler [15]
uses the critical path algorithm, which achieves code size
reductions between 30 % and 50 % compared to vertical
code for an ADSP210x DSP. The range of possible instruc-
tion formats that can be handled is however not reported.
For the CodeSyn compiler [12], only compaction for hor-
izontal machines has been described. The CHESS com-
piler [11] uses a list scheduling algorithm which takes into
account encoding conicts, alternative versions, and ver-
tical side eects. Horizontal side eects and bus conicts
are a priori excluded due to limitations of the processor
modelling language. In [28], a Motorola 56xxx specic
compaction method is described, however excluding out-of-
order execution, i.e. the schedule satises i < j ) CS(r
i
) 
CS(r
j
) for any two RTs r
i
; r
j
, independent of dependency
relations.
An exact (non-heuristic) compaction method, which
does take into account time constraints, has been reported
by Wilson et al. [29]. The proposed Integer Programming
(IP) approach integrates code selection, register allocation,
and compaction. The IP model comprises alternative ver-
sions and vertical side eects, but no bus conicts and hor-
izontal side eects. Furthermore, the IP model { at least
in its entirety { turned out to be too complex for realistic
target processors, and requires a large amount of manual
description eort.
The graph-based compaction technique presented by
Timmer [30] achieves comparatively low runtimes for ex-
act code compaction under time constraints by pruning
the search space in advance. The pruning procedure is
based on the assumption, that inter-RT conicts are xed
before compaction. In this case, the RTs can be parti-
tioned into maximum sets of pairwise conicting RTs, for
which separate sequential schedules can be constructed ef-
ciently. Timmer's techniques produced very good results
for a family of real-life ASIPs, but has restricted capabil-
ities with respect to alternative versions and side eects.
The abovementioned assumption implies, that incompati-
bility of versions e
ik
for RT r
i
and e
jk
0
for RT r
j
implies
pairwise incompatibility of all versions for r
i
and r
j
.
The limitations of existing DSP compaction techniques mo-
tivate an extended denition of the code compaction prob-
lem, which captures alternative versions, side eects, and
time constraints:
Denition: Let BB = (r
1
; : : : ; r
n
) be an RTL ba-
sic block, where each r
i
has a set E
i
= fe
i1
; : : : ; e
in
i
g
of alternative encodings. Furthermore, let NOP =
fNOP
1
; : : : ; NOP
r
g denote the set of no-operations for all
registers fX
1
; : : : ; X
r
g that appear as destinations of RTs
in BB, and let fnop
j1
; : : : ; nop
jn
j
g be the set of alternative
versions for all NOP
j
2 NOP .
A parallel schedule for BB is a sequence CS =
(CS
1
; : : : ; CS
n
), so that for any r
i
; r
j
in BB the follow-
ing conditions hold:
 Each CS
t
2 CS is a subset of
n
[
j=1
E
j
[
r
[
j=1
NOP
j
 There exists exactly one CS
t
2 CS, which contains an
encoding version of r
i
(notation: cs(r
i
) = t).
 If r
i
DD
,! r
j
or r
i
OD
,! r
j
then cs(r
i
) < cs(r
j
).
 If r
i
DAD
,! r
j
, then cs(r
i
)  cs(r
j
).
 If r
i
DD
,! r
j
, then all control steps
CS
t
2 fCS
cs(i)+1
; : : : ; CS
cs(j) 1
g
contain a NOP version for the destination of r
i
.
 For any two encoding and NOP versions
e
1
; e
2
2
2
4
n
[
j=1
E
j
[
r
[
j=1
NOP
j
3
5
there is no control step CS
t
2 CS, for which
e
1
; e
2
2 CS
t
^ e
1
6 e
2
For an RTL basic block BB whose RT dependency graph
has critical path length L
c
, time-constrained code com-
paction (TCC) is the problem of computing a schedule
CS, such that, for a given T
max
2 fL
c
; : : : ; ng, CS satises
CS
t
= ; for all t 2 fT
max
+ 1; : : : ; ng.
TCC is the decision variant of optimal code compaction,
extended by alternative encodings and side eects, and is
thus NP-complete. This poses the question, which prob-
lem sizes can be treated within an acceptable amount of
computation time. In the next section, we present a solu-
tion technique, which permits to compact basic blocks of
relevant size in a retargetable manner.
V. Integer Programming formulation
R
ECENTLY , several approaches have been published,
which map NP-complete VLSI-design related prob-
lems into an Integer (Linear) Programming model (e.g.
[31], [32]), in order to study the potential gains of optimal
solution methods compared to heuristics. IP is the prob-
lem of computing a setting of n integer solution variables
(z
1
; : : : ; z
n
), such that an objective function f(z
1
; : : : ; z
n
) is
minimized under the constraint
A  (z
1
; : : : ; z
n
)
T
 B
for an integer matrix A and an integer vector B. Al-
though IP is NP-hard, thus excluding exact solution of
large problems, modelling intractable problems by IP can
be a promising approach, because of the following reasons:
 Since IP is based on a relatively simple mathematical
notation, its is easily veried, that the IP formulation
of some problem meets the problem specication.
 IP is a suitable method for formally describing het-
erogeneously constrained problems, because these con-
straints often have a straightforward representation in
c IEEE TRANSACTIONS ON VLSI SYSTEMS, VOL. 5, NO. 1, 1997 7
form of linear inequations. Solving the IP means, that
all constraints are simultaneously taken into account,
which is not easily achieved in a problem-specic so-
lution algorithm.
 Since IP is among the most important optimization
problems, commercial tools are available for IP solv-
ing. These IP solvers rely on theoretical results from
operations research, and are therefore considerably
fast even for relatively large Integer Programs. Us-
ing an appropriate IP formulation thus often permits
to optimally solve NP-hard problems of practical rel-
evance.
Our approach to TCC is therefore largely based on IP.
In contrast to Wilson's approach [29], the IP instances are
not created manually, but are automatically derived from
the given compaction problem and target processor model
and an externally specied time constraint. Furthermore,
it focusses only on the problem of code compaction, which
extends the size of basic blocks which can be handled in
practice.
Given an instance of TCC, rst the mobility range
rng(r
i
) := [ASAP (r
i
); ALAP (r
i
)]
is determined for each RT r
i
, with T
max
being the upper
bound of ALAP times for all RTs. The solution variables
of the IP model encode the RT versions selected for each
control step. Dependencies and incompatibility constraints
are represented by linear inequations. Solution variables
are only dened for control step numbers up to T
max
, so
that only constraint satisfaction is required. Any IP solu-
tion represents a parallel schedule with T
max
control steps,
possibly padded with NOPs. In turn, non-existence of an
IP solution implies non-existence of a schedule meeting the
maximum time constraint. The setting of solution vari-
ables also accounts for NOPs, which have to be scheduled
in order to prevent undesired side eects. We permit arbi-
trary, multiple-version instruction formats, which meet the
following assumptions:
A1: There exists at least one NOP version for all ad-
dressable storage elements (register les, memories).
However, the NOP sets for single registers may be
empty.
A2: For each storage element not written in a certain
control step CS
t
, a NOP version can be scheduled,
independently of the RT versions assigned to CS
t
.
These assumptions { which are satised for realistic pro-
cessors { permit to insert NOP versions only after com-
paction by means of a version shuing mechanism, such
that the solution space for compaction is not aected.
A. Solution variables
The solution variables are subdivided into two classes of
indexed decision (0/1) variables:
V-variables (triple-indexed): For each r
i
with encoding
version set E
i
the following V-variables (version variables)
are used:
fv
i;m;t
j m 2 f1; : : : ; jE
i
jg ^ t 2 rng(r
i
)g
The interpretation of V-variables is
v
i;m;t
= 1 :,
RT r
i
is scheduled in control step number t with version
e
im
2 E
i
.
N-variables (double-indexed): For the set NOP =
fNOP
1
; : : : ; NOP
r
g of no-operations, the following N-
variables (NOP variables) are used:
fn
s;t
j s 2 f1; : : : ; rg ^ t 2 [1; T
max
]g
The interpretation of N-variables is
n
s;t
= 1 :,
Control step number t contains NOP for destination regis-
ter X
s
.
B. Constraints
The correctness conditions are encoded into IP con-
straints as follows:
Each RT is scheduled exactly once: This is ensured, if
the sum over all V-variables for each RT r
i
equals 1.
8r
i
:
X
t2rng(r
i
)
jE
i
j
X
m=1
v
i;m;t
= 1
Data- and output-dependencies are not violated: If r
i
; r
j
are data- or output-dependent, and r
j
is scheduled
in control step CS
t
, then r
i
must be scheduled in an
earlier control step, i.e., in the interval
[ASAP (r
i
); t, 1]
This is captured as follows:
r
i
DD
,! r
j
_ r
i
OD
,! r
j
)
8t 2 rng(r
i
) \ rng(r
j
) :
jE
j
j
X
m=1
v
j;m;t

X
t
0
2[ASAP (r
i
);t 1]
jE
i
j
X
m=1
v
i;m;t
0
Data-anti-dependencies are not violated: Data-
anti-dependencies are treated similarly to the previous
case, except that r
i
may also be scheduled in parallel
to r
j
.
r
i
DAD
,! r
j
)
8t 2 rng(r
i
) \ rng(r
j
) :
jE
j
j
X
m=1
v
j;m;t

X
t
0
2[ASAP (r
i
);t]
jE
i
j
X
m=1
v
i;m;t
0
Live values are not destroyed by side eects: A value in
a single register X
s
is live in all control steps between
its production and consumption, so that a NOP must
be activated for X
s
in these control steps. In con-
trast to more rigid handling of side eects in previous
work [5], we permit to tolerate side eects, i.e., NOPs
for registers are activated only if two data-dependent
RTs are not scheduled in consecutive control steps.
c IEEE TRANSACTIONS ON VLSI SYSTEMS, VOL. 5, NO. 1, 1997 8
Conversely, we enforce to schedule these RTs consec-
utively, if no NOP for the corresponding destination
register exists. This is modelled by the following con-
straints:
W
i
= X
s
^ r
i
DD
,! r
j
)
8t 2 [ASAP (r
i
) + 1; ALAP (r
j
) , 1]
| {z }
=:R(i;j)
:
X
t
0
2R(i;j)jt
0
<t
jE
i
j
X
m=1
v
i;m;t
0
+
X
t
0
2R(i;j)jt
0
>t
jE
j
j
X
m=1
v
j;m;t
0
, 1  n
s;t
The left hand side of the inequation becomes 1, ex-
actly if r
i
is scheduled before t, and r
j
is scheduled
after t. In this case, a NOP version for X
s
must be
activated in CS
t
. If no NOP is present for register
X
s
, then n
s;t
is replaced by zero. This mechanism is
useful for single registers. Tolerating side eects for
(addressable) storage elements is only possible, if N-
variables are introduced for each element of the le,
because the dierent elements must be distinguished.
However, this would imply an intolerable explosion of
the number of IP solution variables. Instead, as men-
tioned earlier, we assume that a NOP is present for
each addressable storage element.
Compatibility restrictions are not violated: Two RTs
r
i
; r
j
have a potential conict, if they have at least
one pair of conicting versions, they have non-disjoint
mobility ranges, and they are neither data-dependent
nor output-dependent. The following constraints en-
sure, that at most one of two conicting versions is
scheduled in each control step CS
t
.
8r
i
; r
j
; (r
i
; r
j
) 62 DD[OD : 8t 2 rng(r
i
)\rng(r
j
) :
8b
im
2 E
i
: 8b
jm
0
2 E
j
:
b
im
6 b
jm
0
) v
i;m;t
+ v
j;m
0
;t
 1
C. Search space reduction
The IP model of a given compaction problem can be
easily constructed from the corresponding RT dependency
graph and the set of partial instructions. If the IP has
a solution, then the actual schedule can be immediately
derived from the V-variables, which are set to 1. These
settings account for the detailed control step assignment
and selected encoding version for each RT. Based on this
scheduling information, NOP versions are packed into each
control step by means of version shuing: If a control step
CS
t
demands for a NOP on register X
s
, as indicated by the
setting of N-variables, then a NOP version nop
js
2 NOP
s
is determined, which is compatible to all RTs assigned to
CS
t
. Existence of this version is guaranteed by the above
assumptions A1 and A2. If X
s
is an addressable storage,
then a NOP version is scheduled in each control step, in
which X
s
is not written. This is done independently of the
setting of N-variables.
The computation times both for IP generation and NOP
version shuing are negligible. However, it is important
to keep the number of solution variables as low as possible
in order to reduce the runtime requirements for IP solv-
ing. The number of solution variables can be reduced by
discarding redundant variables, which do not contribute to
the solution space. Obviously, N-variables not occurring
in live value constraints are superuous and can be elim-
inated. V-variables are redundant, if they do not poten-
tially increase parallelism, which can be eciently checked
in advance: Selecting encoding version e
im
of some RT r
i
for control step CS
t
is useful, only if there exists an RT
r
j
, which could be scheduled in parallel, i.e., r
j
meets the
following conditions:
1. t 2 rng(r
j
)
2. (r
i
; r
j
) 62 DD [OD
3. r
j
has a version e
jm
0
compatible to e
im
If no such r
j
exists, then all variables v
i;m;t
are, for all m,
equivalent in terms of parallelism, and it is sucient to
keep only single, arbitrary representative. Further advance
pruning of the search space can be achieved by computing
tighter mobility ranges through application of some ad hoc
rules. For instance, two RTs r
i
; r
j
, with r
i
DAD
,! r
j
, cannot
be scheduled in parallel, if all encoding versions for r
i
and
r
j
are pairwise conicting. The ecacy of such ad hoc rules
is however strongly processor-dependent.
VI. Examples and results
A. TMS320C25
As a rst example, we consider code generation for
the TMS320C25 DSP, while also focussing on the in-
teraction of code compaction and preceding code gen-
eration phases. The TMS320C25 shows a very restric-
tive type of instruction-level parallelism, making com-
paction a non-trivial task even for small programs. We
demonstrate exploitation of potential parallelism for the
complex multiply program taken from the DSPStone
benchmark suite [2], which computes the product of two
complex numbers and consists of two lines of code:
cr = ar * br - ai * bi ;
ci = ar * bi + ai * br ;
The vertical code generated by code selection, register
allocation, and scheduling is shown in gure 2. The real
and imaginary parts are computed sequentially, employ-
ing registers TR, PR, and ACCU, and the results are stored
in memory. The next step is insertion of RTs for mem-
ory addressing. Record makes use of indirect addressing
capabilities of DSPs, based on a generic model of address
generation units (AGUs). Based on the variables access
sequence in the basic block, a permutation of variables to
memory cells is computed, which maximizes AGU utiliza-
tion in form of auto-increment/decrement operation of ad-
dress registers [20]. For complex multiply, the computed
c IEEE TRANSACTIONS ON VLSI SYSTEMS, VOL. 5, NO. 1, 1997 9
(1) TR = MEM[ar]
// TR = ar
(2) PR = TR * MEM[br]
// PR = ar * br
(3) ACCU = PR
// ACCU = ar* br
(4) TR = MEM[ai]
// TR = ai
(5) PR = TR * MEM[bi]
// PR = ai * bi
(6) ACCU = ACCU - PR
// ACCU = ar * br - ai * bi
(7) MEM[cr] = ACCU
// cr = ar * br - ai * bi
(8) TR = MEM[ar]
// TR = ar
(9) PR = TR * MEM[bi]
// PR = ar * bi
(10) ACCU = PR
// ACCU = ar * bi
(11) TR = MEM[ai]
// TR = ai
(12) PR = TR * MEM[br]
// PR = ai * br
(13) ACCU = ACCU + PR
// ACCU = ar * bi + ai * br
(14) MEM[ci] = ACCU
// ci = ar * bi + ai * br
Fig. 2. Vertical code for complex multiply program
address assignment is
MEM[0]$ ci; MEM[1]$ br; MEM[2]$ ai;
MEM[3]$ bi; MEM[4]$ cr; MEM[5]$ ar
After insertion of AGU operations, the vertical code con-
sists of 25 RTs, as shown in gure 3. The TMS320C25
has eight address registers, which in turn are addressed
by address register pointer ARP. In this case, only address
register AR[0] is used. The optimized address assignment
ensures, that most address register updates are realized by
auto-increment/decrement operations on AR[0].
The critical path length L
c
imposed by inter-RT de-
pendencies is 15. Table I shows experimental data (CPU
seconds
1
, number of V- and N-variables) for IP-based com-
paction of the complex multiply code for T
max
values in
[15; 21]. For the theoretical lower bound T
max
= 15, no so-
lution exists, while for T
max
= 16 (the actual lower bound)
a schedule is constructed in less that 1 CPU second. Be-
yond T
max
= 18, the CPU time rises to minutes, due to
the large search space that has to be investigated by the
IP solver. This infavorable eect is inherent to any IP-
based formulation of a time-constrained scheduling prob-
lem: The computation time may dramatically grow with
1
Integer Programs have been solved with IBM's Optimization Sub-
routine Library (OSL) V1.2 on an IBM RISC System 6000.
(1) ARP = 0 // init AR pointer
(2) AR[0] = 5 // point to 5 (ar)
(3) TR = MEM[AR[ARP]]
(4) AR[ARP] -= 4 // point to 1 (br)
(5) PR = TR * MEM[AR[ARP]]
(6) ACCU = PR
(7) AR[ARP] ++ // point to 2 (ai)
(8) TR = MEM[AR[ARP]]
(9) AR[ARP] ++ // point to 3 (bi)
(10) PR = TR * MEM[AR[ARP]]
(11) ACCU = ACCU - PR
(12) AR[ARP] ++ // point to 4 (cr)
(13) MEM[AR[ARP]] = ACCU
(14) AR[ARP] ++ // point to 5 (ar)
(15) TR = MEM[AR[ARP]]
(16) AR[ARP] -= 2 // point to 3 (bi)
(17) PR = TR * MEM[AR[ARP]]
(18) ACCU = PR
(19) AR[ARP] -- // point to 2 (ai)
(20) TR = MEM[AR[ARP]]
(21) AR[ARP] -- // point to 1 (br)
(22) PR = TR * MEM[AR[ARP]]
(23) ACCU = ACCU + PR
(24) AR[ARP] -- // point to 0 (ci)
(25) MEM[AR[ARP]] = ACCU
Fig. 3. Vertical code for complex multiply program after address
assignment
TABLE I
Experimental results for IP-based compaction of
complex multiply TMS320C25 code
T
max
CPU solution # V-vars # N-vars
15 8.39 no 71 23
16 0.75 yes 141 44
17 1.26 yes 211 56
18 22 yes 281 64
19 119 yes 351 72
20 164 yes 421 79
21 417 yes 491 84
the number of control steps, even though the scheduling
problem intuitively gets easier. Therefore, it is favorable
to choose relatively tight time constraints, i.e. close to the
actual lower bound. For tight time constraints, IP-based
compaction produces extremely compact code within ac-
ceptable amounts of computation time: Figure 4 shows the
parallel schedule constructed for T
max
= 16. Both compi-
lation by the TMS320C25-specic C compiler and manual
assembly programming did not yield higher code quality in
the DSPStone project [2].
B. Motorola DSP56k
IP-based code compaction is not specic to code gen-
eration techniques used in Record, but can essentially
c IEEE TRANSACTIONS ON VLSI SYSTEMS, VOL. 5, NO. 1, 1997 10
(1) ARP = 0 // LARP 0
(2) AR[0] = 5 // LARK AR0,5
(3) TR = MEM[AR[ARP]] // LT *
(4) AR[ARP] -= 4 // SBRK 4
(5,7) PR = TR * MEM[AR[ARP]] // MPY *+
|| AR[ARP] ++
(6,8,9) ACCU = PR // LTP *+
|| TR = MEM[AR[ARP]]
|| AR[ARP] ++
(10,12) PR = TR * MEM[AR[ARP]] // MPY *+
|| AR[ARP] ++
(11) ACCU = ACCU - PR // SPAC
(13,14) MEM[AR[ARP]] = ACCU // SACL *+
|| AR[ARP] ++
(15) TR = MEM[AR[ARP]] // LT *
(16) AR[ARP] -= 2 // SBRK 2
(17,19) PR = TR * MEM[AR[ARP]] // MPY *-
|| AR[ARP] --
(18,20,21) ACCU = PR // LTP *-
|| TR = MEM[AR[ARP]]
|| AR[ARP] --
(22,24) PR = TR * MEM[AR[ARP]] // MPY *-
|| AR[ARP] --
(23) ACCU = ACCU + PR // APAC
(25) MEM[AR[ARP]] = ACCU // SACL *
Fig. 4. Parallel TMS320C25 code for complex multiply
be applied to any piece of vertical machine code for pro-
cessors satisfying our instruction-set model. As a second
example, we consider compaction of Motorola 56000 code
generated by the GNU C compiler gcc [33]. Compared to
the TMS320C2x, the M56000 has a more regular instruc-
tion set, and parallelization of arithmetic operations and
data moves is hardly restricted by encoding conicts. As
a consequence, the number of IP solution variables and
the required computation time grow less strongly with the
number of control steps. In table II, this is exemplied for
an RTL basic block of length 23, extracted from an MPEG
audio decoder program. The critical path length in this
case is 14, with an actual lower bound of 19 control steps.
Experimental results for three further blocks are given in
table III, which indicate that exact compaction may save
a signicant percentage of instructions compared to purely
vertical compiler-generated code. Note that in case of the
M56000 a higher exploitation of parallelism can be achieved
by late assignment of variables to dierent memory banks
during compaction [28], which is not yet included in our
approach.
C. Horizontal instruction formats
The permissible basic block length for IP-based com-
paction is inversely related to the degree of instruction
encoding: For weakly encoded formats, compaction con-
straints are mainly induced by inter-RT dependencies, so
that the critical path length is close to the actual lower
TABLE II
Compaction runtimes for gcc-generated M56000 machine code
in relation to T
max
T
max
CPU solution # V-vars # N-vars
14 0.28 no 84 38
15 0.36 no 107 56
16 0.50 no 130 68
17 1.40 no 153 74
18 3.75 no 176 80
19 20 yes 199 86
20 22 yes 222 92
21 33 yes 245 98
22 67 yes 268 104
23 107 yes 291 110
TABLE III
Experimental results for compaction of M56000 machine
code
BB # RTs L
c
opt. # V-vars # N-vars CPU
1 8 5 7 32 9 0.33
2 23 14 17 131 41 6.63
3 37 20 25 412 110 130
bound for the schedule length. In this case, runtime re-
quirements are low even for larger blocks, if tight time con-
straints are chosen. On the other hand, weakly encoded
instruction formats represent the worst case in presence of
loose time constraints. This is due to the fact, that encod-
ing constraints permit early pruning of the search space ex-
plored by the IP solver, whenever RTs have large mobility
ranges. In table IV, this is demonstrated for an audio signal
processing ASIP with a purely horizontal instruction for-
mat. Experimental results are given for a sum-of-products
computation consisting of 45 RTs, including AGU opera-
tions, with a critical path length of 14. The ASIP executes
up to 4 RTs per machine cycle, so that the actual lower
bound meets the theoretical limit. For T
max
2 [14; 17],
schedules are computed very fast. Beyond T
max
= 17, run-
times are much higher and also less predictable than in the
previous experiments.
In summary, our results indicate, that complex standard
DSPs, such as TMS320C2x and M56000, represent the up-
per bound of processor complexity, for which IP-based com-
paction is reasonable. For these processors, blocks of small
to medium size can be compacted within amounts of com-
putation time, which may be often acceptable in the area
of code generation for embedded DSPs. For ASIPs, which
(because of lower combinational delay and design eort)
tend to have a weakly encoded instruction format, also
larger blocks can be compacted, however with higher sen-
sitivity towards specication of time constraints. In the
context of retargetable code generation, these limitations
are compensated by high exibility of our approach: Due
c IEEE TRANSACTIONS ON VLSI SYSTEMS, VOL. 5, NO. 1, 1997 11
TABLE IV
Compaction runtimes for an ASIP with horizontal
instruction format in relation to T
max
T
max
CPU # V-variables # N-variables
14 0.14 46 1
15 0.32 89 44
16 0.53 132 48
17 2.15 175 52
18 120 218 56
19 11 261 60
20 75 304 64
21 26 347 68
22 85 390 72
23 76 433 76
24 20 476 80
to a general denition of the compaction problem, our IP
formulation immediately applies to complex DSP instruc-
tion sets, for which exact compaction techniques have not
been reported so far. As indicated by previous work [30],
signicant runtime reductions can be expected for more
restricted classes of instruction formats.
VII. Conclusions
F
UTURE system-level CAD environments will need
to incorporate code generation tools for embedded
processors, including DSPs, in order to support hard-
ware/software codesign of VLSI systems. While satisfac-
tory compilers are available for general-purpose processors,
this is not yet the case for DSPs. Partially, this is due
to missing dedicated DSP programming languages, which
causes a mismatch between high-level language programs
and DSP architectures. In contrast to general-purpose
computing, compilation speed is no longer a primary goal in
DSP code generation. Therefore, the largest boost in DSP
compiler technology can be expected from new code opti-
mization techniques, which { at the expense of high com-
pilation times { explore comparatively vast search spaces
during code generation. Also retargetability will become
an increasingly important issue, because the diversity of
application-specic processors creates a strong demand for
exible code generation techniques, which can be quickly
adapted to new target architectures.
In this contribution, we have motivated and described a
novel approach to thorough exploitation of potential par-
allelism in DSP programs. The proposed IP formulation
of local code compaction as a time-constrained problem is
based on a problem denition designated to DSPs, which
removes several limitations of previous work. Our approach
applies to very high quality code generation for standard
DSPs and ASIPs, and is capable of optimally compacting
basic blocks of relevant size. Due to a general problem def-
inition, peculiarities such as alternative encoding versions
and side eects are captured, which provides retargetability
within a large class of instruction formats. Since existing
solutions are guaranteed to be found, we believe that exact
code compaction is a feasible alternative to heuristic tech-
niques in presence of very high code quality requirements.
Further research is necessary to extend time-constrained
code generation towards global constraints, for which local
techniques may serve as subroutines. In turn, this demands
for closer coupling of code compaction and the preceding
code selection, register allocation, and scheduling phases.
Also the mutual dependence between retargetability, code
quality, and compilation speed should be studied in more
detail in order to identify feasible compromises.
Acknowledgments
The authors would like to thank Birger Landwehr for
helpful comments on Integer Programming issues and
Steven Bashford for careful reading of the manuscript. This
work has been partially supported by the European Union
through ESPRIT project 9138 (CHIPS).
References
[1] P. Paulin, M. Cornero, C. Liem, et al., Trends in Embedded
Systems Technology, in: M.G. Sami, G. De Micheli (eds.): Hard-
ware/Software Codesign, Kluwer Academic Publishers, 1996.
[2] V. Zivojnovic, J.M. Velarde, C. Schlager, DSPStone { A DSP-
oriented Benchmarking Methodology, Technical Report, Dept. of
Electrical Engineering, Institute for Integrated Systems for Signal
Processing, University of Aachen, Germany, 1994.
[3] P. Marwedel, G. Goossens (eds.), Code Generation for Embedded
Processors, Kluwer Academic Publishers, 1995.
[4] L. Nowak, P. Marwedel: Verication of Hardware Descriptions
by Retargetable Code Generation, 26th Design Automation Con-
ference (DAC), 1989, pp. 441-447.
[5] P. Marwedel, Tree-based Mapping of Algorithms to Predened
Structures, Int. Conf. on Computer-AidedDesign (ICCAD), 1993,
pp. 586-993.
[6] Mentor Graphics Corporation, DSP Architect DFL User's and
Reference Manual, V 8.2 6, 1993.
[7] A.V. Aho, R. Sethi, J.D. Ullman, Compilers - Priciples, Tech-
niques, and Tools, Addison-Wesley, 1986.
[8] S. Bashford, U. Bieker, et al., The MIMOLA Language V 4.1,
Technical Report, University of Dortmund, Dept. of Computer
Science, September 1994.
[9] R. Leupers, P. Marwedel: Instruction Set Extraction from Pro-
grammable Structures, European Design Automation Conference
(EURO-DAC), 1994, pp. 156-161.
[10] R. Leupers, P. Marwedel, A BDD-based Frontend for Retar-
getable Compilers, European Design & Test Conference (ED &
TC), 1995, pp. 239-243.
[11] D. Lanneer, J. Van Praet, et al., CHESS: Retargetable Code
Generation for Embedded DSP Processors, chapter 5 in [3].
[12] P. Paulin, C. Liem, et al., FlexWare: A Flexible Firmware
Development Environment for Embedded Systems, chapter 4 in
[3].
[13] R.E. Bryant, Symbolic Manipulation of Boolean Functions Us-
ing a Graphical Representation, 22nd Design Automation Con-
ference (DAC), 1985, pp. 688-694.
[14] C.W. Fraser, D.R. Hanson, T.A. Proebsting, Engineering a
Simple, Ecient Code Generator Generator, ACM Letters on
Programming Languages and Systems, vol. 1, no. 3, 1992, pp.
213-226.
[15] B. Wess, Automatic Instruction Code Generation based on Trel-
lis Diagrams, IEEE Int. Symp. on Circuits and Systems (ISCAS),
1992, pp. 645-648.
[16] G. Araujo, S. Malik, Optimal Code Generation for Embed-
ded Memory Non-Homogeneous Register Architectures, 8th Int.
Symp. on System Synthesis (ISSS), 1995, pp. 36-41.
[17] H. Emmelmann, Code Selection by Regular Controlled Term
Rewriting, in: R. Giegerich, S. Graham, Code Generation: Con-
cepts, Tools, Techniques, Springer, 1992, pp. 3-29.
c IEEE TRANSACTIONS ON VLSI SYSTEMS, VOL. 5, NO. 1, 1997 12
[18] E. Pelegri-Llopart, S. Graham, Optimal Code Generation for
Expression Trees, 15th Ann. ACM Symp. on Priciples of Pro-
gramming Languages, 1988, pp. 294-308.
[19] S. Liao, S. Devadas, K. Keutzer, et al., Storage Assignment to
Decrease Code Size, ACM SIGPLAN Conference on Program-
ming Language Design and Implementation (PLDI), 1995.
[20] R. Leupers, P. Marwedel, Algorithms for Address Assignment
in DSP Code Generation, Int. Conf. on Computer-Aided Design
(ICCAD), 1996.
[21] J.A. Fisher, Trace Scheduling: A Technique for Global Mi-
crocode Compaction, IEEE Trans. on Computers, vol. 30, no.
7, 1981, pp. 478-490.
[22] A. Aiken, A. Nicolau, A Development Environment for Hori-
zontal Microcode, IEEE Trans. on Software Engineering, no. 14,
1988, pp. 584-594.
[23] S. Novack, A. Nicolau, N. Dutt, A Unied Code Generation
Approach using Mutation Scheduling, chapter 12 in [3].
[24] M.R. Gary, D.S. Johnson, Computers and Intractability { A
Guide to the Theory of NP-Completeness, Freemann, 1979.
[25] S. Mallett, D. Landskov, B.D. Shriver, P.W. Mallett, Some
Experiments in Local Microcode Compaction for Horizontal Ma-
chines, IEEE Trans. on Computers, vol. 30, no. 7, 1981, pp.
460-477.
[26] Texas Instruments, TMS320C2x User's Guide, rev. B, Texas
Instruments, 1990.
[27] Motorola Inc., DSP 56156 Digital Signal Processor User's Man-
ual, Motorola, 1992.
[28] A. Sudarsanam, S. Malik, Memory Bank and Register Allocation
in Software Synthesis for ASIPs, Int. Conf. on Computer-Aided
Design (ICCAD), 1995, pp. 388-392.
[29] T. Wilson, G. Grewal, B. Halley, D. Banerji, An Integrated
Approach to Retargetable Code Generation, 7th Int. Symp. on
High-Level Synthesis (HLSS), 1994, pp. 70-75.
[30] A. Timmer, M. Strik, J. van Meerbergen, J. Jess, Conict
Modelling and Instruction Scheduling in Code Generation for In-
House DSP Cores, 32nd Design Automation Conference (DAC),
1995, pp. 593-598.
[31] C. Gebotys, M. Elmasry, Optimal VLSI Architectural Synthesis,
Kluwer Academic Publishers, 1992.
[32] B. Landwehr, P. Marwedel, R. Domer, OSCAR: Optimum Si-
multaneous Scheduling, Allocation, and Resource Binding based
on Integer Programming, European Design Automation Confer-
ence (EURO-DAC), 1994.
[33] R.M. Stallmann, Using and Porting GNU CC V2.4, Free Soft-
ware Foundation, Cambridge/Massachusetts, 1993.
Rainer Leupers, born in 1967, holds a
Diplomadegree (with distinction) in Computer
Science from the University of Dortmund, Ger-
many. He received a scholarship from Siemens
AG and the Hans Uhde award for an outstand-
ing Diploma thesis. Since 1993, he is with
Prof. PeterMarwedel's VLSI CAD group at the
Computer Science Department of the Univer-
sity of Dortmund, where he is currently work-
ing towards a Ph.D. degree. His focus of re-
search interest is on system-level design au-
tomation for embedded systems, in particular modelling of embed-
ded programmable components and code generation for digital signal
processors.
Peter Marwedel (M'79) received his Ph.D.
in Physics from the University of Kiel (Ger-
many) in 1974. He worked at the Computer
Science Department of that University from
1974 until 1989. In 1987, he received the Dr.
habil. degree (a degree required for becoming
a professor) for his work on high-level synthe-
sis and retargetable code generation based on
the hardware description language MIMOLA.
Since 1989 he is a professor at the Computer
Science Department of the University of Dort-
mund (Germany). He served as the Dean of that Departmentbetween
1992 and 1995. His current research areas include hardware/software
codesign, high-level test generation, high-level synthesis and code gen-
eration for embedded processors. Dr. Marwedel is a member of the
IEEE Computer society, the ACM, and the Gesellschaft fur Infor-
matik (GI).
