Efficient Code Generation for In–House DSP–Cores by Marino Strik & Jef Van Meerbergen
Efﬁcient Code Generation for In-House DSP-Cores
Marino Strik, Jef van Meerbergen
Adwin Timmer*, Jochen Jess*, Stefan Note**
Philips Research Laboratories, WAY 4.47, Prof. Holstlaan 4, 5656 AA, Eindhoven, The Netherlands
 * Eindhoven University of Technology, ** Philips ITCL, Leuven, Belgium
Abstract
A balance between efﬁciency and ﬂexibility is obtained by
developing a relative large number of in-house DSP-cores
each for a relatively small application area. These cores
are programmed using existing ASIC synthesis tools which
are modiﬁed for this purpose. The key problem is to model
conﬂicts arising from the instruction set. A class of
instruction sets is deﬁned for which conﬂicts can be
modelled statically before scheduling. The approach is
illustrated with a real life example.
1. Introduction
Dependent on the desired ﬂexibility and on the
importance of area (cost) and power dissipation different
options exist for the implementation of signal processing
algorithms. At one end of the design space general
purpose processors offer ﬂexibility. Many applications can
be programmed on the same processor but often at a high
cost (area and dissipation). At the other end of the design
space ASICs offer cost effective solutions because they
are tailored towards a speciﬁc application [9].
In an attempt to combine the advantages of both
alternatives, one recently started to look for solutions in
between. This can be done in two ways. First of all general
purpose and ASIC components can be combined in one
design where the ASIC is used as a co-processor. This
approach is very popular with IC vendors of general
purpose DSPs. This way they can increase the efﬁciency
of the total solutions for their customer. They make the
processor available as a (ﬁxed) core which can be used as
a qualiﬁed and veriﬁed building block on a chip. This
approach is attractive in case the programmable parts can
be grouped together such that the communication with the
other parts is limited.
If this is not the case, another solution can be found by
approaching the problem from the other side, i.e. from the
side of the systems industry and the ASIC vendor. In this
case the problem is to design an application domain
speciﬁc processor i.e. an in-house core which is tuned
towards a particular application domain. There is a
relation between the efﬁciency and the size of these
domains. The higher the required efﬁciency the smaller
the application domain is chosen. In this paper the
application domains are rather small. Typical examples are
Digital audio, DECT, GSM etc. For each domain an in-
house core is designed in two phases.
1. Deﬁnition and implementation of the core (datapath,
controller and the instruction set).
2. Code generation.
During phase 1 a representative set of applications
within the target application domain is implemented using
existing ASIC synthesis tools for the design space
exploration. Based on this quantitative feedback a core
architecture including the instruction set is deﬁned. This
core architecture is implemented in VLSI using existing
libraries and methods. In phase 2 any application within
the same domain can be programmed on the core. This is a
feasibility problem since the core, the application and the
timing constraints are given. The goal of this paper is to
show that existing high level synthesis tools can be
adapted for the code generation purpose.
2. Related work
In recent literature a number of papers on the design of
ASIPs can be found [4][10]. The processor is speciﬁed in
terms of the instruction set using for example special
languages like nML [2][3] or via a structural description
of the processor [8]. Code generation is done in 2 major
phases. First instruction set matching and selection is done
[6]. Next the covered graph is scheduled and variables are
assigned to registers including data routing [5]. The target
architectures include commercially available general
purpose DSP processors which are designed to span a
large application domain.
In general purpose architectures special architectural
constructs are often used which highly complicate code
generation which often comes as an after-thought. In case
of in-house cores we can control (to some extend) the
architecture and the instruction set. Therefore we deﬁne a
target architectural style such that retargetable codegeneration becomes possible. This means that we deﬁne a
set of rules for the datapath, the controller and the
instruction set. At one hand the rules are a limitation but at
the other hand still a large range of architectures is
accepted. This paper will concentrate on rules for the
instruction set.
Existing compilers generate code of which the
efﬁciency is not sufﬁcient. The quality of the generated
code is measured by comparing with a hand coded
implementation. For our application domains the cycle
budget is speciﬁed by the user (see example in section 7)
and often taken from existing manual implementations.
This means that efﬁciency is very important. To obtain this
efﬁciency, user interaction with the speciﬁcation and with
the synthesis tools is more important than automation.
This paper is organized as follows. First the starting
point is explained. Some characteristics of the high level
synthesis tools for ASICs are discussed since they are the
basis for the rest of the paper. Then an overview of the new
approach is presented followed by the class of
architectures for which code generation is possible. Next
the modelling of conﬂicts originating from the instruction
set is discussed. Finally an example will show the
possibilities.
3. High level synthesis for ASICs
Since parts of the existing high level synthesis tools for
ASICs are reused a short introduction will explain the most
important concepts. In systems like Piramid and
Cathedral2 [7][12] the overall system (ﬁgure 1a) consists
of two major steps: RT generation, scheduling & controller
generation.
Figure 1 Abstract compiler overview.
Step 1 translates the input source into register transfers
(RTs). The scheduler (step 2) performs the ordering of the
RTs and combines RTs into VLIW instructions.
RTs correspond to paths in the architecture (ﬁgure 2).
The characteristic property of RTs is that they start with
one or more operands originating from register ﬁles as
input for an operation executed on an operation unit (OPU)
which is possibly pipelined. The result is transferred
through a buffer onto a bus and optionally through a
multiplexer into a destination register.
Scheduler
instr. encoding
Application source
RT generation
RT modiﬁcation
Arch. modif.
Instr. set
New
Scheduler
controller generation
Application source
RT generation
Intermediate
(a) ASIC synthesis (b) New code generation
architecture
(merging)
Figure 2 Single RT visualised.
Each RT speciﬁes which resources on the path must be
activated and how the resources are occupied. All
resources used by a RT obtain a usage speciﬁcation. The
resources are found on the left-hand side of the ‘=’ sign
and the usage is positioned on the right-hand side.
Different RTs with common resources can be executed in
parallel when the common resources have the same usage.
The example shows an ‘add’ on an OPU called ‘acu_1’
using two operands and writing the result into a register of
the OPU ‘ram_1’ via the ﬁrst of two available multiplexer
inputs.
Experiences in using high level synthesis for actual
designs [1] have shown that the efﬁciency is strongly
inﬂuenced by the way the speciﬁcation is written.
Therefore design iterations by rewriting the speciﬁcation
are included in ﬁgure 1. Three aspects are important, ﬁrst
the feedback of the compiler must guide the designer to
rewrite the speciﬁcation and secondly the result of the
speciﬁcation modiﬁcations must be predictable.
Furthermore the design time may not be increased
signiﬁcantly. Experience has shown that this is possible.
4. Compiler used for code generation for in-
house cores
The new compiler overview is shown in ﬁgure 1b and
consists of three steps: RT generation, RT modiﬁcation,
scheduling & instruction encoding.
For step 1 the existing RT generation tool is reused. The
generated RTs can be executed on an intermediate datapath
which is equivalent to the Piramid/Cathedral2 architecture
[12]. The ﬁnal datapath of the core can differ because
register ﬁles and busses can be merged later.
In step 2 the core speciﬁcation is taken into account.
This means two things, ﬁrst the register ﬁles and busses
can be merged and secondly the instruction set is taken
into account. Both aspects are realized by modiﬁcation of
the RTs.
The modiﬁed RTs are input for the scheduler (step 3)
which performs the ordering of the RTs. The scheduler
combines RTs into instructions. The modiﬁcations insure
that a scheduler only creates mcode instructions by
combining RTs that are physically possible and allowed in
the instruction set. If this does not result in a feasible
solution an iteration cycle is required in which the source
must be improved.
ACU_1 RAM_1
Dest_1:reg_2_ram_1 <- Opr_1:reg_1_acu_1,
Opr_2:reg_2_acu_1
\ acu_1       = add,
buf_1_acu_1 = write,
bus_1_acu_1 = add(Opr_1, Opr_2),
mux_2_ram_1 = pass[0,1];5. Target architecture model
This section describes the class of architectures for
which code generation is possible. First the datapath
architecture is presented in ﬁgure 3. Then the controller is
illustrated in ﬁgure 4. Section 6. will deal with the possible
instruction sets.
The datapath consists of a number of operation units
with a bus network for interconnection. OPUs can be any
processing unit such as ALU, MULT, RAM, ROM and
ASUs. ASUs are application speciﬁc units speciﬁcally
tuned towards the application area. All operands are
fetched from register ﬁles and after processing in an OPU
the result is stored via an optional multiplexer in the
destination register ﬁle. OPUs may also produce ﬂags
which can be used for conditional branching in the
controller.
The architecture modiﬁcations mentioned in ﬁgure 1b
specify the merging of resources such as busses and
register ﬁles. Then these resources can be shared at the
cost of reduction of parallelism.
Figure 3 Generic target datapath architecture.
Together with an instantiation of the presented datapath
architecture model, a controller is required to complete the
hardware of the processor core. The controller model
shown in ﬁgure 4 incorporates features to implement the
time-loop synchronization and for-loops efﬁciently. The
time-loop is the repetitive part of the (DSP) application.
Figure 4 parameterisable controller model.
The controller is pipelined via a program counter and an
instruction register. A stack is available for saving return
OPU OPU OPU OPU
RF RF RF RF RF RF
Flags
program constant
Flag[ ]
1
datapath
L
x
x
x
x
x
x PC
Instruction
ROM
IR
Start_Signal
1
ﬂags
Branch address
stack
Instr.
0
addresses for the time-loop and for possibly nested for-
loops. The program and instruction bus width, the stack
depth and the number of datapath ﬂags are parameters of
the controller.
For real-time DSP applications the large design
freedom enables the creation a highly suitable processor
core.
6. Instruction set conﬂict modelling
As indicated in section 4 the RT model plays a central
role in the compiler. RTs contain all necessary information
to decide if two RTs can be executed in parallel or not, i.e.
if parallel execution results in a conﬂict or not. However,
conﬂicts can be generated by the instruction set too. It is
possible that RTs without a resource conﬂict in the
datapath can not be executed simultaneously because this
is not allowed by the instruction set e.g. because a vertical
mcode is preferred. In this section we extend the previous
RT model such that the parallelism restrictions imposed by
the instruction set can also be represented.
First RT classes will be introduced in section 6.1. RT
classes are required to specify instruction sets. The way
instruction sets are speciﬁed is deﬁned in section 6.2. Next
the extra conﬂicts for the RTs can be generated
automatically to impose the instruction set (section 6.3).
6.1 RT classes
RT classes need to be introduced to be able to specify
instruction sets with the special property that all
parallelism restrictions imposed by the instruction set can
be modeled before scheduling. Every RT generated in step
1 of the compiler belongs to exactly one RT class. To
which RT class a RT belongs is determined by the
combination of the OPU resource it uses and the way the
resource is used (usage). Consider the following example:
Figure 5 Part of class identiﬁcation of RTs.
It shows a part of the RT classiﬁcation where every
class is identiﬁed with a letter A..E. In the example RT
class A is the set of all RTs performing an addition on
acu_1. A RT class can contain more than one usage for the
OPU resource. For example Class E is (ram_1,{read,
write}).
RT
acu_1
ram_1
add
pass
addmod
inca
read
write
A
B
C
D
E
OPU Resource Usage Class6.2 Instruction set deﬁnition
As soon as RT classes are identiﬁed an instruction set
can be speciﬁed by listing all possible instruction types.
An instruction type is speciﬁed by a set of RT classes. The
empty set results in a NOP (no operation).
instruction type = {class1, class2, ...}
A RT class may only occur once in a possible instruction
type but as often as needed in different instruction types.
An instruction type speciﬁes all possible instructions
which can be created by replacing every RT class in a
instruction type by a single RT from that class.
instruction = {RT1 : RT1 Î class1, RT2 : RT2 Î class2, ...}
An instruction consists of RTs which can be executed in
parallel. The instruction set is the set of all possible
instructions types.
instruction set = {instr_type1, instr_type2, ...}
Instruction set modeling via ﬁxed constraints lead to the
following construction rules:
1. All allowed instruction sets include the NOP (no
operation) as a possible instruction.
2. All individual RT classes must result in a valid instruc-
tion type.
3. If the instruction set includes instruction type
{S, U, V} this automatically allows the instruction
types NOP, {S}, {U}, {V}, {S, U}, {S, V}, {U, V} and
{S, U, V}.
4. Comparable with the previous rule: If {S, U}, {S, V},
{U, V} are allowed instruction types then also
{S, U, V} must be an allowed instruction type.
Example:
Consider the following instruction set example with RT
classes S, T, U, V, X, Y and with desired instruction types
{S, T}, {S, U, V} and {X, Y}. Using the construction rules
an allowed instruction set is:
I = {NOP, {S}, {T}, {U}, {V}, {X}, {Y}, {S,U},
{S,V}, {U, V}, {S, U, V}, {S, T}, {X, Y}}
6.3 Generating instruction set conﬂicts
For allowed instruction sets it is possible to generate
extra conﬂicts before scheduling such that the RT
combinations after scheduling will not violate the
instruction set. An efﬁcient method for automatically
ﬁnding the extra constraints is based on a conﬂict graph.
The individual RT classes form the nodes for the graph. An
edge exists between two nodes if the two RT classes do not
occur together in any of the instruction types of the
instruction set. Figure 6 shows the conﬂict graph of
instruction set I.
Figure 6 Conﬂict graph of I.
In this graph we ﬁnd a set of cliques such that all edges
in the conﬂict graph are covered once.
For the valid instruction set I a possible set of cliques is:
{{S, X}, {S, Y}, {T, U, Y}, {T, V, X}, {U, X}, {V, Y}}.
With these cliques we can model the instruction set
restrictions as resource conﬂicts before scheduling. For
RTs from a class which is also present in a clique a conﬂict
must be added with the clique as artiﬁcial resource. The
clique as artiﬁcial resource is added with as usage the RT
class.
Example:
Suppose RT_1 belongs to RT class S. There are two
cliques containing RT class S:
{S, X}, {S Y}. So SX and SY are added as artiﬁcial
resources with as usage S. The same is performed for
RT_2 and RT_3.
RT_1: .. <- .., .. /* RT_1 Î RT class ‘S’ */
/.
SX = S
SY = S
RT_2: .. <- .., .. /* RT_2 Î RT class ‘U’ */
/.
TUY = U
UX = U
RT_3: ..<- .., ..  /* RT_3 Î RT class ‘X’ */
/.
SX = X
TVX = X
UX = X
It is clear that RT_1 and RT_3 will never be scheduled
in the same instruction as SX = S and SX = X form a
conﬂict for the scheduler. Note that any clique cover will
lead to a valid schedule. The only motivation to look for a
maximal clique cover is to minimize the run time of the
scheduler.
7. Example
A typical signal processing example in the digital audio
domain is presented for which the efﬁciency of the code is
essential. It has been implemented manually before. The
application is shown in ﬁgure 7 and consists of
multiplications, additions, clip actions and delays.
S
U
T
V
XYFigure 7 Signal ﬂow of an audio application (identical
for left & right channel).
For reasons of power dissipation the clock frequency of
the processor is chosen 2.8 MHz. With an incoming
sample rate of 44 KHz the cycle count for the time-loop is
limited to a maximum of 64 cycles. The time-loop is that
part of the program which is executed repeatedly. In this
case the time-loop may consist of 64 instructions. The
number of additions, RAM accesses and multiplications
form the bottlenecks in this application. The architecture
on which the application has to be implemented is shown
in ﬁgure 8. The distributed register ﬁles are characteristic
for these kind of signal processors. Note that the register
ﬁles support single cycle random read and random write.
Figure 8 Processor architecture.
The available register transfers result in 13 RT classes.
Because a high parallelism is required and no special class
combinations using the RAM and ALU can be excluded it
is not necessary to identify their individual classes. Classes
E and F can be combined in a single class X and classes H,
I, J and K can be combined to class Y so the number of
classes is reduced to 9. Only for IO RTs the available
parallelism in the datapath is redundant and can be
eliminated. In this example it is sufﬁcient to be able to do
input via the IPB or output via the OPB_1 or output via the
IN
v
u@1
out0
out1 v
out2
out3
u@1
Multiplication
Delay
Clip
u
u@2
v@1
v@2
v@1
treble section
RAM
MULT
ADD
ROM ACU
CLIP
RT Class identiﬁcation
IPB - Read A
OPB_1 - Write B
OPB_2 - Write C
ACU - AddMod D
RAM - Read E
- Write F
MULT - Mult G
ALU - Add H
- AddClip I
- Pass J
- PassClip K
ROM - Const L
PRG_C - Const M
X
Y
PRG_C
PASS
OPB_2 but not simultaneously. The instructions which are
required are: {A, D, X, G, Y, L, M}, {B, D, X, G, Y, L,
M}, {C, D, X, G, Y, L, M} together with all their sub-
instructions. A single artiﬁcial resource ‘ABC’ is required
to model the instruction set restrictions. This artiﬁcial
resource must be added to all RTs in the classes A, B and C
as prescribed in section 6.3.
The controller used for this application domain is a
stripped version of the controller presented in ﬁgure 4 as
there are no conditional instructions at all.
At this point the core is deﬁned by the presented
datapath, the controller and the instruction set. To show the
programming style used to successfully map the
application on the presented core a small part of the source
is presented:
/* Treble section */
  x0    := u@2; /* U delayed over 2 frames */
  m     := mlt(d2, x0);
  a     := pass(m);
  x2    := v@1; /* V delayed over 1 frame */
  m     := mlt(e1, x2);
  a     := add(m, a);
  x1    := u@1;
  m     := mlt(d1, x1);
  rd    := add_clip(m, a);
  v      = rd;
The source of the treble section is easy to verify and to
read. After scheduling this sequential source will result in
a small number of much more parallel instructions.
The total application is scheduled in 63 cycles. This
could be reduced a few cycles if the time-loop could be
folded which is not supported by the current system. The
schedule is illustrated by ﬁgure 9.
Figure 9 Occupation distribution of schedule of 63
cycles.
The occupation of the RAM, MULT and ALU are all
more than 90% which is extremely high taking the
irregularities in the dataﬂow of the application into
account. This also clearly proves the quality of the code!
8. Future work
Scheduling is one of the central tasks in the code
generation phase of the system presented in this paper. The
characteristic property of the scheduling task for this kind
of code generation is the large amount of constraints and
often the ﬁxed cycle budget. A promising technique is
being developed using execution interval analysis to prune
the search space of the scheduler [11].
92%  PRG_CNST * |********************* ****************************  *********
92%  ROM        |  **********************************************************
92%  MULT       |   **********************************************************
92%  ALU        |    **********************************************************
93%  ACU        |  ******************** **************************** ***********
92%  RAM        |   ******************** **************************** **********
 3%  IPB        |  *                     *
 6%  OPB_1      |                  *    *  *   *
 6%  OPB_2      |                                               *    * *    *
----------------|-----|----|----|----|----|----|----|----|----|----|----|----|----|
             -2  0    5   10   15   20   25   30   35   40   45   50   55   60   659. Conclusions
A target architecture model for reprogrammable
in-house DSP-cores is presented. The core deﬁnition
consists of a user deﬁned datapath, controller and
instruction set. The instruction set must obey construction
rules in order to be able to model the imposed parallelism
restrictions with ﬁxed conﬂicts before scheduling. Under
these conditions existing ASIC synthesis tools can be
modiﬁed for this purpose which is implemented as a
modiﬁcation of RTs before scheduling. The approach is
illustrated with a real life example for which the efﬁciency
of the code is verry important. In the future scheduling
techniques like execution interval analysis will be studied
to exploit the large amount of constraints available in the
problem speciﬁcation.
Acknowledgements
I would like to thank the following people for their
support and constructive discussions:
Henry Janssen, Antoine Delaruelle (Philips Nat.Lab.,
Eindhoven, The Netherlands)
Wim Lempens, Ivo van Gelder, Jan de Mortel (EDC,
Leuven, Belgium)
References
[1] A. Delaruelle, J.A. Huisken, J. van Loon, F. Welten, “A
Channel Demodulator IC for Digital Audio Broadcasting”,
Proceedings of the IEEE 1994 Custom Integrated Circuits
Conference, IEEE Electron Devices Society, pp. 47 - 50,
May 1994
[2] A. Fauth, A. Knoll, “Automated generation of DSP
program development tools using a machine description
formalism”, Proceedings ICASSP 93, Minneapolis, Minn,
1993.
[3] M. Freericks, “The nML Machine Description Formalism”,
Technical Report 1991/15, Technische Universität Berlin,
Fachbereich Informatic, Berlin, 1991.
[4] G. Goossens, F. Catthoor, D. Lanneer, H. De Man,
“Integration of Signal Processing Systems on
Heterogeneous IC Architectures”, Proc. of Sixth
International Workshop on High-Level Synthesis, Lagana
Niguel, CA, Nov. 1992.
[5] D. Lanneer, M. Cornero, G. Goossens, H. DeMan, “Data
Routing: a Paradigm for Efﬁcient Data-path Synthesis and
Code Generation”, to be presented at the High-Level
Synthesis Symposium, May 1994
[6] C. Liem et al., “Instruction-Set matching and Selection for
DSP and ASIP Code Generation”, Proceedings EDAC ’94,
pp. 31-37, Paris , March 1994.
[7] H. De Man, F. Catthoor, G. Goossens, J. Vanhoof, J. Van
Meerbergen, J. Huisken, “Architecture-driven synthesis
techniques for VLSI implementation of DSP algorithms,
Proceedings of the IEEE, February 1990, pp. 319--335.
[8] P. Marwedel, “Tree-Based Mapping of Algorithms to
Predeﬁned Structures”, Digest of Technical Papers of
ICCAD-93, pp. 586-593, Santa Clara (CA), Nov. 1993.
[9] K. Van Nieuwenhoven, J. De Moortel, D. Genin, S. Note,
“Mistral 2 a True Architectural Synthesisä Tool: from a
Behavioural Speciﬁcation down to a Register Transfer
Level Description”, to appear in DSP Applications and
Multimedia, October 1994.
[10] P. Paulin et al., “DSP Design Tool Requirements for
Embedded Systems: A Telecommunications Industrial
Perspective.”, to appear in Journal of VLSI Signal
processing.
[11] A.H. Timmer, J.A.G. Jess, “Exact Scheduling Strategies
based on Bipartite Graph Matching”, accepted EDAC’95,
Paris, March 1995.
[12] R. Woudsma, F. Beenker, J. Van Meerbergen, C. Niessen,
“An architecture-driven silicon compiler for complex DSP
applications”, Proceedings IEEE International Symposium
on Circuits and Systems, 1990, pp. 2696-2700.