Introduction
Dependent on the desired flexibility and on the importance of area (cost) and power dissipation different options exist for the implementation of signal processing algorithms. At one end of the design space general purpose processors offer flexibility. Many applications can be programmed on the same processor but often at a high cost (area and dissipation). At the other end of the design space ASICs offer cost effective solutions because they are tailored towards a specific application 191.
In an attempt to combine the advantages of both altematives, one recently started to look for solutions in between. This can be done in two ways. First of all general purpose and ASIC components can be combined in one design where the ASIC is used as a co-processor. This approach is very popular with IC vendors of general purpose DSPs. This way they can increase the efficiency of the total solutions for their customer. They make the processor available as a (fixed) core which can be used as a qualified and verified building block on a chip. This approach is attractive in case the programmable parts can be grouped together such that the communication with the other parts is limited.
If this is not the case, another solution can be found by approaching the problem from the other side, i.e. from the side of the systems industry and the ASIC vendor. In this case the problem is to design an application domain specific processor i.e. an in-house core which is tuned towards a particular application domain. There is a relation between the efficiency and the size of these domains. The higher the required efficiency the smaller the application domain is chosen. In this paper the application domains are rather small. Typical examples are Digital audio, DECT, GSM etc. For each domain an inhouse core is designed in two phases. 1. Definition and implementation of the core (datapath, controller and the instruction set).
2. Code generation. During phase 1 a representative set of applications within the target application domain is implemented using existing ASIC synthesis tools for the design space exploration. Based on this quantitative feedback a core architecture including the instruction set is defined. This core architecture is implemented in VLSI using existing libraries and methods. In phase 2 any application within the same domain can be programmed on the core. This is a feasibility problem since the core, the application and the timing constraints are given. The goal of this paper is to show that existing high level synthesis tools can be adapted for the code generation purpose.
Related work
In recent literature a number of papers on the design of ASIPs can be found [4][ 103. The processor is specified in terms of the instruction set using for example special languages like nML [2] [3] or via a structural description of the processor [8] . Code generation is done in 2 major phases. First instruction set matching and selection is done [6] . Next the covered graph is scheduled and variables are assigned to registers including data routing [5]. The target architectures include commercially available general purpose DSP processors which are designed to span a large application domain.
In general purpose architectures special architectural constructs are often used which highly complicate code generation which often comes as an after-thought. In case of in-house cores we can control (to some extend) the architecture and the instruction set. Therefore we define a target architectural style such that retargetable code generation becomes possible. This means that we define a set of rules for the datapath, the controller and the instruction set. At one hand the rules are a limitation but at the other hand still a large range of architectures is accepted. This paper will concentrate on rules for the instruction set.
Existing compilers generate code of which the efficiency is not sufficient. The quality of the generated code is measured by comparing with a hand coded implementation. For our application domains the cycle budget is specified by the user (see example in section 7) and often taken from existing manual implementations. This means that efficiency is very important. To obtain this efficiency, user interaction with the specification and with the synthesis tools is more important than automation. This paper is organized as follows. First the starting point is explained. Some characteristics of the high level synthesis tools for ASICs are discussed since they are the basis for the rest of the paper. Then an overview of the new approach is presented followed by the class of architectures for which code generation is possible. Next the modelling of conflicts originating from the instruction set is discussed. Finally an example will show the possibilities.
High level synthesis for ASICs
Since parts of the existing high level synthesis tools for ASICs are reused a short introduction will explain the most important concepts. In systems like Piramid and Cathedral2 [7] [12] the overall system (figure la) consists of two major steps: RT generation, scheduling & controller generation. Step 1 translates the input source into register transfers (RTs). The scheduler (step 2) performs the ordering of the RTs and combines RTs into VLIW instructions.
RTs correspond to paths in the architecture (figure 2). The characteristic property of RTs is that they start with one or more operands originating from register files as input for an operation executed on an operation unit (OPU) which is possibly pipelined. The result is transferred through a buffer onto a bus and optionally through a multiplexer into a destination register. Each RT specifies which resources on the path must be activated and how the resources are occupied. All resources used by a RT obtain a usage specification. The resources are found on the left-hand side of the '=' sign and the usage is positioned on the right-hand side. Different RTs with common resources can be executed in parallel when the common resources have the Same usage. The example shows an 'add' on an OPU called 'acu-1' using two operands and writing the result into a register of the OPU 'ram-1' via the 6rst of two available multiplexer inputs.
Experiences in using high level synthesis for actual designs [l] have shown that the efficiency is strongly influenced by the way the specification is written. Therefore design iterations by rewriting the specification are included in figure 1. Three aspects are important, first the feedback of the compiler must guide the designer to rewrite the specification and secondly the result of the specification modifications must be predictable. Furthermore the design time may not be increased significantly. Experience has shown that this is possible.
Compiler used for code generation for inhouse cores
The new compiler overview is shown in figure lb and consists of three steps: RT generation, RT modification, scheduling & instruction encoding.
For step 1 the existing RT generation tool is reused. The generated RTs can be executed on an intermediate datapath which is equivalent to the PiramidKathedral2 architecture [12] . The final datapath of the core can differ because register files and busses can be merged later.
In step 2 the core specification is taken into account. This means two things, first the register files and busses can be merged and secondly the instruction set is taken into account. Both aspects are realized by modification of the RTs.
The modified RTs are input for the scheduler (step 3) which performs the ordering of the RTs. The scheduler combines RTs into instructions. The modifications insure that a scheduler only creates pcode instructions by combining RTs that are physically possible and allowed in the instruction set. If this does not result in a feasible solution an iteration cycle is required in which the source must be improved.
Target architecture model
This section describes the class of architectures for which code generation is possible. First the datapath architecture is presented in figure 3 . Then the controller is illustrated in figure 4. Section 6. will deal with the possible instruction sets.
The datapath consists of a number of operation units with a bus network for interconnection. OPUs can be any processing unit such as ALU, MULT, RAM, ROM and ASUs. ASUs are application specific units specifically tuned towards the application area. All operands are fetched from register files and after processing in an OPU the result is stored via an optional multiplexer in the destination register file. OPUs may also produce flags which can be used for conditional branching in the controller.
The architecture modifications mentioned in figure lb specify the merging of resources such as busses and register files. Then these resources can be shared at the cost of reduction of parallelism. For real-time DSP applications the large design freedom enables the creation a highly suitable processor core.
Instruction set conflict modelling
As indicated in section 4 the RT model plays a central role in the compiler. RTs contain all necessary information to decide if two RTs can be executed in parallel or not, i.e. if parallel execution results in a conflict or not. However, conflicts can be generated by the instruction set too. It is possible that RTs without a resource conflict in the datapath can not be executed simultaneously because this is not allowed by the instruction set e.g. because a vertical pcode is preferred. In this section we extend the previous RT model such that the parallelism restrictions imposed by the instruction set can also be represented.
First RT classes will be introduced in section 6.1. RT classes are required to specify instruction sets. The way instruction sets are specified is defined in section 6.2. Next the extra conflicts for the RTs can be generated automatically to impose the instruction set (section 6.3).
RT classes
RT classes need to be introduced to be able to specify instruction sets with the special property that all parallelism restrictions imposed by the instruction set can be modeled before scheduling. Every RT generated in step 1 of the compiler belongs to exactly one RT class. To which RT class a RT belongs is determined by the combination of the OPU resource it uses and the way the resource is used (usage). Consider the following example: It shows a part of the RT classification where every class is identified with a letter A..E. In the example RT class A is the set of all RTs performing an addition on acu-1. A RT class can contain more than one usage for the OPU resource. For example Class E is (ram-l,{read, write}).
Instruction set definition
As soon as RT classes are identified an instruction set can be specified by listing all possible instruction types. An instruction type is specified by a set of RT classes. The empty set results in a NOP (no operation). instruction o p e = (class,, clas%, ...) A RT class may only occur once in a possible instruction type but as often as needed in different instruction types.
An instruction type specifies all possible instructions which can be created by replacing every RT class in a instruction type by a single RT fkom that class.
An instruction consists of RTs which can be executed in parallel. The instruction set is the set of all possible instructions types. instruction set = (instr-type,, instr-typet, ...) Instruction set modeling via fixed constraints lead to the following construction rules:
1.
2.

3.
4.
All allowed instruction sets include the NOP (no operation) as a possible instruction.
All individual RT classes must result in a valid instruction type.
If the instruction set includes instruction type (S, U, V) this automatically allows the instruction typesNOP, (SI, (U), (VI, (S,U), (S,V), W , V ) and (S,U, VI.
Comparable with the previous rule: If (S, U), (S, V), (U, V ) are allowed instruction types then also (S, U, V) must be an allowed instruction type.
Example:
Consider the following instruction set example with RT classes S, T, U, V, X, Y and with desired instruction types { S , T), (S, U, V) and { X, Y). Using the construction rules an allowed instruction set is:
Generating instruction set conflicts
For allowed instruction sets it is possible to generate extra conflicts before scheduling such that the RT combinations after scheduling will not violate the instruction set. An efficient method for automatically finding the extra constraints is based on a conflict graph. The individual RT classes form the nodes for the graph. An edge exists between two nodes if the two RT classes do not occur together in any of the instruction types of the instruction set. Figure 6 shows the conflict graph of instruction set I. In this graph we find a set of cliques such that all edges in the conflict graph are covered once.
For the valid instruction set I a possible set of cliques is:
With these cliques we can model the instruction set restrictions as resource conflicts before scheduling. For RTs f " a class which is also present in a clique a conflict must be added with the clique as artificial resource. The clique as artificial resource is added with as usage the RT class.
Example:
Suppose RT-1 belongs to RT class S . There are two cliques containing RT class S:
( S , X), ( S Y). So SX and SY are added as arti6cial resources with as usage S . The same is performed for RT-2 and RT-3.
RT-1:
. _ < -.., . . / * RT-2 E R T c l a s s 'U' * / / .
..<-.., .. / * RT-3 E R T class 'X' * / / .
It is clear that RT-1 and RT-3 will never be scheduled in the same instruction as SX = S and SX = X form a conflict for the scheduler. Note that any clique cover will lead to a valid schedule. The only motivation to look for a maximal clique cover is to minimize the run time of the scheduler.
Example
A typical signal processing example in the digital audio domain is presented for which the efficiency of the code is essential. It has been implemented manually before. The application is shown in figure 7 and consists of multiplications, additions, clip actions and delays. For reasons of power dissipation the clock frequency of the processor is chosen 2.8 MHz. With an incoming sample rate of 44 KHz the cycle count for the time-loop is limited to a maximum of 64 cycles. The time-loop is that part of the program which is executed repeatedly. In this case the time-loop may consist of 64 instructions. The number of additions, RAM accesses and multiplications form the bottlenecks in this application. The architecture on which the application has to be implemented is shown in figure 8 . The distributed register files are characteristic for these kind of signal processors. Note that the register files support single cycle random read and random write. The available register transfers result in 13 RT classes. Because a high parallelism is required and no special class combinations using the RAM and ALU can be excluded it is not necessary to identify their individual classes. Classes E and F can be combined in a single class X and classes H, I, J and K can be combined to class Y so the number of classes is reduced to 9. Only for IO RTs the available parallelism in the datapath is redundant and can be eliminated. In this example it is sufficient to be able to do input via the IPB or output via the OPB-1 or output via the The source of the treble section is easy to verify and to read. After scheduling this sequential source will result in a small number of much more parallel instructions.
The total application is scheduled in 63 cycles. This could be reduced a few cycles if the time-loop could be folded which is not supported by the current system. The schedule is illustrated by figure 9 . The occupation of the RAM, MULT and ALU are all more than 90% which is extremely high taking the irregularities in the dataflow of the application into account. This also clearly proves the quality of the code!
918
Future work
Scheduling is one of the central tasks in the code generation phase of the system presented in this paper. The characteristic property of the scheduling task for this kind of code generation is the large amount of constraints and often the fixed cycle budget. A promising technique is being developed using execution interval analysis to prune the search space of the scheduler [ 111.
Conclusions
A target architecture model for reprogrammable in-house DSP-cores is presented. The core definition consists of a user defined datapath, controller and instruction set. The instruction set must obey construction rules in order to be able to model the imposed parallelism restrictions with fixed conflicts before scheduling. Under these conditions existing ASIC synthesis tools can be modified for this purpose which is implemented as a modification of RTs before scheduling. The approach is illustrated with a real life example for which the efficiency of the code is verry important. In the future scheduling techniques like execution interval analysis will be studied to exploit the large amount of constraints available in the problem specification.
