A formal approach for the transformation of computation intensive digital signal processing algorithms into suitable array processor architectures is presented. It covers the complete design flow from algorithmic specifications in a high-level programming language to architecture descriptions in a hardware description language. The transformation itself is divided into manageable design steps and implemented in the CAD-tool DECOMP which allows the exploration of different architectures in a short time. With the presented approach data independent algorithms can be mapped onto array processor architectures. To allow this, a known mapping methodology for array processor design is extended to handle inhomogeneous dependence graphs with nonregular data dependences. The implementation of the formal approach in the DECOMP is an important step towards design automation for massively parallel systems.
INTRODUCTION
rogress in VLSI technology allows to integrate more and more transistors into a single chip.
Thus, microelectronic systems with an increasing complexity can be realized. But this also results in a large quantity of design work manageable only with efficient support by design tools.
At the same time the algorithms developed in digital signal processing (DSP) grow in their complexity thereby requiring more and more computational power and higher throughput rates. This is in pa.rticular the case in the area of image and video processing where algorithms for high definition television (HDTV) and video telephone have to be applied under real time conditions. Application specific integrated circuits (ASICs) for such systems can be realized only using special purpose architectures (cf [1] ). One possible architecture are array processors [2] because they meet the requirements by a massive application of pipelining and parallel processing. In addition, due to their regularity and modularity array processors are well suited for a design process automated by design tools. These trends influence the design methodology for microelectronic systems. In the past design work mainly consists of logic design and layout synthesis. Today these design tasks are well supported by commercial tools. But there is a need to extend these tools because increasing emphasis is given to decisions at the architecture level.
Today the derivation of architectures is manually performed by an intensive and error prone process.
In most cases only few different architectures are examined. Due to this an unsuitable architecture may be derived and it becomes impossible to fulfill the requirements of a given algorithm. Thus, design methodologies supporting the architecture level must be developed and implemented in CAD-tools which enable designers to explore different architectures in a short time.
In this direction a lot of research is performed. But due to the complexity of the design process the solutions inevitably are restricted to small and/or regular design problems. Exploiting regularity several methodologies for mapping algorithms onto architectures have been developed [2, 3, 4, 5, 6, 7, 8, 9, 10, 11] and partly implemented in design tools (see references in [12] ). A disadvantage of these methodologies and tools is that most of them are restricted to special architecture types (e.g. array processors consisting of one type of processing element (PE) connected by regular data dependences) or to a special class of algorithms (e.g. regular algorithms representable by nested loop programs). Furthermore, they do not support the complete design flow starting with the specification of the algorithms and ending with a netlist description at the gate level.
Due to these reasons the CAD-tool DECOMP has been developed to support the mapping of algorithms onto array processor architectures [13, 14] . The DECOMP requires PASCAL-descriptions [15] of the algorithms as input and produces EDIFnetlists [16] at register-transfer-level as output. Later developments lead to a new implementation of the frontend in the DECOMP which now is able to compile data independent algorithms [17] into dependence graphs (DGs) [12, 18] . The resulting DGs consist of different node types connected by nonregular data dependences. Thus, they cannot be mapped onto array architectures by the known design methodologies. To allow the mapping of these DGs a procedure for combining nodes of a different type into one PE has been derived [19] , and in addition the mapping procedure proposed in [2] has been extended to handle nonregular data dependences. Currently the new mapping is implemented in the DECOMP.
The design process captured by the DECOMP cannot be performed in one step. Because of its complexity it has to be split into manageable design tasks each of them performing a specific design step. This results in a method referred to as stepwise transformation. A similar technique is known from high-level synthesis where it is applied to transform a behavioural description step by step into hardware (cf [20] ). The purpose of this paper is to outline the formal approach underlying the stepwise transformation and its implementation. Furthermore, two data representations, one assigned to the algorithm level and the other assigned to the architecture level, are defined, based on which one of the main design steps of the transformation is explained in more detail. With the presented approach data independent algorithms can be mapped onto highly parallel array processor architectures. The main advantages of the presented transformation are its ability to process nonregular algorithms and its degree of automation.
In Section 2 of this paper the stepwise transformation is outlined, and in section 3 the data representations are introduced. One of the main design steps is explained in more detail in section 4. The implementation in the CAD-tool DECOMP is described in Section 5, and finally a design example is given in Section 6.
THE STEPWISE TRANSFORMATION
The design process of mapping a given algorithm onto an array processor architecture is performed in four phases. These are 1. a specification phase, 2. a compilation phase, 3. a mapping phase, and 4. an optimization phase.
The phases themselves are divided into smaller design tasks each of them performing a correctness preserving transformation. This means, without changing the I/O-behaviour of the algorithm. Thus, a given algorithm is step by step transformed into an array processor architecture. The phases and its design tasks are depicted in Fig. 1 .
The specification phase consists of only one step which is the Program development. In this step the given algorithm is manually specified in a high-level language which is executable using standard compilers. Besides the algorithm the specification may contain an interface description specifying how the input data is provided and how the output data is required. In addition design constraints like maximum chip area and maximum delay times can be specified for the array processor or its PEs.
In the four steps of the compilation phase the description of the algorithm is modifed in a way that a dependence graph can be built from it. First, by application of compiler techniques [21] the given specification is symbolically executed [17] and the performed assignments are listed in the so-called run time protocol (RTP) [12] . (2 0) ( 
4)
The mapping phase consists of two steps which are the DG-derivation and the mapping onto signal flow graphs (SFGs). In the DG-derivation the RTP is transformed into a DG consisting of nodes and arcs (see example in Sec. 6). Then a mapping which is based on the multi projection method proposed in [2] In contrast to most of the known methodologies the proposed mapping has the advantage not to be restricted to homogeneous DGs with regular data dependences. It is also capable of mapping different nodes into the same PE by merging their internal structure as described in [19] . In addition, the method proposed in [2] is extended to handle DGs with irregular data dependences [22] . Furthermore Thereafter the derived architecture can be adapted to given design constraints. Not in every case for example it is possible to derive an architecture which requires the input data in the same way as it is provided by the external input interface. Therefore, register-multiplexer circuits for sorting the data coming from the input interface can be synthesized and put in front of the derived array processor. The problem of data supply for array processors has been studied in [23, 24] .
The last step, the extraction of a netlist in a hardware description language, is performed by a direct conversion of the used data structure. The netlist is given at the register transfer level. The smallest blocks at this level are registers and arithmetic building blocks like adders and multipliers which can be generated using building block generators [25, 26] As example: fatt(TSApE) fatt(ind(ll))) ind (5) with bb in as the block at which the signal starts and bb eer as the block at which it ends. Then the description of a block contains redundant information which for example can be used for a consistency check of the netlist.
A problem arises if internal signals of a block are allowed to be an output, too. In this case the output cannot be recognized automatically. In the presented model this is handled by fork-elements which have one input and more. than one output. The function for a fork-element for example is ffork ( fot'kl, {Sl}, {s2s3}).
As an example for the presented data representation the PE shown in Fig. 2 Table 2 ) as follows:
With these sets the external input and output arcs of the DG are given by Eq. 8 and 9, respectively. The symbol e means that there is no value at this place. Jind is developed by application of Eq. 13 . This set contains all assignments of the given run time protocol which produce a variable with the index ind.
Thus, all these assignments are placed in the same node.
ind {PVi f/(aW/ )lf/na(PVi) ind) (13) A synthesis algorithm for the derivation of DGs from the RTP has to apply the given equations in the order as described above. The main advantage of this approach is that the basic sets 97, 7/and the respective dind are derived by linear processing To show the feasibility of the formal approach presented in this paper the stepwise transformation has been implemented in the CAD-tool DECOMP. COMMON LISP [27] has been used as implementation language because it offers various possibilities for the implementation of object oriented data structures as required by the presented formal approach. In addition, it is well suited for rapid prototyping. The program structure of the DECOMP is shown in Fig. 3 .
The transformation of a given algorithm starts with an input description in the high-level language PASCAL. Besides the algorithm this input description also may contain a specification of the external interfaces and given design constraints. The first transformation step in the compilation phase, the symbolic execution, is performed by application of compiler techniques [21] . Precise, the frontend compiler of the DECOMP analyses the input description based on a grammar describing the input language. Thus, by changing this grammar, other input languages can be implemented easily without changing the source code of the compiler.
The symbolic execution as well as the introduction of single assignment code (SAC), the localisation, and the placement (see Fig. 1 3 The structure of the DECOMP as output. This RTP is input to the program component DG-derivation which translates it into the internal data structure. The internal data structure is able to represent graphs (DGs and SFGs) as well as PEs and their building blocks. Furthermore, to avoid confusion with the design data the internal data structure is only accessible via a macro-shell. The Via the DIF a building block generator [25, 26] is connected to the DECOMP. Thereby, the synthesis of the required building blocks can be performed outside the DECOMP and the performance data of the generated building blocks can be written back into the DIF. Furthermore, different netlist converters are implemented allowing the translation of the DIF into the standards EDIF and VHDL Thus, further design steps like simulation and layout synthesis can be performed with commercially available CAD-tools.
The performance data of the generated building blocks can be transferred back into the internal data structure of the DECOMP. Then, based on this data the created architectures, or rather SFGs, can be interactively analysed with respect to the specified design constraints. The program component interactive analysis provides functions e.g., for the calculation of area and delay parameters of the architectures. If an architecture does not fulfill required constraints it can be modified and extracted again. This design cycle implements the optimization step 'adaptation to design constraints'.
In addition, the program component data I/Oadaptation provides functions for generation of register-multiplexer circuits which adapt the input interface of a designed architecture to a given external interface. The generated circuits are represented in the same data structure as the SFGs. Therefore, the same design steps can be applied to them.
The program components of the DECOMP allow a straightforward conversion of an algorithm into an architecture as intended by the formal approach of the stepwise transformation. The components of the DECOMP can be used interactively or in an automated way. In the latter case the program component control/strategy performs an experience based heuristic search to find architectures close to given design constraints. To allow this the control / strategy has access to all data structures. 
n and m can only take integer values of the specified interval. In the following only the calculation of U is considered. First of all, the algorithm has to be formulated in the high-level language PASCAL which is required as input for the DECOMP. The algorithmic part of the input description which has been formulated for N 3 is shown in Fig. 4 . It consists of four nested loops, two of them for the calculation of the sums over and k and two of them for calculating the sums over the variable displacements in m and n. The input description neither is localized nor is it given in single assignment code. It should be noticed that the declaration of all variables x and y as array of the length one is necessary due to the prototype implementation of the DECOMP-compiler which otherwise cannot distinguish between variables used for calculation of data and variables used as loop counter. Further, for verification purpose the input description can be execut.ed using a standard PAS-
The given input description is translated into a RTP by the DECOMP-compiler which at the same time automatically introduces SAC. Then a placement algorithm is applied which consistently changes Table 3 .
With the DG-derivation as described in Sec. 4 
