The paper presents the parallelizing programming environment CoDe-X introducing hardwardsofhoare codesign strategies on two levels of partitioning for datadriven Xputer-based accelerators. CoDe-X pe$orms both, in the first level a profiling-driven hosdaccelerator partitioning for performance optimization, and in a second level a resource-driven sequentiavstructural partitioning of the accelerator source code to optimize the utilization of its reconfigurable resources. CoDe-X accepts a C dialect also including optional data-procedural language features, which can be included to achieve highest possible acceleration factors provided by the Xputer hardware.
Introduction
Today, increasingly complex and computation-intensive tasks have to be performed by computers. From empirical studies [30] it can be concluded that the major amount in computation time is due to rather simple loop constructs. Since additionally these loop constructs are combined with indexed array data structures, ordinary von Neumann style computers are burdened mainly with addressing computations. First efforts to reduce addressing overhead and to introduce parallelism have been undertaken by the development of pipelined and vector supercomputers [23] , [17] . Together with the achievements in supercomputer technology, parallelizing compilers have been developed, where compilation is based on data dependence analysis [2] , [32] producing executable code for different parallel target machines. Unfortunately, the hardware structures are not reflecting the structure of the algorithms very well, which restricts the exploitation of inherent parallelism in the algorithm to a large extent.
Emanating from the technology of field-programmable logic (FPL) the new paradigm of structural programming has evolved. Instead of loading the program code sequentially into memory (procedural programming), hardware structures are configured to fulfill the application needs (structural programming). With the development of Custom Computing Machines (C6Ms: [13] , [16] , [24] ) the combination of structural and procedural programming has been evolved by FCCMs [4] [5] . The scene of hardware/software co-design has introduced a number of approaches to speed-up performance, to optimize hardwarekoftware trade-off and to reduce total design time [6] , [7] , [21] . The von Neumann paradigm (used in the FCCM-approach) does not efficiently support "soft" hardware because of its extremely tight coupling between instruction sequencer and ALU: architectures fall apart, as soon as the data path is changed. So a new paradigm is desirable like the one of Xputers, which conveniently supports "soft" ALUs like the rALU concept (reconfigurable ALU) [ 1 11 or the rDPA (reconfigurable data path array) [ 181 [ 193. In such an environment parallelizing compilers require two levels of partitioning: host/accelerator partitioning for optimizing performance and a structurakequential partitioning (second level) for optimizing the hardward software trade-off of the Xputer resources. Furthermore the application development environment CoDe-X combines three programming paradigms into one more powerful approach: the control procedural paradigm reflected in C language features, the data procedural paradigm realized in an optional language extension for specifying selected data procedural application parts executed faster by Xputer hardware and the structural programming paradigm for the reconfigurable Xputer hardware components.
To stress the significance of this application development methodology, the paper first gives an introduction to the underlying hardware platform. Section 3 presents the codesign framework CoDe-X and its strategies including the option of additional data procedural features for experienced users. In section 4 computation-intensive application examples from the area of image processing are discussed.
Xputer-based Accelerators
Many applications require the same data manipulations to be performed on a large amount of data, e.g. statement blocks in nested loops. The Xputer machine paradigm aims at the acceleration of such applications. Xputers are especially designed to reduce the von Neumann bottleneck of repetitive decoding and interpreting address and data computations. High The data memory which is local on each Xputer module, is primarily organized two-dimensionally, but can also be interpreted as higher dimensional. The data is arranged in a special order to optimize the data access sequences, called data map. The scan windows serve as interface of the rALU subnets to the data memory. The main memory of the host and the local memories are logically a single shared memory. A large amount of input data is typically organized in arrays where the array elements are referenced as operands of a compktation in a current iteration. The sequence of data accesses in iterations shows a regularity which allows to describe this sequence by a number of parameters. The reconfigurable generic address generators (GAGs) of the data sequencer (DS) interpret these parameters arid compute generic address sequences to access the data. This results in a controlled movement of the scan windows over the data map, each controlled by a single GAG. Each time the scan windows move one step, a compound operator of the corresponding rALU is evaluated. Thus the data sequencer represents the main control instance of an Xputer. The implementation of the data sequencer can be done with reconfigurablle logic. The control part is admitted with features for reconfiguration and realized with a single Xilinx XC4013 P G A . Because of the wide datapath of the generic address generator swch a fine grained structure is not suitable. All YO operations are done by the host as well as the memory management by taking advantage of the functionality of the host's operating system. The host has direct access to all memories on the Xputer modules via a small bus interface (IF, see figure 1). On the other hand the Xputer can access data in the host's memory. The Xputer activation, its synchronization and data transfers with the host are controlled by the Xputer Run Time System (XRTS) [25] .
The H/S Co-Design Framework CoDe-X
For above described hardware platform, a partitioning and parallelizing compilation framework, CoDe-X, is being implemented. CoDe-X is based on two-level hardware/ software co-design strategies and accepts X-C source programs (figure 2). X-C (Xputer-C) is a C dialect. CoDe-X consists of a Is' level partitioner (partially implemented), a GNU C compiler, and an X-C compiler (fully implemented).
The X-C source input is partitioned in a first level into a part for execution on the host (host tasks, also permitting dynamic structures and operating system calls) and a part for execution on the Xputer (Xputer tasks). Parts for Xputer execution are expressed in a C subset, which lacks only dynamic structures [27] . At second level this input is partitioned by the X-C subset compiler in a sequential part for the DS, and a structural part for the rDPA Additionally experienced users may also include directly dalta procedural MoPL-code (Map-oriented bogramming Language [I]) into the original C description of applications, or can add new functions described in MoPL to the library such that each user can access them (see figure 2) .
Less experienced users may use these special MoPL library function similar to C function calls to take full advantage of the high acceleration factors possible by the Xputer paradigm. In this chapter the profiling-driven hostlxputer partitioning is sketched brie:fly. Then the inclusion of data procedural options for experienced and less experienced users is explained. 
Q@+

Optional Data-Procedural Features
The CoDe-X framework contains an optional dataprocedural extension, where experienced users can use and build a generic function library with a large repertory of Xputer-library functions. These should be functions, which are well suited for Xputer use resulting in high acceleration factors compared to single processor workstations e. g. different kind of filters, matrix operations etc. The scan patterns as well as the rALU compound operators can be described in the Xputer language MoPL [ I ] fully supporting all hardware features.
The input specification is then compiled into Xputer code, namely the datamap, the data sequencer code and the rALU code, dependent on the concrete functions calls in the input X-C-program, which are similar to C-function use. A difference is the use of generic parameters, where the dimensions of actual used parameters have to be specified (see figure 3) . In this example a 8 by I O matrix is multiplied 3 ). There is no need to build new library functions described in MoPL, if a user is not familiar doing this. But if one experienced wants to create new Xputerlibrary functions, he can describe the required functions in MoPL (described in detail including examples in [l]) and include them in an existing library or create a new library. As X-C is an ANSI C extension, it is also possible to program directly MoPL-parts into C-programs ( figure 3) , if no suitable library function is available. Optional data-procedural language features make highest acceleration factors provided by Xputer hardware possible for users familiar with MoPL. An example for using Xputer-library functions as well as including directly MoPL code into an X-C program is given in section 4.
Resource-driven second level Partitioning
The X-C compiler realizes the 2nd level of partitioning and translates an X-C program into code which can be executed on the Xputer. The compiler performs a data and control flow analysis. First the control flow of the program graph is partitioned according to the algorithm of Tarjan [31] resulting in a partial execution order. This algorithm is partitioning the program graph into subgraphs, which are called Strongly Cotitiected Cotnponents. These components correspond to connected statement sequences like fully nested loops for example, which are possibly parallelizable. The Xputer hardware resources are exploited in an optimized way by analyzing, if the structure of statement sequences can be mapped well to the Xputer hardware providing instruction level parallclism and avoiding reconfigurations or idle hardware resources [27] . Further details and examples about the X-C compiler can be found in [ 151, [27] , [28] .
Example: Image Processing Algorithm
Smoothing-operations are used primarily for diminishing spurious effects, that may be present in a digital image as a result of a poor sampling system or transmission channel.
Neighborhood averaging is a straightforward spatial-domain technique lor image smoothing. Given an N x N imagef(x,y), the procedure is to generate a smoothed image g (x,y), whose gray level at each point (x,y) is obtained by averaging the graylevel values of the pixels o f f contained in a predefined neighborhood (kernel) of (x,y). In other words, the smoothed image is obtained by using the equation:
for x,y = 0,1, ..., N-1. S is the set of coordinates of points in the neighborhood of (but not including) the point (x,y), and M is a pre-computed normalization factor. This small example for illustrating the methods of CoDc-X was dividcd in four tasks. Two of them were liltcrcd out in thc lirst step for bcing executed on thc host in any case. Thcsc wcrc tasks containing YO routines for reading input parameters and routincs for plotting the iimagc, which cannot bc executed on thc Xputcr. The remaining two tasks were potential candidates for mapping onto the Xputer. In one of these two tasks strip mining was applied by the Is' level partitioncr. The rcsultin, (7 smaller indepcndcnt loops can be executed in parallel on different Xputer modules. The X-C compiler was pcrfomiing loop unrolling additionally up to the limit of available hardware resources (see section 3.3). The final acceleration factor in comparison with a SUN SPARC 10/51 was 73. For dctails about this example see [ 
141.
Another application from the area of image processing are sharpening algorithms, which are special cases of the discrete convolution for detecting edges. In figure 4 a Figure 4 . Transformation of a data procedural functiion into corresponding MoPL-code function will be executed in any case by the accelerator and is transformed by CoDe-X into the corresponding MoPL-code optimized for Xputer hardware execution. First the parameter transfer is done, then different declarations (scan puttem, window, rALUsubnet) are performed and finally the main program part applies a video scan over the image computing in every step the sharpening-value of one pixel. In figure 4 the MoPL-code is outlined only briefly, for details about programming with MoPL please see [l] . If no library function for sharpening would have been available, the MoPL-code in figure 4 could also be included directly into the X-C program by experienced users. A novel characteristic of CoDe-X is the combination of the control and data procedural paradigm into one input specification language X-C, whereas the structural programming paradigm for the reconfigurable Xputer hardware components is transparently for users integrated into the synthesis path of CoDe-X. Thus
Conclusions
