In this paper we present an extension of the class of piecewise linear algorithms (PLAs) 
Introduction
In the last two decades a lot of research has been spent in the area of parallel algorithms that can be systematically mapped onto a class of massive parallel architectures called processor arrays. Today these architectures are of great interest, since progressive integration densities and minimal structures of modern ULSI-devices allow implementations of hundreds of 32-bit microprocessors and more on a single die. Moreover, with the advent of reconfigurable architectures, processor arrays have become flexible as the design of software. Such arrays can solve efficiently a large number of problems in signal, image, and video processing, or numerical linear algebra. Key components of mapping methodologies which can be classified to the area of loop parallelization in the polytope model [5] are linear transformations and schedules in order to derive preferably homogeneous processor arrays with local and regular communication structures and a high degree of pipelining and parallelism. For this purpose, mostly only data flow dominant algorithms with static control have been considered.
Many computational intensive algorithms of the above listed domains have also, in fact only a small and simple control flow which can not evaluated in advance at compile time but have to be considered at run-time. In order to be able to handle also these algorithms we propose an extension of the class of piecewise linear algorithms by one type of run-time conditionals. We describe how these algorithms may be scheduled and mapped by adaptation of existing methods.
Related Work
Loop parallelization is of great interest in order to accelerate applications either in software or hardware. Transformations can be performed on a program given in an imperative form or in single assignment code (SAC), where the whole parallelism is explicitly expressed. SAC is closely related to a set of recurrence equations, a formalism introduced by Karp, Miller, and Winograd [11] . This formalism has been used in many languages and advanced over the years about affine dependencies or piecewise definitions. E.g., Systems of Affine Recurrence Equations (SARE) which are used in the Alpha language [3] , the class of Affine Indexed Algorithms (AIA) [4] , and the class of Piecewise Linear Algorithms (PLA) [19, 20] . None of these classes can handle or is used to schedule dynamic data dependencies. As parallelizing compiler, LooPo [6] is mentionable since it cannot only handle static loop bounds like the before described algorithm classes but also while-loops.
In this area only few synthesis tools for the design of application specific circuits exists: PICO Express [18] which was primarily developed as PICO-N by the HewlettPackard Laboratories [12, 17] , Compaan [14] which deals with process networks, and PARO [2, 15] which is based on the class of PLAs. PARO is a design system project for modeling, transformation, optimization, and processor synthesis for the class of PLA. PARO can be used during the process of automated synthesis of regular circuits.
Background and Notation
The purpose of this section is, (i) to recapitulate the class of algorithms we are dealing with called piecewise linear algorithms (PLAs), and (ii) to extend this algorithm class by one type of dynamic data dependencies.
The class of PLAs has been defined in [19, 20] . This class extends the notation of regular iterative algorithms [16] that may be related to regular processor arrays. In the following, the properties of PLAs are defined: Single assignment property: Any instance of an indexed variable appears at most once on the left hand side of an equation or, all equations defining the same variable are identical.
Computability: There exists a partial ordering of the equations such that any instance of any variable appearing on the right side of an equation appears on the left hand side earlier in the partial ordering.
Execution Model: The execution model of programs is architecture independent. A program may be executed as follows: (1) All instances of equations are ordered respecting the above defined partial ordering. (2) The indexed variables are determined by successive evaluation of equations.
The domains I i are defined as follows: 
Definition 3.3 (Linearly Bounded Lattice). A linearly bounded lattice denotes an index space of the form
I = {I ∈ Z n | I = Mκ + c ∧ Aκ ≥ b} where κ ∈ Z l , M ∈ Z n×l , c ∈ Z n , A ∈ Z m×l and b ∈ Z m . {κ ∈ Z l | Aκ ≥ b} denotes
the set of integral points within a convex polyhedron or in case of boundedness within a polytope in Z l . This set is affinely mapped onto iteration vectors I using an affine transformation (I = Mκ + c).
Throughout the paper, we assume that the matrix M is square and of full rank. Then, each vector κ is uniquely mapped to an index point I. Furthermore, we require that the index space is bounded.
In order to allow not only iteration dependent conditionals C I (I) which are static and known at compile time we extend in the following the algorithm class by run-time dependent conditionals.
Definition 3.4 (Run-Time Dependent Conditional). Let
where Note that by this definition we can strictly partition each condition into an iteration dependent conditional and a runtime dependent conditional (separability). Due to both, the run-time dependent conditional (C RT i ) and the negated runtime dependent conditional (¬C RT i ), the left hand side variable of an equation is defined whensoever C I i (I) is fulfilled, and thus the computability property of a program remains satisfied. Furthermore, a corresponding static dependence graph of a DPLA can be specified as will be shown subsequently. But first, in Ex. 3.1 and Ex. 3.2 we give examples of a DPRA and a DPLA, respectively. Example 3.1
A PRA might be expressed by a so called reduced dependence graph (RDG) [19] , also a DPRA can be expressed by a RDG extended by run-time dependent conditionals. Fig. 1 (a)-(d 
Definition 3.6 (RCDG). The reduced control/dependence graph RCDG G = (V, E, D) associated to a dynamic regular algorithm as defined above is defined as follows: The set of nodes V can be divided into three disjoint subsets
In the following small example all the definitions are wrapped-up. In the majority of cases, starting point is a given program in a high-level language like C or Java.
Example 3.3 Consider the following fictive program fragment given in a pseudo language:
This program can be formulated as a DPRA as follows: [13] . A corresponding reduced dependence graph is depicted in Fig. 1 (e) 
Scheduling of DPRAs
Let w C be the execution time to evaluate a run-time dependent conditional. Furthermore, let w F 1 and w F 0 be the execution times of the if-and the else-branch of an equation, respectively, and w max = max{w C , w F 1 , w F 0 }. Then with respect to scheduling dynamic data dependencies, different cases can be considered:
Nearly balanced branches. The conditional branches are (nearly) balanced if the following condition holds: w C = w max ∨ w F 1 = w F 0 . Then, two hardware resource models might be considered:
• Assumed enough resources are available, different branches of a run-time dependent conditional may be executed in parallel to achieve highest performance. These types of run-time conditionals are very common in image processing algorithms where often absolute, threshold, or min/max values are computed. Due to the balanced behavior of branches' execution time an optimal static linear schedule can be derived at compiletime.
• If the computation of one branch is more hardware costly, it makes sense to share the resources since different branches of a conditional are mutually exclusive.
Unbalanced branches. When the execution times of branches are different (unbalanced, |w
, linear static scheduling may lead to sub-optimal execution times, since the worst case execution time is always given by the longest branch. Then, worst case and best case run-time estimations might be of interest, or techniques such as loop shifting and compaction as in [7] can be considered in order to balance the branches. If the loop branches may not be balanced properly using branch balancing so that the overhead in execution time would not be tolerable, mixed scheduling concepts consisting of mixed static/dynamic schedules or quasi-static schedules where events generated at run-time from the evaluation of dynamic data dependencies and trigger statically optimized sub-schedules.
In the following we consider only the case when the branches are simple 1 , nearly balanced and both branches are computed in parallel to allow fastest execution. Then, an optimal static schedule may be derived by the formulation and solving of a mixed integer linear program (MILP), similar as in [9, 21] . Therefore, additionally a resource graph has to be specified which expresses the binding possibilities of operations to functional units and execution times and pipeline rates of these units. Due to the sake of brevity and the well-known concepts we omit the MILP formulation here and refer to [9, 21] . The MILP has to satisfy the following conditions: Obviously, one necessary condition to allow the parallel execution is that in the given resource graph there must exist disjoint binding possibilities of op-
for each conditional and its branches. Then the parallel execution is satisfied by the following constraints
1 The branches are not nested and can be encapsulated as shown in Fig. 1 (e) and Fig. 3 , respectively. 
where τ (v i ) denotes the relative start time of each operation v i .
Example, Edge Detection
A lot of computational intensive applications for video and image processing consist of nested loop programs with only few and small run-time dependent conditionals. As example we consider in the following an edge detection algorithm which is given as pseudo code in Fig. 2 . In order to formulate the scheduling problem as a MILP we have to denote the available resources. This can be expressed by a resource graph as shown Fig. 4 . An edge in this graph models the possibility that v i might be executed on one instance of resource type r k . To each edge an execution time of node v i on resource type r k is associated. Furthermore, to each resource type r k a number α(r k ) of available instances is associated. In the example, the assignment and multiplex operations are considered to have zero clock cycles delay. Furthermore, the parallel execution of branches is possible by the given resource graph. The schedule of one possible and optimal solution is shown in Fig. 5 for three subsequent instances. After a period of P = 3 a new computation can start on the same resources. In Fig. 6 , a corresponding hardware realization is depicted. 
Conclusions and Future Directions
In this paper we presented an extension of the class of PLAs in order to model one type of dynamic data dependencies. This extension significantly increases the range of applications which can be parallelized and mapped to massively parallel processor arrays. Furthermore, we outlined in which case these extensions can directly used -with slight changes -within traditional mapping methodologies based on loop parallelization in the polytope model.
Currently we investigate nested and computational intensive branches where the parallel execution of both branches is too expensive [10] . In the future we would like to consider also unbalanced branches. Here, a two-stage scheduling methodology might be applied: within a branch a static linear schedule can be determined during compile time, around these static parts a dynamic or data flow driven concept has to be developed. In case of reconfiguration at runtime our resource graph has to be extended to allow the modeling of reconfiguration times in order to perform precise worst/best case execution estimations.
The newly class of DPLA introduced in this paper is currently integrated into the PARO design system [1, 15] . Furthermore, in the future we would like to adapt our design methodology in order to target also coarse-grained reconfigurable architectures [8] . 
