Abstract. We present a new methodology for controlling the space-time behavior of VLSI and FPGA-based processor arrays. The main idea is to generate simple local control elements which take control over the activeness of each attached processor element. Each control element thereby propagates a "start" and a "stop execution" signal to its neighbors. We show that our control mechanism is much more efficient than existing approaches because 1) only two control signals (start/stop) are required, 2) no extension of the computation space is necessary. 3) By the local propagation of just one start/stop signal, energy is saved as processing elements are only active between the time they have received the start signal and the time they have received the stop signal. Our methodology is applicable to one-and multi-dimensional processor arrays and is based on local control signal propagation. We provide a theoretical analysis of the overhead caused by the control structure.
Introduction
Parallel processor arrays and mapping methodology for such architectures are becoming more and more important, especially due to the advent of reusable reconfigurable hardware such as FPGAs [1, 20] , and due to the increasing amount of computing power implementable on a single chip (SoC-technology). Only if we are able to map computation intensive algorithms onto such architectures efficiently are we able to exploit the benefit of the given technology. This paper is a contribution to implement computation intensive algorithms as specified by loop-like algorithms such as piecewise regular algorithms [16, 17] or uniform recurrence equations [7] in hardware, see also [2, 6] .
The major task of mapping a given algorithm with indexed computations onto hardware is to assign space (processor index) and time to each computation in a computational domain that is often called index space. In the area of VLSI processor arrays, linear index transformations [8, 10, 11] are typically used and known as space-time mappings. Similar approaches are also used in the area of parallelizing compilers for supercomputers with linear transformations such as loop skewing, loop tiling, and loop permutation [19] . The problem addressed in this paper is in the context of controlling the operations of a given algorithm when being linearly transformed by a space-time mapping. Let the index space be described by a bounded polyhedron (polytope), then a space-time mapping leads to a skewed polytope. In such a polytope, the operations assigned to each processing element do not necessarily start executing at the same time 0. Hence, one could either apply techniques of control generation such as described in [13, 14, 21] that have in common that the index space is extended by dummy operations such that all processors start at the reset time t = 0. In order to control correct execution, a control signal is generated for each bounding hyperplane of the transformed index space such that predicates in control signals help a processing element to identify whether and what operation it has to perform at each time step. This is, however, inefficient in the sense that the number of control signals generated depend on the number of bounding hyperplanes of the algorithm and that much energy is consumed by extension of index spaces and by the introduction of propagated control signals.
Here, we present a new idea of efficient controlling the boundaries of a spacetime transformed index space polytope that introduces only two additional signals that indicate at each processing element the first time step and the last time step it will have to execute an operation. These two signals are propagated locally to neighbor processing elements. The approach is energy-efficient in the sense that a processing element only consumes energy during the time interval between reception of its start and its stop signal.
The rest of the paper is structured as follows. First, in Section 2, we show how the principle works for 1-D arrays (2-D index spaces). In Section 3, we extend this generation of boundary control hierarchically to higher dimensions including optimization techniques to reduce the control overhead in terms of number of control elements. Finally, results are presented in Section 4.
Control Mechanism for Linear Processor Arrays
First, we present our approach to controlling the activity of linear processor arrays. Let I denote a two-dimensional convex index space defined as follows:
and let d = 2. Hence, I is the set of integral points in the intersection of m twodimensional halfspaces:
is the i-th row vector of matrix A and b i is the i-th element of vector b. For each index vector I = (p t)
T ∈ I, p denotes the processor index and t the (discrete) time index or simply time. Hence, a space-time mapping of the operations inside the polytope is assumed to be given.
If the element A i,t = 0, P i can be written as a linear equation: In Eq. (2), t denotes the time at which the bounding hyperplane P i is crossed in dependence of the processor index p. If A i,t = 0, then p is constant. Fig. 1 shows an index space I with 7 bounding hyperplanes. Each integral point inside or on the border of the index space corresponds to a computation of a loop-body of a given algorithm. For the purpose of this paper, it does not matter how complex and what type of computation is specified at each index point. In Fig. 1 , we have already given a space-time mapping: The p-axis denotes the processor index, and t denotes the time of a computation. By the projection of the polytope along the time-axis, it can be seen that the corresponding processor arrays consists of 7 processing elements.
Boundary Control
It can be seen in Fig. 1 that some processor elements start their first computation at different times than others. A similar observation holds for the last time, a processor performs a computation inside the index space polytope. Hence, control is necessary that tells each processing element when to start executing and when to stop executing.
In previous approaches to control generation such as [13] [14] [15] , it was discovered that the space-time-transformed index space of an algorithm that corresponds to a synchronously clocked processor array reset at time 0 must be in the shape of a right prism as indicated by dashed lines in Fig. 1 for the index space shown there. The authors proposed to specify an extension of index space to such a right prism hull and to provide control signals that propagate control signals from the border of the array that provide a processing element with the information whether it computes an operation inside I or not. The complexity of this kind of boundary control in terms of the number of control signals is m. Another disadvantage of the control proposed in [14, 21] is that the m generated control signals are propagated in the complete space I, hence causing an overhead of m|I| propagations that cause a lot of energy consumption.
In the following, we develop our main idea of boundary control by propagating just a signal start and a single stop signal to the processing elements, and only once during the complete algorithm execution. Before that, some definitions are in order.
We assume that I is an integral polytope. In this case, all its vertices of I are integral points. Otherwise, we may apply the cutting plane algorithm according to [12] to compute the integer hull.
1 The vertex set V of I is defined as follows:
For example, the vertex set of the index space shown in Fig. 1 is highlighted by black dots. In the following, let the processor index of v i ∈ V be denoted p i and the time index be denoted t i (v i = (p i t i ) T∈I ). We define
Hence, I tmin denotes the set of vertices with the property that there is no vertex with a smaller time step, and so on. Let I tmin (I tmax , I pmin , I pmax ) be an arbitrary element of I tmin (I tmax , I pmin , I pmax ). Finally, we define two predicates TLO and THI on index vectors as follows:
In other words, TLO(I) (THI(I)) is true for an index point I = (p t)
T iff I ∈ I and there is no point I = (p t )
T in I at the same processor index p with a smaller (bigger) time step than t. We use these predicates in order to define the boundary control graph G(I) = (V G (I), E G (I)) with vertices V G (I) and edges E G (I):
In Fig. 2 , a sample index space I and the corresponding graph G(I) is shown. Since TLO(I tmin ) and THI(I tmax ) holds, I tmin , I tmax ∈ V G (I).
The idea of boundary control is now to traverse the boundary control graph node by node, starting in a node I tmin ∈ I tmin in two directions (L for left and R for right) as indicated in Fig. 3 until a node I tmax ∈ I tmax is reached. These two paths can be represented as vectors, where each vector component is a pair 
The traces L and R are called boundary control paths. Each vector element of a boundary control path represents the flow of a control signal from a control element to the next. The implied hardware structure of control elements is defined in the next subsection.
Control Hardware
All index vectors I = (p t) T with the same p-coordinate are mapped on the same processor element. Thus, the processor elements of operations at index points I tmin , I tmax , I pmin , I pmax are mapped onto processor indices p tmin , p tmax , p pmin , p pmax , respectively.
The hardware structure of our control methodology is defined as follows: To each processing element, we associate a dedicated controller called control element (CE). Each control element has exactly two control input/output pairs. Control elements are interconnected locally with neighbor control elements in the form of a control chain as indicated in Fig. 4 , showing the control chain for the index space in Fig. 2 .
Thereby, each trace L and R represents a control signal flow from p tmin to p tmax . A control signal that initiates the execution of a processing element is passed from p tmin to p pmin and p pmax . Each processor element in between starts its operation once receiving this start signal. When a control signal is passed start ready 
Control chain for the index space in Fig.2 . The data links between the processor elements are not shown here.
back from p pmin and p pmax to p tmax , the processor elements in between stop their execution (stop signal). The CE at location p tmin has a start-input which must be set active for one clock cycle when the computation of the array starts. Two control signals are duplicated from the start signal and propagated to the first and the last CE in the control chain (at p pmin and p pmax ) and then back to the CE p tmax . The control signals are delayed in each CE, and both control signals must arrive at p tmax at the same time step. Thus, the sum of the delays must be the same for both control signal paths. By determining an appropriate delay for both control signals when passing from one CE to the next, we obtain a control mechanism for controlling the boundary of an index space I.
Each CE thus handles two control signals, thus it has two control input/output pairs, except for the CEs connected to p tmin and p tmax which are slightly different. A control signal on an input is passed to the corresponding output after a delay of δ cycles. An additional control output (enable) is connected to the attached processor element. This enable signal is set active when the first (start) control signal arrives and is reset when the second control signal (stop) arrives. The processor element is active as long as the enable signal is set and stopped else.
Computation of δ-Delays
The number δ of time steps a control signal must be delayed before passed to the next CE depends on the number of isotemporal hyperplanes the corresponding edge e = (v, w) ∈ E G (I) crosses the index space I. Thus, we define functions delay L : 0, . . . , k − 1 → N 3 and delay R : 0, . . . , k − 1 → N 3 that return the triples (i, j, δ L ) and (i, j, δ R ) with the following semantics: A control signal passed from the CE at p i to the CE at p j must be delayed for δ L (δ R ) time steps. Remember that each vector element l i (r i ) of the vector L (R) is of the form 
Computing both delay L and delay R on their whole domains gives us the complete structure of the control chain and all necessary δ L and δ R delays of the CEs.
Architecture of the Control Elements
A refinement of the internal circuitry of a control element is shown in Fig. 5 . The CE consists of two binary counters used for realizing the delay of the control signals. Each counter is controlled by a delay logic that starts the counter when a control signal event occurs at the in L (in R ) input. The counter is stopped and reset after δ L (δ R ) clock cycles and the out L (out R ) output is set active for one clock cycle. A small state machine (FSM) is used to generate the signal pe enable, which is set active when the in L (in R ) signal arrives and is reset when the in R (in L ) signal arrives.
The CE architecture differs from that shown in Fig. 5 at processor locations p tmin and p tmax since they have an additional start input or ready output. In the case of p tmin = p tmax , we use another type of control element that has both start and stop signals.
The computed delays for the space-time transformed index space introduced in Fig. 2 are represented by the slope of the edges of the boundary control paths in Fig. 3. I. e., the number of isotemporal hyperplanes crossed by each edge corresponds to the δ-delay. The δ-delays are also annotated to the interconnect of the control elements in Fig. 4 .
Boundary Control for Multi-Dimensional Arrays
The methodology of boundary control presented in Section 2 is applicable only to two-dimensional index spaces (1-D arrays) . Here, we show how the concept may be extended to multi-dimensional arrays by a hierarchical application of the principle introduced earlier. 
Multi-dimensional Control
An extension of our model to algorithms with index spaces of higher dimensions is possible due to the observation that each n-dimensional convex polytope can be decomposed into a finite set of n − 1-dimensional convex polytopes by slicing it using a hyperplane partitioning. Such a slicing algorithm will be used in this section to extend the method of boundary control to higher-dimensional index spaces. For example, a space-time mapped 3-D index polytope corresponds to a twodimensional processor array. A hyperplane partitioning in 3-D space divides the 2-D processor space into a set of 1-D (linear arrays), see, e.g., in Fig. 6 . For each of the partitions, we may apply the boundary control of the 2-D polytopes as introduced in Section 2. Finally, an additional controller is required that generates the start-signal for each of the linear control chains. For this purpose, we introduce another global controller chain. A sample control structure of a two-dimensional processor array is shown in Fig. 6 .
The number of required additional control elements obviously depends on the orientation of the slices of the original polytope. If the slices were oriented in vertical direction, in Fig. 6, 8 additional control elements instead of 4 would be required. Other hyperplane directions used for slicing the 3-D space would lead to different directions of controller chains and different numbers of required global control elements.
Hyperplane Partitioning Algorithm
Our slicing algorithm is applicable for any convex index space of arbitrary dimension d > 1. Actually, for illustrative reasons, the algorithm is described here only for d = 3. The algorithm decomposes a given index space I into a finite number of parallel, d − 1 dimensional hyperplanes, such that -the hyperplane normal vector chosen is not equal to the time axis, and -the number of hyperplane partitions is minimal.
2
The last condition minimizes the number of additional control elements.
Optimal Hyperplane Direction
Given an index space I according to Def. 1. Let the index space be space-time transformed. This means that without loss of generality, we may assume that the index points I = (p 0 p 1 . . . p d−1 t)
T have d − 1 leading processor indices and the time t as the d-th coordinate.
The first step in our algorithm is to find the slice-minimal orientation of the hyperplanes. Let v = (v 0 v 1 . . . v d−1 v t ) be the normal vector of feasible hyperplanes (being orthogonal to the time axis). This problem is very similar to finding an optimal schedule vector for processor arrays, see [4, 18] , and can be solved as follows: Find a vector v that minimizes the maximum number of hyperplanes, i.e., that minimizes max{v(I 2 − I 1 ) | AI 1 ≥ b ∧ AI 2 ≥ b}. This problem can be represented in its dual form as one single linear program as follows:
This linear program may be solved in polynomial time. For example, for the processor space shown in Fig. 6 , the optimal hyperplane obtained is v = (0 1 0) leading to a hyperplane with 4 control elements.
The purpose of the top-level linear array of control elements is to inject only start signals to the local linear control chains that are obtained by the hyperplane partitioning. This is due to the fact that the control elements at the lowest level generate stop signals to finish operations which are not needed at higher levels. Hence, the control hierarchy is obtained by having one processing element at which the start signal is injected. This element, see e.g., in Fig. 6 , is obtained as belonging to a hyperplane with a processing element that starts at a globally minimal time step. All neighbor elements obtain a delayed start signal. On the top-level, this delay is given by the difference between the minimal time step of one hyperplane and the minimal time step of the index space in the neighbor partition.
In case of more than 3 dimensions, the hyperplane method can be applied iteratively to obtain a control hierarchy with one linear array at the top-level, each element starts a linear control array at the next lower level, and so on, until a control chain at the lowest level is started.
For completeness, we only specify next how we obtain the description of the slices.
Description of Slices
With a found hyperplane direction v, we can process on the next lower hierarchical level by a) computing all non-empty slices, and b) for each slice, apply the boundary control to each slice of dimension d − 1.
In the following, we simply present an algorithm for computing the descriptions of each slice, given an index space I and a hyperplane partitioning vector v.
All non-empty slices satisfy:
The first slice is obtained by solving the linear program
The last slice is obtained by solving the linear program
By the convex nature of the index space, there is a non-empty slice for each value z min ≤ z ≤ z max . Each hyperplane description is given by {I ∈ Z d | vI = z ∧ AI ≥ b}. In order to apply the control to the next lower level of hierarchy, we need to eliminate a variable of I to obtain a description with d − 1 variables. This can be done easily by choosing an arbitrary variable of the index vector I, e.g., p 0 , solve vI = z for p 0 , i.e.,
and replace p 0 in AI ≥ b with the right hand side of the above equation.
Results
In this section, the methodology of our approach is shown for a realistic example. As instance, we consider the well-known problem of LU-decomposition. In the LU-decomposition, a given matrix C ∈ R N ×N is decomposed into C = A · B, where A ∈ R N ×N is a lower triangular and B ∈ R N ×N is an upper triangular matrix. This problem may be formulated by a recursive algorithm [9] . A corresponding piecewise regular algorithm is given by the following set of quantified equations
where the index space is given by
For illustration purposes let be N = 5 in the following. As mentioned before, linear transformations as in Equation (5) are used as space-time mappings [10, 11] in order to assign a processor index p ∈ Z n−1 (space) and a sequencing index t ∈ Z (time) to index vectors I ∈ I.
In Eq. (5), Q ∈ Z (n−1)×n and the schedule vector λ ∈ Z 1×n . Now, consider that we have obtained an optimal space-time mapping by exploring the design space in terms of cost and performance. In [6] , we have presented efficient pruning techniques for the search of optimal projection vectors (space-time mappings). We only summarize the main ideas: 1) Only consider coprime vectors, 2) only consider co-prime vectors that have the properties that at least two points in I are projected onto each other. This leads to a search space of co-prime vectors in a convex polytope called difference-body of points in I. Finally, in this reduced search space, we can exploit symmetry to exclude search vectors v = −v such that typically, only few projection vector candidates v have to be investigated. In order to count the number of points in a projected index space, Ehrhart polynomials [3, 5] may be evaluated to count the number of points (control elements) in the projected space.
Let u = (0 1 0) T be a projection vector and λ = (0 1 2) be a schedule vector candidate during exploration. Herewith, index points I are mapped onto processors p at time t
This point set can also be described by the following polytope I pt T , and λ = (0 1 2). The isotemporal hyperplanes are visualized. The matching processor array is shown in (b). In (c), the control structure for the processor array is shown.
The transformed index space is visualized in Fig. 7 (a) , its shape is like a skewed pyramid. The triangular shape of the processor array is shown in Fig. 7 (b) . For example, at time step t = 0 and t = 1 only five processor elements are making computations (marked as white points in Fig. 7) , at time t = 2 and t = 3 in total nine processors are used, at time t = 4 twelve processors, at time t = 5 seven processors, and so on.
Next, an optimal hyperplane direction is determined. By solving the corresponding linear program (see Eq. (4)), the normal vector v = (1 0 0) is obtained as an optimal solution vector. Using this direction all processor with k = 0 can be started at time 0. Therefore, a broadcast is sent to all the white processor elements in Fig. 7 (c) . In Fig. 8 (a) 
and for the right path R
The controlled processors are shown in Fig. 8 (b) , the edges between the controller elements are annotated by the delays δ L and δ R respectively. At time 0 the processor element p 40 is enabled by its controller. After two time steps (δ R = 2) the next processor element p 41 is enabled. Every second time unit the start signal is propagated two the right neighbor and the corresponding processor is enabled. The first stop signal is produced at time step 4 where processor p 40 is disabled. Then, one processor element from p 41 to p 44 is turned off at each subsequent time step. The utilization of our new control methodology for the LU-decomposition algorithm shows several advantages very well compared to the classically approaches of control generation [13, 14, 21] . An extension of a given index space to a right prism is not necessary. In our case of the LU-decomposition, only approximately half of the number of data transfers (propagations) are necessary. The energy reduction is evident. ce 40 ce 41 ce 42 ce 43 ce 44 start Fig. 8 . In (a), a slice for i = 4 of the mapped index space Ipt is visualized. In (b), the corresponding controller structure of the processor array is shown. The edges between the controller elements are annotated by the δ-delays. The lower edges between the control elements propagate every second time step a start signal to its right neighbor. After the delay δL = 4, subsequently the stop signal is sent from one control element to its right neighbor.
Conclusion
In this paper, we have introduced a new method for controlling the operations of loop-like algorithms when mapped onto processor arrays. The idea is to reconstruct the border of an index polytope of computation by propagation of a single start and a single stop event locally to neighbor processors that identifies the execution intervals of each processor. The effort amounts to the fixed amount of two control signals. Contrary to existing approaches, our approach is much more area-and energy-conscious. Our method can be seen complementary to existing control generation approaches that are still necessary, e.g., for replacing iteration dependent predicates of operations inside a given index space by predicates on locally propagated control signals.
The presented approach can easily be extended for a more general class of index spaces, so-called linearly bounded lattices:
where κ ∈ Z l , M ∈ Z n×l , c ∈ Z n , A ∈ Z m×l and b ∈ Z m . Like throughout this paper, {κ ∈ Z l | Aκ ≥ b} defines an integral convex polyhedron or in case of boundedness a polytope in Z l . This set is affinely mapped onto iteration vectors I using an affine transformation (I = M κ + c).
Our new distributed loop control methodology is integrated as part of the PARO design system and can be used during the process of automated synthesis of regular circuits. PARO is a design system project for modeling, transforming, optimization, and processor synthesis for the class of piecewise linear algorithms. For further information on the PARO design system project, check the following website: http://www-date.upb.de/research/paro/.
