Abstract
Introduction
The perennial need for faster computation by exploiting concurrency has driven the development of parallel architectures. On one hand, general purpose parallel computers consists of farms of SIMD or MIMD processors. These computers offer an optimal platform for the parallel execution of number crunching algorithms from the fields of image processing, numerical simulation, visualization, etc. Therefore, a major field of research is the automatic parallelization of software loops (e.g. for or while loops in C) onto parallel architectures. PIPS is state-of-the-art parallelizing compilers based on the intuitive polytope model [8] . On the other hand, reducing integration densities and consideration of area, cost, and power for reasons of mobility and portability has lead to the development of domainspecific architectures. There are numerous well known software/hardware alternatives, including digital signal processors (DSPs), custom application-specific integrated circuits (ASICs), application specific instruction processors (ASIPs), field programmable gate arrays (FPGAs) and also lesser known ones such as coarse-grained processor arrays (CGAs). DSPs and ASIPs exploit instruction level parallelism (ILP) which may not suffice for many applications thus calling for flexible alternatives such as multi-DSP sys- * Supported in part by the German Science Foundation (DFG) in project under contract TE 163/13-1 tems, ASICs, FPGAs, etc. These solutions can exploit both iteration and instruction level parallelism. The accelerators for loop programs in ASICs, CGAs and FPGAs can be realized as massively parallel processor arrays. Processor arrays consist of an array of 1-D or 2-D processing elements (PE) girdled by memory banks and hierarchy of caches for parallel and fast access of data. These architectures also call for mapping tools in order to realize their full potential. The mapping tools should be able to obtain synthesizable architecture descriptions (ASICs, FPGAs) or generate programs for a given fixed processor array architecture (e.g. CGAs) from software descriptions. In the last two decades lot of research in academia and industry has spawned stateof-the-art design tools such as PICO-Express [11], MMalpha [2] , PARO [10], etc. A perpetual challenge of such tools is in matching resource constraints of a given architecture (e.g. number of PEs, functional units, memory banks, I/O pins) with parallelized program or generated hardware description through interaction of program transformations and architecture constraints. Partitioning is a fundamental transformation for parallelizing compilers which tackles the problem of hardware matching. Partitioning deals with the division of the index space of the loop program into disjoint subsets called tiles and schedule the corresponding iterations. The popularly known partitioning schemes are local parallel global sequential (LPGS), or local sequential global parallel (LSGP) scheme. The massively parallelism might be expressed by different types of hierarchical parallelism: (1) several parallel working processing elements (PEs), (2) functional and software pipelining, (3) multiple functional units within one PE, and finally, (4) sub-word parallelism (SWP) within the PEs. The mentioned partitioning schemes do not suffice to match available massive parallelism with I/O rate and hierarchical memory constraints. Therefore co-partitioning which partitions an already partitioned index space was introduced with enhanced hardware matching [5] . This idea has been generalized to n-hierarchical partitioning where an index space is recursively tiled n times in order to match hierarchical massive parallelism, hierarchical memory constraints, and I/O rates. In this paper, we introduce a detailed methodology encompassing hierarchical partitioning for the first time.
However, first in Section 2, we introduce the related work. Subsequently in Section 3, a brief description of our design flow is given. Section 4 contains an intuitive introduction and algebraic formulation of basic partitioning transformations. In Section 5, we unveil our exact methodology for hierarchical partitioning. Finally in section 6, we conclude and present an outlook for future work.
Related Work
Partitioning is known under modern compiler theory under different terminologies such as strip-mining, or loop tiling [13] . The traditional loop tiling as in compiler theory introduces integer division, mod, ceil, and floor operators which lead to index spaces that are non-convex sets. This is not only inefficient for hardware designs but also is not allowed in recurrence equations [9] which form the underlying framework of design methodologies in the polytope model. The DTSE methodology [1] from IMEC is another compilation method based on the polytope model which has studied hierarchical partitioning but only for general purpose embedded processors and not processor arrays or multi-processors. PIPS tries to solve the problem of hardware matching by multi-dimensional scheduling (in other words partitioning in processor-time domain). Outstanding among all partitioning schemes is the introduction of the idea of co-partitioning in [5] .
Background and Notation
Loop parallelization in the polytope model has its origin in research for automatic generation of systolic arrays. However, it was lacking in treatment of compilation of loop algorithms under architectural constraints, i.e. for a given fixed processor array architecture. In this section we first give a brief overview of our existing mapping methodology PARO. The starting point is an algorithmic description by a set of recurrence equations. The algorithm descriptions are then transformed by embedding of variables, localization (vectorization), and other transformations in the polytope model for reasons of hardware generation. Furthermore, a space-time mapping of the transformed program is carried out in order to obtain an architecture description as input to a back-end code generator for compiling the program onto a given processor array architecture or HDL generation [6] . [4] .
In the following example we illustrate the PLA notation with help of a FIR filter example. 
with the Fig. 1(b) ). This enables a regular communication structure in hardware and enhances data reuse. After localization of variable u, the FIR filter has the following PLA description.
Partitioning
Partitioning is a well known transformation which covers the index space of computation using congruent hyperplanes, hyperquaders, or parallelepipeds called tiles. The transformation has been studied in detail for compilers and its use has led to program acceleration through better cache reuse on sequential processors (i.e., loop tiling or blocking) [13] , implementation of algorithms on given parallel architectures from supercomputers to multi-DSPs and FPGAs. It is carried out in order to match a loop nest implementation to resource constraints in terms of available number of processing elements (PEs), local memory, and communication bandwidth. Well known partitioning techniques are multi-projection, LSGP (local sequential global parallel, often also referred as clustering or blocking) and LPGS (local parallel global sequential, also referred as tiling). Formally, partitioning divides the index space I using congruent tiles such that it is decomposed into spaces I 1 and I 2 , i.e., I → I 1 ⊕I 2 1 . I 1 ∈ Z n represents the points within the tile and I 2 ∈ Z n accounts for regular repetition of the tiles, i.e., the origin of each tile. In case of parallelepiped shaped tiles, tiles are defined by tiling matrix, P . Hierarchical partitioning methods have been studied in [5] . These partitioning techniques use different hierarchies of tiling matrices to
divide the index space. Co-partitioning 2 is such an example of a 2-level hierarchical partitioning [5] , where the index space is first partitioned into LS (local sequential) tiles, this tiled index space is tiled once more using GS (global sequential) tiles as shown in Fig. 1 . Formally, it is defined as splitting of an index space into spaces I 1 , I 2 , and I 3 , i.e., I → I 1 ⊕ I 2 ⊕ I 3 3 using two congruent tiles defined by tiling matrices, P LS and P GS . I 1 ∈ Z n represents the points within the LS tiles and I 2 ∈ Z n accounts for the regular repetition of the origin of LS tiles (i.e., tiles marked with dashed lines in Fig. 1(a) . I 3 ∈ Z n accounts for the regular repetition of the GS tiles (i.e., bigger tiles marked with solid lines in Fig. 1(a) . Similarly, an n-hierarchical partitioning method splits the index space I into n + 1 spaces. The different partitioning schemes such as LSGP, LPGS, and co-partitioning are defined by specific scheduling which are typically realized through appropriate affine transformations defining the allocation and scheduling (see Section 5.4). Fig. 1(a) and 1(b) respectively. The loop matrices used for tiling are
Example 4.1 Co-partitioning of the iteration space of the localized FIR filter example and FIR filter with no localization of variable a is shown in
The new PLA obtained after application of co-partitioning on the PLA given in Eq. (3) is shown in Eq. (4) which can be also be attested by Fig. 1 . One can verify that the index point I = (4, 3) is uniquely 2 Co-partitioning uses both LSGP and LPGS methods in order to balance local memory requirements with I/O bandwidth with the advantage of problem size independence. Therefore, the problem to be dealt in this paper is that given an input PLA both with uniform and affine dependencies, how does one obtain an output PLA preserving the dependencies after hierarchical partitioning?The approach presented in the paper is not only limited to processor arrays but can also be used for program analysis in parallelization. In the next section, we introduce our methodology for hierarchical partitioning of PLAs.
Hierarchical Partitioning
The following methodology for partitioning proposed in this section encompasses all possible partitioning techniques (i.e. LSGP, LPGS, Co-partitioning, . . . ). The only major assumption is the standard use of congruent parallelepiped tiles for partitioning. Fig. 1(b) shows a partitioned index space with affine and uniform data dependencies. The first step tiling of the index space is equivalent to the problem of strip mining or loop tiling in compiler theory. However, in our methodology we go one step further by embedding the data dependencies in the new tiled index space. The advantage of this data dependence analysis step is that we remain in the polytope framework which offers the possibility of mapping the algorithms onto massively parallel architectures. Modern compilers use the tiling transformation only to enhance data reuse and henceforth create faster programs by reducing cache misses on uni-processor machines. Therefore, no embedding of dependencies is done in modern compilers. The partitioning not only embeds the data dependencies but also the iteration dependent conditionals in the new index space. Furthermore, the new data dependencies are also associated with unique iteration dependent conditionals. Finally scheduling is an important step for describing the space and time co-ordinate of execution of each iteration in the tiled index space and hardware generation. Therefore, our approach for partitioning is constituted of the four steps tiling, embedding of data dependencies and control conditionals and finally scheduling. In following subsections, we explain these steps.
Tiling: Decomposition of the Index Space
Similar to the idea that square tiling of a loop nest of depth 2 gives a loop nest of depth 4, n-Hierarchical partitioning tiles converts a global iteration space of dimension m into an (n + 1) · m-dimension index space (for loop tiling n=1). I.e. the global iteration space I is decomposed into the direct sum of (n + 1) subspaces I 1 , I 2 , . . . , and I n+1 such that I ⊆ I 1 ⊕ I 2 ⊕ . . . ⊕ I n+1 . I 1 accounts for the index points in innermost tiles. I 2 accounts for the regular repetition of origin of innermost tiles (i.e. I 1 ) and so on they collectively form the new index space. The tiles are parallelepipeds and are described by the n tiling matrices (P 1 , P 2 , . . . , P n ) and a tiling offset q which describes the origin. If we assume the initial index space I to be of the form I = {I|A · I ≥ b}, then the iteration bounds for the new index space, I new is represented as follows:
det(Pi) and P n = n i=1 P i . P roj defines the orthogonal projection over the subspace defined by variable in I n to eliminate all variables in I. It may be noted that the above LBLs can be written as for loops. 
Example 5.1 The iteration space of the FIR filter example in Section 3 (I = {(i j)
T : 0 ≤ i ≤ 11, 0 ≤ j ≤ 5})
Embedding: Splitting of Dependencies
The existing affine and regular dependencies need to be embedded in the new tiled iteration space. This step introduces new equations with new dependencies and iteration dependent conditionals in case dependencies cross different tiles. All equations (as in Def. 
The purpose of embedding is to embed the equations in the new index space as follows:
) (as all the R i terms cancel each other). Therefore the problem is to find all distinct
In [12] , it was shown for n = 1 (i.e. simple partitioning) one can setup a constraint polytope and enumerate all its points to find different possible values of R 0 . In this section we introduce the method for embedding of dependencies for n-hierarchical partitioning. First we explain the extension with help of co-partitioning (2-hierarchical partitioning). Co-partitioning is partitioning done twice on an index space. I.e. I → I 1 ⊕ I 2 ⊕ I 3 , tiling and embedding operations are applied only once. Therefore, we need to define the embedding operation for hierarchical partitioning as following.
therefore the problem is to find all
From Eq. (5) and (c), we infer
Similarly from Eq. 5 and (b), we infer
Hence, R 0 must satisfy
. After replacing R 0 we get following set of inequalities or constraint polytope.
The above polytope has 5n variables and one must enumerate all its integral points which leads to distinct values for (R 0 , R 1 ). For each distinct value a new equation is generated. The enumeration is done by scanning the polytope for integer points lying in the rectangular hull of the polytope. The above argument can be extended by induction for n-hierarchical partitioning and gives following polytope. 
Iteration dependent Conditionals
In this subsection we will transform the initial iteration conditionals in the tiled space. Furthermore, embedding leads to new equations which in turn is associated with unique iteration dependent conditionals. For n-hierarchical partitioning each new equation has the following new conditionals depending on corresponding (R 0 , . . . , R n−1 ) because the conditions
. . , QI n+1 + R n ∈ I n+1 need to be guaranteed. These conditions are guaranteed by the following inequalities.
The above conditionals contain a lot of inequalities which are ugly for hardware implementation. A larger number of iteration dependent conditionals means that control costs burden the computation for partitioned examples. However on removal of redundant inequalities (e.g. inequalities already defined by the iteration space) one obtains a simplified form for the iteration dependent conditionals for practical examples. Furthermore assuming an initial iteration dependent conditionals (I ∈ I c ) is a linearly bounded lattice as following.
Then the transformed conditional is as following The set of equations in the example denotes the operations to be performed for a single iteration. These operations and iterations need to be scheduled on the processor array. This is done using an affine transformation realizing different partitioning schemes as discussed in the next section.
Scheduling
Linear transformations are used as space-time mappings in order to assign a processor p (space) and a sequencing index t (time) to index vectors [6] . In LSGP, all index points within a tile are executed sequentially by the same processor and the index points in different tiles are executed in parallel by different processors. Therefore, the number of processors is equal to the number of tiles. This observation is incorporated into the affine transformation defining the space-time mapping for LSGP method as following. Table 2 . The iteration dependent control conditional for the co-partitioned FIR filter example.
Variables Control Conditionals
Q d R 0 R 1 a Empty 0 0 0 1 (0 0) T (0 0) T (0 0) T u i 1 > 0 ∧ j 1 > 0 E (1 1) T (0 0) T (0 0) T u i 1 = 0 ∧ j 1 > 0 ∧ i 2 > 0 E (1 1) T (1 0) T (0 0) T u i 1 > 0 ∧ j 1 = 0 ∧ j 2 > 0 E (1 1) T (0 1) T (0 0) T u if i 1 = 0 ∧ j 1 = 0 ∧ i 2 > 0 ∧ j 2 > 0 E (1 1) T (1 1) T (0 0) T u if i 1 = 0 ∧ j 1 > 0 ∧ i 2 = 0 ∧ i 3 > 0 E (1 1) T (0 0) T (−1 0) T u if i 1 = 0 ∧ j 1 = 0 ∧ i 2 = 0 ∧ j 2 > 0 ∧ i 3 > 0 E (1 1) T (0 1) T (−1 0) T x E (0 0) T (0 0) T (0 0) T y if j 1 = 0 ∧ j 2 = 0 ∧ j 3 = 0 E (0 0) T (0 0) T (0 0) T y if j 1 > 0 E (0 1) T (0 0) T (0 0) T y if j 1 = 0 ∧ j 2 > 0 E (0 1) T (0 1) T (0 0) T
Definition 5.1 (Space-time mapping for LSGP and LPGS)
where E is the identity matrix,
and n 1 +n 2 = n. p defines the processor index and t defines the time step of execution.
Similarly in co-partitioning, the index points within the LS tiles (i.e. smaller tiles) are executed sequentially. All the LS tiles within a GS tile (i.e. larger tile) are executed in parallel by the processor array. Therefore, the number of processors in the array is equal to the number of LS tiles within a GS tile. The GS tiles are executed sequentially.
Definition 5.2 (Space-time mapping for co-partitioning).
A space-time mapping in case of co-partitioning is an affine transformation of the form
where E ∈ Z n2×n2 is the identity matrix,
Similarly, other hierarchical partitioning schemes can be realized using appropriate selection of affine transformation characterizing the scheduling and the allocation of the index points. The problem of determining an optimal sequencing index (i.e., λ 1 , λ 2 , . . .) taking into account constraints on timing of PAs and availability of resources might be solved by a Mixed Integer Linear Programming (MILP) [7] . 
The description obtained on applying the above affine transformation to the output PLA in Example 4.1 is shown in Eq. (9) . One can derive the description of the processor array architecture from the following PLA in Eq. (9) . The obtained processor array architecture is shown in Fig. 2 . 
Conclusions and Future Directions
In this paper, we presented an exact methodology for partitioning of piecewise linear algorithms (i.e. perfectly nested loop programs) with consideration of both affine and uniform dependencies. This enables not only mapping of algorithms onto massively parallel architectures but is also of interest in studying the additional control introduced due to partitioning. The transformation has been implemented in the PARO design system [10] . The future works entails studying techniques for efficient enumeration of the index space for reduced execution time of the program.
