Abstract. Due to the complexity of modern data parallel applications such as image processing applications, automatic approach to infer suitable and ecient hardware realizations are more and more required. Typically, the optimization of data transfer and storage micro-architecture has a key role for the data parallelism. In this paper, we propose a comprehensive method to explore the mapping of a high-level representation of an application into a customizable hardware accelerator. The high-level representation is in a language called Array-OL. The customizable architecture uses FIFO queues and double buering mechanism to mask the latency of data transfers and external memory access. The mapping of a high-level representation onto the given architecture is performed by applying a set of loop transformations in Array-OL. A method based on integer partition is used to reduce the space of explored solutions.
Many studies exist on methods to map a high-level specication of an application onto a parallel architecture. These techniques are usually loop transformations [6] , which enhance data locality and allow parallelism. Data-parallel applications are usually mapped onto systolic architectures [7] , i.e., nets of interconnected processors, in which data are streamed through in a rhythmic way. Amar et al [8] propose to map a high-level data parallel specication onto a Kahn Process Network (KPN), which is a network of processing nodes interconnected by FIFO queues.
We argue that KPNs are not suitable to take into account the transfer of multidimensional data.
The FIFO queues can be used only when two communicating processors produce and consume data in the same order. In the other cases, a local memory is necessary. When a local memory is used, the risk of conicts on a memory location imposes that the processor producing and the processor consuming the data execute at dierent times. This creates a bottleneck in the pipeline execution, which can be avoided by using a double buering mechanism [9] . Such a mechanism consists in using two buers that can be accessed at the same time without conict: the rst buer is written while the second one is read and vice-versa.
The complexity of data parallel architectures is so high that only an Electronic Design Automation (EDA) tool can eciently undertake their design. One of the biggest challenge of the EDA is to perform a Design Space Exploration (DSE) able to nd in a short time an optimal hardware realization of a target application. We can distinguish two kinds of approaches to perform an exploration: exact and approximated approaches. Ascia et al. [10] briey survey these approaches and propose to mix them in order to reduce the exploration time and approximate the optimal solution more precisely.
Our contribution. In this paper we present an exact DSE approach to map data parallel applications onto a specic hardware accelerator. The analysis starts from a high level specication dened in a language called Array-OL [11] . The proposed method combines several optimizations from the three research domains previously presented: the memory hierarchy for data parallelism, the mapping of a high level specications onto a parallel architecture and DSE. The method is based on a customizable architecture including parallel processors and distributed local memories used to hide the data transfer latency. The target application is transformed in order to enhance data parallelism through data partitioning. The blocks of data are rhythmically streamed into the architecture. Several parallelism levels are possible: inter-task parallelism thanks to a systolic processing of blocks of data; parallelism between the data-access and the computation thanks to the double buering mechanism; data parallelism in a single task thanks to the pipelining or the instantiation of parallel hardware resources. The parallelism level and the size of transferred blocks of data are chosen in order to hide the latency of data transfers.
To illustrate our approach we will refer to an example of application called low-pass spatial lter (LPSF). LPSF lter is used to eliminate the high spatial frequency in the retina model [12, 13] . It is composed of four inter-dependent lters as shown in Fig. 1 . Each lter performs a low pass ltering according to a given direction. For example, HFLR computes the value of the pixel of coordinates (i, j) at instant t = 1, by computing the pondered sum of the pixel (i, j − 1) at instant t = 1 and the pixel (i, j) at a previous instant t = 0. In Section 2 of this paper, we describe the Array-OL formalism and the associated specication transformations. Then in Section 3, we propose an implementation architecture for data parallel applications and we dene a corresponding mapping model. In Section 4, we dene a method to systematically apply a set of Array-OL transformations in order to optimize the memory hierarchy and the communication sub-system for a target application. We illustrate the method with the example of the low pass spatial lter.
2
High-level design of data parallel applications with Array-OL Array-OL (Array-Oriented Language) is a formalism able to represent data intensive applications as a pipeline of tasks applied on multidimensional data arrays [11] .
input port output port Image display generator Image The target applications have ane array references. An Array-OL representation of the LPSF is given as in Fig. 2 . Each lter has an inter-repetition dependency, i.e., the result of a task repetition instance is used to compute the next repetition instance. In an Array-OL representation we distinguish three kinds of tasks: elementary, compound and repetitive tasks. An elementary task, e.g. image generator in Fig. 2 , is an atomic black box taken from a library and it cannot be decomposed in simpler tasks. A compound task, e.g. LPSF lter in Fig. 2 , can be decomposed in simpler interconnected task hierarchy. A repetitive task, e.g. HFLR lter in Fig. 2 , species how a task is repeated on dierent subsets of data: data parallelism. In an Array-OL representation, no information is given on the hardware realization of tasks. Fig. 3 shows a detailed Array-OL representation for the HFLR lter. The tilers (T1 and T2) describe the shape, the size and the parsing order of the patterns that partition the processed frames. The pattern sp, that tiles the input data and is described by T1, is monodimensional. It contains 4 pixels and lies along the horizontal-spatial dimension. The vector d is an inter-repetition dependency vector.
The information on the array tiling is given by: the origin vector O, the tting matrix F and the paving matrix P. The origin vector species the coordinates of the rst datum to be processed. The tting matrix F says how to parse each data pattern and the paving matrix P says how the chosen pattern covers the data array. The size of the pattern is denoted by s p . Array-OL is a single data assignment and a deterministic language, i.e., each datum can be written only once and any possible scheduling, which respects the data dependencies, produces the same result. For this reason, an application described in Array-OL can be statically scheduled. A crucial problem for an Array-OL description is to dene the optimal granularity of the repetitions, i.e., nding an optimal data paving in order to optimize the target implementation architecture.
Glitia et al [14] describe several Array-OL transformations. In our work, we use the fusion, the change paving and the tiling. These transformations are respectively equivalent to the loop fusion [15] , loop unrolling [16] and loop tiling [17, 18] . One of the biggest problem for a competitive exploration is to dene an optimal composition of these transformations. In fact, by composing them we can optimize the following criteria: the calculation overhead, the hierarchy complexity, the size of the intermediate array and the parallelism level. We propose to compose them in order to nd a ecient architecture to mask the data access and transfer latencies.
3
A target customizable architecture
Overview
We propose a synchronous architectural model ( [19] . Two processors may communicate through a stand-alone FIFO queue or through a local memory. In the former case, the local memory controller and the buers are not instantiated.
inria-00522786, version 1 -1 Oct 2010
In the latter case, the buers and the controller are instantiated. Each buer is a single port memory because the realization of multiport memories is not mature yet and it is subject to many technological problems [20] . For this reason, a processor receiving two data ows includes two distinct buers, one for each communication ow. Furthermore, as a buer receives a single data ow, all data transfers can be parallel without any conict. Finally, each local buer has a double buering mechanism which allows to parallelize the data access and the computation.
For a given application, the data to be processed are stored into the external memory. They are streamed per groups of parallel blocks into the architecture. The size of data blocks is chosen, on the one hand, in order to mask the data access latency thanks to the double buering, and on the other hand, in order to respect the constraints on the external memory bandwidth.
The data transfer and storage model
As shown in Fig. 5(a) , each processor of the proposed architecture can have several input data blocks x j but a single output data block x i . Thanks to the double buering mechanism, each processor fetches the data needed to compute the next group of iterations while it is computing the current group of iterations. One of the aim of the proposed method is to chose a data block size that allows to mask the time to fetch with the time to compute. As a result, such a block size must respect the following mapping rule:
(1)
As illustrated in Fig. 5(b) , the execution timeline have two parallel contributions: the time to pre-fetch and the time to compute. The duration of the longest of these contributions synchronizes the beginning of a new set of computations in the pipeline. The time to pre-fetch a set of data depends on the communication type, i.e., when the processor receives data from the external memory or from another processor. We propose a model for each of these transfer types.
Data transfer with external memory. We make the following hypotheses:
The external memory is accessed in a burst mode [21] , i.e. there is a latency before accessing the rst datum of a burst, than a new datum of the burst can be accessed at each processor cycle. A whole data partition is contained in a memory burst and the access to a whole data partition is an atomic operation. It is possible to store m data per word of the external memory (for example we can have 4 pixels of 8 bits per each memory word of 32 bits).
Under the above hypotheses, we propose the following denition: Denition 1. Given L and m denoting respectively the latency before accessing a burst and the number of data per external memory word, the time to fetch a set of data x j from an external memory is:
The latency L(j) has three contributions: the latency due to the other data transfers between the target architecture and the external memory, the latency due to the burst address decode and the latency due to other data transfers which are external to the target architecture (these two last contributions are indicated as L m ). In the worst case, this latency is:
with x z ∈ X M em and X M em being respectively the set of all the data transfers (in input and output) between the given architecture and the external memory. Hence, we have:
where N k is the number of data blocks exchanged between the target architecture and the external memory. The expression V ect(1) indicates a line vector of coordinates 1.
Transfer between two processors and the computation time. In the proposed architecture, the data computed by a processor are immediately fetched into a second processor, in a dedicated buer. Thus, the time to fetch a set of data x j into a processor P k corresponds to the time to compute a set of data x j by a processor P l . To dene the time to compute we use the simplication hypothesis proposed by Schreiber et al in [22] :
The iterations of a task can be pipelined on the same hardware, but in order to avoid the conicts on the streams or memory accesses, a minimum interval of time (called Initiation Interval -II) has to pass between the beginning of two successive iterations. The time to access data increases with the increasing of the number of iterations while the datapath latency remains constant (cf. Fig. 6 ). If the number of pipelined iterations is suciently high, the datapath latency can be neglected with respect to the time spent to access data. If the number of iteration N is very high, the datapath latency can be neglected with respect to the time spent to access data.
External memory bus This simplication leads to a transaction-based model, without precision on the execution time of the functional core. Under the given hypothesis, we dene: Denition 2. Let c x l be the Initiation Interval of a processor P l . The time to compute a set of data x j is: t com (x j ) = c x l x j . Example 1. An application of inequality (1). Let P k be a processor, as presented in Fig. 7 . It has three input ows: one from the external memory, one from a processor P l and another due to an inter-repetition. To avoid a deadlock, all the data needed by an inter-repetition are stored in internal memories without any latency. From denitions 1 and 2, the time to fetch is t f etch = max{L + x0 m , 3x j } and the time to compute is t com = 6x i . Thanks to the access anity we know that, x 0 = 6x i and x j = 3x i . By applying the inequality (1), we have that either inria-00522786, version 1 -1 Oct 2010 L + 6xi m ≤ 6x i , possible only if m > 1 or 9x i > 6x i , which is impossible. In this last case, the usage of data parallelism can mask the time to fetch.
Scaling the parallelism level. In the proposed architecture there are two levels of parallelism: 1) the parallelism between data access and computation 2) the parallelism due to the pipeline of iterations on the same processor. It is possible to further scale the parallelism level, by computing in parallel independent groups of data. We indicate with N cu k the number of parallel computation units of a processor P k , which is able to compute N cu k blocks of data in parallel.
To apply the data parallelism, we distinguish between the processors communicating with the external memory and the processors communicating with each other. For the processors communicating with the external memory the number of parallel computation units N cu k is limited by the memory bandwidth. When a processor writes a data ow into the external memory, its number of parallel computation units N cu k is limited to m the number of data per memory word:
When a processor receives an input data ow from the external memory, all the data are sequentially streamed on a unique input FIFO queue. It is possible to duplicate the internal memories and the computation units, but the time to fetch the input data is multiplied by the number of parallel computation units N cu k , while the computation time will be divided by a factor N cu k :
For the processor communicating with each other the parallelism level reduces both the time to fetch and the time to compute. The inequality (1)becomes
. Thanks to the anity of the array accesses, i.e. x j = k ij x i with k ij ∈ N * , we can infer a constraint on the parallelism level of two communicating processors:
Generalization of the transfer and storage model. We generalize the mapping rule of inequality (1)to the whole processor network, by taking into account the processors interconnections. For that, we consider the following denitions presented in a progressive way.
Denition 3. Given a network of processors, a column-vector X of possible data block sizes produced or consumed by the processors in the network is: X = (x 0 , . . . , x n−1 ) where x i is a data block produced or consumed by
The vector X is is associated with two vectors X out and X in . The coordinates of X in (riespectively X out ) are either 0 or those of X at the same position and corresponding to the input (respectively output) data blocks in the network. A coordinate x j of X in equals k ij x i , where x i ∈ X out and k ij ∈ N * . The data blocks of size x j can be received either from the external or from another processor of the network.
In the following denitions, we give the relation between X in and X out and we distinguish the case when the input data blocks are received from the external memory.
Denition 4. Let X M em in be a vector of possible sizes for the data blocks read from the external memory. The matrices K δ and K
M em δ
giving the mapping between the sizes of input and output data blocks are:
for all the input communications
for the communications from the external memory.
inria-00522786, version 1 -1 Oct 2010
where X out and X in are vectors of possible sizes for respectively output and input data blocks.
An
) is dened as follows: ) is:
Denition 6. Given a processor P k , a column vector C x giving the Initiation Interval per processor, is C x = {c x k : c x k is the Initiation Interval of a processor P k }. Denition 7. Given a processor P k , a column vector N cu giving the number of parallel computation units per processor, is N cu = {N cu k : N cu k is the number of parallel computation units in P k }.
From the above denitions, we infer the mapping criteria to mask the data access latency for the input, output and internal communications of the target architecture.
Mapping Criterion 1. Input communication from the external memory. Let be
)X out and Diag(N cu ) a diagonal matrix whose diagonal elements are the coordinates of vector N cu . It is possible to write the equation 3 as follows:
) and Denition 2. as follows:
. The substitution of t f etch (X M em ) and t com (X
M em in

)
in inequality (1), gives:
Mapping . From inequality 6, we infer:
The above mapping criteria form a system of inequalities whose variables are the parallelism level N cu and the size of the transferred data blocks X out . Solving the system of inequalities means to nd N cu and X out that mask the time to access data and respect the external memory bandwidth.
inria-00522786, version 1 -1 Oct 2010 4 The Design Space Exploration approach
We propose a Design Space Exploration (DSE) which is aimed to eciently map an application described in Array-OL onto an architecture as proposed in Fig. 4 . The DSE chooses a task fusion that improves the execution time and uses the minimal inter-task patterns. Then it changes the data paving in order to mask the latency of the data accesses. The space of the exploration is a set of solutions with a given parallelism, a fusion conguration and data block sizes that meet the constraints on the external memory bandwidth. The results of the exploration is a set of solutions which are optimal (also termed pareto ) with respect to two optimization criteria: the architecture latency and the internal memory amount.
The optimization criteria. A processor P i contains a buer of size LM j per each input ow:
The factor 2 is due to the double buering. The total amount of used internal memory is:
We dene the architecture latency as the latency to compute a single output image. As in our model, the times to access data are always masked, we can approximate the architecture latency with the time necessary to execute all the output transfers towards the external memory:
where denotes the determinant. Image size can be inferred from the Array-OL specication.
The Array-OL model on the target architecture. An Array-OL model can directly be mapped onto an architecture like that presented in Fig. 4 , by using the following rules: 1. The analysis starts from a canonical Array-OL specication, which is dened to be equivalent to a perfectly loop-nested code [23] : it cannot contain repetition around the composition of repetitive task. 2. The hierarchy of the Array-OL model is analyzed from the highest to the lowest level. 
An element c xi of C x is c xi = max j {δ i,j }. The values of I M em δ and I com δ elements depend on the inter-task links. 4. At each level we distinguish among an elementary, a composite or a repetitive task.
If the task is elementary or a composition of elementary tasks, a set of library elements is instantiated to realize it. When a task is a repetition we distinguish two cases: if it is a repetition of a single task we instantiate a single processor; if it is a repetition of a compound task we instantiate a set of processors in a SIMD, MIMD or pipelined conguration. The repetitions of the same task is iterated on the same hardware or executed in parallel according to mapping criteria. 5. Each instantiated processor contains at least a datapath (of library elements) and may contain some local buers and a local memory controller. Reducing the space of possible fusions. To reduce the exploration space of the possible fusions, we propose to adapt the method used by Talal at al. ( [24] ) to generate optimal coalition structures. Given a number of tasks n, we can map the space of possible fusions onto an integer partition of n. An integer partition is a set of positive integer vectors whose components add up to n. For example, the integer partition of n = 4 is [1, 1, 1, 1], [2, 1, 1], [2, 2] , [3, 1] , [4] . Each vector of the integer partition can be a mapping for more possible fusions as shown in Fig. 9 . Let a macrotask be the result of a fusion, as proposed by Talal et al., we reduce the number of sub-spaces by merging the sub-spaces whose solutions contain the same number of macrotasks.
For the example of Fig. 9 , the sub-spaces mapped on the integer partitions [3, 1] and [2, 2] are merged. In this way the number of sub-spaces is limited to n. This mapping reduces the number of comparisons between the possible solutions. In fact we search for the pareto solutions of each sub-space and we compare them to each other in order to nd the pareto solutions of the whole space. For the example of Fig. 9 , we perform 32 comparisons instead of 56. The pareto solutions are denoted by P sol i in Fig. 9 . Among these solutions a user can choose the most adapted to his or her objectives. In our case, we have chosen the solution P sol 1 , which has the most advantageous tradeo between the used internal memory and the architecture latency Example of an exploration for a LPSF. Solutions merging VFDT and HFRL have to store a whole image, thus they use 2M of internal memory.
Conclusion
We presented a method to explore the space of possible data transfer and storage micro-architectures for data parallel application described in Array-OL. This method starts from a canonical Array-OL representation and apply a set of transformations in order to infer an Application Specic architecture that masks the times to transfer data with the time to perform the computations.
We propose a customizable model of the target architecture including FIFO queues and double buering mechanism. The mapping of a given image processing application onto the proposed architecture is performed through a ow of Array-OL transformations aimed to improve the parallelism level and to reduce the size of the used internal memories. We also use a method based on an integer partition to reduce the space of explored transformations. This method is aimed to be integrated into Gaspard, an Array-OL framework able to map Array-OL models onto dierent kind of target architectures [25] . An industry-size case study is currently in progress.
