Abstract
Introduction
Embedded real-time multimedia applications usually imply data parallel processing. Indeed image processing involves vectors or matrices of pixels for instance. Moreover they are often (statically) predictable and well-structured in such a way that the potential parallelism can be extracted at compile time. More than 80% of the execution time is spent in loop nests that define and use arrays. Efficiently performing these regular computations obviously is a key issue.
SIMD (Single Instruction Multiple Data) architectures are well suited for this multimedia application domain be-cause they naturally take advantage of stream-oriented parallelism. Even if architecturally speaking this technology can be seen as an old-fashioned one, it remains easily scalable and can be exploited in addition to other techniques. Such SIMD units are now embedded in SoCs and GPPs (General Purpose Processors) e.g. AltiVec technology from Motorola [4] and Streaming SIMD Extensions from Intel [1] .
Programming such domain-specific applications for these targets requires data placement and operation scheduling which are NP-complete problems. But the constraint technology [6] enables the efficient exploration of the combinatorial space of tiling and scheduling. A set of cost functions lets the user obtain automatically a variety of solutions, close to expert given ones, that bound the valid design space.
This article briefly presents our APOTRES tool and its underlying constraint technology [5, 11] and is focused on a case study to show how the design space is explored. The next section describes the application and the architecture. Section 4 introduces constraint-based models such as tiling and scheduling, which define the solution space. Section 5 presents our Concurrent Constraint Logic Programming approach. Then several optimization criteria are considered such as machine size (7), memory capacity size (8), execution time with technological or real time constraints (9, 10) . Results for the different cost functions are compared in Section 11.
the bit rate needed to transfer images of the same quality. However, H.264 kernels require more computation levels and their implementation should be adapted to various architectures.
We use the fractional sample interpolation of H.264 [13] as running example. Interpolation is used to determine intensity values at non integer pixel positions to perform motion compensation. In Figure 1 the grey squares represent existing pixels. Pixel i is the desired output. A vertical convolution produces pixels such as cc, dd, h, m, ee, ff. An horizontal one produces j from these pixels. Finally, i is derived from h and j.
Figure 1. Fractional sample interpolation
For our application we compute the 8×8 matrix of pixels i with the following equations:
where X is the 13×13 input matrix, H a 8×13 intermediary matrix, K a 8 × 8 intermediary matrix, Y the 8 × 8 output matrix, and conv a 6-tap convolution filter.
Application code
The interpolation code is expressed in a pseudo-C language in Figure 2 . It is derived from the equations defining H, K and Y, code is composed of parallel loops in single assignment form. Each loop nest implements a task that reads one or more multidimensional data arrays and updates one different array. The call function arguments represent all the array elements read by the function. The loop nest tasks are called T 0 , T 1 , T 2 , T 3 and T 4 . 
The global approach
The goal is to find a mapping of the application onto the machine that satisfies the programming model and architectural constraints. The global optimization problem is decomposed into many subproblems: memory minimization, task tiling and scheduling, communication overlapping, respect of dependencies, latency minimization,... each being a fully optimization problem. Only a concurrent model modular approach can meet all our needs simultaneously. Figure 3 illustrates our global approach. A Constraint Logic Programming Model approach is used (Section 5). The application, the architecture, the memory, the task scheduling and tiling, the communication, the data-flow dependencies and the latency are modelized with linear and non-linear constraints. These models are introduced in Sections 4 and represent the core of APOTRES.
The tool takes as input the application and architectural parameters. The processor number, the local and global memory sizes, the pipeline depth are given in a specification file. The information coming from the application is automatically extracted using the PIPS compiler. The task time durations are estimated using [8] and array elements referenced by each task are evaluated by the array region [14] .
These pieces of information are taken into account and propagated together with deduced information from model to model. During the resolution the models exchange range information about their variables. The cost function selected among the execution time, memory, architecture cost and latency guides the resolution through a specific heuristics. A first solution satisfying all the constraints is found. The solving process is automatic. There is never bad solution, since any solution produced meets all the constraints. To obtain legal results, the tool designer has to approximate the architecture and the solution design by proper models. The introduction of a new architectural scheme implies the development of new set of constraints.
Models
As explained above the mapping of applications onto the architecture uses different interacting models. The main ones used concurrently by the solver are introduced in this section.
Tiling
For each loop, the tiling partitions the iteration set and distributes it along three dimensions: (1) a cyclic temporal dimension, (2) a "processor" dimension, (3) a local dimension which exploits the local memory.
The tiling is formally defined in [12] and summarized here. Let I be the loop nest iteration set (with n loops) contained in Z n defined by
Let P and L be n × n square diagonal integer matrices with non-null determinant. Then for each point i of I, there exists one and only one triplet (c, p, l) of points of I such as:
The associated triplet (c, p, l) can be interpreted as follows: at a logic time c, each processor p runs l iterations.
P and L define a tiling of the iteration domain. The solver must find the numerical values of their elements.
Scheduling
Schedules are computed with respect to tilings. For each loop nest, an affine function represents the schedule. A schedule is legal if it respects the data flow dependence constraints. A schedule function is
where c is the index of a computation block, α a line vector of the same dimension as c, "." the standard scalar product and β an integer. The solver selects values for all α and β.
Data flow dependence
If a block c uses a value defined by a block c, c must be executed first. This is called a data flow dependence. If there is a data flow dependence between two computation blocks c and c , then any legal schedule meets the constraint
Memory capacity
A capacitive memory model is used. As all processors execute the same code, the required memory size is the same for each processor. Each task is allocated a private input buffer, but all tasks share a common output buffer. As soon as they are computed, results are sent to all input buffers of tasks using them. The output buffer is sized to fit the output of any task and to support a flip-flop mechanism used to overlap computations and communications.
The size of a task input buffer is the sum of the spaces required for each argument. Its capacity is the minimum obtained with four possible schemes. The memory constraints are linked to the partitioning, dependence and scheduling parameters.
A Concurrent Constraint Logic Programming Model (CCLP)
While some constraints can be translated into linear inequations and solved by classical linear programming algorithms, others like resource constraints require non linear expressions. Solving techniques for both constraints require the combination of integer programming and search. We use the Constraint Logic Programming approach.
CCLP handles linear and non linear constraints and yields, through the concurrent propagation of constraints over all models, solutions satisfying the global problem. Figure 3 illustrates our models. The models are linked by variables that appear concurrently in different models. For example, Constraint 6 links tiling and architectural models. The number of processors required by the tiling must be smaller than the number of processors available for any task k.
As another example, the computation block index c appears:
•
• in the scheduling model (Equation 4).
During the resolution process, the models communicate their partial information, value intervals, about their variables to others.
The CCLP system builds a solution space on a modelper-model basis. The global search looks for partial solution in the different concurrent models. Only relevant information is propagated between models. Several global heuristics are used to improve the resolution, e.g. schedule choices are driven by computing the shortest path in the data-flow graph.
Case study
To show the characteristics of our tool, we present in the next sections various mappings of the application obtained for three different optimization criteria. These results are successively taken into account to size the architecture:
• Firstly the cost criterion (Section 7) gives the cheapest circuit configuration able to execute the application meeting or not real-time constraint;
• Secondly the memory minimization criterion (Section 8) gives a lower bound for the local memory capacity per processor;
• Finally the execution time criterion (Sections 9,10) together with the constraints chosen from the previous results provides solutions that fit both architectural and real-time constraints.
We choose a target architecture having up to 16 processors, each with a local memory of 128 or 256 bytes. These parameters are entered as architectural constraints in APOTRES.
A pipelined multiply-add is used and task durations estimated by PIPS [9] , the optimizing compiler used as a frontend to APOTRES, are respectively 1, 6, 6, 1 and 1 cycles.
Cost minimization
The cost of SoCs is key for industrial exploitation. It depends on the surface of a single processor, on the surface of a memory unit, and on the number of processors and memory units required by the application.
The cost of a processor outweighs that of the memory. So, cost minimization induces solutions where the processor number is minimal.
Without the memory constraint, APOTRES selects a one-processor target machine, as could be expected. If a constraint on the memory capacity per processor such as #mem ≤ 128 is added, APOTRES chooses the Table 2 mapping on four processors using 117 bytes each. The reduction of the pipelined local iterations per computation block decreases the data liveness and thus the memory used. In particular, 56 elements of Array H have to be stored between the two orthogonal convolutions T 1 and T 2 instead of 104 for the first solution.
Table 2. Tilings for cost minimization, 128b
Tasks Figure 4 . From the α and β scheduling parameters (Section 4.2), the schedules can be expressed using regular expressions:
In order to shorten the scheduling formulations, C i corresponds to the computation block with l i pipelined local iterations of Task T i : C i = T li i , and C ALL is used for
The logical durations of the schedules in Figure 4 are respectively 1355 and 519 cycles, while the memory capacities per processor are 203 and 117 bytes.
Memory constraint minimization
In order to size the architecture we wish to know the minimum memory required to execute the application on the target architecture having up to 16 processors.
The best solution found by APOTRES uses only 46 bytes on the 16 processors. The tiling chosen is shown in Table 3 .
Table 3. Tilings with memory minimization
To minimize data storage during the execution of data flow dependent tasks, computation blocks have only one local iteration, except Tasks T 0 and T 1 . Look at T 1 code in Figure 2 . It uses six array elements produced by Task T 0 . Pipelining two contiguous local iterations of T 0 and T 1 implies only one additional element of storage and all 16 processors can be used to exploit the parallelism available in tasks T 2 , T 3 and T 4 .
The execution time is 117 cycles. Computations are again scheduled according to an as soon as possible schedule:
Execution time under memory constraint
Multimedia applications often must meet real-time constraints. Here we wish to set the execution time to a value strictly less than the 500 cycles found in one previous solution in Section 7. Furthermore, the results of Sections 7 and 8 make processors having from 64 to 128 bytes good candidates for our application. Two cases are successively studied with memory sizes of 128 and 64 bytes.
To reduce the execution time, APOTRES has to maximize the number of processors. Table 4 presents a solution that takes advantage of the available parallelism: processors and software pipelines.
Table 4. Tilings for time optimization, 128b
Tasks
The schedule is not interleaved:
The execution time, computed by the solver, is 98 cycles and the memory capacity required per processor is 72 bytes.
When memory size is limited to 64 bytes, the application cannot be executed on the machine without additional partitioning. APOTRES finds the solution in Table 5 and its related schedule:
This solution uses 110 cycles to execute the application and requires 62 bytes per processor. It actually does not differ from the previous one, except that Task T 0 has been tiled. Now, at each iteration, Task T 0 writes only one array 
Execution time under processor constraint
In order to compare the solutions efficiencies, the results have to be computed for the same optimization criterion. Here are briefly given the mappings of the application on one and four processors using the execution time as function cost.
On one processor, the memory capacity required is 241 bytes and the execution time is 1343 cycles. On four processors, the memory capacity required per processor is 125 bytes and the execution time is 367 cycles. Table 8 summarizes the different solutions according to the optimization function used and additional constraints, Typically, to minimize the silicon area of a SoC, the two main components to take into account are the number of processors and the memory size of each processor. These two criteria yield the solution with 1 processor and 203 bytes of local memory. Unfortunately, this solution is executed too slowly for a real-time embedded application such as video decoding. So the number of cycles should also be considered. Furthermore, decreasing the number of cycles required to execute an application allows to decrease the SoC frequency and thus to reduce the SoC surface by tightening the layers. Tradeoffs should be made between speed and cost. Solutions 4 and 6 are efficient. They satisfy both the application and architectural constraints.
Discussion

Related work
Using constraint programming to solve such mapping and scheduling problems can be seen as an automatic approach. Manual approaches may be preferred by designers who want to use their own design strategy and control the whole process. Mapping in GEDAE[2] is performed manually by allocating tasks to hardware resources and scheduling is based on dynamic heuristics. But architecture features such as memory hierarchy or communication paths are not shown explicitly and managed by the tool itself. This restricts in practice the scope of usable architectures. The SPEAR [10] Design Environment removes this limita- 
3 Cost mem ≤ 128 4 519 117 468 0.65
4 Exec. Time #proc = 4 4 367 125 500 0.92 . This enables rapid exploration of the design space. These tools make the architecture model and design choices explicit to guide the design space exploration.
Our work is unique because it takes into account simultaneously: the mapping of the complete application, with scheduling and sizing using tiling, the application memory requirement and the operational constraints.
Conclusion
This article shows how a multimedia application can be rapidly mapped onto a SIMD architecture. Our mapping tool is able to explore the tiling and scheduling spaces within the combinatorial space of solutions according to different criteria. It enables finding in a few minutes the best trade-off depending on the embedded real-time constraints and target cost.
APOTRES is connected to PIPS, a tool that automatically analyzes and transforms codes written in Fortran. Another potential input is ANSI-C code whose functional results can be checked (by THOMSON). Hence our prototyping chain is nearly seamless in the sense that a multimedia code can be parallelized from any standard specification (sometimes not parallel at all) translated "à la Fortran" or from a C sequential code.
To make APOTRES more useful, some improvements are planed in two areas: data communication and code generation (control, allocation, communication). Communications have to be modeled with respect to the communication resources. APOTRES generates integer values which are interpreted as mapping directives. We are currently studying control code generation using CLooG [7] , which generates an efficient control C code from a description of iteration domains and schedules.
