In this study we propose a framework and a combined temporal partitioning and design space exploration method for run time reconfigurable processors. Our objective is to help designers to implement an algorithm in limited FPGA area resources while respecting the execution time constraint. The algorithm to be implemented is represented by a task graph with different implementation alternatives (design points) for each task. We study the effect of hardware resources limitation in the choice of the algorithm implementation design point. The proposed method is based on an heuristic technique which consists on combining temporal partitioning and task design points selection to obtain solutions that satisfy the imposed constraints.
INTRODUCTION
With today's deep sub-micron technology, the state-of-the-art FPGA have exceeded 10 million system gates, allowing for multimillion gates FPGAs operating at speeds surpassing 400 MHz. Many designs, which previously could only achieve speed and cost of density goals in ASICs, are converting to much more flexible and productive reprogrammable solutions. The major developments in FPGA logic density, speed, packaging etc. have made implementing a system of processor/s, IP blocks, and user logic in an FPGA (System On a Programmable Chip: SOPC) a possibility. This technology is currently being used for the acceleration of a wide variety of applications on a large number of systems. It has evolved so much that the real-time aspect is not the only objective of the designer [1] . It has allowed the association of the flexibility and the specificity. Several applications can be realized by specialized architectures by simply configuring the FPGAs each time the FPGA-based board is supplied.
With the advent of new device architectures and new software tools, the interest in Run-Time Reconfiguration (RTR) or dynamically reconfiguration logic has increased. This concept has introduced several advantages. It helps the designer to optimize his implementation by increasing the functional density of the FPGA coprocessor. It offers the possibility of sharing in time the available resources in the FPGA between the different tasks of an application. This can be accomplished by using either total or partial dynamic reconfiguration. This later allows the configuration of a part of an FPGA design to change while the circuit is running. The AT40K40 FPGA family of ATMEL allows the reconfiguration of any area of the component by modifying the SRAM configuration contents [1] . Actually, Xilinx offers this technology for his more recent families such as the Virtex-II Pro FPGAs [2] [3] [4] . In this study we propose a method for efficient management of a given FPGA area resources for a particular algorithm by exploiting the dynamic reconfiguration for possible use in SOPC. The proposed method consists in combining temporal partitioning techniques and design points selection of the different tasks constituting the algorithm. We aim to resolve the temporal partitioning problem for a given application while considering the characteristic of multi design point of each algorithm task. This can be very useful for possibly variable resources available on the FPGA when adding a new service or an update [5] . By using different alternatives for the algorithm tasks we increase the chance to meet the application constraints. In fact choosing the best design point for each task may not necessarily result in the best overall design. This depends on the architectural constraints and the dependency constraints among the tasks [6] .
State of the ART in the dynamic reconfiguration domain:
In literature a lot of interest was given to the dynamic reconfiguration and the opportunities given by the new FPGA technologies. Works in this field aim To reduce the difficulty in managing the dynamically reconfigured application and to provide a reliable implementation by developing a set of tools and associated methodologies addressing many issues related to the Dynamic Reconfiguration such as: Automatic partitioning of a conventional design, Specification of the dynamic constraints, Verification of the dynamic implementation through dynamic simulations, Automatic generation of the configuration controller, etc. In [7] [8] [9] the FSS (FPGA Support System) environment which is developed at Manchester University is represented. It facilitates the execution of hardware-based tasks on a dynamically reconfigurable FPGA. It supports the placement, execution and removal of blocks on the FPGA. A framework for the Design and Implementation of Dynamically and Partially Reconfigurable Systems "PaDReH" is proposed in [10, 11] . It is presented with a design flow including partitioning, scheduling and validation. Papers [12, 13] are related to the integrated design system called SPARCS (Synthesis and Partitioning for Adaptive Reconfigurable Computing Systems) which is developed in the ECECS Department at Cincinnati University. It aims to automatically partitioning and synthesizing designs for reconfigurable boards with multiple field-programmable devices (FPGA). The system contains a temporal partitioning tool, a spatial partitioning tool, and a high-level synthesis tool. In [14] a run-time reconfiguration system for FPGA computing resources is proposed; System behaviour and architecture are represented as a problem graph, and an architecture graph, respectively. The Model-Integrated Development Environment for Adaptive Computing (MIDE) project of Vanderbilt University [15] has as goal to develop high-level system design tools for implementing dynamically reconfigurable systems using adaptive computing technology. It is aimed at embedded systems of weapons like missile guiding systems. It uses DSP processors coupled to Virtex Xilinx FPGAs. The Berkeley Reconfigurable Architectures, Software, and Systems (BRASS) project of Berkeley University has proposed SCORE (Stream Computations for Reconfigurable Execution) [16] , a computation model based on the organization of reconfigurable systems around the virtualization of three main hardware concepts: paged reconfigurable hardware, page communication through the use of streams, and storage. The Dynamically Reconfigurable Hardware Research at Bournemouth University [17, 18] has proposed the DYNASTY tool, which is a generic CAD framework for research in the area of reconfigurable system design techniques and methodologies.
The Specific action « dynamically reconfigurable Architectures » of the CNRS (National Center of Scientific Research in France) Multi-field themes network on SOC provides three main research projects in dynamic reconfiguration: ARDOISE, DART and DNODE [19] . In difference to ARDOISE, DART and DNODE are note based on the use of the dynamically reconfigurable FPGA. ARDOISE propose a dynamically reconfigurable Architecture dedicated to embedded image and signal processing. In [1] , an imageprocessing application, image rotation, that exploits the FPGAs dynamic reconfiguration method is presented. It shows that the choice of an implementation, static or dynamic reconfiguration, depends on the application nature. Paper [20] describes Implementation of JPEG2000 Arithmetic Decoder using Dynamic Reconfiguration. The target architecture is ARDOISE. Works about Dynamic Reconfiguration methods and applications presented in [21] [22] [23] are also related to the ARDOISE architecture.
Numerous algorithms for partitioning, scheduling and placement of tasks for reconfigurable computing devices are proposed in the literature. The paper [24] treats in addition to the partitioning, the scheduling of the tasks. The advantage of the approach presented in this study is the capability to model communication between nonadjacent on-chip configurations and multiple levels of logic. Other more recent works were interested in depth to the time placement problem [25] [26] [27] [28] [29] [30] . The execution of a task on the reconfigurable device leads to the online placement problem, for which a method based on free rectangles management and heuristics fitting has been proposed in [28] and improved in [30, 31] . In [25] the authors modeled the time placement as a three-dimensional problem. A task is represented by a cube in which the X and the Y coordinates represent the width and the height of the task in the given FPGA. The Z coordinates represents the time at which the task will be mapped in the FPGA. In spite of the variety and the importance of the state of the art, this field remains very active. Many problems remain without efficient solutions or have yet to be solved and the academic community continues the suggestion of new techniques and methods to better exploit the dynamic reconfiguration in different applications and systems. The handled problems are due to new applications needs, environment and technology constraints.
Motivations and problem formulation:
The actual systems for multimedia services have demanding applications that can be driven by portability, performance, cost, consumption and flexibility. A key challenge of mobile computing, for example, is that many attributes of the environment vary dynamically [32] . The key issue in the design of portable multimedia systems is to find a good balance between flexibility and high-processing power on one side, and area and energy-efficiency of the implementation on the other side. the rapid evolution of multimedia services and their quality necessitates the use of dedicated electronic systems that assure a big level of flexibility by giving the possibility for updates and new services addition while respecting the application constraints. Dynamically reconfigurable systems, usually based on dynamically reconfigurable FPGA, present a very interesting solution for such problems. In fact, with the development of new height performance FPGA families it becomes possible to achieve high performance in term of latency at relatively low cost in term of used hardware resources. Our research interest is related to SORC (system on reconfigurable chip) conception and Algorithm Architecture Adequation (AAA) for multimedia applications [33] . Resolving problems related to the us of dynamic reconfiguration in SOPC is one important issue in our work .
Critical treatments of a given application are usually executed by hardware modules. By considering the application constraints and the available hardware resources the designers have to choose between dynamic reconfiguration and static configuration (or ASIC implementation). In this study we consider that we have limited reconfigurable hardware resources that should be exploited for the implementation of a given algorithm. The resources limitation can be caused by global consideration related to the system and the application constraints or for the fact that the system already exists and no extensions are possible whereas adding a new service or an update or modification of existing service is required (Fig. 1:) . Our objective is then to be able to run our algorithm in this limited FPGA area resources while respecting the execution time constraint. The main idea consists in considering different implementation alternatives for the tasks set of the algorithm to be implemented and combining the temporal partitioning and tasks design points selection to obtain a solution satisfying the constraints (Fig. 2) .
Different implementation alternatives are possible for the same task. They suggest different area-time tradeoff points. These different implementations are design points in the design space of a task. Choosing the best design point for each task may not necessarily result in the best overall design. When combining several design points we can satisfy different application constraints by using an adequate static selection method [5] to choose the convenient combination of tasks.
Exploring a very large design space can be too computationally expensive. To limit the number of candidate design points in our work we consider tasks with average or high granularity. Design space exploration of tasks does not concern the operators level.
The first step in our method is the Static Estimation (SE). This step can help the designer to choose between static and dynamic implementation. It also gives important information that can be used to take decisions in the partitioning algorithm progress. In the following paragraphs we represent the proposed method and the associated tool which integrate the different techniques used in this method and support the required input and output format.
Static estimation:
The principal input of the proposed method is a tasks graph with different area and latency for each task. Figure 3 represent two possible methods to enter the task graph and parameters of the different tasks: graphical method and text based method. In this figure the number "2" written near the first task means that the correspondent task has for the moment two design points. Design points can also be introduced by using a text editor. Design points are composed of three parameters: area, time and consumption. In this work we interested only in the two first parameters.
The static estimation is done according to an iterative process based on a technique of unconstrained scheduling. It gives important information about static realization. In a first step the algorithm determines the As Soon As Possible (ASAP) and the As Late As Possible (ALAP) schedules of tasks. In a second step the mobility of each task is calculated. The mobility of a task (T i ) is calculated as follow: The proposed framework shows the ASAP and ALAP scheduling results and the different possible schedules (Figure 4) . It also allows calculation of the limit values of area (A min , A max : minimum area, maximum area) and latency (L min , L max : minimum latency, maximum latency) when using the best area design points or the best latency design points of each task . Figure 4 represent an example of a task graph and the scheduling results for the design points presented in 
Mobility (T i ) = ALAP (T i ) -ASAP (T i ) Between the two scheduling limits ((ASAP
)
Static Estimation
Partition (1) Partition(n) The proposed partitioning method: The proposed method consists in modifying the partitioning algorithm presented in [34] to take in to account the different design point of tasks constituting the algorithm to be implemented. The input of the new partitioning algorithm is tasks graph with different area and latency for each task, area constraint, memory constraint and the reconfiguration time. The output of the algorithm is a set of time partitions. The tasks of each partition are executed when the partition is mapped on the reconfigurable device. We aim to obtain the optimal number of temporal segments and to place each task in the appropriate partition while satisfying area, memory and time constraints. The latency reduction is performed by selecting the appropriate design points in each algorithm iteration. The different steps of the proposed method are summarized by Fig. 5 .
To obtain the optimal number of partitions the algorithm preserves the minimal number of partitions for which the area and memory constraints are satisfied (N min ) And verify the possibility to obtain a solution that satisfies the time constraint. To determine N min the algorithm starts by calculating the theoretical minimal number of time partitions Nth min , which is given by : area constraint (FPGA area) ceil is a C++ function. Ceil (x) returns a value representing the smallest integer that is greater than or equal to x If area and memory constraints are satisfied for at least one possible scheduling, N min is equal to Nth min , if not we increment the number of partitions.
The partitioning process for a specified number of partition (N) gives as a result the possible schedules that respect the memory and area constraints. The constraints test allows identification of the best solution that respect the time constraint. If no solution satisfies the time constraint a design points selection is performed to improve the latency of each partition. The constraints test will be applied for the obtained solutions. If, also, no satisfactory solution is obtained N will be incremented and we repeat the different steps for N = N+1. Before repeating the described steps a test called "iteration test" is done. Its objective is to verify if the actual number of partitioning (N) allows a feasible and improved solution to be obtained. As a simple condition to verify the feasibility is that N must be smaller than N tmax which represent the theoretic maximal number of partitions. N tmax is obtained by:
reconfiguration time Floor is a C++ function. Floor(x) returns a value representing the largest integer that is less than or equal to x N tmax represents a very coarse criterion. In our algorithm we consider a better criterion definition based on the time left for the execution of the different tasks for a given number of partitions (N). This execution time is defined as:
When the number of partitions increases using design points with a higher area becomes possible and a higher speed can be obtained, but the available execution time, E T , decreases. Before studying solutions for the actual number of partitions we have to verify if E T is sufficient to execute tasks. In this step the Static Estimation results are useful. Partitioning process: In the partitioning process for a given number of partitions (N) we always start with the minimum area design point for each task and we identify the possible schedules that respect the memory and area constraints. The obtained schedules must respect the dependency constraint: a task T i on which another task T j is dependent has to be placed either in the same partition as T j or in an earlier one. The reconfiguration time and the area, memory and time constraints are introduced by using the graphical interface represented by Fig. 6 . To respect the area constraint the sum of area costs of all the tasks mapped to a temporal partition must be less than the area constraint (Ac):
A (P i ) ≤ Ac A (P i ): area of partition P i . It is calculated as following:
Data transfer between partitions take place due to dependent tasks belonging to different partitions ( figure  7 ). This intermediate data needs to be stored between partitions and should be less than the memory constraint (Mc). For a given partition P i we have to take into account only the available memory resources (Ma) which is calculated by: Ma = Mc -Mu (M u ) represents memory resource used by previous partitions of P i to communicate with subsequent partitions of P i . Data used for this communication must still be saved until the execution of the concerned partitions. In Figure 7 the DATA (D2) will be used by the partition P 3 . For the memory constraint of partition P 2 we have to subtract the necessary memory resources from D2. The Partitioning problem resolution starts by extracting a dependency vector (DV) from the task graph as defined in [34] . In a second step, the algorithm builds the dependency matrix D. The matrix D has as dimension N(lines) x Nt( columns); with N represent the number of partition and Nt represent the number of tasks. From this matrix, the algorithm extracts the possible schedules of tasks while respecting the dependency constraints.
Design point selection:
To be able to meet the latency constraint the design point's selection has as objective to improve the latency of each partition by exploring the different design points of each task constituting the considered partition. For a given partition the selection problem can be formulated as represented in figure 8 . We aim to obtain the optimal latency while respecting the area and memory constraints. Design points of each task is represented in the following form:
: area of the design point N° j of task i L(T i ) j : latency of the design point N° j of task i Area constraint is calculated by considering the sum of tasks area in the considered partition. The latency of a given solution depends on the execution order of tasks (sequential and parallel) and their dependency (Fig. 9 ). It will be the maximal latency among all the paths of the task graph mapped to the partition.
T2 T 5 T3
Ti-s2 
. .
, if the task (T j ) respectively the task (T k ) belongs to the partition (P i ), else ∅ ij = ∅ ik = 0 ∅ jk = 1, if the task (T k ) depends on the task (T j ), else ∅ jk = 0 . For the example represented by figure 9 , the latency is calculated as follows:
We use the same principle to calculate the global latency for a given solution (L(S n )) :
THE PARTITION RESULT REPRESENTATION
The final result is a set of partition of tasks. Each task takes an appropriate design point so that the combinations of the different tasks according to the partition order correspond to a satisfying solution. The final result can be represented graphically as shown in Fig. 10 . 
CONCLUSION
In this study we have described a method and a framework that allows facilitating the implementation of a given algorithm in a limited run time reconfigurable FPGA area. The proposed method consists on combining temporal partitioning and a selection of task design points to obtain constraint satisfying solutions. This is based on the fact that choosing the best design point for each task may not necessarily result in the best overall design as this depends on the architectural constraints and the dependency constraints among the tasks. A dedicated framework was designed to implement the method with graphical interface to be more helpful for the user.
Our work currently concerns different aspects that aim to improve the effectiveness of the results and to concretize the use of DR and the proposed methods in several applications. In our work group we are interested in the multimedia field represented mainly by content based video indexing applications that can be a very important application domain for the RD. To improve the obtained results of the proposed methods for the DR it is necessary to take into account the internal structure of the new FPGA devices in terms of internal memory and resources and their dispositions. This depends on the FPGA family and characteristics that will be used as a supplementary input for our methods. It will represent additional conditions to be taken into account for result reliability improvement.
