I. INTRODUCTION
Reconfigurable computing is an emerging paradigm for present and future computing requirements of embedded applications, in terms of flexibility and performance. Conventional execution of complex algorithms employs two methods. The first is using traditional Von Neumann computing by programming microprocessors, microcontrollers with sequential based data processing. The second method is the usage of application specific processors or integrated circuits such as ASICs (Application Specific Integrated Circuits) with real data parallel processing. The first solution offers increase flexibility with degraded performance while in the second solution, the system is designed for customized applications with tight performance constraints (such as latency, throughput, power consumption, area etc.); making future modifications non cost effective and improbable. The gap between these two approaches can be bridged by using reconfigurable architecture based on FPGA (Field Programmable Gate Array). The major developments in FPGA logic density, speed, packaging etc. have made implementing a system of processors, IP blocks, and user logic in an FPGA (System on a Programmable Chip: SoPC) a possibility. This technology is currently being used for the acceleration of a wide variety of applications on a large number of systems. It has evolved so much that the real-time aspect is not the only objective of the designer [1] . It has allowed the association of the flexibility and the specificity. Several applications can be realized by specialized architectures by simply configuring the FPGAs each time the FPGAbased board is supplied.
With the advent of new device architectures and new software tools, the interest in Run-Time Reconfiguration (RTR) or dynamically reconfiguration logic has increased. This concept has introduced several advantages. It helps the designer to optimize his implementation by increasing the functional density of the FPGA coprocessor. It offers the possibility of sharing in time the available resources in the FPGA between the different tasks of an application. This can be accomplished by using either total or partial dynamic reconfiguration. For a given application, where the reconfigurable resources on the SoPC are not allocated to only one operator but to a set of operators that must be placed at the same time and removed at the same time. With this, an application must be partitioned in sets of operators. The partitions will then be successively implemented at different time on the device. This process, called temporal partitioning, allows an application to be sequentially computed, by allowing a temporal sharing of resource among different sets of operators.
Several objective functions can be defined for the temporal partitioning problem [1] - [3] . One objective could be the minimization of the number of partitions to reduce the overall reconfiguration overhead. Another objective could be the minimization of the computation time. This can be expressed for example through the minimization of the maximum computational delay across all the partitions. A third objective could be the communication cost of the design. This aim can be reached by minimizing the transfer of data required between design partitions.
In this paper, we focus on communication cost between partitions in order to develop a methodology to solve temporal partitioning problem for SoPC. In fact, this algorithm optimizes the transfer of data required between design partitions and the reconfiguration overhead.
II. RELATED WORKS
In the literature, many methods have developed by different authors to solve the temporal partitioning problem. In [4] - [6] authors used traditional scheduling methods, such as list scheduling. The idea behind the list scheduling approach is first to place all the nodes of a graph representing the problem to be solved in a list. A new partition (also called configuration) is built stepwise by removing nodes from the list and allocating them to the partition until the size of the partition reached a given size limit (the size of the FPGA). A new partition is then created and the process is repeated until all the nodes from the list are placed in partitions. Others authors extended existing scheduling of high-level synthesis [7] , [8] . In [9] , [10] the authors used ILP algorithm. The ILP is a mathematical method for determining a way to achieve the best outcome, such as lowest latency. The main problem of the ILP approach is its high execution time; therefore, the algorithm can only be applied to small examples. In [11] authors combined the force directed scheduling (FDS) algorithm and network flow algorithm to reduce the whole latency and the communication cost at the same time. In [12] author used leveling node method to determine the communication cost. However, for each end of stage the method is not a min cut just only a leveling cut. Also, the network flow algorithm has been used to reduce the communication cost across temporal partitions in [12] , [13] . The first network flow algorithms has been used in [12] - [14] and improved in [3] . The method is a recursive bipartition approach that successively partitions a set of remaining nodes in two sets, one of which is a final partition; whereas a further partition step must be applied on the second one. The network flow may minimize the communication cost. However, the model is constructed by inserting a great amount of nodes and edges in the original graph. The resulting graph may grow too big. In the worst case, the number of nodes in the new graph can be twice the number of the nodes in the original graph. The number of additional edges also grows dramatically and become difficult to handle. Further, the network flow algorithm, is a heuristic algorithm, in fact there is no a mathematical model behind him.
III. DESIGN FLOW
The actual systems for multimedia services have demanding applications that can be driven by portability, performance, cost, consumption and flexibility. A key challenge of mobile computing, for example, is that many attributes of the environment vary dynamically. The key issue in the design of portable multimedia systems is to find a good balance between flexibility and highprocessing power on one side, and area and energyefficiency of the implementation on the other side. The rapid evolution of multimedia services and their quality necessitates the use of dedicated electronic systems that assure a big level of flexibility by giving the possibility for updates and new services addition while respecting the application constraints. Dynamically reconfigurable systems usually based on dynamically reconfigurable FPGA, present a very interesting solution for such problems. In fact, with the development of new height performance FPGA families it becomes possible to achieve high performance in term of latency at relatively low cost in term of used hardware resources. Our research interest is related to SoPC (system on programmable chip) conception and Algorithm Architecture Adequation (AAA) for multimedia applications. Resolving problems related to the dynamic reconfiguration in SoPC is one important issue in our work.
In this paper we consider that we have limited reconfigurable hardware resources that should be exploited for the implementation of a given algorithm [8] . The resources limitation can be caused by global consideration related to the system and the application constraints or for the fact that the system already exists and no extensions are possible whereas adding a new service or an update or modification of existing service is required. Our objective is then to be able to run our algorithm in this limited FPGA area resources while respecting all constraints.
Our methodology aims to solve the following problem: Given a multimedia application based data flow graph DFG G = (V, E) and a set of constraints: Find the way of graph partitioning in optimal number of temporal partitions such as the communication cost having the lowest value while respecting all constraints.
IV. PROPOSED TEMPORAL PARTITIONING ALGORITHM
The temporal partitioning problem [1] - [3] can be formulated as a graph-based problem. A program or application can be represented by a data flow graph (DFG). A DFG is a directed acyclic graph G = (V, E), where V is the set of nodes | | = n = number of nodes in G and E is a set of edges. Each node represents a functional operation, correspondingly , represents the operation size. A directed edge exists if there is data dependency between node and . We define the weight of as the amount of data transferred from to . A temporal partitioning P of the graph G = (V, E), is its division into some disjoints partitions such as: P = {P1. . .Pk}. The temporal partitioning problem has been formulated as follows.
 Inputs: given a data flow graph G = (V, E)  Constraints: 1) ⋃ , where K is a number of partitions. 2) All dependency constraint relations are satisfied for all K partitions, let denote the dependency constraint of a node . For tow nodes and , we define if must be scheduled no later than . 3) , where denoted the area of partition and denoted the total area of the reconfigurable part of SoPC. 4) we have
where number of programmable input/outputs (I/Os) per device. We extend the ordering relation to P as follow:
with and either or is not defined for and . The partition P is ordered an ordering relation exists for P. An ordered partitioning is characterized by the fact that for a pair of partitions, one should be always implemented after the other with respect to any scheduling relation.
 objectives Several objective functions can be defined for the temporal partitioning problem. One objective could be the minimization of the number of partitions to reduce the overall reconfiguration overhead. Another objective could be the minimization of the computation time. This can be expressed for example through the minimization of the maximum computational delay across all the partitions. A third objective could be the communication cost of the design. This aim can be reached by minimizing the transfer of data required between design partitions. Fig. 1 shows the Partitioning Methodology for SoPC. In this paper, we focus on communication cost between partitions in order to develop an algorithm to solve temporal partitioning problems for SoPC.
A. Definition 1
Given a data flow graph G = (E, V), we define:  The (n x n) weighted adjacency matrix W(G) as follows; n is the number of nodes in G:
;  The (n x n) degree matrix D(G) as follows:
 The (n x n) Laplacian matrix L(G) as follows:
B. Definition 2 An n-vector is an eigenvector of L(G) with eigenvalue λ if and only L(G) λ= ϑ . We denote the set of eigenvectors of L(G) by with corresponding eigenvalues µ….. . The n x n eigenvector matrix has columns and the n x n eigenvalue matrix has diagonal entries, and 0 entries elsewhere.
We assume that the eigenvectors are normalized, i.e. for . The eigenvectors of L(G) have many desirable properties, including [15] :
 The eigenvectors are all mutually orthogonal; hence they form a basis in n-dimensional space.
 L(G) has n non-negative, real-valued eigenvalues 0 = λ 1 ≤ λ 2 ≤, …, ≤ λ n   For every vector { } ; we have:
Given a temporal partitioning of G = (E, V) into k disjoint partitions P = {P 1 , P 2 …P k }; the communication cost, CCost (P m ), of partition P m has been defined in [13] as follow:
This implies that:
where TCCost is the total communication cost. | | is the number of nodes inside partition . | | be the number of nodes outside the partition . Hence, we have | | | | | | .
D. Lemma 1
Given a temporal partitioning of G = (E, V) into k disjoint partitions P = {P 1 We have:
E. Lemma 2 Given a temporal partitioning of G = (E, V) into k disjoint partitions P = {P 1 , P 2 …P k }. We define the nxk matrix M as follow:
, where
We have: . We define the function max sum vector:
We call the subset vector for . Let , denote row i of . Consider the d-dimensional vector partitioning instance with the vector set { }, in which each graph task T i corresponds to a vector . (Observe that is the indicator vector corresponding to T i in the scaled eigenspace.) We say that a graph partitioning P = {P 1 . . .P k } corresponds to a vector partitioning { }
G. Lemma 3
Let { }, where y: is row i of , then if P corresponds to , then:
Proof For a given , let ∑ , we have:
To establish a reduction between min-cut graph (minimization TCCost) partitioning and max-sum vector Partitioning, we reformulate the min-cut objective as the maximization objective ( ), where l is some constant greater than or equal to . Used (6) we formulate a new maximization objective for data flow graph partitioning as:
The choice of ensures that , where n is the total number of tasks.
H. Definition 4
The n x d scaled eigenvector matrix is given by √ . i.e., by the matrix with each column scaled by √ . 
I. Lemma

V. EXPERIMENTS
In our experiences, we used four approaches, list scheduling [6] , initial network flow [13] , improved network flow [3] and the proposed algorithm. In our experiences, we evaluated the performance of each algorithm in term of total communication cost, whole latency of the graph and run time of the algorithm. The platform used for results evaluation was on FPGA Vertex-II XC2V1000. The Vertex-II XC2V1000 has the characteristics shown in Table I . Fig. 2 shows the main blocks of H.264. The prediction algorithms represent the main elements of the H264 algorithms. Indeed, the prediction "Inter" exploits the temporal correlation between successive images and the prediction "Intra" exploits the spatial correlation in the same image. These two modes of prediction allow a considerable gain in terms of quality and compression ratio. According intra mode, the predicted block is based on previous encoded blocks. This predicted block is subtracted from the current block prior to encoding. For the luminance (luma) samples, the predicted block may be formed by 4×4 sub block or by 16×16 macro block. There is one of nine optional prediction modes for 4×4 luma block; 4 optional modes for 16×16 luma block. Table II gives the different solutions provided [16] , [17] . Table II gives the different solutions provided by the ILP algorithm, the list scheduling, the initial network flow technique, the enhance network flow and the proposed algorithm. Result shows that our algorithm has always the lowest number of partitions. Results show a D(G) improvement of 25.74% compared to list scheduling, 19.35% compared to network flow algorithm, 15.52% compared to enhanced network flow algorithm. 
Entropy Coding
Intra
Quantization
Transform Intra Prediction
Core encoder
Intraprediction
Quantization
VI. CONCLUSION
Dynamically reconfigurable computing systems have the potential for achieving high performance at a relatively low cost for a wide range of applications. In this paper, we have proposed a new temporal partitioning algorithm for Systems on Programmable Chip to reduce maximum communication cost. The experiments on benchmark circuits such as H264 task graph have shown the effectiveness of the proposed algorithm.
