Abstract. In recent years, dynamic reconfigurable processor which can achieve reconfiguration with a few cycles is proposed. The fast reconfiguration makes run-time reconfiguration possible, and the run-time reconfiguration gives a new possibility to the dynamic reconfigurable processor, i.e. the dynamic reconfigurable processor can also execute partitioned independent subtasks with repeated reconfigurations and executions. However, to achieve an execution with the run-time reconfiguration, performance should be evaluated with various overheads: reconfiguration, memory accesses, etc. The overheads depend on reconfigurable architectures, and it is generally difficult to evaluate the overhead. As the overhead may critically affect the performance, designers should carefully explore design space for suitable architectures. In this paper, we propose a dynamic reconfigurable architecture exploration method based on Parameterized Reconfigurable Processor model (PRP-model) and task partitioning optimization algorithm for architecture exploration corresponding to proposed PRP-model. Experimental results showed that the proposed PRP-model and the task partitioning algorithm for PRP-model can fast evaluate various reconfigurable architectures, and designers can easily find suitable reconfigurable architectures by changing the PRPmodel parameters.
Introduction
Portable information systems such as cellular phones and mobile MP3 players are widely spreading in our daily life. In general, there are various requirements for a design of portable information systems, e.g. low hardware cost and high performance. The requirements of low hardware cost lead to inexpensive and portable products, and the high performance is usually needed for media processing. Moreover, flexibility is required to adapt various coding standards using the same chips on board. To fulfill these requirements, system designers always design the products under the hard constraints to take account of design quality metrics: performance, area, and power. When designing embedded systems, it is essential to explore design solution space and choose a solution which they really need.
To meet these requirements, Instruction Set Processors (ISPs) or ASICs have usually been used for their design. However they cannot completely satisfy these requirements. While ISPs can flexibly execute various functions by changing software, they cannot achieve high performance. While ASICs can realize high performance for specific applications, it is difficult to use them for other applications. Therefore, as a new approach to meet them, dynamic reconfigurable processors that feature both the flexibility of ISPs and the high performance of ASICs are focused [1, 2] .
Dynamic reconfigurable processors usually have many coarse grain processing elements (PEs), and each PE is connected each other by flexible interconnections. Since many PEs are simultaneously executed, the tasks with high parallelism are effectively executed. The function of dynamic reconfigurable processor is defined by configuration data, e.g. setting information of interconnections and function of each PE, that is, according to the configuration data prepared before hand, the dynamic reconfigurable processor can reconfigure itself into various circuits. The reconfiguration speed depends on total amount of configuration data and the reconfigurable architecture specification, and the reconfiguration timing is decided by the reconfiguration speed.
In recent years, dynamic reconfigurable processor which can achieve a reconfiguration with a few cycles is proposed. The fast reconfiguration makes run-time reconfiguration possible, and the run-time reconfiguration gives a new possibility to the dynamic reconfigurable processor, i.e. the dynamic reconfigurable processor can also execute partitioned independent subtasks with repeated reconfigurations and executions. The run-time reconfiguration is a special feature that traditional programmable device does not have, and the authors pay attention to this feature and its potential.
However, to achieve an execution with the run-time reconfiguration, performance should be evaluated with various overheads: reconfiguration, memory accesses, etc. The overheads depend on reconfigurable architectures, and it is generally difficult to evaluate the overhead. As the overhead may critically affect the performance, designers should carefully explore design space for suitable architectures. The authors claim varieties of reconfigurable architectures and difficulty of architecture evaluation confuse designers to explore vast design space for the best solution and fast evaluation method for various reconfigurable architectures is needed.
In this paper, we propose a Parameterized Reconfigurable Processor model (PRP-model) and a task partitioning optimization algorithm for architecture exploration corresponding to proposed PRP-model. The task partitioning optimization algorithm divides tasks into subtasks to minimize execution cycles. The algorithm is applicable to various reconfigurable architectures and supports the evaluation of various architectures for specific applications by changing PRPmodel parameters. To realize run-time reconfiguration, designers can easily find the suitable reconfigurable architectures. This paper is structured as follows: section 2 gives an overview of related work and highlights our contribution. Sections 3, 4, and 5 present a Parame-terized Reconfigurable Processor model, a port expansion DFG, and proposed algorithm, respectively. Experimental results are given in section 6. In section 7, we conclude this paper.
Related Work
Various reconfigurable processor architectures have been proposed, which are classified by reconfiguration granularity or reconfiguration time [1] . To use the architectures effectively, many design methods, algorithms, and applications have been studied. Especially, task partitioning problem is one of essential dilemmas to use dynamic reconfigurable processors, and so far many studies have been conducted. However, we have never seen a task partitioning problem for execution model of repeated reconfigurations and executions.
[3] proposed a HW/SW task partitioning that considered task assignment for Instruction Set Processors (ISPs). [4] proposed a task partitioning method that partitions tasks into two reconfigurable processors with different granularities.
It cannot be applied to different reconfigurable processor architectures. [5] proposed a task partitioning algorithm considering reconfigurable overhead which means the number of CLBs of FPGA for the communication. [6] proposed behavior partitioning method which does high level synthesis of each task of the behavior simultaneously. It can get the optimal solution because the problem is come down to NLP, but it takes a long time even if the target application is small. [7] proposed a task partitioning method under the constraints of the number of memory ports, and can get the optimal solution. [8] considers a task partitioning limiting the number of partitioned subtasks. In this paper, we do not limit the number of partitioned subtasks. [9] proposed the task partitioning method for DRL architecture. [3] [4] [5] [6] [7] [8] proposed methods for FPGA platforms. FPGA needs to store intermediate data to an external memory at a reconfiguration, because it cannot hold the data during reconfiguration. In recent years, many dynamic reconfigurable architectures with registers in the array are proposed, and the dynamic reconfigurable architectures can hold the data in the array during reconfiguration. In this paper, we propose a task partitioning method for not only FPGA but also the dynamic reconfiguration architectures which can hold the data in the array during reconfiguration. [10, 11] proposed reconfigurable architecture exploration method. [10] proposed a design space exploration method using the task partitioning method proposed in [6] . [11] proposed ADRES architecture template, which is unique architecture consisting of VLIW processor and a PE array, and architecture exploration method only applicable to the template. In this paper, to realize the execution model of repeated reconfigurations and executions, we propose architecture exploration method which offers designers fast evaluation of various reconfigurable architectures by changing architecture parameters.
Parameterized Reconfigurable Processor Model
To evaluate many reconfigurable architectures in a short time, the reconfigurable processor model which covers many kinds of processing elements and memory architectures is needed. In this paper, we propose a Parameterized Reconfigurable Processor model (PRP-model) and task partitioning algorithm based on the PRP-model. Figure 1 illustrates the proposed PRP-model. PRP-model includes PE array arranged processing elements (PEs), internal memories with different capacity, and configuration memory to store configuration data. PE Array A PE array is composed of three types of PEs: pPE, rPE, and prPE. pPE (Processing PE) has an ALU, and rPE (Register PE) has a register file, respectively. prPE (Processing and Register PE) has both an ALU and a register file. The numbers of pPEs, rPEs, and prPEs in a PRP-model are denoted as n pP E , n rP E , and n prP E , respectively. pPEs can perform some operations, but pPEs cannot hold data at reconfiguration since it does not have any registers. rPEs can hold data at reconfiguration, but rPEs cannot perform any operations since it does not have any ALUs. Since prPEs have both ALUs and registers, it can operate and hold data. ALUs included in pPEs or prPEs have the same functionalities, and we assume that the application tasks are resolved into operations which pPEs or prPEs can execute on PEs. In this model, we treat the total amount of data as the number of data packet, and the numbers of data that rPEs and prPEs can hold are denoted as n reg rP E and n reg prP E , respectively. In this model, we consider the available number of all PEs and not the interconnection or placement of PEs because the interconnection and placement can be defined after the number of PE is decided.
Processor Structure Model
Memory Structure Memory structure of PRP-model comprises three types of memories: external memory, internal memory, and register file of rPEs or prPEs. The number of internal memory is n in mem . Each memory has some read or write access ports, and can read and write as much data as access ports at reconfiguration.
External memory is large enough to hold any data, and it has p ex r read access ports and p ex w write access ports. One read access needs t ex r cycles, and one write access needs t ex w cycles. Thus, up to p ex r data are read from external memory at t ex r cycles, and up to p ex w data are written to external memory at t ex w cycles.
The i-th internal memory keeps n in mem dat (i) data. The numbers of read port and write port of internal memory are p in r and p in w , respectively, and t in r cycles or t in w cycles are needed to read or write data.
rPE and prPE have p reg r read ports and p reg w write ports, and t reg r cycles or t reg w cycles are needed to read or write data.
In PRP-model, memory access of data read or write can be done in parallel, and execution of calculation starts after all memory accesses finish.
Configuration Memory Configuration memory can store n conf ig configuration data. All configuration data are the same size, and reconfiguration always needs t conf ig cycles. After k-th reconfiguration, a new configuration data can be overwritten to k-th configuration data from external memory. It takes t cf g r cycles to store one configuration data to configuration memory.
Let n subtask be the number of partitioned subtasks to execute, that is, n subtask configurations are needed to execute the target task. When n subtask is not over n conf ig , all configuration data can be kept at configuration memory. On the other hand, when n subtask is greater than n conf ig , n subtask −n conf ig configuration data cannot be kept at configuration memory. Thus, at run-time, the configuration data which cannot be kept at configuration memory are read from external memory by its reconfiguration. The configuration data read and task execution can execute simultaneously.
Usually, memory specification is defined by bit width and memory depth. Let BW CM and M D CM be the bit width and the memory depth of the configuration memory, respectively. The total bit of configuration memory T B CM is calculated as follows:
Let T B conf ig be the number of total bit of one configuration data, and T B conf ig is expressed by Eq.(2).
where Scale is a parameter of reconfigurable architecture complexity, and it is defined by each reconfigurable architecture, e.g. function of each PE, interconnection, etc. In this research, we assume the configuration data is increased linearly according to the number of PEs because the function of PE and the interconnection are not considered in PRP-model. Then, n conf ig and t cf g r is expressed by Eq. (3) and Eq. (4).
Memory Access Overhead
PRP-model has storage resources: external memory, internal memory, and register file of rPEs and prPEs. The input data of application are read from external memory and the output data of application are finally written to external memory. Thus, internal memories and register files in PEs are used to keep temporal data at reconfiguration. When the temporal data cannot be kept at internal memories or register file due to the limitation of the capacity, the data are stored to external memory. The memory access cycles on PRP-model are calculated as follows. Let req in r (i, j) be the number of requested data from the i-th internal memory at configuration j. The data read cycles at configuration j, T in r (j), are expressed by Eq.(5).
where T in r (i, j), the data read cycles from i-th internal memory at configuration j, are as follows:
Let req ex r (j) be the number of requested data from external memory at configuration j. The data read cycles at configuration j, T ex r (j), are expressed by Eq. (7).
Let req prP E r (i, j) be the number of requested data from i-th prPE at configuration j. The data read cycles at configuration j, T prP E r (j), are expressed by Eq.(8).
T prP E r (j) = max
where T prP E r (i, j), the data read cycles from i-th prPE at configuration j, are as follows:
Let req rP E r (i, j) be the number of requested data from i-th rPE at configuration j. The data read cycles at configuration j, T rP E r (j), are expressed by Eq.(10).
T rP E r (j) = max
where T rP E r (i, j), the data read cycles from i-th rPE at configuration j, are as follows:
The data read cycles ("Memory Access (Data Read)" in figure 2) of configuration j are expressed by Eq.(12).
where T P E r (j), the data read cycles from prPE and rPE at configuration j, are as follows:
The data write cycles at configuration j, T w (j), are expressed the same way. PRP-model is configured to the target circuit according to the configuration data, and input data or intermediate data are read from external memory, internal memory, and register file of rPEs and prPEs. Then the partitioned subtasks are processed after reading data. Finally, output data or intermediate data are stored to storage resources. PRP-model processes this processing flow repeatedly. However, as the limited configuration memory, the processing flow of PRP-model may be stalled according to the reading cycles of the configuration data needed to the reconfiguration. The waiting cycles for configuration data read are calculated as follows.
Processing and Reconfiguration
Let End exec (i) and End cf g r (i) be the end time of the execution of configuration i and the end time of i-th configuration data read from external memory, respectively. The end time of i-th reconfiguration is expressed by Eq.(14).
where Start reconf (i) is as follows:
After i-th reconfiguration, configuration i is executed. Let T proc (i) be the processing cycles of i-th subtask without memory access cycles, and the end time of execution of configuration i are expressed by Eq.(16).
where Start exec (i) is as follows:
PRP-model can simultaneously read the configuration data and execute the task. The start time of configuration data read is expressed by Eq.(18).
where End cf g r (i) is as follows:
Therefore, the waiting cycles according to the stalled processing flow are expressed by Eq.(20).
Total execution cycles are evaluated including T wait .
Port Expansion DFG
We define port expansion DFG to evaluate reconfiguration overhead in terms of exactly calculating the number of memory accesses. Data flow graph (DFG) is one method of task representation with execution dependency. Traditional DFG is composed of nodes and edges. Node represents an operation, and the edge connected to a pair of nodes means the data flow between them. Incoming edges of a node mean the input data used for the corresponding operation, and outgoing edges of a node mean the results from corresponding operation. When the data are used at some nodes, data read sometimes occurs but data write occurs only one time. The traditional DFG cannot precisely represent the number of memory accesses because the differences of the data represented by outgoing edges cannot be distinguished. Thus, we label the input/output of node of traditional DFG as "port", which represents the input or output data. We call this DFG as port expansion DFG. Each node of the port expansion DFG has ports, and each edge connects to a pair of ports. Figure 4 shows an example of a port expansion DFG based on a traditional DFG showed in figure 3 . In figure 3 , it is difficult to calculate the number of memory accesses when the configuration n is reconfigured into the configuration n + 1. However, using a port expansion DFG, we can recognize that node 0 clearly outputs data used at nodes 2 and 3, and node 1 outputs two data that are respectively used at nodes 3 and 4. Thus, it can be obtained that three data writes are required after configuration n.
Task Partitioning Algorithm

Task Partitioning Problem
In this section, we define the task partitioning problem for PRP-model.
-Task Partitioning Problem -
For given a port expansion DFG of the target application and a PRPmodel, to find a task partition and storage resource assignment whose execution cycles are minimum keeping the execution order defined by port expansion DFG. 2
Outline of Task Partitioning Algorithm
In this section, we propose a task partitioning algorithm using Simulated Annealing (SA) for PRP-model that consists of configuration and storage resource assignments. The following is the outline of task partitioning algorithm. In configuration assignment, the task is partitioned into several subtasks, and each subtask is assigned to a configuration according to an execution order. Then in storage resource assignment, the intermediate data between configurations are assigned to storage resources.
Configuration Assignment
In this section, we explain how to make neighbor solution (configuration assignment) using MOVE operation. The MOVE operation in SA is the movement of a randomly selected node to an adjacent configuration. We define two MOVE op-
erations: M OV E BW D and M OV E F W D . M OV E BW D is the movement of the node to the previous configuration, and M OV E F W D is the movement of the node to the next configuration (figure 5).
Let Child(x) and P arent(x) be a set of child nodes of node x and a set of parent nodes of node x, respectively. Let Cf g(s, x) and V acant(s, c) be the number of configuration assigned to node x of the solution s and the number of vacancies of configuration c of the solution s, respectively. We call configuration c empty when there is no nodes in the configuration c. 
∀x ∈ P arent(n); Cf g(s, x) < Cf g(s, n)
(21)
Similarly, when the node n in the solution s satisfies Eq.(23) and Eq.(24), M OV E F W D can apply to node n.
∀x ∈ Child(n); Cf g(s, x) > Cf g(s, n)
V acant(s, Cf g(s, n) + 1) > 0 (24)
M OV E BW D and M OV E F W D operations occur sometimes an empty configuration. In figure 6(a) , there is empty configuration c in the configuration sequence because of M OV E BW D for the node k. In such case, M OV E BW D removes the empty configuration c after moving node k ( figure 6(b) ). We call this operation "Packing". By the packing, the number of the configuration is not always greater than the number of nodes of port expansion DFG.
Storage Resource Assignment
In this section, we define storage resource assignment.
"Data1" and "data2", illustrated in figure 7, should be saved as storage resources because these data cannot be kept at reconfiguration. Thus, the storage resource assignment algorithm assigns these interconfiguration data to storage resource according to the following policy after configuration assignment:
-Data are assigned to the highest priority resource with space. -The first priority resource is rPE, the second is prPE, the third is internal memory, and last is external memory. Figure 8 shows an example of storage resource assignment, and "data1" and "data2" are stored external memory and internal memory at reconfiguration, respectively. Note that memory accesses to read or write are needed at reconfiguration. When the assigned data become unnecessary, other data can be overwritten.
Since the number of interconfiguration data is changed when the configuration assignment is changed by the MOVE operation, the storage resource assignment is recalculated after the MOVE operation. 
Experiment
To demonstrate the efficiency of our proposed method, we show two experimental results: evaluation of task partitioning algorithm for PRP-model and reconfigurable architecture exploration. The experimental environment includes PentiumD 2.8 GHz, 2 GB memory, and Fedora 4. The Simulated Annealing (SA) parameters were as follows: initial temperature was 10, final temperature was 0.01, and the cooling ratio of the temperature was 0.98. Solutions obtained by SA are according to SA's random seed. Thus, we evaluate the average execution cycles and CPU times of 10 solutions obtained by different random seeds.
Evaluation of Task Partitioning Algorithm
To demonstrate the efficiency of the proposed algorithm, we applied it to two applications: DCT8 and CHENG. DCT8 is a one-dimensional eight point discrete cosine transform (DCT), implemented completely in parallel. CHENG is Cheng's DCT algorithm implemented completely in parallel, too. The number of nodes in DCT8's port expansion DFG is 50, which is 13 in CHENG. We evaluate the proposed task partitioning algorithm by focusing on the quality of solutions and CPU time. We evaluated 8 architectures under the variety of n pP E , n prP E , and n conf ig . The other architecture parameters were fixed and described in Table 1 . Table 2 shows comparison results of the execution cycles of target application CHENG. "Proposed" in Table 2 is execution cycles obtained by the proposed task partitioning algorithm, and "Opt." is the optimal execution cycles obtained by branch and bound strategies. Experimental results showed that for all architectures the proposed task partitioning algorithm can obtain the same execution cycles as the optimal solutions. Notice that execution cycles decreased in Table  2 when the number of PEs changed from four to eight, because the more operations that are executable simultaneously, the fewer the reconfigurations. Using prPEs with internal register files decreases the execution cycles more than using pPE. Reconfiguration overhead is decreased because prPEs have register files that pPEs do not have at storage resource assignment. Next, we compared CPU time under the same conditions as previous and target applications, CHENG and DCT8. Table 3 shows CPU time comparisons. "Proposed" is CPU time by the proposed algorithm, and "Opt." is CPU time to get an optimal solution by branch and bound strategies. Table 3 shows that the proposed algorithm can obtain CHENG's solution in 30.7 seconds in the worst case when the optimal solution is obtained in about 57 minutes (3391 sec.). Furthermore, when DCT8 is the target application, the optimal solution cannot be obtained in practical time. However the proposed algorithm can obtain solutions in about 150 sec. in the worst case. Thus, the proposed algorithm can obtain solutions for various architectures in practical time. Table 4 shows CPU time for more complex port expansion DFGs under the same conditions as previous. The number of port expansion DFG's nodes N equals 100, 300, and 500, and the number of pPE n pP E equals 8, 64, and 256. In table 4, the worst search time is about 50 minutes (3160.3 sec.). Thus proposed algorithm can get the solution for complex input in the practical time with various architecture parameters.
However, from solution's quality perspective, proposed algorithm cannot always obtain a good solution. In case of N = 500 and n pP E = 256, the number of configurations results in four, and this solution is not feasible. Analyzing this result, some configurations can be merged, and the execution cycles can be drastically reduced by decreasing number of configurations. When the number of PEs is n pP E = 256, the same phenomenon occurs. The reason why proposed algorithm cannot obtain the good solution is discontinuity of solution space. The discontinuity of solution space prevents proposed algorithm from efficient search for a good solution. To solve bigger problem more than 128 PEs, the proposed algorithm should be modified.
Reconfigurable Architecture Exploration
In this section, we demonstrate reconfigurable architecture exploration using PRP-model under the variety of the number of PEs and configuration memories. Target application is the sample DFG, whose number of nodes equals to 500, used previous experiment. Experimental environment is the same conditions as previous. The above-mentioned experimental results show that proposed algorithm cannot always obtain a good solution in case of more than 128 PEs. Therefore, we demonstrate reconfigurable architecture exploration under the number of PEs from 16 to 128. Table 5 shows fixed parameters of explored reconfigurable architectures, and table 6 shows specifications of configuration memories. We assume the parameter of reconfigurable architecture complexity Scale in Eq.(2) equals 128. Then, n conf ig and t cf g r are shown in Table 7 . Table 7 shows that n conf ig is decreased half when the number of PEs is increased two times. Figure 9 shows the execution cycles of each architecture with configuration memory A, B, and C, and the number of prP E equals zero. In Figure 9 , the execution cycles simply increase/decrease in the case of configuration memory A/C according to the increase of the number of PEs. On the other hand, in the case of configuration memory B, the execution cycles progressively decrease according to the increase of the number of PEs. However, when the number of PEs equals 128, the execution cycles increase greater than the execution cycles when the number of PEs equals 16. Figure 10 shows the execution cycles of each architecture with configuration memory D, E, and F. We can also see that the execution cycles simply increase/decrease in the case of configuration memory D/F according to the increase of the number of PEs. The increasing rate of execution cycles is more slowly than Figure 9 . At the same time, the execution cycles always decrease according to the increase of the number of PEs in the case of configuration memory E, whose total bit equals configuration memory B's one.
When the total bit of configuration memory is the same, the bit width of configuration memory affects the increase of the execution cycles. The increase of configuration memory bit width leads to the decrease of configuration data read cycles, i.e. the overhead of configuration data read decreases. For less overhead, designers should choose the configuration memory which is as wide bit width as possible. Table 8 shows the execution cycles of each architecture and the ratio of waiting cycles included the execution cycles. In the increasing case of the execution cycles, we can see that waiting cycles account for a large percentage of the execution cycles. On the other hand, in the decreasing case, the ratio of waiting cycles always equals zero. The above-mentioned experimental results are obtained when prP E is not used. Table 9 shows the execution cycles and the ratio of waiting cycles when prP E is only used. Experimental results in Table 9 show the execution cycles are 150-300 cycles less than Table 8 in the case of the ratio of waiting cycles equals zero. However, the execution cycles are almost the same as Table 8 when the ratio of waiting cycles is not zero. Because of the limitation of configuration memory, n conf ig simply decrease according to the increase of the number of PEs. In contrast, t cf g r simply increase according to the increase of the number of PEs because we assume the configuration data linearly increase according to the number of PEs. Therefore, the overhead of configuration data read drastically increases according to the increase of the number of PEs, and critically affects the execution cycles.
When the reconfigurable architecture which includes small amount of configuration memory is used, experimental results show that the overhead of configuration data read is dominant, and the overhead critically affects the execution cycles. To use the reconfigurable processor effectively, designers should carefully design the configuration memory and its parameters, i.e. t conf ig , n conf ig , and t cf g r . Next, we assume the parameter of reconfigurable architecture complexity Scale in Eq.(2) is decreased to 64, i.e. designers use simpler reconfigurable architecture. Table 10 shows n conf ig and t cf g r in case of Scale = 64. Figure 11 shows the execution cycles of each architecture, whose Scale equals to 64, with configuration memory A, B, and C. In Figure 11 , compared to Figure 9 , the execution cycles always decrease according to the increase of the number of PEs because the decrease of Scale makes the configuration data half. To decrease overhead, the reduction of reconfigurable architecture complexity Scale also effectively affects.
When the performance is not enough, designers often add more PEs to execute effectively with rich HW. However, to use the execution with run-time reconfiguration, addition of PEs does not always improve the performance because of effect of the complex overhead. Considering the effect of the overhead carefully, the best reconfigurable architecture which satisfies the requirement should be chosen. It is the first step for the effective execution with run-time reconfiguration to analyze the details of reconfigurable architecture carefully.
Conclusion
In this paper, we proposed dynamic reconfigurable architecture exploration method based on Parameterized Reconfigurable Processor model and task partitioning optimization algorithm for reconfigurable architecture exploration corresponding to proposed PRP-model. Using proposed method, designers can fast evaluate various reconfigurable architectures, and easily find suitable reconfigurable architectures by changing PRP-model parameters. Future work includes the modification of task partitioning optimization algorithm corresponding to large size architectures, the establishment of reconfigurable architecture exploration that considers area and power, and the processing element function decision method according to the total configuration data constraint.
