SUMMARY The multi-process execution in dynamically reconfigurable processors is a technique to enhance throughput by trying to exploit more inherent parallelism of applications. Basically, a total process for an application is divided into small processes, assigned into limited areas of a reconfigurable array, and concurrently executed in a pipelined manner. In order to improve the efficiency of the multi-process execution, a systematic method for mapping processes onto a reconfigurable array consisting of multiple hardware execution units is essential. This paper proposes and investigates a systematic method for mapping an application modeled as a Kahn Process Network onto a dynamically reconfigurable processing array. In order to execute streaming applications in a pipelined manner, the size of Tiles, which is a unit area of dynamically reconfigurable array, and the grouping of processes are adjusted. Using real applications such as DCT, JPEG encoder and Turbo encoder, the impact of different versions mapped onto the NEC Dynamically Reconfigurable Processor on performance is evaluated. Evaluation results show that our proposed mapping algorithm achieves the best performance in terms of the throughput and the execution time.
Introduction
To date, a large number of researches in the area of reconfigurable computing have resulted in a number of academic and commercial Dynamically Reconfigurable Processing Arrays (DRPAs). These devices play an important role in balancing high performance demands and low power consumptions, especially in embedded devices. One of the trends in developing reconfigurable devices is the dynamic reconfigurability based on a multi-context mechanism such as DRP, DAPDNA-2, FE-GA and ADRES in order to minimize the reconfiguration overhead and greatly improve the performance of reconfigurable systems. The datapath mapped to a piece of physical hardware is called a context. A target application is divided into a set of different contexts, and a multi-context DRPA executes them by changing contexts with each clock cycle. Basically, with such multi-context DRPAs, an application is designed and mapped into hardware as a single thread of control. At any time, only one required context is activated and executed. In order to increase the throughput, some techniques such as software pipelining and loop unrolling can be applied to exploit more parallelism. 
Target Application Model
We assume that a target application can be represented with a Kahn Process Network (KPN), which is similar to a model proposed for streaming processors [14] . In this model, a total job is represented with multiple processes that can be executed in a pipelined manner. That is, data streams continuously arrive at a certain interval, and the results of a process are transferred to adjacent processes. A KPN has following characteristics: For example, The upper part of Fig. 3 shows the KPN graph of a JPEG encoder. In this case, the graph becomes a linear structure. Media processing programs could be easily translated into KPNs [14] , and here, we assume that the KPN corresponding to a target application has been already formed. 
Goal of Mapping
Each process of a target KPN (pi) can be mapped into a Tile Group TGj of the target DRPA, and executed in a pipelined manner. The lower part of Fig. 3 shows an example of mapping for a JPEG encoder, in which, processes mapped into each Tile Group get data stream from the input FIFO, execute their own computation, and produce results to the output FIFO. If there is no data in the input FIFO or the output FIFO is full, the process execution is stalled. The data stream is assumed to arrive in a certain interval corresponding to the total throughput of the DRPA. In order to improve the total throughput and the execution time of an application, it is critical to balance computation stages in the interrelation with other processes in a pipelined chain. In a pipelined processing model, the total throughput is bottlenecked with the most time consuming process. Here, by increasing the size of TGjs, the throughput can be enhanced by parallel processing with more number of processing elements. If, for example, the process DCT is the bottleneck of the JPEG encoder in Fig. 3 , the total throughput can be improved by mapping DCT into a TG with a large size because the number of execution clock cycles of the bottleneck process (DCT) could be reduced.
The goal of mapping is to find the best combination of processes and Tile Groups in order to improve the throughput and to reduce the execution time of each pipeline stage while preserving system limitations: (1) the total number of Tiles used in TGj must be smaller than or equal to the number of Tiles supported in the target DRPA, and (2) the sum of contexts required for processes mapped in a TG must be smaller than or equal to the number of contexts supported in the target DRPA.
Mapping Algorithm

Target Process Graphs
In this research, target process graphs are limited in a simple unidirectional linear graph with a fork-join structure. As shown in Fig. 4 , a process can send a data stream to multiple processes (fork) and the results are gathered in the next process (join). Each process is connected with a FIFO, and can work independently. Stream data arrive in a certain interval to the starting process, and the total processes can be executed in a pipelined manner. Although complicated graphs cannot be represented because of the above limitations, most graphs of streaming processing are rather simple and fall into this limitation.
Here, process number pi is assigned into each process from the starting process to the terminal one. Parallel execution processes (processes 2, 3, and 4 in Fig. 4 ) can be assigned in any order. The target process graph. (1) All possible mapping exploration
Since the number of Tiles in a system is limited into small numbers (for example, eight in DRP as seen in Table 1 ), choosing the best topological mapping by this approach is possible in a reasonable amount of time by searching the complete solution space to retrieve all possible mapping variants.
In order to limit the search space, we decide the best allocation of Tiles for each list (Size(TG0), Size(TG1),...,Size(TGk)), which consists of the sizes of TGs after adjusting where k denotes the number of used TGs. There are multiple possibilities of Tile assignment for each list. For example, the allocations in Fig. 7 are all conesponding to the list (2,1,1,2,2), since the arbitrary combination of tiles can be allowed in DRP-1. However, allocating a TG into separating tiles increases long wires which connect distant tiles, and degrades the operational frequency. Moreover, the communication between TGs is done through the FIFOs allocating edges of a tile, and so the neighboring TGs that need communication should be mapped into neighboring tiles. Considering them, we selected only one allocation for all possible patters in each list beforehand. For example, Fig. 7 (a) was selected for the combination (2,1,1,2,2) . It is called prepared pattern in the paper. Figure 8 shows an example of the list (2,1,1,4) and another example of (2,1,1,2,2). In this method, for every branch of the list, a pattern is pre-assigned. Apparently, this method will cause the explosion of the possible patterns if the number of Tiles becomes large. This approach has been proved to be NP-complete, and it requires exponential time in order to find an optimal solution. However, for many upto-date DPRAs with a realistic size, the number of patterns are reasonable.
(2) Dynamic programming approach
In the possible solution space, the same sequence of physical tiles for a specific sequence of TGj often appears as a part of many mapping solutions. Instead of searching the whole solution space, the dynamic programming technique can be utilized to find the optimal topological mapping for the complete sequence of TGj by using solutions for smaller subsequences. Once an optimal solution for mapping up to TGi is determined, the execution time for executing up to TGi+1 can be determined. This step is applied recursively to compute the final optimal mapping. Given a sequence of Tile Groups (Size(TG0), Size(TG1),...,Size(TGk)), the minimum execution time for executing up to process i in a TGj, Eij can be computed using the following recursive expression.
Eij=tij+min(Ei-1)
In the expression, tij is the execution time of the process i in the physical TGj, and min(Ei-1) is the total of the minimum execution time of other processes up to process (i-1). The expression shows that all possible ways of mapping process i are examined once the mapping of process (i-1) has completed.
Target Device: DRP
DRP Architecture
Here, for evaluation, we used a real DRPA named DRP-1. It is a coarse-grain dynamically reconfigurable processor core released by NEC Electronics in 2002 [2] . It carries an on-chip configuration data corresponding to multiple contexts, and it dynamically reschedules them to realize multiple functions with one chip.
The primitive unit of DRP core is called a tile, and a DRP core consists of arbitrary number of tiles. The number of tiles can be expandable, horizontally and vertically. The primitive modules of a tile are processing elements (PEs), a State Transition Controller (STC), 2-ported Vertical Memories (VMEMs), and 1-ported Horizontal Memories (HMEMs). The structure of a tile is shown in Fig. 9 . There are 64 PEs located in one tile. The architecture of a •EA process can be mapped to a Tile Group formed in any shape (like the 4-tile example in Fig. 6 ). the stage with the largest execution time. Table 2 shows that all implementations in the multiprocess execution improve the throughput in a certain degree over the single-process execution up to nearly three time in JPEG implementation. In other words, by representing an applicatiooooon as a process network that are mapped into a DRPA as separate threads of control, the throughput could be improved. As the size of an input data block is the same, the throughput could be improved by either reducing the critical path or the number of execution clock cycles. In the multi-process execution, the critical path is usually shorter than that from the single-process execution. While a target application is mapped into the whole reconfigurable array in the single-process execution, in the multiprocess execution, the critical path of an application is the longest one among child processes, each of which is often mapped to one or several tiles but not the whole reconfigurable array. For example, in DCT, the implementation in the single-process execution requires 8 tiles of NEC's DRP, but in the multi-process execution, the largest process can only be mapped to 5 tiles (version DCT1 in Table 2 ).
Moreover, the calculation of the throughput in the multi-process execution takes into account the largest number of execution clock cycles among processes; and, by dividing an application into independent processes, that number is often smaller than that of a big process executing in the single-process execution though the total number of clock cycles from all processes is often larger due to process overhead. Since both the critical path and the number of execution clock cycles in the multi-process execution could be shorten, the throughput is likely to be improved.
The throughput could also be improved by taking advantages of the pipeline technique where multiple processes are arranged to operate in a pipelined manner. In most cases, the output of a computation step will be the input of the next step with no data or control hazard, so this is suitable for pipelining.
Among implementations in the multi-process execution mode, the one with our mapping algorithm achieves the best throughput. According to Hasegawa [1] , each implementation has an optimum context size (the number of tiles) where the performance becomes optimum. Other context sizes no matter whether they are smaller or larger than the optimum one result in performance degradation. The implementation according to the mapping algorithm is likely the one where constituent processes are mapped with the optimum context sizes; therefore, the optimum throughput could be achieved. More importantly, in a pipelined environment, the throughput is greatly influenced by the balance of computation stages. The proposed mapping algorithm produces the most appropriate result in terms of process workload balancing. For example, balancing two processes, row-direction computation and column-direction computation, in DCT is the most important factor for the throughput since they occupy the largest part of the total execution time.
Although the proposed mapping algorithm could improve the throughput in a certain degree, the main limitation and Viterbi) and applications modeled with five processes (JPEG, Turbo and MPEG) on the same target architecture is considerably different when the method of all possible mapping exploration is applied. Table 3 also shows that time for topological mapping depends on the number of processes each application is divided into since applications with the same number of processes take almost the same time. Moreover, the number of tiles on the target DRPA influences time for mapping as well.
Conclusion
A systematic method for mapping an application modeled as a KPN onto a dynamically reconfigurable processing array is proposed. Using real applications and a real target architecture DRP-1, the impact of the proposed method on performance and area utilization is evaluated and analyzed. Evaluation results show that the throughput of the multiprocess execution increases from two to three times compared with the single-process execution, while more area utilization is realized as a result of processes being executed in parallel. In addition, our proposed mapping method results in the best throughput and execution time.
