Abstract-The multi-process execution in dynamically reconfigurable processors is a technique to enhance throughput by trying to exploit more inherent parallelism of applications. In order to improve the efficiency of the multi-process execution, the paper proposes a systematic method for mapping an application modeled as a Kahn Process Network onto a dynamically reconfigurable processing array. Using real applications, the impact on the performance from different versions mapped onto the Dynamically Reconfigurable Processor (DRP) is evaluated. Evaluation results show that our proposed mapping algorithm achieves the best performance in terms of the throughput and the execution time.
I. INTRODUCTION AND RELATED WORK For particular Dynamically Reconfigurable Processing Arrays (DRPAs), the performance improvement depends on the inherent parallelism of target applications. In many cases, the parallelism of an application is smaller than the number of processing element (PE) array in a context. Accordingly, a large number of PEs is not used efficiently. One of the methods for this is making the best use of stream-level pipelined execution by dividing a total process into independent processes, and executing in the pipelined manner.
Beside the single-process execution, where an application is implemented as the only thread of control, the "multi-process execution", which allows multiple threads of control to run concurrently is also available on some devices. An application is divided into several processes; a large reconfigurable array is partitioned into small arrays called Tiles; and each process is assigned to different Tiles for executing in parallel. An interprocess communication mechanism using internal memories is defined. Although the introduction of the multi-process execution may lead to an effective way of partitioning applications in order to improve the throughput and the energy efficiency, a method for efficiently mapping processes into Tiles has not been well researched. Here, we propose a systematic mapping method and show examples on DRP-1.
There are a lot of research efforts aimed at models of computation such as Synchronous Dataflow (SDF) [1] , Dataflow Process Networks [2] , and Kahn Process Networks [3] . Partitioning applications is a well-known technique. HW/SW partitioning specifies which parts of an application should be mapped to hardware or software components [4] . Partitioning applications between reconfigurable hardware blocks of different granularity tries to map parts of an application into fine-grain reconfigurable units and coarse-grain arrays [5] . In researches for job mapping onto the partitioned area of FPGAs with partial reconfigurable capabilities, the area, the execution time and the scheduling algorithm decide the order and place where multiple jobs should be mapped and started. Since three-dimensional (x, y, and temporal) placement problem must be solved for efficient placement, a number of theoretical researches have been done [8] [9] . However, our target problem has the following differences, which do not allow to directly apply the previous researches. 1) In multi-process execution treated here, each process which is a part of a single job works in the pipelined manner. Thus, the throughput of a single job with multiple processes is the target of the optimization. 2) By multi-context execution, a Tile can execute multiple processes sequentially as far as the number of contexts is sufficient. 3) Multiple Tiles can be assigned to a single process for speeding up the execution with a large number of PEs.
II. THE MAPPING ALGORITHM

A. Target process graphs
The target process graphs treated here are limited in a unidirectional linear graph with fork-join structure. As shown in Fig. 1 , a process can send data streams to multiple processes (fork) and the results are gathered in the next process (join). Each process is connected with FIFO, and can work independently. Stream data arrives in a certain interval to the starting process, and the total processes can be executed in the pipelined manner. Although complicated graph cannot be represented with this type, most graphs of streaming processing is rather simple and fall into this limitation. In Fig. 1 , process number pi is assigned into each process from the starting to the terminal process. Parallel execution processes (processes 2, 3 and 4) can be assigned any order in each other. Then, process numbers are mapped into the underlying DRPAs, where Tiles are partitioned into Tile Groups. We assume that a target application can be represented with Kahn Process Network. In this model, a total job is represented with multiple processes which can be executed in the pipelined manner. For example, the graph corresponding to the JPEG coder is shown in Fig. 2 . Programs of typical media processing can be easily translated into simple KPNs [10] .
C. Process mapping
Each process of the target KPN (Pi) can be mapped into a Tile Group TGj of the target DRPA, and executed in the pipelined manner. Fig. 2 shows an example of mapping. The goal of mapping is to find the best combination of processes and Tile Groups to maximize the throughput and to reduce the execution time of each pipeline stage while preserving system limitations: (1) the total number of Tiles used in TGj must be smaller than or equal to the number of Tiles supported in the DRPA, and (2) 3) Topological mapping: With the previous described steps, processes are mapped into TGj, and the last step is to fit TGj into the physical shape of M x N structure. Since the number of Tiles in a system is limited into small numbers, for example, eight in DRP-1, it is possible in a reasonable amount of time to select the best topological mapping by searching the complete solution space taking into account all possible mapping. A pattern matching with the list of (Size(TGo), Size(TG), ... . Size(TGO)) and prepared pattern is applied to find the possible Tile allocation. Fig. 3 shows an example of the list (2,1,1,4 ) and another example of (2,1,1,2,2) . In this method, for every branch of the list, a pattern is pre-assigned. Apparently, this method will cause the explosion of the possible patterns if the number of Tiles becomes large. However, for many of up-to-date DPRAs with realistic size, the number of patterns are reasonable. Fig. 3 . Example of topological mapping III. TARGET DEVICE: DRP Here, for evaluation, a real DRPA DRP-1 is used. It is a coarse-grain dynamically reconfigurable processor core released by NEC Electronics [6] . It provides configuration data corresponding to multiple contexts, which are dynamically scheduled to realize multiple functions with one chip.
The primitive unit of DRP core is called a tile. The primitive modules of a tile are processing elements (PEs), a State Transition Controller (STC), 2-ported memory VMEM, and 1-ported memory HMEM. The structure of a tile is shown in Fig. 4 . There are 64 PEs located in one tile. The STC is a programmable sequencer in which any finite state machine (FSM) can be stored. STC has 64 states, each state is associated with an instruction pointer.
As shown in Fig. 5 Mapping an application onto 8-tile architecture of DRP-1 according to the proposed mapping algorithm can be done after an application is partitioned and modeled as a KPN with certain constraints: the limit number of tiles that can be allocated to an application is eight; separate processes must be mapped to different tiles; and a process can be mapped to a Tile Group formed in any shape. An example mapping of the JPEG encoder is shown in Fig. 2. IV. EVALUATION A. Target applications Table I presents target applications and evaluation results. "#" denotes the number of processes each application is partitioned into; "Variant" shows the name of different mapping variants, of which, Single is the result of the single-process execution, and A is the variant generated by the proposed mapping algorithm; and "Mapping" shows each variant with the tile structure assigned to processes. For example, the A version of DCT contains 1, 3, 3, 1. That means DCT is modeled with four processes that are mapped to groups of 1, 3, 3, and 1 tile(s) respectively. Theoretically, any mapping could be possible, but some variants cannot be either implemented or synthesized by the place-and-route phase. The columns "PEs/context" and "Memories/context" show the average number of required PEs and used memories in a context B. Throughput Table I shows that almost all implementations in the multiprocess execution improve the throughput in a certain degree over the single-process execution up to nearly three time in JPEG. In other words, by representing an application as a process network that are mapped into a DRPA as separate threads of control, the throughput could be improved.
Among implementations in the multi-process execution, the one with our mapping algorithm achieves the best throughput. According to Hasegawa [7] , each implementation has an optimum context size where the throughput becomes optimum. The version according to the mapping algorithm is likely the one where constituent processes are mapped with the optimum context sizes. Moreover, in a pipelined environment, the throughput is greatly influenced by the balance of computation stages. The proposed mapping algorithm produces the most appropriate result in terms of process workload balancing.
Although the proposed mapping algorithm could improve the throughput in a certain degree, the main limitation is the number of available contexts. Since the number of required contexts becomes easily more than 16, the possibility of grouping processes are strictly limited. This is the reason why the execution time of each process is still unbalanced.
C. Execution time
The performance of an implementation can be expressed by the execution time for a given set of data. In the single-process execution, when multiple blocks of data are inputed, the number of execution clock cycles increases linearly. However, in the multi-process execution, it is no longer true because of the effect of pipelined processing. This is illustrated in Fig. 6 , Table I , among implementations in the multi-process execution, the one with the proposed mapping algorithm achieves the smallest execution time since it has either the shortest critical path or the smallest execution clock cycle.
V. CONCLUSION A systematic method for mapping applications modeled as a KPN onto a dynamically reconfigurable processing array is proposed. Using real applications and a real architecture DRP-1, the impact of the proposed method on the performance is evaluated. Evaluation results show that the throughput of multi-process execution increases from two to three times compared with the single-process execution, while more area utilization is realized as a result of processes being executed in parallel. In addition, our proposed mapping method results in the best throughput and execution time.
