In this paper, we propose a Dynamically Reconfigurable Processor Array (DRPA) generator which can generate various types of DRPAs. Our target DRPA architecture is fully parameterized. By specifying architectural parameters, it can automatically generate RTL model, simulation environment, and finally chip layout. In our DRPA generator, although the fundamental design of a processing element (PE) and an inter-PE connection is fixed, the array size, PE granularity, and connection flexibilities of intra/inter PE are selectable. In this paper, we have generated various types of DRPAs and evaluated semiconductor area and speed by using the AS-PLA/STARC 90-nm CMOS technology. From evaluation results, fundamental trade-offs between architectural parameters and area/delay are analyzed.
INTRODUCTION
In recent years, coarse grained dynamically reconfigurable processor arrays (DRPAs) have been received an attention as a flexible and efficient off-loading engine for various types of System-on-Chips (SoCs). Some devices are commercially available [1, 2, 3, 4] , and some of them have been integrated into digital appliances [5] .
In order to achieve better area-and power-efficiency compared with traditional field-programmable devices such as FPGAs, they incorporate the following properties: (1) a simple coarse grained processor consisting of an ALU, a data manipulator, a register file and other functional modules is used as a primitive processing element (PE) of an array, and (2) dynamic reconfiguration which enables PE array to perform time-multiplexed execution.
Unlike common FPGAs which are based on Look-Up- Tables (LUTs) and island-style interconnection, there exist wide design space in DRPAs, such as PE granularity, the number of hardware contexts which can be switched dynamically, the total amount of wiring resource, and PE array size itself. Our previous work revealed that the optimal PE array size considering area and power consumption is different for each application [6] . Thus, we believe that there is no allaround architectures in DRPAs, and the structure should be configurable or customizable for its main target applications. Since DRPAs are embedded into an SoC, their architectures should be customized at design time.
The object of our project, Multi-Core Configurable Reconfigurable Architecture (MuCCRA) project, is to develop a design methodology and framework which generate highly configurable DRPAs for various target applications. In this paper, as the first step of the project, we develop a flexible architecture generator and target DRPAs are modeled and parameterized. And then, the impact of architectural parameters on area and delay is analyzed. Fig. 1 The most fundamental parameter of DRPA is granularity of PE given by G. G specifies the data width treated in a PE and interconnection. G is set from 4 to 32 in the most cases.
DESIGN ENVIRONMENT
The flexibility of interconnection within a PE Core can be defined with the number of selectors provided on inputs and outputs of functional units such as ALU. Each functional unit of a PE Core has an input selector, and the number of input channels which can be selected by the unit is an important parameter. As shown in Fig.2 , the input channel number for SMU, ALU, and register file are represented by Fsmu, Falu, Freg respectively. These parameters are corresponding to the flexibility of intra-PE local routing.
Each PE is connected with global routing wires via connection blocks. The connection blocks pick up the data from global routing wires, and distribute to all functional units of the PE Core. We define the number of inputs and outputs that can be connected to the connection blocks as Fpi and Fpf. If the connection blocks can get the data from global routing wires in 4 directions, the number of connections in 
Array Architecture
Our DRPAs have a two-dimensional PE array, and its size is denoted by (M, N). And, an island-style interconnection structure like traditional FPGAs is adopted. Fig.3 shows an example of the DRPA with (4,4). As shown in this figure, each PE is surrounded by programmable routing wire segments. And, connection blocks in each PE mediate the connection between PEs and global routing resources.
On the intersection of a vertical and horizontal channel, a Switching Element (SE) is placed. The SE is a set of simple programmable switches in which an entering link is connected to the other SEs. Because of space limitations, the other parameters are fixed; the channel number W = 4, the array size (N, M) = (4, 4) , and the number of contexts C = 32. In this work, the analysis is limited only in PE array, and distributed memory modules provided in the edge of PE array are excluded.
Granularity and Area/Delay
The PE granularity G is usually decided to match the data size mainly treated in the DRPA. Fig.4 shows the area of PE array with various G. The area is increased almost linearly with G independent of the SE flexibility (FSW). The area becomes exactly double when G becomes 4 times (8bit to 32bit) with any FSW. Given G = 32, only 1.5mm-2mm square die area is needed. This fact demonstrates that the DRPA is enough small to be used as an IP core in an SoC.
As shown in Fig.5 , the critical path delay versus G is also increased with G, but the impact is rather modest compared with the case of area. If G is increased by 8bit, the delay increases about 2nsec in the 90-nm CMOS technology. This suggests that the large granularity is advantageous from the viewpoint of the critical path delay. The delay is also not so sensitive with Funit 4.2. Intra-PE Flexibility and PE Area Fig.6 shows the total cell area of a PE for each G and Funit. As prospected, the area becomes large with increasing G and Funit. Funit influences the area of input selectors of each functional unit and that of output selectors of connection blocks. In Fig.6 , increasing area mainly comes from selectors.
Increasing Funit also enlarges the area for a context memory as shown in Fig.7 . However, it is not so severe compared with increasing of the cell area for a PE. That is, the increase of configuration bits is not sensitive to the increasing granularity. Moreover, additional configuration data for the PE Core is a certain constant value when G becomes double. Hence, from the viewpoint of the context memory, a large G is area-efficient. We have generated various types of DRPAs and evaluated hardware area and speed by using the ASPLA/STARC 90-nm CMOS technology. From evaluation results, it appears that when the PE granularity changes from 8bit to 32bit, the area is doubled, and the delay time is increased about 6 nsec.
As the future work, we would like to establish the automatic design framework of application-customized DRPAs on the basis of this result. For this purpose, a re-targetable DRPA compiler is now under construction and we'll analyze architectural trade-offs based on real applications.
