Abstract-Many-core processors are designed for improving thread-level parallelism (TLP) across the cores and for keeping instruction-level parallelism (ILP) in each core. However, each application has its own characteristic TLP and ILP. Therefore, a "pre-fabricated" chip multiprocessor (CMP) cannot tolerate a wide range of applications. Recent works attempted to reconfigure the CMP in order to fit the processor to the applications. This paper proposes a scalable processor that is able to up scale and down scale the datapath of the adaptive processor. The scaling is based on a chaining interconnection networks between segments. The adaptive processor uses a linear topology to form a stack structure. In order to map the array to a two-dimensional array, a new topology, which we call an S-topology which scales well is also proposed. We assessed costs in terms of area and delay on the S-topology applied to the adaptive processor, and peak performances.
I. INTRODUCTION
Semiconductor technology provides more design space to architects trying to integrate more functionalities into a microprocessor. Large-scale processors have introduced a design complexity problem that a larger building block lengthens the critical path and requires increased power consumption [13] . Now we can not take into account for a scaling clock frequency, and for a high-level functionality, in order to improve the performance. In addition, background logic circuits have consumed a majority of the chip area in order to improve instruction-level parallelism (ILP). In the 1990s, a multi-core processor was introduced and researched [4] . Now, we routinely use the multi-core processors and the number of cores will soon be many.
Many-core processors are designed for improving threadlevel parallelism (TLP) across the cores and for keeping the ILP in each core. However, each application has its own characteristic TLP and ILP. Therefore, a pre-fabricated chip multiprocessor (CMP) can not tolerate a wide range of applications. Recent research efforts have attempted to reconfigure the CMP in order to fit the processor to the applications [5] [2] . This approach is called a dynamic CMP. An alternative approach is to use reconfiguration technology such as field-programmable gate array (FPGA) devices [23] [22] . Recently, the conventional microprocessor and FPGAs have been merged into a single computing node [19] [20] . However, such processor and its application designs are more complex. Such an application has to be partitioned to host program and many hardware tasks. The higher application design workload poses an obstacle to bringing the reconfigurable computing into mainstream technology.
A heterogeneous massive parallel processor (MPP) has difficulty tolerating a defect. Heterogeneous MPP technology will be applied after the verification of the major cores and a last phase in the market as an optimization of the homogeneous MPP. Or rather than implementing a higher-level functionality, reconfiguring datapath with middle grained functionality provides more flexibility and more resource sharing for utilization. The issues concerning homogeneous and heterogeneous pre-fabricated MPPs will become end of a moot point as research begins on merging the CMP and reconfigurable technology.
Our prior work proposed an adaptive processor that removes the necessity of partitioning, supports resource management and scheduling on chip, reduces workload in reconfiguration, and reduces the workload to design the processor and its applications [14] . This paper studies the dynamic CMP architecture for the adaptive processor (AP). The architecture is called a very large-scale integrated (VLSI) processor. Because of the AP does not need an instruction set architecture in basic model, we need to investigate how to interface between the VLSI processor and its application without impact in terms of area and time overheads. Our approach to up/down scale is simply to chain/unchain between segmented interconnection networks by programming switches. The AP can configure multiple application datapaths in a sequential configuration manner. By providing ability of the dynamic CMP to the AP, controlintensive application consisting of basic block datapaths can be performed well in addition to the data-intensive application with streamings. The AP uses a linear topology. The linear array has to be folded into two-dimension efficiently. In order to map the array to two-dimensional array, we propose a new topology to serve the scaling, which we call an S-topology. We propose a simple and general model of the S-topology and a model of the VLSI processor. By the VLSI processor we obtain several benefits as follows;
• Processor Optimization. Application has its own characteristics of a locality and a dependency. We obtain the optimal configuration of processor that fits for the special characteristics by reconfiguring it. For example, an application having huge (data) dependency and doing streaming probably needs more resources to configure its datapath. Then application requests the resources, and its datapath is configured on the largescale AP. Application designers know optimal amount of resources, so they should be able to control the reconfiguration through some methodology.
• Balance between General-Purpose and ApplicationSpecific. Compared with application-specific processors (ASIPs), the general-purpose processor does not achieve to performance of the ASIPs. We can coordinate a computation trade-off between application and processor by the reconfiguration to optimal processor configuration. It is application dependent what point designer wants to optimize. It is probably coordination between clock cycle time and the number of resources that control the performance, throughput, area, and power consumption. At least the number of resources can be controlled by this technology.
• Guard Data-Intensive Datapaths from ControlIntensive Datapaths. Sometimes, a control-flow flashes the processor pipeline and decreases the performance and the throughput. Regarding the AP, the control-flow breaks a regularly reconfiguring datapath, and makes unpredictable swap-in/out of object. The basic blocks can perform with a small overhead and without interfering with each other's execution, if the basic blocks, which are partitioned by the controlflow, are mapped to the VLSI processor. The isolated basic block processors can communicate with an interprocessor communication that activates followed basic block(s).
• Defect Tolerance. Scaling to hundreds or thousands of processor element and memory block on chip will increase defect. By the VLSI processor architecture, the AP having a failure can be removed from the system. For example, case of four APs are on chip and can be fused to one large-scale processor, two mediumscale processors, and four small-scale processors. When second AP has some failure, first processor can be a small-scale processor, and third and fourth processors can be fused to the medium-scale processor or split to two small-scale processors. The next section shows our prior work on the adaptive processor. The third section explains and discusses the basic S-topology. The fourth section shows the costs in terms of area and delay, and the peak performances derived from the delays are assessed. Related work is discussed in the fifth section, and the finally section concludes this paper.
II. ADAPTIVE PROCESSOR
As we move towards thousands of processing and memory elements on a single chip, one of the most important topics, essential for achieving peak performance, is a resource management and scheduling. The CMP does not support the resource management and scheduling on a chip. The larger scale of many-core processor will easily result in a larger gap between the peak and the effective performances, probably causing delay of many cycles for the managing and scheduling resources. The alternative is that more optimization effort will be required. We know that the most frequently used operations should be on chip; therefore, we implemented such functionality on chip. This section explains such a processor called an adaptive processor, designed to reduce the workloads of processor and application designs, resource management and scheduling, and reconfiguration.
A. Object and 2-Level Configuration
A processing element called a physical object performs its operation defined by a configuration data. Such configuration data is called a local configuration data. Pair of initial data and the local configuration data are called a logical object, and it is called an object that binds logical object on the physical object. The binding is not tight, the logical object moves on array of the physical object explained in later. In order to configure an application datapath, chaining between operators is defined by a global configuration data that element consists of a sink object ID and source IDs. Therefore in the global configuration data stream, a dependency is represented by the ID.
B. Processor Pipeline
The adaptive processor has the following pipeline stages: 1) Pointer Update Pipeline stage to update a pointer addressing an element of the global configuration data stream. This stage is independent from the following pipeline stages. 2) Request Fetch Pipeline stage to fetch an element of the global configuration data. This stage is similar to the instruction fetch pipeline stage in conventional pipelined processors. 3) Request Evaluation Pipeline stage to evaluate to request with the fetched element. Evaluation of a memory access request is carried out in this pipeline stage. 4) Request Pipeline stage to request resources (necessary objects). Global configuration data stream for object cache miss is inserted at this stage. 5) Acquirement Pipeline stage to acquire resources for the request. Routing is performed during this pipeline stage by an acquirement signal from special registers called a working-set register file (WSRF) keeping the acquired elements.
After the acquirement, objects are free from control. The object is released by receiving and firing release token(s) from preceding object(s). Figure 1 shows the procedure to configure a part of the application datapath. The datapath consists of set of objects. The figure shows the request objects, the procedure to chain between objects, and the termination of its operation. At the request pipeline stage, necessary resources (logical objects) are searched for in the processor. The "hit" object acknowledges the hit and activates the execution fabric in the physical object. In the next cycle, the object will receive an acquirement signal from a WSRF; this is the acquirement pipeline stage. The acquirement signal indicates which communication port should be used for chaining between objects. When it is an object cache miss, its logical object(s) is loaded from the library in the memory blocks to a configuration buffer object(s). After loading all the requested but cache miss logical object(s), the processor makes a stack shift from the top of the stack to the bottom of the stack in order to enter the loaded logical object(s) into the object space. After logical objects have been entered, the objects are requested again and it will be chained (acquired). 
C. Configuring an Application

D. Stack Shift and Dependency
An array of physical object composes a stack structure. The stack structure makes deterministic and locality based placement; the placement is always on the top of the stack. Because the stack shift sorts objects in the array, the replacement, based on an LRU algorithm, is easily implemented, objects closing to the bottom of the stack are candidates for the replacement.
The study of stack algorithms has shown a relationship between the stack distance and the cache hit rate [11] . The stack distance is the distance from the top of the stack to the cache hit location (physical object). In order to make a hit always occur, the stack distance has to be less than or equal to C, where C is the capacity of the cache, namely the array size regarding the adaptive processor.
An element of the global configuration data stream requests resources. After acquirement of the request, the resources are on chip and chained among them. The element simply expresses an object ID. Therefore, the stream shows the dependency that is used for the chaining. The stack distance is equivalent to the dependency distance in the CACHE model [14] . The dependency distance can be observed by an object code showing the object IDs.
E. Virtual Hardware
An unused object should be swapped out to a memory block in order to make room for new requested object(s). This replacement is equivalent to the write-back policy of a conventional cache memory. When it is an object cache miss, cache missed object(s) is loaded, and replaceable object(s) is stored if necessary. The replacement is scheduled by a special interconnection network composing a scheduling table [14] . A virtual hardware is supported when the processor works on completely scalar operations. When an operation involves streaming, the reconfigured datapath has to be smaller than capacity C, because the streaming does not allow swapping out of the part of the datapath.
F. Channel Segmentation Distribution Model
The basic AP uses a global interconnection network for the chaining. This network was not considered in previous works, in order to demonstrate the CACHE model. The global interconnection network is suitable only for a small number of physical objects (a small area). We needed to resolve this limitation.
1) Chaining Processors as Scaling:
A channel segmentation distribution (CSD) on the AP keeps the number of channels (tracks) constant. The scaling on the AP simply chains the segmented global interconnection networks, used for finding LRU object(s) and the stack shift, and so on. Cache hit detection can be a centrally processed on the WSRF instead of searching in the array. Searching in WSRFs can be performed in parallel.
2) Extended Model of CSD Network: An approach to allocate a channel to the chaining is suitable only for static configuration, we have examined. This subsection discusses and proposes about how to extend the basic CSD network. In case of CSD network, the channel has to be selected dynamically. Our approach is to make a dynamic CSD network with a chain/unchain that each channel is completely segmented with one hop. Segments are chained at initial state, and it is unchained by a routing procedure. Figure 2 shows its very simplified logic circuit.
The source object broadcasts a request signal to every channel. The signal passes through an interconnection network that is also segmented with one hop, and a default state is a "chained". The sink object has a priority encoder that decides which channel is used for the request. A grant signal from the encoder is checked by the sink object that can know the request of the routing. The grant signal is stored in a memory cell that controls to unchain the request network and to gate in data from the channel to sink object. The grant signal is sent back to source object as an acknowledgement. This approach is capable of stackshifting from the top to the bottom of the stack. Therefore, the decision to select the channel, the sending the channel number, and the acquirement signal are unnecessary for this sequence.
We developed a CSD simulator to evaluate. Figure 3 shows evaluation result of one-source model (not a twosource model), how many channels are used by a random datapath configuration. N object is the number of physical objects. Figure shows a locality versus the used channels. A random request on a sink object and a locality based request on a source object were used. Left most plots have higher locality that used a smaller number of channels in general. Figure shows that N object channels were not used, and N object /2 channels are sufficient for the random datapath. Although necessity of a fan-out (broadcast) requires more channels, up to N object channels, we can allocate remained channels to the fan-out.
This approach must consider that the how much area reduction is acceptable to provide a sufficient routability.
G. Summary
The processor pipeline serves the resource management and includes the placement and routing in a pipeline stage. The placement is always on the top of the stack, the stack structure serves the LRU replacement and the object caching. The dependency configures application datapath, therefore, the global configuration data simply expresses the dependencies. The dependency distance is a key of the Figure 3 . Locality versus Number of Used Channels efficient processing. We need to take care of the distance that should not be larger than capacity in order to avoid making an object cache miss. The dynamic CSD approach was proposed in order to reduce an area requirement. The routability is traded off for the area requirement.
III. 2D ARRANGEMENT FOR LINEAR ARRAY
The AP uses a linear topology of array to compose a stack structure. In order to map the array to a two-dimensional array, a new scalable topology, we call an S-topology is also explained in this section.
A. S-Topology
A topology has to have the following properties: 1) The topology has to be a hierarchical or fractal. In order to fair placement on any processing element, the hierarchical or fractal topology is necessary. In addition it helps to make a simple structure for the scaling.
2) The array has to have a minimum number of patterns for the layout. The minimum cluster of processing (and memory) elements allows more flexibility and reconfigurability.
3) The chain/unchain switch point has to be a regular pattern. Such the pattern provides predictability for controlling and routing. This supports placement and routing on the very large-scale array. Figure 4 (a) shows the S-topology. A cluster shown in Figure 4 (b) is simply replicated. Each AP has a stack structure and thus has a linear topology. The linear network is folded in a 2D arrangement as shown in Figure 4 (c) . Although the figure shows a unidirection, it is a stack shift direction, and a bidirection is possible by proposed dynamic CSD architecture. By applying the dynamic CSD and the modular structure that does not require a large number of metal wire layers, the folded linear network can be on the higher metal layers in a fashion similar to that of a recent many-core processor [8] . The S-topology network supports the ability to unchain (split) the array into any arbitrary shape that may be formed by connecting clusters as shown in Figure 5 . The shape can form a ring topology on the 2D array. 
B. Programmable Switch
Any arbitrarily shaped region can be configured on the array by the programmable switches. Figure 6 (b) and (c) show the basic architecture. The box is a programming register. Figure 6 (b) and (c) are a programmable switch for unidirection of the stack shift interconnection network and a programmable switch for bidirection of the chain interconnection network. The default status of programmable switches is a "unchained". We can implement the VLSI processor with a die-stacking (a chip-on-chip) by connecting bottom and up side dies as shown in Figure 6 (d).
C. Scaling Operations
In order to configure AP having necessary scale, first should configure the processor having executable scale (a minimum requirement). After that up scaling can be performed. The scaling is done by programming the switches, thus we can reconfigure the processor with storing appropriate configuration data to appropriate switch. Figure 6 (e) shows basic state diagram consisting of release, sleep, active, and inactive states. First the processor starts from and ends to the release state that is not used and allocated. After programming the switches in a minimum AP, the processor turns to be inactive state that is ready to execute but not read and write protected from others. A timer, and read and write protections in the scaled region are set, and it is invoked as the scaled active AP. The active processor can be inactive state by clearing read and/or write protection. In the inactive state, others can access to its memory blocks. Thus storing a global configuration data and libraries, spilling and filling of data in the memory block are done in this state. Fetching a global configuration data depends on the application. The sleep state is ready to execute and read and write protected from others. In addition, a global configuration data is not fetched in this state. The scaled active AP can sleep for waiting some event by setting the timer or an event from its inside. Figure 7 shows very simple example to show the scaling procedure and execution. The application can be partitioned to four atomic blocks as shown in Figure 7 (b) . The atomic blocks can be the scaled AP. Other processor, may be a preceding atomic block or supervisor processor configures the four processors. In order configuration may perform spatially local placement as shown in Figure 7 (b). The configuration is based on the wormhole routing as shown in Figure 7 (c) . After the configuration, first each processor is in inactive state. Preceding processor access and write to follow processor's memory block as sending data (shown in Figure 7 (d) ). The first processor sends data to either second or third processor depending on the condition. The second or third processor is activated and sends result to fourth processor. The fourth processor receives appropriate data in its inactive state. This can be a pipelined execution through the multiple processors. In general a conditional execution can break the regular partial reconfiguration of the scaled AP and can make negative impact. In this example, second and third basic blocks make the irregular sequence. By isolating the basic blocks, this example does not have such impact.
D. Inter-Processor Communication
An interconnection network used for inter-processor communication can be applied to the scaling. Figure 7 (c) proposes the reconfiguration methodology to up-scaling. We use a wormhole routing for up and down scalings. The execution uses inactive state, preceding processor makes active the processor. Before activation, the processor stores sending data to memory block as shown in Figure 7 (d). Figure 7 (e) shows current router architecture which is under development. The down-scale is possible with the wormhole routing along with the unidirection by clearing active state, turns to be a release from the active state. Table I lists area requirements for the physical object, obtained from [12] . The general-purpose compute fabric includes 64bit floating-point and ALU modules. Because the reference for these area requirements does not include dividers, we used weight values estimated from [17] in order to calculate the raw area of the dividers. Table II lists area requirements for the memory block. The ALU-II is used for a vector length and a hardware-loop, and so on. An instruction register is used for a sequencer object. We set [21] . A global wire delay is calculated as the square root of λ 2 (the total area of the physical object and shown in Table IV ). We obtained a peak GOPS (giga operations per second) values excluding load and store streams, as shown in Table IV , which are assessed from the global wire delays as a critical delay used for chaining between the memory block and the physical object because the memory block can not be relocatable so the global network is still required. The [6] . Project [4] is the first tile processor where each processor has bypass paths connecting it to neighbours [7] . Such architecture requires a partitioning application to workingsets (tasks) consisting of a program and data because the tiles are not scaled. The unchained approach also requires synchronization between the tasks, thus an inter-processor communication methodology is also required. Moreover, placement (configuration) is probably static. In our approach, it is scaling-based which eliminates the need to compile to a set of dedicated working-sets. Objects form the application datapath. An application compiler needs to simply take care of a linear array size to fit the application datapath to the fused region in order to enable streaming on the datapath.
IV. COST ASSESSMENTS
A. VLSI Processor
Recently core fusion [5] and composable processors [2] have been proposed as scalable methods. The core fusion has the advantage that the processor does not require a compiler to schedule the instructions to fuse and split, however, the scalability is limited to the fusion no higher than eightissue processor. Splitting and fusion instructions are in its instruction set architecture. The composable lightweight processor has the advantage of reconfiguration to a large scale: however, it requires a special compiler to schedule. In addition, the data-flow processing on the conventional processor philosophy results in an inefficient flow of data: the commitment of instructions is delayed until the completion of data flow on the critical path. The large scale array could potentially have a long critical path. The VLSI processor uses the unchain and chain; there is no specific instruction. In addition, there is no specific procedure to chain and unchain, it simply requires routing and storing data set used in an ordinary communication. The scalability is limited only by the format of configuration data and by the wire delay. A resource is released by firing release tokens. This technique reduces the idling time as rapidly as possible.
Recently a ring topology is used for multi-core processors [15] [8] . The topology supports relocatability to be flexible placement. Its latency is increased by the number of cores. This technique is scalable for a small number of cores. The combination of a modular structure and this topology allows flexible configuration so that the number of cores may be changed at design time without significant impact on the layout. However, recently the mesh topology has become a popular alternative [1] [3] [6] . The topology is very simple and completely scalable and relocatable. It also has abundant bisection bandwidth. Though it has the freedom of placement, a host system has to manage placement, routing, replacement, and defragmentation. As previously shown, the ring topology can be implemented on the S-topology. The VLSI processor is able to manage.
VI. CONCLUSION
This paper proposed a scalable processor called a very large-scale integrated (VLSI) processor that is able to up/down scale the datapath of the adaptive processor. The up/down scaling is simply to chain/unchain between the segmented interconnection networks. The scaling does not require a dedicated instruction, it is to simply store appropriate configuration data to appropriate programmable switch with a wormhole reconfiguration (communication) manner. There is no specific logic circuit for the scaling. Therefore, area cost is a very low. The adaptive processor uses a linear topology to form a stack structure. In order to map the linear array to a two-dimensional array, we also proposed the S-topology. The dynamic CSD network was applied to the VLSI processor. The dynamic CSD network is able to reduce the area requirement. The reduction of the number of channels must be carefully performed by architects because the number of channels decides the routability. Costs in terms of area requirements and delays, and peak performances were assessed in this study. The performance of pure 64bit 276 GOPS can be achieved in a 1 cm 2 area without SIMD features and fused operations, and with current process technology.
