Abstract-This paper presents a hardware-supported resource management methodology for massively parallel processor ar rays. It enables processing elements to autonomously explore resource availability in their neighborhood. To support resource exploration, we introduce specialized controllers, which can be attached to each of the processing elements. We propose different types of architectures for the exploration controller: fast FSM based designs as well as flexible programmable controllers. These controllers allow to implement different distributed resource exploration strategies in order to enable parallel programs the exploration and reservation of available resources according to different application requirements. Hardware cost evalua tions show that the cost of the simplest implementation of our programmable controller is comparable to our FSM-based implementations, while offering the flexibility for implementing different exploration strategies. We show that the proposed distributed approach can achieve a significant speedup in com parison with centralized resource exploration methods.
reserve (invade) them for executing a parallel program for a certain amount of time. In order to provide hardware support for the implementation of possible resource exploration strate gies in a decentralized manner within the processing elements, a specialized exploration controller, denoted as invasion con troller, is proposed. In this paper, we present different archi tectural designs for the invasion controller. These architectural versions range from simple FSM-based hardware controllers to fully programmable VLIW-based controllers, which can be parameterized at synthesis time to exploit different levels of instruction parallelism. Different distributed exploration strategies may be implemented on these controllers, which allow to capture different region types of PEs with various shapes and sizes. The hardware controllers we present im plement two different flavors of exploration strategies, which capture available resources either a) in a linearly connected fashion or b) in rectangular connected regions. These types of shapes denote typical interconnect topology requirements for many signal and image processing algorithms. The hardware cost of our two types of designs is compared, as well as their performance in terms of speedup in comparison to the centralized approach is evaluated. Moreover, our proposed resource management methods are prototyped in a centralized way on a LEON3 processor, and are compared in terms of exploration time with our decentralized implementations.
The rest of this paper is organized as follows: A brief overview of related work is given in the next section. In Section III, our distributed resource exploration methodology is explained. Section IV presents our different architectures for distributed resource management, followed by a quantitative evaluation in terms of speedup for different design options and their area cost in Section V. Finally, the paper is concluded in Section VI.
II. RELATED WORK
In this section, we give an overview of recent projects from academia on dynamic application mapping and resource-aware computing in MPSoCs. In the TRIPS project [1] , an array of small processors is used for the flexible allocation of resources dynamically to different types of concurrency, ranging from running a single thread on a logical processor composed of many distributed cores to running many threads on separate physical cores. In the CAPSULE project [2] , the authors describe a component-based programming paradigm combined with hardware support for processors with simultaneous multi threading in order to handle the parallelism in irregular pro grams. Here, an application is dynamically parallelized at run time. A pure software version of CAPSULE, demonstrated on an Intel Core 2 Duo processor, is presented in [3] . However, the above approaches do not touch the major problems of algorithmic design and also not the issue of hardware support for resource management and distribution of workload across a given architecture. Concerning dynamically reconfigurable architectures, the MORPHEUS project [4] aims to develop new heterogeneous reconfigurable SoC with various types of reconfiguration granularity. Resano and others [5] developed a hybrid design/run-time prefetch heuristic that schedules reconfigurations at run-time, but carries out the scheduling computations at design-time. In [6] , a configuration manage ment mechanism is presented for multi-context reconfigurable systems targeting DSP applications, in order to minimize configuration latency. Similarly, in [7] a scheduling algorithm is proposed to tackle the scheduling problem in dynamically reconfigurable FPGAs. The application mapping for DRP [8] , PACT XPP [9] , and ADRES [10] is done in a similar way, where the array can be switched between multiple contexts or can be reconfigured quickly at run-time. The authors in [11] have introduced an approach based on integer linear progranuning for loop level task partitioning, task mapping and pipeline scheduling, while taking the communication time into account for embedded applications.
All the aforementioned mapping approaches have in com mon that they are relatively rigid since they have to know the number of available resources at compile time. Further more, the above architectures are controlled centrally and often provide no mechanisms to manage the utilization of the computing resources. In order to tackle this problem, we introduce novel, distributed hardware architectures for the resources management in MPSoCs. For large MPSoCs with hundreds to thousands of tightly-coupled processor elements, we show that these concepts scale better than centralized resource management approaches. The distributed nature of our approach allows to perform resource exploration particu larly in a parallel way. Another advantage is that only short connections between neighbor resources are considered and no global communication mechanism is required.
III. DISTRIBUTED RESOURCE EXPLORATION

METHODOLOGY
In this section, we describe our new adaptive methodology for dynamic, decentralized exploration and reservation of resources for MPSoC architectures. The main contribution of it is to give each application the ability to explore and claim (invade) available resources in a specific neighborhood, to copy its configuration code (infect) to such captured resources, and then to execute the given program in parallel on the employed resources. Using this approach, applications are able to claim their computational resources, in order to exploit dynamically a changing degree of parallelism. After finishing a phase of execution, the application may free its previously occupied resources by perfonning a release operation (retreat).
The chart depicted in Fig. 1 shows a typical state transition diagram that occurs during the execution of a parallel program. In the beginning, an application may change into the INVADE state and claim a number of PEs by issuing an invade com mand to the local neighbors. This command triggers a process that propagates recursively through the PE network in order to explore and reserve the required number of resources. After that, the application transits to the INFECT & EXECUTE state in order to copy and then to execute the application code on the captured PEs. Once the parallel execution phase is finished, the application can change into the RETREAT state and free the captured resources by issuing a retreat command. As our methodology is mainly applied to loop programs, which are relatively small piece of program codes, the infection can be performed in a short time. In addition, it is possible to group the PEs that receive the same configuration and infect them simultaneously, and consequently shorten the infection time.
The next section explains different exploration methods, which provide support for acceleration of typical signal and image processing algorithms.
A. Exploration Methods
Different applications may have different computational
requirements to obtain their best performance. Such require ments may include the type and number of computational resources and an appropriate interconnection topology, which couples the resources together. In order to minimize the com munication overhead, the required resources may be claimed regarding certain specifications, which can include information about sizes and shapes of the domain. In [12] , in order to map applications into bounded regions, convex regions of process ing elements are considered. This problem becomes even more difficult, when considering not only the shape, but also the interconnection topology of MPSoC architectures [13] . In this work, we consider MPSoCs with connected in a 2D-mesh PEs. For these types of systems, we introduce different exploration methods (invasion strategies), which make it possible to claim a domain of processing elements with different topologies and shapes. Our methodology is evaluated for strategies that allow reserving linear and rectangular regions since they are very frequently required for typical signal and image processing algorithms. These strategies are explained briefly in the fol lowing.
Linear region strategies, LIN: The main objective of this type of invasion strategy is to obtain a chain of linearly connected PEs inside of an array of PEs. Like all the invasion strategies, this one works in a distributed, recursive manner. Each PE controller performs one step of invasion by finding a single available neighbor according to a certain invasion policy and invading it, which means signaling it to continue the invasion. The invaded neighbor continues invasion recursively. In this paper we propose three different policies for claiming of a rectangular invasion strategy, it can be specified, which directions should be explored. InstrOperands denote the size of the claimed region. In case of a linear invasion strategy, the number of needed PEs is specified in this field and in case of a rectangular strategy, these operands denote required dimensions of the rectangular region. As an example, the invasion conunand for the linear invasion shown in Fig. 2(b) is (lNV , LIN, RND, 15), which characterizes a request for reserving 15 linearly-connected PEs in a random-walk fashion. In case of the rectangular exploration, depicted in Fig. 3 , the invasion command is (RECT, SE, 3, 5), which leads to the reservation of a rectangular region originated from PE(O, 0), expanded to the East and the South, and contains 3 rows and 5 columns of PEs.
In order to enable a fast and efficient resource exploration, a decentralized hardware controller is proposed that can be attached to each of the processing elements of the system (see Fig. 4 ). We propose two different types of the controller:
FSM-based controllers that implements a certain exploration strategy, and progranunable controllers that are flexible and can implement different strategies. Next section describes our different controller types.
IV. INVASION CONTROLLER
The task of resource exploration might be done by the PEs in the processor array, but in order to accelerate this process, a specialized controller may be designed to perform such functionalities within the PEs. Having such dedicated hardware is not only neccesary for single-threaded PEs like in [13] , but also can support multi-threaded processors by removing the burden of the resource exploration-related task from the processing units. In addition, invasion strategies mainly represent control flow intensive algorithms, conse quently they may lead to inefficient execution if mapped to processing elements that are designed for signal processing applications.
In previous work [14] , a basic FSM-based invasion con arrays of PEs and was able to acquire one PE in every cycle during the invasion phase. We extended this work for general 2D architectures, and different approaches for gathering and transmitting the results of a resource exploration were studied [15] . But in both the mentioned works, only a simple linear invasion strategy was considered and the proposed architectures were not able to perform any complex explorations like rectangular explorations as depicted in Fig. 3 .
In the following, we propose new architectural designs of the invasion controllers, which allow us to implement more sophisticated invasion strategies. In this section we describe two different designs:
• Hard-wired FSM controllers that implement just a single invasion strategy;
• Programmable controllers that are flexible and can implement all aforementioned strategies.
Each of these approaches has its advantages and disadvantages.
A programmable invasion controller can be easily repro grammed for a wide range of invasion strategies, so that it can be adapted for different application requirements. An FSM based solution allows typically faster resource exploration, but is rigid and inflexible.
FSM-based Invasion Controller:
As mentioned in Section III-A, two different types of invasion strategies are studied in this work. The basic structure of each of the implementations consists of a state machine that corresponds to the state transition diagram depicted in Fig. 1 . Here, after receiving an invade conunand from a neighbor PE, the state of the controller transits to the INVADE state,
where it issues invade commands to one or several neighboring PEs. In case of linear invasion, it chooses one free neighbor according to the defined exploration policy in the instruction (see Section III-A), and requests it to continue invading available PEs. In case of rectangular exploration (see Fig. 3 ), the horizontal and vertical neighbors to be invaded are chosen according to the specification in the invasion command.
Programmable Invasion Controller:
Our architecture of the progranunable invasion controller is a typical VLIW architecture. It can be partitioned into three 5 ). The underlying architecture is highly parameterizable at synthesis time. The whole interconnect structure of the device, e. g., the interconnect between the FU ports and the register file, is generated automatically. The high generality of the design allows us to quickly create and explore a wide range of different configurations with different performance cost trade offs.
The controller performs resource explorations by decoding invasion instructions, modifying them and sending them to the invasion controllers of the neighboring resources. These instructions are received and stored in the register file. The reg ister file also provides fine granular access to the sub-fields of the stored instructions, which allows us to decode instruction fields individually. The execution unit of the device consists of one or several FUs working in parallel. We can decide before synthesis time, how to configure the controller in order to achieve the needed functionality: with several specialized FUs, with one universal FU or a combination of them. Since the interconnect structure is generated automatically, we can easily compare different options without doing a time-consuming and error-prone redesign of the interconnect and achieve the best trade-off between speed and resource overhead for a given set of applications. The control unit of the device takes care of the control flow of the exploration programs loaded into the processor. Implementing the invasion strategies shows, that corresponding programs have to make a lot of decisions due to many parameters, so that they are quite control intensive. To 91 deal with it, the control unit allows building and encoding of a wide range of logical functions out of the FU flags, evaluating them in hardware within a single cycle and taking branches according to the evaluation results. The execution of each FU may also be predicated depending on the branch condition result. Every FU can be chosen to execute when the branch condition evaluates to true or false, providing a possibility to encode an "if-then-else"-like construct in a single instruction and execute it within a single clock cycle.
Our intention is to attach an invasion controller to each PE of an MPSoC with tightly-coupled processors (see Fig. 4 ).
Since the PEs of such processor arrays are typically domain specific and optimized for low area consumption, they are often very small. A configuration of the invasion controller shall be significantly smaller in size than a typical PE. In order to reduce the size of the controller, all the parameters like number and size of registers, number of FUs and number of supported instructions shall be reduced to a minimum.
V. EXPERIMENTAL RESULTS
In the following, we describe and compare the speed and the overhead of different controller implementations. These controllers can be embedded into a class of processor archi tectures consisting of an array of tightly-coupled lightweight processor elements. In order to verify the functionality of the proposed invasion controllers, a simulation model of each of the controllers is integrated with a C++ simulation model of the underlying tightly-coupled processor array. In this paper, as case study, we profiled applications from the field of robotics.
These applications are highly dynamic due to changing en vironment (e. g., object recognition, path planning). These applications can be grouped in ID applications, working on a linearly-connected array of PEs such as digital filters, and 2D applications working on a 2D-mesh array of PEs such as an edge detection algorithm or an optical flow algorithm [16] .
Our linear resource invasion strategy can be used to reserve the required resources for ID applications, and the rectangular invasion strategy fits well for 2D applications. In this section, the exploration methodologies are evaluated from two aspects: their ability of successfully finding and reserving a set of required processing elements, and their timing overhead for performing such an exploration process.
A. Success ratio
In order to successfully meet the dynamic computational requirements of a given set of applications, it is important to develop exploration methods that can adapt at run-time to the dynamic requirements of the application. This is especially of high importance, when multiple applications are competing for their resources. In order to investigate the success of different invasion strategies, we use a simulation platform, which is configurable for different experimental scenarios. In each experiment, a portion of the array is randomly occupied by some other ID or 2D applications as an initial setup. This situation may happen in the field of robotics, since a robot dispatches different applications at runtime in dependence on its environmental situations. The amount of occupied PEs Roee is randomly chosen. Subsequently, a new application is started and performs an invasion for a specific number of PEs. claim ratio value, are depicted in Fig. 6(a) . In the case of linear invasion methods, the method based on the meander policy has a higher success ratio than the other methods, meaning that this method offers the greatest chance to claim the required resources in comparison with the other methods. The random policy method showed bad results. As shown in Fig. 2(b) , it is highly possible to run into inaccessible regions when performing randomized linear invasions, and consequently, this method has a very high probability to fail. As it can be seen, the success ratio for all the methods decreases with increasing claim ratio. Fig. 6(b) shows the dependence of the success ratio of the mentioned methods depending on the Roce value.
Also in this case the meander policy method dominates the other ones. As expected, the success ratio of every method diminishes with increasing array occupation ratio.
B. Hardware cost and timing overhead
As explained in Section IV, our invasion controllers are categorized in two groups: programmable controllers and FSM-based controllers. In order to compare the hardware cost of each of these controller types, we configured the programmable controller for the lowest affordable hardware cost by reducing the level of instruction parallelism and other parameters to a minimum. The results in Table I show that the size of the simplest version of the programmable controller is slightly greater than the size of the dedicated FSM-based implementations. This controller contains a single functional unit, which is able to perform usual arithmetic and logical instructions in a sequential way. The values in Ta ble I and Ta ble II are based on the synthesis results of the designs for an implementation on a Virtex6 FPGA architecture. In order to implement different exploration strategies, we have configured our programmable controller for different sizes of the instruction memory. In our experience, the con troller with 64 lines of code would be sufficient for very simple invasion strategies, but for complex strategies, bigger instruction memories might be needed. Table II Here, we took the simplest design with a memory for 64
instructions and programmed it to implement the aforemen tioned strategies. As linear strategy, the meander version is im plemented because of its superiority over the other suggested approaches, concerning the success ratio. In addition, these strategies (linear meander strategy and rectangular strategy) are implemented by dedicated FSM-based controllers and on a LEON3 processor. The implementation on the LEON3
processor performs the same explorations as our distributed approaches, but in a centralized software-based manner. In this case, a data structure is used that holds the current status of each of the PEs in the array. The LEON3 processor manages resource exploration by searching in this data structure and finding the available resources.
In case of the linear invasion strategy, the total exploration time is dependent on the latency needed to invade a single PE and the number of claimed PEs. Ta ble III shows the average latency for invading one PE in each case of our three different implementations in terms of number of clock cycles. For the LEON3 implementation, these clock cycles denote the latency for checking each element in the array status data structure in case of linear invasion. The latency is obtained empirically by simulation of our resource management strategies on the LEON3 processor. In case of the linear invasion, the speedup gained by applying the distributed resource management is not dependent on the requested number of PEs, since the exploration is performed in a step-by-step manner and is not parallelized among the PEs. If we define the exploration speedup as the timing overhead of the centralized exploration divided by the decentralized exploration, S = /e,centralized , the speedup for the e decentrallzed FSM-based approach is SFSM = 45.5 and for the programmable controller is Sprag = 2.6.
93
The exploration time of the rectangular invasion strategy on the centralized LEON3 processor implementation can be estimated as follows:
where te,centralized denotes the latency needed to explore an N x M field of PEs. Lc denotes the average latency for check ing a single PE entry in the PE status data structure by the LEON3 core. In case of decentralized rectangular explorations, the vertical explorations are done in parallel, consequently the total exploration time grows with the weighted sum of the dimensions of the requested domain. According to this, the exploration time of the decentralized approaches for a N x M rectangular region is represented by: 
VI. CONCLUSION
In this paper, different architectural designs for a distributed resource management methodology for massively parallel tightly-coupled processor arrays were presented. The designs are used to explore and claim resources dynamically in arrays show that a higher success ratio is obtained in the case of the meander method. The meander linear invasion and the rectangular invasion were implemented for programmable invasion controllers, FSM-based invasion controllers, and in a centralized way on a LEON3 processor. The hardware cost of the programmable controller was compared to the FSM based ones. Although the programmable controllers offer more flexibility for implementing different exploration strategies compared to the FSM-based controllers, they suffer from higher exploration latencies. Our distributed methods gained about 2.6 to 45 speedup over the centralized implementation with LEON3 for the linear exploration strategy. In case of the rectangular exploration strategy, the speedup grows proportionally with the linear dimensions of the explored area. We believe, this work is an important step toward fast, scalable and reliable resource management in future many core architectures with 1000 and more PEs, where centralized approaches do not scale anymore because of the latency, but also due to reasons of fault-tolerance. This will be explored in future work.
VII. ACKNOW LEDGEMENT
This work was supported by the German Research Founda tion (DFG) as part of the Transregional Collaborative Research
