In this paper, a new methodology for tolerating link as well as node defects in self-adaptive reconfigurable networks will be presented. Currently, networked embedded systems need a certain level of redundancy for each node and link in order to tolerate defects and failures in a network. Due to monetary constraints as well as space and power limitations, the replication of each node and link is not an option in most embedded systems. Therefore, we will present a hardware/software partitioning algorithm for reconfigurable networks that optimizes the task binding onto resources at runtime such that node/link defects can be handled and data traffic on links between computational nodes will be minimized. This paper presents a new hardware/software partitioning algorithm, an experimental evaluation and for demonstrating the realizability, an implementation on a network of FPGA-based boards.
INTRODUCTION
Available networked embedded systems, e.g., in the field of automotive networks bind functionality statically onto electronis control units (ECUs). Thus, if a node fails, the functionality hosted by the control unit will be lost, and due to data dependencies, other functions on working nodes may not operate either. The same holds true for erroneous communication links which may lead to an isolation of a node or changing routes in point-to-point networks. However, in order to tolerate defects of computational nodes, it is necessary to replicate computational nodes and links such that a certain degree of redundancy is available. Obviously, introducing redundancy into embedded networks has some essential drawback concerning monetary costs, power consumption, size, weight, etc. Therefore, we will present an approach for tolerating permantent faults like node or link defects by separation of functionality from the physical hardware and rebinding tasks from defect nodes onto working nodes.
When using reconfigurable devices such as FPGAs together with internal CPU cores, it will be possible to assign tasks implemented in either hardware or software, dynamically to the resources in the network. Besides these reconfigurable devices, computational nodes in a network contain dedicated analog hardware for driving sensors and actuators which leads to a certain heterogeneity in the network. This irregularity is typical for embedded systems that consist of specialized nodes for certain purposes.
For such reconfigurable networks we will show how to reduce the degree of redundancy on the one hand while increasing fault tolerance and flexibility on the other hand. Essential for achieving these objectives is a novel class of algorithms called online hardware/software partitioning which determines a binding of tasks to available resources in heterogeneous networks.
Binding tasks onto computational nodes has been investigated in many research fields. The offline approach towards hardware/software partitioning has been considered by many researchers [14, 16, 10] . For example, Blickle [10] synthesizes so-called Pareto-optimal systems out of many design alternatives with the help of Evolutionary Algorithms. Such an approach helps a system designer for an unbiased decision making.
Also, so-called load balancing algorithms have received a considerable interest and solve the problem of task binding with the objective of homogeneously distributing tasks' loads onto CPUs at runtime. The goals of load balancing are a) the reduction of latency or average response time, b) to provide fairness and c) reduction of overheads due to many context switches on highly utilized nodes. Load balancing approaches like Token Distribution [18, 8] , Diffusion [11, 12] , and so-called Balancing Circuits [9] are distributed algorithms and are thus applicable for fault-tolerant networked systems.
Another field of placing functionality on resources has been opened by Vahid et al. Based on a platform consisting of reconfigurable hardware and a CPU [17] , a profiler extracts critical code regions, decompiles them and synthesizes them to hardware. Achieving an average speedup of 2.6 [19] for different benchmarks, this approach to dynamic hardware/software partitioning shows the potential of dynamically assigning tasks to software or hardware resources.
A first approach to online hardware/software partition- Figure 1 : Functionality is modeled with a so-called sensor-controller-actuator chain. This functionality will be bound with certain restrictions onto the nodes of the network topology.
ing for reconfigurable networks [6, 4] which is based on a combination of diffusion algorithms and bi-partitioning balances the load between the resources and thus, maximizes the amount of free resources on each single node. With this strategy the likelihood that the load of defect nodes or newly arriving tasks may be adopted by every node is increased. Unfortunately, all the presented approaches either do not consider hardware/software reconfigurability at all or provide no extension to reconfigurable networks. Also, heterogeneities due to sensors and actuators attached to single nodes in the network are not respected by the algorithms but strongly affect the placing of functionality onto nodes drastically. Moreover, none of these approaches consider the minimization of data traffic on links between computational nodes.
To overcome these drawbacks, we will explain in Sec. 2 a network model in which certain sensors and actuators can only be connected to certain nodes and tasks reading the sensor values or controlling the actuators have limited binding possibilities. Different to the approaches in [4] where the binding of tasks to resources is done with the objective of minimizing load on each single resource, the methodology presented here tries to minimize the congestion on the communication links while respecting utilization constraints of hardware and software resources. The entire hardware/software partitioning approach runs distributed in the system and is described in Sec. 3. In Sec. 4, we present an implementation of the online hardware/software partitioning algorithm as well as an evaluation and comparison of our methodology to an approach with global knowledge.
CONCEPTS AND MODELS
In this paper, networks are considered consisting of hardware/software reconfigurable nodes. The networks have a fixed topology which is only influenced by node and link defects. Different to ad-hoc networks the size and the dynamic effects are not arbitrary. Assuming that all network nodes are connected via point-to-point connections and having more than one incoming and outgoing communication link, the considered networks should be faulttolerant against permanent and transient faults as well as babbling idiot failures. Presuming a network with reconfigurable nodes allows for implementing a task in either software and run it locally sequentially together with other software tasks or in hardware (e.g., using reconfigurable hardware technology). Typically for embedded systems are dedicated IO-interfaces which might not be available on each node and lead to a heterogeneous network structure.
Exemplarily, Fig. 1 shows a network topology with four computational nodes ci ∈ C, sensors si ∈ S, actuators ai ∈ A and communication links represented by the edges between the nodes ci. The sensors and actuators are not connected to all nodes in the network, but only to some. Thus, the presented methodology in Sec. 3 has to be able to bind functionality onto a heterogeneous network structure. Similar to the network structure, the functionality is modeled by a so-called sensor-controller-actuator chain graph and distinguishes between sensor t s i , controller t c i and actuator tasks t a i . While sensor tasks produce data which are processed by one or more controller tasks, actuator tasks consume data entities. In Fig. 1 , such a sensor-controlleractuator chain is represented by gray nodes and edges in between where the edges represent data dependencies. Annotated to these nodes and edges are the following attributes which are necessary for the online hardware/software partitioning approach:
Execution Time (Ci) and Deadline (Di): In order to analyze the schedulability of a task t c i on a CPU without violating deadlines, the execution time Ci and its Deadline Di are required. With the help of schedulability analyses for real-time schedulers the utilization can be computed. Based on this utilization, it can be determined whether a task can be executed in software. Look-Up Tables (LU T ), Memory-Cells (M C): Before a task is assigned to hardware resources at runtime, it has to be checked if enough resources are available. Considering current FPGA-architectures, not only Look-Up Tables are required to implement the hardware functionality, moreover memory cells or embedded RAM blocks are necessary. Additional placement constraints together with the shape of a hardware module might prevent the binding of tasks onto free hardware resources either. In summary, the feasibility of placing functionality onto hardware resources depends in our model not only on one parameter, but on a set of parameters. Note that due to this set of parameters online hardware/software partitioning algorithms which are based on load balancing algorithms are not applicable. In general, load balancing algorithms have their legitimacy in architectures or topologies where only one parameter decides about executability. Migration Size (M ): This parameter is used to reduce the probability of migrating huge tasks between nodes. In FPGA-based architectures with a CPU, the migration size is given by the sum of the binary and the bit-stream size of task t Due to the heterogeneity caused by the sensors si and actuators ai in the network topology, the binding of sensor tasks t s i and actuator tasks t a i is restricted. In particular, a sensor task t s i is only allowed to be bound onto a corresponding node si ∈ S. In contrast, an actuator task t a i is only allowed to be bound onto an actuator node ai. We assume that all controller tasks t c i may run on each computational node ci. Considering Fig. 1 , sensor task t s 1 may be bound onto the sensor node s1, but not onto s2. Analogously, sensor task t 
Formal Model
The previously literally described model can be formally defined as follows: 
The nodes of the topology graph can be refined as: Definition 3 (Computational Node). A computational node has ports pi ∈ P and |P | = deg(cj) + 1. While the ports pi : i = 1 . . . deg(cj) are dedicated for communication between sensor, computational or actuator nodes, the port p0 is dedicated for internal node communication.
For modeling the functionality, we define so-called sensorcontroller-actuator chains.
Definition 4 (Sensor-Controller-Actuator Chain).
represent the data dependencies between the tasks. Annotated to the edges and nodes can be different parameters which do not belong explicitly to the model. The parameters required by our online hardware/software partitioning approach are presented above.
In order to express, where a task t In order to obtain a feasible communication, it might be necessary that an edge e sca i = (t k , t l ) is routed over many edges e tg j . In particular it is required that the path constructed by the edges e tg j connects the resources where the tasks t k and t l are bound to.
Definition 7 (Binding Restrictions). Sensor tasks t s i ∈ T s may only be bound onto sensor nodes si ∈ S. Controller tasks t c i ∈ T c may be bound onto all computational nodes cj ∈ C. Actuator tasks t a i ∈ T a may only be bound onto actuator nodes ai ∈ A. Note that additional binding restrictions may occur during the hardware/software partitioning process due to attributes annotated to edges or nodes in the graphs G tg and G sca .
Problem Statement
Online hardware/software partitioning aims at binding tasks to free hardware or software resources at runtime. Typically, the hardware/software partitioning is executed during the design phase of an embedded system. But since dynamic effects like node or link defects as well as new arriving tasks corrupt an optimal binding, it is inevitable to determine a new binding online. Our approach to online hardware/software partitioning consists of two main steps of which the second step will be refined later on. In Fig. 2 , these two steps are shown in an exemplifying scenario. The presented network topology consists of four computational nodes c1, . . . , c4, a sensor s1 and an actuator a4. The controller tasks t 
A requirement to this replica binding is that a task t c i and its replica t c i must not be bound onto the same computational node cj. Next, the computational node c1 fails in Fig. 2 and thus, all tasks bound onto this node are lost. During the fast-repair phase, the replicated tasks t c i become the main task t c i and new routes for the task-to-task communication have to be established. Obviously, the tasks are suboptimally bound after the fast-repair phase which will be improved during the optimization phase. The optimization phase tries to find a binding of tasks t c i to resources such that the data traffic on the communication links is minimized and constraints to the CPU utilization or the usage of hardware resources, resp., are not violated. In order to tolerate another node defect, replicated tasks t c i need to be created and bound onto the computational nodes cj.
ONLINE HW/SW PARTITIONING
As presented in Fig. 3 , the overall approach to hardware/ software partitioning consists mainly of two phases. While the concepts and implementations of the first phase (fast repair) have been described in detail in [3] , this paper concentrates on the second phase (optimization). Several constraints exist to this optimization phase:
• distributed computation: Due to fault-tolerance aspects, the binding of tasks has to be determined in a distributed manner at the computational nodes. 
2).
• data traffic: Communicating tasks produce a certain amount of data that has to be transferred over links between the nodes. Therefore, a requirement to the algorithm is the optimization of traffic in the network.
• local knowledge: Gathering of data to obtain global knowledge about the network is time consuming and produces communication overhead. Thus, it is desired to optimize the binding with limited information. The next section shows how our approach fulfils these constraints by determining improvement values on each task and migrating tasks according to these values. 
Task Binding
The proposed methodology for determining an optimal binding is based on three improvement values: a) an communication improvement that tries to cumulate functionality with data dependencies, b) a migration improvement which reduces the overhead caused by the task migrations and c) a partitioning improvement that tries to implement a task according to its favorite implementation style.
communication improvement: The communication improvement I com i,j is defined as the improvement for task t c i if it is migrated from node cm over port pj to a neighboring computational node (j = 0):
The outer sum adds deg(ci) + 1 terms because not only the traffic over the node's ports but also the node internal traffic needs to be considered.
Considering Fig. 4 as an exemplifying binding where the communication improvement I is required for optimizing the implementation style (hardware/software) of a task t c i . For certain applications, e.g., video stream processing, it might be desirable to implement a task in hardware while alternatively, a statemachine might be efficiently executed in software. However, assuming that each task t c i has a favorite implementation style, a likelihood value li ∈ R with 0 ≤ li ≤ 1 will be defined at design time. The decision wether a task is better implemented in hardware or software can be taken based on resource utilization or a quality of service. The resulting improvement I par i,j will be defined as:
1 : if t c i was implemented in its non-favorite style and can be implemented in its favorite style after migration over pj −1 : if t c i was implemented in its favorite style and can only be implemented in its non-favorite style after migration over pj 0 : else The resulting improvement Ii,j for migrating a task t c i over port pj to a neighboring computational node is:
As shown in Fig. 5 , this improvement will be determined for all migratable tasks t c i ∈ Tm ⊆ T and all ports pj of node cm. After calculating the improvement values for the migratable tasks, negative improvement values might be in the list and can impair the current binding. Therefore, two possibilities exist, a) to remove all negative improvement values or b) to allow for negative improvement values depending on the migration count mci of task t c i . In the next step, the algorithm selects the task t c i with the highest improvement value Ii,j and asks the neighboring computational node at port pj if the task can be scheduled on the CPU or bound onto the reconfigurable hardware device, resp. (see Fig. 5 ). If enough resources are available for scheduling/placing the task, the task will be migrated and all improvement values ∀pj ∈ P : Ii,j will be deleted. Otherwise, only the improvement value Ii,j for the considered port pj and task t deleted. These two steps of selecting the task with the highest improvement value and trying to migrate it, is repeated locally until no improvement value Ii,j remains. Note that the set of migratable tasks Tm contains only tasks with a migration counter less than a certain limit: mci ≤ mc limit . The counter mci is incremented after each migration of task t c i and reset after a node or link defect. With this constraint, the algorithm will terminate by preventing an alternating behavior. All in all, our methodology runs asynchronously in the network, i.e., there are no periodic migration rounds, but a node tries to improve the binding when the number of tasks has increased.
Discussion
Our approach to online hardware/software partitioning is derived from two strategies. The first one is the partitioning algorithm by Kernighan and Lin [13] . This algorithm deals with the combinatorial problem of partitioning the nodes of a graph with costs on its edges into subsets no larger than a given maximum size, so as to minimize the total cost of the edges cut. The essence of the proposed methodology is the following: starting with any arbitrary partition, the algorithm tries to decrease the initial external cost by a series of simultaneous interchanges of the subsets' nodes. Similarly, our approach tries to minimize the overall cost by calculating the improvement if a task is migrated to a neighboring node. Unlike the Kernighan-Lin algorithm, we cannot interchange tasks simultaneously, because the migrated task has to be started at the new computational node before releasing the resources at the old computational node. Additionally, our algorithm considers the migration of tasks only to neighbors and not to all computational nodes or partitions, respectively. Therefore, we allow for migrating tasks even if the migration makes the task binding worse. This strategy is similar for Simulated Annealing [15] approaches. In order to escape from local minima, worse solution are accepted with a certain probability and the probability will decrease during the optimization.
RESULTS
We implemented our approach to distributed online hardware/software partitioning on a network of four reconfigurable FPGA-based boards incorporating a RISC-CPU and additional logic for implementing hardware. The operating system microC-OS II [2] has been extended such that node and link defects are automatically detected. Additionally, a task manager has been designed and implemented which gathers information about the task binding and locally decides where to bindg the tasks. This decision will be taken by our approach to online hardware/software partitioning.
On top of this network infrastructure, a driver assistance application has been implemented. With the help of pattern recognition algorithms, the application tracks the lane and in case of an unintended lane change, the assistant sets off an accoustic warning. The entire driver assistant runs on the network in a distributed manner. Thus, if one node fails, the tasks have to be dynamically reassigned to free resources in the network. Interestingly, we can see in this small example, how redundancy can be minimized with the help of reconfigurable networks. Considering, e.g., the acoustic warning, we require memory for storing the audio-samples, a controller task for starting and stopping the warning as well as a D/A-converter. We implemented the D/A-converter as a puls-width-modulator which is filtered with a simple lowpass consisting of one capacitor and one resistor. Due to our online hardware/software partitioning approach, it is possible to migrate the puls-width-modulator, the audio-samples as well as the controller task to another node at run-time. In this particular example, only the analog low-pass filter has to be implemented redundantly. For further information on this system, please refer to [1, 3] .
However, for a detailed evaluation of our approach to online hardware/software partitioning, we implemented a behavioural model of the previously described network, too. This model has been supplied with nine different scenarios where each scenario consists of a sensor-controller-actuatorchain and a network topology. Three different scenarios were created with 40 tasks and 10 computational nodes. The next three scenarios had 80 tasks and 20 nodes and the last three scenarios had 200 tasks and 50 nodes. Our distributed approach started from an arbitrary initial binding of tasks onto computational nodes. For each scenario, 10 initial bindings were determined such that in total 90 test cases were examined. Starting with an arbitrary binding of the tasks onto the computational nodes of the network topology, the algorithm tries to improve the binding by migrating functionality between the hardware and software resources in the network. After each migration step, we determine the overall traffic T in the network and the fraction of tasks which are executed in their non-favourite implementation style N :
is implemented in its non-favourite style 0 : else We compared the solutions si = (T, N ), si ∈ S of each optimization run with a hardware/software partitioning algorithm based on Evolutionary Algorithms (EA) [7] that incorporates global knowledge. Note that our algorithm tries to optimize the binding only with local knowledge. The EA-based approach determines a reference set of socalled Pareto-optimal solutions REA with rEA = (T, N ) and rEA ∈ REA. The minimal normalized distance d(s) between For each locally determined solution s ∈ S, we computed the distance d(s) to the reference set R with the Paretooptimal solutions. The distance between the Pareto-front determined by the EA-based approach and the solution s after each task migration is shown in Fig. 6 . Each plot in Fig. 6 represents one test case with either 40 Tasks/10 Nodes, 80 Tasks/20 Nodes or 200 Tasks/50 Nodes. Due to the migration counter, the smaller test cases terminate earlier than the bigger test cases, but it can be clearly seen that our methodology improves the initial partitioning and approaches a global optima. In Fig. 7 , the two objectives (traffic T and percentage of suboptimally implemented tasks N ) after each task migration are shown. For these plots, we normalized the traffic by dividing by the maximal traffic of each optimization run. Interestingly, our algorithm is able to reduce the traffic T by at least 20%. Additionally, the number of suboptimally implemented tasks N which has been about 50% at the beginning has been reduced to 25% in average.
CONCLUSIONS AND FUTURE WORK
Online hardware/software partitioning aims at binding functionality onto free ressources at run-time. While other approaches solved this partitioning problem offline or just assign software tasks dynamically to network nodes, our approach solves the partitioning problem at run-time. Moreover, it runs in a distributed manner, requires only local knowledge and respects various resource limitations on the nodes. While assigning functionality to nodes, our algorithm successfully minimizes the congestion in the network. All in all, we presented an online hardware/software partitioning approach for FPGA-based or general reconfigurable networks. 
