Abstract
Introduction
The development of dynamically reconfigurable architectures/systems offers new abilities for a flexible and fast execution of computations which change their requirements to the architecture during runtime. Such systems can be reconfigured during runtime to change their internal communication structure and/or their functional units. A problem that arises with increasing abilities for reconfiguration is the * This work was financed by the German Research Foundation (DFG) through the project "Models and Algorithms for Hyperreconfigurable Architectures" within the priority programme 1148 "Reconfigurable Computing Systems" growing amount of information that is needed for a reconfiguration operation to define the new state. In addition, the high integration of reconfigurable hardware, e.g. reconfigurable circuits on an FPGA chip, requires large bandwidths for the transfer of reconfiguration information when reconfiguration operations are done at runtime speed. The amount of reconfiguration data that is needed for modern FPGAs for example is several megabytes for a single reconfiguration step. This large amount of necessary information transfer becomes especially time critical for computations that exploit the full capacity of dynamically reconfigurable architectures by frequent reconfigurations.
Some approaches have been proposed in recent years to cope with this problem. Off-line compression methods have been applied to the reconfiguration bit stream before it is loaded onto the system by Dandalis and Prasanna [3] . Additional hardware on the chip allows then to decompress the reconfiguration bit stream during run-time before it is needed to define the next configuration. A compression method that is suitable especially for the reconfiguration bit stream of the Xilinx XC6200 architecture has been described by Hauck et al. [6] . Another approach is to compute the bits which are necessary for reconfiguration directly on chip (see Köster und Teich [8] , Sidhu et al. [13] , Wadhwa und Dandalis [16] ). For Multi FPGA systems it has been proposed not to reconfigure all FPGAs at the same time, but perform the reconfiguration incrementally (Lee and Wong [10] ). All these approaches have in common that they do not change the reconfiguration information itself.
A different approach with the aim to reduce the reconfiguration information has been proposed recently by the authors (see [12, 9] ). The idea of this approach is to make the potential for reconfiguration itself reconfigurable. This means that the reconfiguration potential of an architecture can be decreased during such periods of a computation where it requires only minor reconfiguration features. During such periods only the new states of the (few) actually available reconfiguration features have to be defined during reconfiguration and therefore the amount of necessary reconfiguration information is small. Two types of reconfiguration steps are used on such architectures: i) reconfiguration steps where the reconfiguration potential of the architecture is defined (hyperreconfiguration steps), and ii) ordinary reconfiguration steps which are used to reconfigure the actual context which is used by the computation (simply called reconfiguration steps). Reconfigurable architectures that use these two types of reconfiguration steps are called hyperreconfigurable architectures.
Hyperreconfigurable architectures are a promising concept for fast dynamic reconfiguration. Especially for applications where computations typically consist of different phases that use only small parts of the whole reconfiguration potential of the architecture. The smaller the potential of the architecture, that is actually available, as has been defined through a hyperreconfiguration, the less reconfiguration bits are needed for the (ordinary) reconfiguration steps and the faster is such a reconfiguration step.
In this paper we propose several models for partially hyperreconfigurable architectures where several tasks can run in parallel. For such multi task hyperreconfigurable architectures we study the problem of finding optimal (hyper)reconfigurations. While this problem is known to be NP-complete even for a single task under a general cost model we identify an interesting special case that can be solved in polynomial time even for multiple tasks. We give an example for a multi task hyperreconfigurable architecture and show some experimental results for the (hyper)reconfiguration problem on this architecture. Due to space limitations some topics can only be sketched and will be discussed in more detail in the full paper.
Hyperreconfigurable Architectures
In this section we describe hyperreconfigurable machines as they have been introduced in [12, 9] . A reconfigurable system is called hyperreconfigurable when its ability for reconfiguration is reconfigurable itself ( [12] ). Hyperreconfigurable architectures have two types of reconfiguration steps. Ordinary reconfiguration steps allow for changing the context of an algorithm during runtime, e.g. the context can be the communication structure or the functional units that are used. Hyperreconfiguration steps allow to define the potential for reconfiguration that is available to the following ordinary reconfiguration steps. Thus, hyperreconfiguration steps define the set of contexts that are available to the following (ordinary) reconfiguration steps. Such a set of available contexts is called a hypercontext. An (ordinary) reconfiguration takes always place within the actual hypercontext. A central aspect of hyperreconfigurable architectures is that the costs (i.e. the time, or the amount of bits necessary to be loaded onto the architecture) for a reconfiguration step depend on the actual hypercontext.
An algorithm/computation is characterized by a sequence of context requirements that specify which reconfigurable features are needed during runtime for every reconfiguration step on a hyperreconfigurable machine. The context requirements are minimal requirements that have to be satisfied in order to guaranty a successful computation. Examples are the number of switches that are available for reconfiguration in order to satisfy the routing demands or the number of functional units that are available for reconfiguration to satisfy the computation demands. Note that the actual demand of a computation during runtime might depend on the data and cannot be determined exactly in advance. In that case the context requirements are worst case upper bounds. When the meaning is clear we call the context requirements of an algorithm/computation sometimes simply its contexts. The reason is that each context requirement corresponds to exactly one new context which will be realized during runtime by a reconfiguration operation. Reconfiguration into a new context can only be realized at runtime when the machine is in a state (more exactly in a hypercontext) that provides the necessary reconfigurable features, i.e., it satisfies the corresponding context requirement. 
where |S i | is the length of S i , i.e., the number of context requirements in S i .
The next model is intended for coarse grained reconfigurable machines where the number of possible hypercontexts is not too large. It is assumed that the different hypercontexts can be ordered with respect to their computational power by a precedence relation. The precedence relation is given as a directed acyclic graph (DAG). It is assumed that there always exists a hypercontext h which satisfies every possible context requirement, i.e. h(C) = C.
DAG model: Given a DAG G = (V, E) with V = H for a set H of hypercontexts and for each h ∈ H a set h(C) such that for each edge (h 1 , h 2 ) ∈ E the relation h 1 (C) ⊂ h 2 (C) holds. In addition there are costs cost(h) > 0 and init(h) = w defined for each h ∈ H and a constant k ≥ 0 such that for each edge
The total reconfiguration cost of a computation is given by
For each context requirement c ∈ C let c(H) be the set of minimal (with respect to the precedence relation defined by E) hypercontexts h in the DAG which satisfy c ∈ h(C). be used also for fine grained reconfigurable machines. It is assumed that there exists a set of small (similar) reconfigurable units and every subset of these units can be used to define the reconfigurable machine that is available during a hypercontext. For an example each reconfigurable unit might be a switch and the set of switches defines the available part of the reconfigurable machine. The larger the routing requirements of an algorithm for a certain context, the more switches should be made available by the corresponding hypercontext at runtime. During a reconfiguration operation the state of each available switch has to be defined. Hence, the costs of reconfiguration are simply the number of available reconfigurable units.
Switch model: Given a set of reconfigurable units or switches X = {x 1 , . . . , x n }, the set of context requirements C and the set of hypercontexts H with C = H = 2 X , i.e., C and H are the set of all subsets of X, and a se-
where |h| is the size of h, i.e. the number of switches available in h and init(h) = w > 0 for each h ∈ H. The total reconfiguration time of a computation is defined as
Note that in the PHC-DAG model the number of hypercontexts, i.e., the number of nodes of the graph, is part of the size n+m of the problem instance. For the PHC-Switch model there exist 2 n hypercontexts but this number of not part of the size of the problem instance which still is n + m.
Multi Task Hyperreconfigurable Machines
In this section we introduce models for multi task hyperreconfigurable machines. A new feature of these machines is their ability to perform so called partial hyperreconfiguration operations which have an effect only for a subset of the tasks. As an example consider a machine where a hypercontext is characterized by different types of properties: i) qualitative properties (e.g., good, medium, or low routability ) ii) quantitative properties (e.g., number of switches in the switch boxes) and iii) limited resources (number of I/O pins). Routability might be characterized by local wires and corresponding switches that are used only within a single task and by global wires and corresponding switches that can be used by all tasks. So, there can be different types of reconfigurable resources that are available on a hyperreconfigurable machine and they are used differently by the tasks. In the following we identify several aspects that are important for the design of multi task hyperreconfigurable architectures. For a multi task environment where several tasks run in parallel three types of hyperreconfigurable resources can be distinguished.
• Private global resources: Reconfigurable resources that have to be shared between the tasks, i.e. each task uses some amount of the resource (which has to be assigned to it) and where the total amount that is available and the assignment to the tasks is defined by the hypercontext. For example I/O units might be a resource that has to be shared between the tasks. Thus the number of I/O units available in the hypercontext has to be at least as large as the maximum total number of I/O units that is used by all tasks during any context under this hypercontext.
• Public global resources: Reconfigurable resources that can be used by all tasks at the same time and according to the same quality as defined by the actual hypercontext. As an example assume that the type of switches that is actually available on an FPGA is defined by the hypercontext. Then, when tasks run in disjoint areas of the FPGA they all can do their internal routing independently from the other tasks but all have the same type of switches available within their own area.
• Local resources: Reconfigurable resources for which the quality or amount that is available for a task can be defined during hyperreconfiguration independently from the quality or amount that is used by the other tasks. An example is an FPGA where tasks run in disjoint areas and where the type of switches available in the area of a task can be defined independent from the type of switches available in areas of other tasks.
Note that private global and local resources differ in whether they can change their ownership. We assume here that local resources are assigned fixed to a task at initialization. In contrast, the private global resources are assigned to the tasks by global hyperreconfigurations. To define partially hyperreconfigurable machines we distinguish between two types of hyperreconfigurations:
• Global hyperreconfigurations determine all available private and public) global hyperreconfigurable resources.
• Local hyperreconfigurations determine all available local hyperreconfigurable resources and also which of the public global resources that are assigned to a task (by a global hyperreconfiguration) are available.
Note, that private global resources are involved in the global and the local hyperreconfigurations. The global hyperreconfiguration defines the maximal availability (for the local hyperreconfigurations) and the assignment to the tasks. For example it can be defined that altogether 12 I/Ounits are available from which 5 are assigned to task 1. The local hyperreconfiguration can determine to which extend the assigned private global resources are available for reconfiguration. In the example in a local hyperreconfiguration it can be defined that only 3 (of the at most 5) I/O-units for task 1 are actually available and can reconfigured.
Depending on the extend to which a hyperreconfigurable machine allows partial (hyper)reconfigurations we distinguish three types of partially hyperreconfigurable machines:
• A hyperreconfigurable machine is called partially reconfigurable when a subset of the tasks can perform a reconfiguration without interruption of the computation of the other tasks but hyperreconfigurations can only be done for all tasks at a time.
• A hyperreconfigurable machine is called partially hyperreconfigurable when a subset of the tasks can perform local hyperreconfigurations and reconfigurations without interrupting the computations of the other tasks.
• A hyperreconfigurable machine is called restricted partially hyperreconfigurable when a subset of the tasks can perform local hyperreconfigurations without interruption of the computations of the other tasks but reconfigurations can only be done for all tasks at a time.
For partially hyperreconfigurable machines two types of hypercontexts can be distinguished:
• Global hypercontexts define the available global hyperreconfigurable resources and the assignment of private global resources to the tasks.
• Local hypercontexts define the available local hyperreconfigurable resources for different tasks. The extended local hypercontexts define the local and the private global resources (within the limits of what has been assigned to a task) that are available for the different tasks.
It is assumed in this paper that during a global hyperreconfiguration step no task can perform computations and the old extended local hypercontext and the context are no longer valid afterwards. The new extended local hypercontext and a new context have to be defined after a global hyperreconfiguration. Thus a global hyperreconfiguration has a synchronizing effect. For reconfigurations and partial hyperreconfigurations different modes of synchronization are possible. We distinguish the following three modes of synchronization between the tasks for (partially) hyperreconfigurable machines:
• Hypercontext synchronized: Partial hyperreconfigurations are synchronized between all tasks so that no task can perform computations during a partial hyperreconfiguration no matter whether the task actually executes a partial hyperreconfiguration or is idle.
• Context synchronized: Reconfigurations are synchronized between all tasks so that no task can perform computations during reconfigurations no matter whether the task actually executes a reconfiguration or is idle.
• Fully synchronized: The machine is hypercontext synchronized and context-synchronized.
• Non-synchronized: The machine is neither hypercontext synchronized nor context-synchronized.
In this paper we assume that synchronization always means explicit synchronization in the form of a barrier synchronization. Hence a program for a task contains statements for (partial) hyperreconfigurations and for reconfigurations and the tasks wait until all tasks arrive at their corresponding next (hyper)reconfiguration statement. In a synchronized machine a no-hyperreconfiguration (or noreconfiguration) statement is used by those tasks that do not perform a corresponding (hyper)reconfiguration.
Note that public global resources exist in the partially (hyper)reconfigurable models only when they are context synchronized or fully synchronized. The reason is that a reconfiguration of the public global resources (potentially) influences all tasks and therefore has to be executed synchronously.
Cost models
In this subsection we extent the three models for measuring the runtime of an algorithm/computation as described in Section 2 to partially hyperreconfigurable multi task machines. Since the reconfiguration times of some tasks can overlap with the computation times of other tasks it is not enough to consider only the reconfiguration times on multi task machines.
Another important aspect that is new in the multi task environment for the definition of suitable cost function is whether the reconfiguration bits can be loaded onto the machine in parallel for all tasks or not. We distinguish whether a reconfiguration operation is done
• task parallel, i.e., the reconfiguration bits for the tasks (i.e. for the private global and local resources) and the reconfiguration bits for the public global resources can be uploaded in parallel onto the machine, or
• task sequentially, i.e., all reconfiguration bits for the tasks (and for the public global resources) have to be uploaded sequentially onto the machine.
Analogous definitions can be made for partial hyperreconfiguration operations and for the reconfiguration bits that correspond to private global resources of global hyperreconfiguration operations. For the cost models that are described in the following we assume that non-synchronized operations are always executed task parallel. For synchronized operations either way is possible and we describe cost functions for both cases. Due to the limited space and for ease of description we assume in most cases that the costs for global hyperreconfiguration operations are constant.
Let For a given hypercontext (h 0 , h 1 , . . . , h m ) 
The Asynchronous Case
Here we describe cost models for non-synchronized partial hyperreconfigurable machines (but recall that global hyperreconfigurations are always barrier synchronized) where reconfigurations and partial hyperreconfigurations are executed task parallel.
1. General Multi Task model: The maximal total time for (hyper)reconfigurations of all tasks from h to h is
where init(h) are the cost for a global hyperreconfiguration and init(h j , f loc j ) is the cost for a local hyperreconfiguration for task T j and cost(h loc i,j , h priv i,j ) are the reconfiguration costs for task T j . Note, that we assume that after a global hyperreconfiguration each task has to perform a local hyperreconfiguration. 
Multi Task DAG (MT-DAG) model: Given are a DAG
The maximal total time for (hyper)reconfigurations of all tasks from h to h is 
The maximal total (hyper)reconfiguration time of all tasks from h to h is
and a typical special case is
A model variant assumes that a hyperreconfiguration has fixed costs w plus costs for the difference between the new hypercontext h an the predecessor hypercontext h (see [9] ). The latter costs are called changeover costs. The difference is measured as the size of the symmetric difference between the sets of switches h and h denoted by |h h |. The motivation for the introduction of changeover costs are applications where only the difference information to the predecessor hypercontext has to be loaded onto the machine.
The Synchronous Case
As an example for a cost model of a synchronized machine we describe the cost model for the fully synchronized MT-Switch machine. For ease of description we assume formally that there is a partial hyperreconfiguration immediately before every reconfiguration (since a barrier synchronization model is assumed each tasks executes a local hyperreconfiguration or no-hyperreconfiguration operation). We assume further that the computation time after a reconfiguration up to the next hyperreconfiguration is the same for all tasks (if this is not the case we count the maximum computation time of the tasks between corresponding reconfigurations). An example for such a machine is a reconfigurable mesh where a reconfiguration is done at the start of each computational cycle.
Assume that each tasks performs n local (no-)hyperreconfiguration operations and n reconfigurations between global hyperreconfigurations h and h . For j ∈ [1 : m] and l ∈ [1 : n] let I j,l = 0 when the lth local (no-)hyperreconfiguration operation of task T j is a nohyperreconfiguration operation and otherwise let I j,l = 1. For task T j and l ∈ [1 : n] let f j (l) be the number of the last local hyperreconfiguration operation in the sequence of its first l local (no-)hyperreconfiguration operations. The maximal total (hyper)reconfiguration time of all tasks when partial hyperreconfiguration and reconfiguration are executed task parallel is
When partial hyperreconfiguration or reconfiguration operations are executed task sequentially basically the corresponding "max" in the formula is exchanged by a " ", e.g. when partial hyperreconfiguration is done task sequentially and reconfiguration is done task parallel the costs are
The Synchronized MT-Switch Problem
Under the general cost model the problem to find the best time steps when to perform global and local hyperreconfigurations in order to minimize the runtime of a computation is NP-hard even for the single task environment ( [9] ). We show that the problem can be solved efficiently under the fully synchronized switch cost model for the multi task environment where hyperreconfigurations and reconfigurations are done task parallel. Here we show that the problem can be solved efficiently under the fully synchronized switch cost model for the multi task environment where hyperreconfigurations and reconfigurations are done task parallel. Due to space limitations we describe only the case that there are only local resources and no global resources and omit the corresponding algorithm. Recall, there are no global hyperreconfiguration in this case.
The MT-Switch problem is defined as follows: Given tasks T j , j ∈ [1 : m] and for each task a sequence of context requirements c j,1 . . . c j,n the problem is to find the best time steps when to perform local hyperreconfigurations and to define the corresponding hypercontexts so that the total (hyper)reconfiguration time is minimal.
With a dynamic programming algorithm we can obtain the following theorem (It should be noted that the runtime can be further improved with pointer techniques as described in the full paper). 
Example Architecture and Experimental Results
We define the Simple HYperReconfigurable Architecture (SHyRA) as an example of a minimalistic model of a rapidly reconfiguring machine. As depicted in Figure 1 it features 2 reconfigurable Look-Up Tables each with 3 inputs and one output. For storing signals a file of 10 registers is used which are reconfigurably connected to both LUTs by a 10:6 multiplexer and a 2:10 demultiplexer. The small number of LUTs poses a bottle neck for the test applications we have run on SHyRA and forces them to make extensive use of reconfigurations. The test applications therefore naturally lend themselves to profit from the use of partial hyperreconfigurations. For the test runs SHyRA worked in fully synchronized mode and partial hyperreconfigurations were performed in task parallel mode. We compare the multiple task case where each of the four components (LUT1, LUT2, MUX, DeMUX) forms a task (m = 4) and partial hyperreconfiguration is possible with the single task case where all components are combined into one single task (m = 1).
As a first example application, a 4 bit counter with a variable upper bound was mapped onto SHyRA. The counter increments its value that is stored in the first four registers until it has reached the value stored in registers five to eight. As all operations can only be performed through the use of the LUTs it is impossible to implement the counter in one clock cycle. The design is thus time partitioned.
The resulting code was executed on a simulated SHyRA system where hyperreconfiguration was disabled. The initial counter value was 0000 and the upper bound was set tiple tasks case and 30 hyperreconfiguration steps for the single task case. The total reconfiguration costs when hyperreconfiguration is disabled were 5280. This has to be compared with total (hyper)reconfiguration costs of 3761 for the single task case and costs of 2813 for the multiple task case (this is 71.2% respectively 53.3% of the reconfiguration costs without using hyperreconfiguration). Figure 3 shows which tasks have performed a hyperreconfiguration and which performed a nohyperreconfiguration operation. Note, that since tasks T 1 , T 2 , and T 3 have the same size of local context (i.e., l 1 = l 2 = l 3 ) and hyperreconfigurations are done task parallel either all four tasks perform a partial hyperreconfiguration or tasks T 1 , . . . , T 3 perform a partial hyperreconfiguration. The results show that partial hyperreconfiguration is a promising concept that has a potential to increase the speed of runtime reconfiguration.
Conclusion
The concept of hyperreconfigurable architectures that can adapt their ability of reconfiguration during runtime in order to speed up ordinary reconfiguration steps has been extended in this paper to multi task environments. Several models for partially and dynamically hyperreconfigurable architectures have been proposed. It was shown that the problem of finding optimal (hyper)reconfigurations can be solved by a polynomial time dynamic programming algorithm for the switch cost model when (hyper)reconfigurations are done fully synchronized. The introduced concepts for partial hyperreconfiguration were illustrated by a simulated example architecture that has different reconfigurable units (2 LUTs, multiplexer, and demultiplexer). Experimental results for a small application (4-bit counter) that was run on the example architecture under the multiple tasks switch cost model show how partial hyperreconfiguration can lead to reduced (hyper)reconfiguration costs.
