Abstract. A method for managing recon gurable designs, which supports run-time con guration transformation, is proposed. This method involves structuring the recon guration manager into three components: a monitor, a loader, and a con guration store. Di erent trade-o s can be achieved in con guration time, optimality of the con gured circuits, and the complexity of the recon guration manager, depending on the reconguration methods and the amount of run-time information available at compile time. The proposed techniques, implementable in hardware or software, are supported by our tools and can be applied to both partially and non-partially recon gurable devices. We describe the combined and the partitioned recon guration methods, and use them to illustrate the techniques and the associated trade-o s.
Introduction
Exploiting the run-time con gurability of FPGAs has been regarded by many as the key to overcoming their reduced capacity and speed compared with custom integrated circuit implementations. The approach will, however, only be valid if the time for recon guring the FPGAs does not outweigh its bene ts of increasing capacity. Techniques are required to manage recon gurable resources e ciently at run time; such techniques may also provide abstractions which hide low level details from users when appropriate. This paper presents a method for e cient run-time management of recongurable designs, which involves structuring the recon guration manager into three components: a monitor, a loader, and a con guration store. The method can be implemented in hardware, software, or a combination of both. It can be applied to dynamically recon gurable systems containing one or more FPGAs, which may or may not support partial recon guration. Techniques such as runtime transformation and partitioning the recon guration manager can be used to optimise con guration store usage or to reduce recon guration time.
Our work complements related research on tool development and run-time support for recon gurable systems 1], 2], 3], 4], 8], 10]. The important aspects of our work include: (a) exploitation of compile-time information for optimising run-time performance, (b) exibility of implementing the recon guration manager in hardware or software, (c) support for both partially recon gurable and non-partially recon gurable FPGAs.
Framework Overview
This section provides an overview of our framework for recon guration management. Details of the components in this framework will be presented later. While the discussion below centres on one dynamically recon gurable FPGA, the framework can be extended to deal with multiple devices.
In this framework, the recon guration manager contains three components: a monitor, a loader, and a con guration store (Figure 1 ). The monitor maintains information about the con guration state, which may include the type and location of the circuits currently operating in the FPGAs. When the conditions for advancing to the next con guration state { such as receiving a request from the application or from the FPGA { are met, the monitor noti es the loader to install the new circuit at particular locations on the FPGA. In situations such as image processing, as long as the image size is xed, the number of cycles for many operations are data independent and can be determined at compile time. The monitor can then be simpli ed to contain a few counters. The loader, on receiving a request from the monitor, con gures the FPGA using data from a con guration store. When nished, it signals the monitor for completion, and normal operation can resume.
The con guration store contains a directory for the circuit con gurations. The con gurations are usually stored in the form of address-data pairs, where the data specify the con guration for an FPGA cell while the address indicates its location in the FPGA. A transformation agent can be used to transform or compose circuit con gurations at run time; such details will be discussed later.
Our framework can be used to construct generic or customised recon guration managers. A generic recon guration manager can deal with a variety of applications, and is therefore likely to be more complex and less e cient. A customised recon guration manager is developed for one or a few applications, and can be optimised at compile time based on knowledge about run-time conditions. It is often more e cient, compact and simpler than a generic recon guration manager, but is not as exible. For this paper, we shall focus on the sequencing step. In this step, the design is captured as a network with control blocks connecting together the possible con gurations for each recon gurable component, together with the sequence of conditions for activating a particular con guration for each control block. In the next section, we shall describe how compile-time information captured in the activation sequence can be used to optimise the recon guration manager.
The above procedure can be explained using our model 6] for recon gurable designs. In this model, a component that can be con gured to behave either as A or as B is described by a network with A and B connected between two control blocks. The control blocks, RC DMux and RC Mux, route the data and results from the external ports x and y to either A or B at the desired instant, depending on the value c on their select lines ( Figure 2 ). Each control block will be mapped either into a real multiplexer or demultiplexer to form a single-cycle recon gurable design, or into virtual ones which model the control mechanisms for replacing one con guration by another 6]. We shall see how this model can be used in developing and optimising the recon guration manager in later sections. 
Monitor
The purpose of the monitor is to keep track of the con gurations in the FPGA. The monitor also contains information about possible transitions to the next state from a particular state.
Since run-time conditions usually require rapid capture and may involve a large amount of data, part of the monitor often resides on the dynamically recon gurable FPGA, and is mainly used for data-driven recon guration. The monitor checks for the user condition that activates recon guration. If the user condition for the next con guration is met and the desired con guration is not in a usable form on the FPGA, the monitor noti es the loader to introduce the con guration. When nished, the monitor may signal the completion of the con guration process if required.
The monitor includes one or more recon guration state machines. These state machines can be produced from our tools automatically and are based on the activation sequence from the user specifying the recon guration conditions (Section 3). A recon guration state machine indicates which con guration to load from the con guration store.
There are three possibilities for the monitor operation depending on the information in the recon guration sequence available at compile time.
(a) The duration for which the current con guration remains valid is known at compile time, and the next con guration is also known. (b) The duration for which the current con guration remains valid is not known, although the next con guration is known. (c) Both the duration for which the current con guration remains valid, and the next con guration, is not known. Case (a) is the simplest: a timing mechanism such as a counter could be included in the monitor to indicate when the next con guration will be loaded. This happens, for instance, in video processing when the hardware recon gures to a known next state after a xed number of frames whose size is also known. Recall that RC Mux/RC DMux pairs are used to indicate the recon gurable regions, and that changing the value on their select lines corresponds to recon guring between components delimited by the RC Mux/RC DMux pair (Figure 2 ). For case (a), these select lines will be connected to the timing mechanism.
For FPGAs supporting partial recon guration such as Xilinx 6200 devices, this means that partial recon guration will be performed after a xed duration; for non-partially recon gurable FPGAs such as Xilinx 4000 devices, entire chip con gurations will be swapped. Provided that there is enough FPGA resources, one can implement the RC Mux/RC DMux pairs and the associated con gurations as physical components on the FPGA to produce a single-cycle recon gurable design 6], 9].
Case (b) requires inputs from run-time conditions, from the FPGA or from application software, to decide when the next con guration is required. In this case, the select lines of the RC Mux/RC DMux pairs are connected to the source that triggers recon guration. The same is true for case (c); however, since the choice of the next con guration is determined at run time, all possible next con gurations will have to be produced at compile time or at run time.
Our scheme allows an abstraction layer above the RC Mux/RC DMux level. A mapping function can be de ned that relates a value from the user design to the corresponding RC Mux/RC DMux pairs. In the constant adder example provided in Section 8, a user only needs to supply an integer constant which is then mapped to selecting the corresponding RC Mux/RC DMux pairs that indicate the recon guration to be performed.
Sometimes the designer can determine whether reducing the recon guration time, or optimising the size or speed of the new circuit, should take priority. For instance, one con guration may contain circuit elements usable by its successor, but in an suboptimal way. One can then decide whether to reduce the recon guration time and tolerate a suboptimal circuit, or to have a longer reconguration time in return for a better circuit. Alternatively, circuit elements from the next con guration can be included in the current con guration, such that circuit behaviour is preserved while reducing recon guration time. Facilities for estimating recon guration time will be useful 8].
Loader
The purpose of the loader is to carry out the recon guration of the FPGA, as speci ed by the select value for the RC Mux/RC DMux components. On receiving a request from the monitor, the loader obtains the location of the requested con guration from the con guration directory, extracts the con guration from the con guration store and then initiates the con guration process. On completion, the loader may, when appropriate, set a new clock speed for the new circuit. It then signals the monitor for completion, and normal operation can resume.
The software version of the loader runs on the host processor. API functions are provided to facilitate design development by hiding the mechanisms used for performing run-time recon guration. We follow an object-oriented approach, treating an RC Mux/RC DMux pair as objects which load a new con guration when the value on their select lines change. When the object is created, the con guration data associated with the RC Mux/RC DMux pair are loaded into the host's main memory to ensure fast con guration of the FPGA. The resulting facilities are similar to those supported by JERC 4] .
To improve recon guration speed, we have developed a scheme to implement the loader in hardware. This enables dynamic recon guration to be performed at the maximum speed that the FPGA can handle. This is di cult to achieve by loading con gurations from a loosely-coupled processor, for example an FPGA co-processor board that resides on a PCI bus.
A handshaking scheme is used to synchronise the user design with the recon guration manager, since the recon guration manager can be clocked faster than the user design. This allows multiple con guration cycles to occur in a single compute cycle, thus reducing recon guration overhead.
Con guration Store and Run-Time Transformations
The con guration store contains three components: a con guration directory, a repository for con guration data, and a transformation agent (Figure 3 ). The con guration directory and the con guration data can be arranged as shown in Figure 4 . If required, the transformation agent transforms a con guration before loading it into the FPGA; this can be used in minimising con guration store usage, as discussed below. For performance critical applications, the transformation agent can itself be implemented in hardware. If the next con guration can be predicted at compile time or at run time before it is required, there may be su cient time for a software transformation agent to perform its tasks. Fast storage is often scarce. To minimise con guration storage, three transformation methods are explored. The rst method covers regular circuits: if the same con guration information is used in two or more locations of the FPGA, an o set ( Figure 4 ) can be added repeatedly to the address of the base con guration to produce the required con gurations. Our tools automatically calculate these o sets and the number of replications, and place them as transformation parameters in the con guration store. The replication of con guration data at the row and column o sets are generated by the transformation agent during recon guration of the FPGA.
The second method is to maximise sharing of lower-level components in the design hierarchy: for instance the same adder con guration can be used in producing di erent kinds of multipliers. This method is an extension of the rst method to support hierarchical representations of con guration data.
The third method adopts a small number of con guration templates, which can be transformed by operations such as stretching or partial evaluation, for building the actual con guration bitstreams at run time. This method is particularly useful in, for example, producing constant-coe cient adders or multipliers. Further parameters can be included to support speci c transformations.
All three transformation methods assume that the con gurations are relocatable 10], and work best if there are minimum constraints on the placement of the circuits. These methods can be implemented in hardware to reduce their run-time overhead. While other con guration store architectures may result in greater utilisation, they may do so at the expense of increasing recon guration time or complicating the transformation agent.
Recon guration Methods
This section presents two recon guration methods, and assesses their impact on our framework. An example will be considered in Section 8; further case studies, such as arithmetic and video processing designs, are under development.
Combined recon guration method. For a design with n con gurations, there are n(n?1) possibilities of changing from one con guration to another. If the recon guration sequence is known at compile time, then we can generate incremental con gurations instead of full con gurations 7] . At run time, the transformation agent produces the required con guration from incremental con gurations, including the computation of o sets (Section 6). For devices supporting partial recon guration or simultaneous recon guration, there will be an improvement in recon guration time since only the parts that change need to be recon gured.
However, if the recon guration sequence is only available at run time, then up to n(n?1) con gurations will need to be generated at compile time. Alternatively the con gurations will have to be produced on demand at run time.
Partitioned recon guration method. An alternative method is based on the principle that more e cient implementations can often be obtained by moving the RC Muxes and RC DMuxes to a lower level of description 6]. For the above example, this method is applicable if the n con gurations can each be decomposed into m components, so that each component is controlled by its group of RC Mux/RC DMux pairs. m recon guration state machines are generated, one for each group of RC Mux/RC DMux pairs, so that the design can be con gured to be one of the n possible con gurations.
At run time, the required con guration is produced by the transformation agent from data for each of the m components. The con guration state machine in the monitor for each component determines if the conditions for transition has been reached; if so, it signals the loader to load the appropriate partial con guration. In this example, the partitioned recon guration method reduces the number of partial recon gurations from n(n ? 1) to an application-speci c value dependent on m. However, the recon guration controller is more complex than that for the combined recon guration method, since there are now m recon guration state machines instead of one. This method may not be able to take advantage of simultaneous recon guration techniques, unless the relevant control information (such as wildcard data for the Xilinx 6200 FPGA) can be computed rapidly 7].
Finally, a mapping function may be required to produce the appropriate control information for the m state machines; this will be illustrated in the next section.
Constant Adder
In this example, a bitslice of a variable adder is partially evaluated, resulting in the two circuits shown in Figure 5 (a) which correspond to a constant zero adder and a constant one adder. Our tools 9] automatically nd the recon gurable regions in these two designs and insert RC Muxes and RC DMuxes to delimit the recon gurable regions, resulting in the bitslice in Figure 5 (b). This bitslice can then be replicated to give a constant adder of a particular size. For a Xilinx 6200 FPGA, the use of a constant adder in place of a variable adder reduces the size by 50%, and increases the speed by 33%. Combined recon guration method. In this method, the user speci es the constants in the command le along with the duration between recon guration if available. The con guration state diagram in Figure 6 (a) is produced by our tools. If the duration between recon guration is known at compile time, then a timing mechanism will be included in the monitor to trigger the recon guration automatically. If the duration is not known, then the monitor keeps track of the con guration state so that, when the conditions for recon guration occur, it requests the loader to initiate the recon guration. For this method, the ease of recon guration comes at the expense of increasing the amount of con guration data. For each bit that di ers between two successive constants, two con gurations cycles are needed in the Xilinx 6200: one for recon guring the XNOR gate to the XOR gate, and the other for recon guring the OR gate to the AND gate. Our tools can take advantage of device-speci c optimisation such as wildcarding in the Xilinx 6200, thus reducing the amount of recon guration cycles between the constant \1111" and \0000 " 7] . There are a total of 20 con guration words for the recon guration sequence in Figure 6(a) . In general, if the user would like to recon gure between all 2 n di erent constants, 2 n (2 n ? 1) partial con gurations would have to be generated and stored. Partitioned recon guration method. An alternative is to partition the adder into bitslices, and calculate the con guration needed for each bitslice to add a 0 or 1. To change a constant, a mapping function is de ned that selects the appropriate RC Muxes/RC DMuxes for each bitslice shown in Figure 5 (b). The monitor has access to a recon guration state machine for each bitslice, which determines if its bit of the constant has changed; if so, it signals the loader to load the appropriate partial con guration. The con guration for each bitslice is stitched together by the transformation agent to form the required con guration.
This method signi cantly reduces the amount of con guration data for an nbit constant adder. Four con guration words are needed for each bitslice. Apart from the component at the least signi cant bit position due to the external carry input, the con guration bits for the bitslices are the same, except for an address o set. Hence we only need to store the con guration bits for the component at the least signi cant bit position and the repeating bitslice. During recon guration, the transformation agent in the con guration store adds the corresponding o sets to recon gure the bitslice. There are only 8 con guration words needed to be stored using this method.
Summary
This paper presents a framework for e cient run-time management of recon gurable designs, which exploits compile-time information for optimising run-time performance. The recon guration manager can be implemented in hardware or software, and supports both partially and non-partially recon gurable FPGAs. Current and future work includes re ning and extending our framework and tools, exploring their use in multi-tasking systems, and applying them to realistic applications.
