This paper describes a framework and tools for automating the production of designs that can be partially recon gured at run time. The approach involves several stages, including: (i) a partial evaluation stage, which produces con guration les for a given design, where the number of con gurations are minimised during the compile-time sequencing stage (ii) an incremental con guration calculation stage, which takes the output of the partial evaluator and generates an initial con guration le and incremental con guration les that partially update preceding con gurations (iii) an optimisation stage for devices or systems supporting simultaneous conguration of multiple components. While many of our techniques are independent of the design language and device used, experimental tools have been developed that target Xilinx 6200 devices. Simultaneous con guration, for example, can be used to reduce the time for recon guring an adder to a subtractor from time linear with respect to its size to constant time at best and logarithmic time at worst. Our tools have been used in developing a variety of designs, including arithmetic, video and database applications.
Introduction
The run-time recon gurability of FPGAs provides them an increasingly competitive edge over microprocessors which tend to be exible but slow, and over custom-designed integrated circuits which tend to be fast but in exible, and in addition require a long time to develop. Run-time recon guration has been featured in a growing list of applications, including genomic database searching 15], neural networks 8], and boolean satis ability solving 33]. Products incorporating run-time recon guration are beginning to reach the market place 4], and some predict that even microprocessors will eventually be implemented using recon gurable hardware 3].
While rapid advances have been made, many obstacles remain to be surmounted before runtime recon guration can become a common feature in FPGA-based systems in general and recon gurable computing in particular. The major challenge is to improve understanding of recon gurable systems, and to provide facilities for developing and optimising them with much less e ort and specialised knowledge than is required now.
Our objective is to provide a framework and tools for automating the exploitation of such hardware features in run-time recon gurable designs. Although there has been work on simulating 28], optimising 24] and deriving 12] recon gurable designs, the development of practical compilation tools for such designs is still largely unexplored. Pioneering research on compilation tools for run-time recon gurable systems has been described by Bellows and Hutchings 1] and by Gokhale and Marks 6] . Our approach, in contrast, is largely language independent.
Prototype version of our tools reported in this paper have been distributed to a n umber of institutions. They have b e e n u s e d i n d e v eloping a variety of designs, including computer arithmetic 11], image interpolation 13], video processing 18], augmented reality 2 2 ], and database searching 30] .
The contributions of this paper can be seen in the context of previous work on models, tools and devices. For instance, while partial evaluation is not a new idea, our prototype tools are probably the rst to apply it to run-time recon guration based on an abstract model 24]. Similarly, although wildcarding was invented by Xilinx, we are not aware of any analysis of its e ects comparable to the description in Section 7. Our tools for incremental con guration appear unique, although there has been research on using wildcarding for con guration compression 9].
Overview of Framework
We strive t o d e v elop design tools for run-time recon guration that will become standard in future synthesis systems. From experience, the desirable features for such tools include: the ability to produce a wide range of implementations that are globally or locally recongurable 14], covering devices that provide special hardware for rapid recon guration support for simulating, optimising and validating designs at various levels of abstraction facilities assisting design reuse and performance analysis so that optimal designs can be produced rapidly.
This section outlines a framework that meets the above requirements. There are six steps in our framework: decomposition, sequencing, partial evaluation, incremental con guration calculation, simultaneous con guration generation, and validation ( Figure 1 ). The rst three steps and the last step can be applied to any recon gurable designs step 4 is speci c to devices or systems that support partial recon guration, and step 5 i s speci c to those that support simultaneous recon guration. Tools are being developed for each of the six steps in our framework a m o r e detailed illustration of the design ow for three of our tools is shown in Figure 2 . The six steps in our design framework. The dotted boxes indicate that they are speci c to devices or systems supporting partial recon guration or simultaneous recon guration.
In the rst step of our framework, a design is decomposed into appropriate recon gurable regions. This procedure should take the following into account: (i) trade-o s between maximising resource usage and minimising recon guration overhead in both space and time, and (ii) chip boundaries when there is more than one device in the implementation. Methods 24] are available to guide the decomposition step. We follow a library-based approach 2 1 ] to facilitate reusing designs, and to simplify development of con gurations with compatible size, shape and interface constraints for partially-recon gurable components. At the end of this step, the design is captured as a network with control blocks connecting together the possible con gurations for each recon gurable component, together with the sequence of conditions for activating a particular con guration for each c o n trol block.
In the second step, the activation sequence is used to decide which con gurations are required at run time. For a component with n con gurations, there are n(n ; 1) possibilities of changing from one con guration to another. All these con gurations will need to be generated at compile time if the activation sequence is not available, or alternatively the con gurations will have t o be produced on demand at run time. If the number of con gurations is too large, one can return to the rst step for an alternative decomposition. Each control block will be mapped onto a real or a virtual component { further explanations will be given in the next section. During the third step, the actual con guration les are produced by partially evaluating the design according to the activation sequence. Inputs having a xed value throughout a con guration can be used to simplify the hardware for that con guration this process involves propagating the constant values through the circuit, and is sometimes called data folding 5]. Partial evaluation is usually carried out at compile time, and the resulting netlists are compiled by FPGA vendor tools ( Figure 2 ). Partial evaluation can also take place at run time if the overheads involved can be tolerated 30].
The fourth step, incremental con guration calculation, concerns only devices or systems supporting partial recon guration. The partial evaluation step results in complete con guration les the purpose of this step is to produce incremental con guration les to minimise their size and recon guration time. When this step is completed, each recon gurable component w i l l b e assigned an initial con guration le and one or more incremental con guration les.
The fth step, simultaneous con guration generation, concerns only devices or systems supporting simultaneous recon guration of multiple array cells such as Xilinx 6200 series FPGAs. While this step is application-dependent and device-dependent, as shown later the recon guration time can often be substantially reduced for regular circuits.
The sixth and nal step, validation, involves checking that the design behaves as expected and meets the constraints on performance and resource usage. A comprehensive model of the recon gurable component will be useful here for two reasons. First, it can be used to investigate the detailed behaviour of the device during recon guration, for formulating e cient and reliable recon guration methods. Second, it can be used to validate more abstract models which contain less information, but are more amenable to dealing with large designs.
Design tools for the rst and the last steps are based on parametrised libraries 21] developed using the Ruby and Rebecca tools 17], the Pebble system 23], and commercial VHDL tools. These libraries and tools enable us to support a high-level and modular design approach for design compilation 7], visualisation 20] a n d v alidation 26].
The following sections describe, in greater detail, the prototype tools that we have been developing to support the sequencing, partial evaluation, incremental con guration calculation and simultaneous con guration generation steps ( Figure 2 ). All of our tools are functioning and have been used in developing the examples in Section 7. While most of our techniques are device-independent, our tools currently target Xilinx 6200 devices which support both partial and simultaneous recon guration { the latter by a procedure known as wildcarding 2]. Also, to maintain compatibility with Xilinx 6200 design tools, the data les and the results of the partial evaluation step are captured in the EDIF format.
Partial Evaluation
The basic idea behind the way w e specify run-time recon gurable regions is straightforward 24].
A block that can be con gured to behave either as A or as B is described by a n e t work with A and B sandwiched between two control blocks C and C multiplexer, called an RC Mux (Figure 4) , which is used to select between component s A a n d B. At compile time the select value, MUX SEL, can be speci ed as a result, either block A or B is instantiated, and the RC Mux is removed. If the MUX SEL value is not speci ed at compile time, a netlist in the EDIF format for each block will be produced and compiled separately, a n d each will then be loaded into the FPGA on demand at run time. The RC Mux can have m o r e than one input in order to describe recon guration between multiple components, and each input and output can be a multi-bit bus. One advantage of using the RC Mux to model run-time recon guration is that the circuit can be simulated without modi cation, since the behaviour of RC Muxes can be modelled by normal multiplexers. This approach also covers the possibility that the RC Muxes are mapped onto actual multiplexers, provided that enough chip area is available 24]. Since we adopt a library-based approach, the locations of input and output ports of the components connected to the RC Mux are known and will be extended to match those for the largest component.
At compile time, the partial evaluator searches for an instance of an RC Mux. When one is found, the instance is removed. If the value of the select line of the RC Mux is given, the unselected block is only removed if it is connected to just the RC Mux that is if it has a fan-out of one. The output of the selected block is then connected to the component t h a t w as connected to the output of an RC Mux, and the net names are resolved. The initial con guration is compiled using the largest component connected to the RC Mux, so that su cient c hip area is reserved for the recon gurable units. Since the connected components are selected from a parametrised library, their sizes, shapes and interface constraints are known before the design is processed by vendor tools. This process is continued until all the RC Muxes have been dealt with.
Compile-Time Sequencing
If the sequence of con gurations is known at compile time, the number of di erent incremental con gurations that need to be generated can be reduced from n(n ; 1) to m, where m is the number of times an RC Mux select line is changed. As shown in Figure 2 , a command le is used to specify the sequence of con gurations. Additional commands can be given in order to use this le for simulation as well as for compilation.
The con guration sequence is speci ed in the command le by assigning a value to a net in the circuit connected to the select lines of an RC Mux or to registers within the FPGA. If the net is connected to one or more select inputs of an RC Mux, this means that a new con guration corresponding to the selected hardware should be loaded into the FPGA. If the net is connected to a register within the FPGA, a register read or register write should be performed. The number of clock cycles can also be speci ed so that the time between recon guration is known.
The output of the sequencer tool is either a C routine or a hardware sequencer. The C routine is generated by translating the commands in the command le to their equivalent C functions. At run time, the C routine can be used as a template and other functions can be added. If very fast recon guration is needed, the sequencer can be generated partially or completely in hardware as a state machine 30].
Calculating Incremental Con gurations
Since Xilinx 6200 FPGAs support partial recon guration, it is possible to minimise the size of con guration les and to reduce recon guration time by calculating incremental con guration les. A program called Con gDi ( Figure 2 ) was written to calculate the incremental con gu-rations between two successive con gurations for the Xilinx 6200 FPGA. Suppose we need to recon gure a design from con guration current to con guration next. For this purpose, the incremental con guration will consist of two parts. The rst will obviously be the regions which are speci ed in next but not in current these correspond to functions which a r e not in the current con guration, and the cells involved will therefore need to be included in the incremental con guration. The regions in current but not in next correspond to functions which are no longer required, so the cells involved should be con gured to unused logic. Since in most cases the sequence of con gurations is known at compile time, only the necessary incremental con gurations are calculated.
Simultaneous Con guration Generation
Xilinx 6200 FPGAs have a feature called`wildcarding' that allows more than one cell within a column to be written to simultaneously with the same data 2]. This is performed by supplementing the address decoder with a wildcard register. During con guration, a logic one in the wildcard register indicates that the corresponding bit in the row address is to be taken as à don't-care' in other words, the address decoder will match addresses where this bit is a one or a zero.
An extension to Con gDi was written to take a d v antage of the wildcarding feature. Wildcard optimisation was performed by rst building a look-up table. For the Xilinx 6216 device, this table was constructed by e n umerating each of its 64 row addresses with all 64 wildcard values. Each location of the look-up table is a 64-bit value each bit indicates which of the 64 rows would be written, given an address and a wildcard value. A function is provided to search the look-up table for the best wildcard value, given the rows which need to be written to simultaneously with the same data. Since there may not always be an exact match b e t ween the rows that need to be written to and the rows that actually will be written to, this function returns a 64-bit value indicating which r o ws will be a ected. The con guration le is processed by repeatedly applying the best match function on a column of cells, until there are three cells or fewer that are con gured with the same data { because of the overheads involved, it is not economical to apply wildcarding to three or fewer cells. Since the current implementation applies wildcarding to a single column of cells, the number of combinations is small enough that the optimal wildcard value can be obtained by exhaustive search.
Run-Time Recon gurable Design Examples
To evaluate the e ectiveness of simultaneous recon guration, we tested wildcard optimisation using two examples from our parametrised design libraries 21] w h i c h h a ve v ery di erent properties. The rst example illustrates recon guration from one regular structure, an n-bit adder, to another regular structure, an n-bit subtractor. In the worst case, simultaneous recon guration reduces the recon guration time from linear to logarithmic time in the best case, the reconguration time is constant ( Figure 5 ). The second example illustrates recon guration between irregular designs using a 64-bit pattern matcher. These examples, both of which h a ve been tested on a Xilinx 6200 FPGA in a PCI-based platform 24], will be described in more detail below.
Adder/Subtractor Example
In a Xilinx 6200 FPGA, an n-bit ripple adder/subtractor using only localised routing can be implemented using 6n cells. The size of this adder/subtractor can be reduced by 33%, if the adder is changed into a subtractor using run-time recon guration this can be achieved by inverting one of the input bits of each adder component, and also changing the carry-in to the adder array from a logic zero to a logic one.
Without wildcarding, it takes n cycles to recon gure the n-bit adder to the n-bit subtractor.
This linear con guration time is shown in Figure 5 . When using wildcard optimisation, the best-case recon guration time, which t a k es a c o n s t a n t t i m e of 4 cycles, occurs when n can be expressed in the form 2 m . The worst-case recon guration time, occurs when n = 2 m ; 1, is due to the inability to apply a single wildcarding to a large number of address bits, and multiple wildcarding is needed. An expression can be derived for the worst-case recon guration time in terms of the number of con guration cycles 25] for our adder/subtractor example, this expression is 3log 2 (n + 1 ) ; 2 where n is the adder size. This logarithmic con guration time is shown in Figure 5 by a dashed line above the actual results. Since the best case occurs when n = 2 m and the worst case occurs when n = 2 m ; 1, the worst case can be improved by recon guring an additional cell to maximise wildcarding.
Pattern Matcher Example
Our second example is a 64-bit pattern matcher. The structure of the recon gurable version of our pattern matcher is shown in Figure 6 5] this design takes up 64 2 = 128 FPGA cells, whereas a design including an additional shift register for storing the pattern and an additional row of comparators will be twice as large.
The test for the worst-case con guration time is performed by c hanging the pattern matcher Figure 5 Variation of time against design size for recon guring a multi-bit adder to become a subtractor.
to match the one's complement o f t h e n umberitwas previously matching, so that all 64 cells in the column are recon gured. An experiment i n volving 10,000 test cases was conducted, during which the pattern matcher was constructed to match a 64-bit random constant. The results from this test are shown in Figure 7 . Without wildcarding it takes 64 write cycles to recon gure the pattern matcher. With wildcarding, it takes on average around 53 cycles, saving around 17% of the recon guration time. Since this analysis assumes the worst case, in practice there will usually be some regularity in the matching pattern to remove the need for recon guring every bit of the pattern matcher, resulting in a shorter recon guration time. However, it will be harder to apply a wildcard of 32 or 16 bits if there are fewer cells to recon gure.
This example illustrates a common technique for dealing with irregular designs. Since it is impractical to generate the circuits for matching all possible 64-bit patterns, we produce instead the two possible con gurations for each of the 64 gates in the design ( Figure 6 ). We then compute the wildcarding for the complete con guration le from con guration data for each o f the 64 gates. This technique reduces the number of con gurations from 2 64 (2 64 ;1) ' 3:4 10 38 to 64 2 = 128. Sometimes the wildcard computation cannot be carried out at compile time because, for instance, the matching pattern is not available. Under these circumstances it may be possible to compute the wildcarding at run time, provided that this can be achieved with acceptable e ciency.
Concluding Remarks
We h a ve presented a framework and the associated tools for developing run-time recon gurable designs, and their bene ts and costs are demonstrated in two applications. The framework is capable of supporting a wide variety of FPGAs, including those with special support for rapid recon guration such as facilities for partial and simultaneous recon guration. Our tools are com- patible with existing industry-standard tools for simulation and synthesis, and their e ectiveness has been illustrated using two examples. A library-based approach is adopted which simpli es physical conformance of con gurations for a recon gurable component it also facilitates design reuse and performance analysis. Our framework is supported by the Rebecca 17] a n d P ebble 23] systems, which p r o vide (i) a path for formally verifying recon gurable design optimisations, and (ii) additional tools such as those for mixed-level symbolic simulation and visualisation 19].
To be successful, such toolsets for run-time recon gurable designs must include facilities that can exploit device-speci c features whenever possible. For instance, our work has shown that the wildcard capability of Xilinx 6200 devices can result in substantial reduction of recon guration time.
In related work, we h a ve d e v eloped a tool that automates the identi cation of recon gurable regions and mapping of recon gurable regions 29]. Two successive circuit con gurations for a partially recon gurable system are matched to locate the components common to them. Such components will not be recon gured when the second con guration replaces the rst, hence reducing recon guration time. This tool has been integrated with the tools described in this paper.
Current and future work is focused on re ning and extending our framework and tools to cover further applications and devices, such as Xilinx Virtex FPGAs 16] . We are also improving run-time support 31], providing an interface to higher-level tools 32], including support for Figure 7 Worst-case analysis of recon guring a 64-bit pattern matcher using wildcarding.
platforms containing multiple and heterogeneous processing elements 10] as well as systems with both hardware and software 27].
