Many algorithms have a very efficient hardware implementation that cannot be captured by a general-purpose processor. The static nature of hardware implementations has previously made them unsuitable in a flexible computer. However, modern dynamically-reprogrammable hardware provides the ability to realise new algorithms in hardware at run-time. However, these devices are typically more limited in terms of speed and computing resource than static hardware. In order to reclaim some of the cost of using reprogrammable hardware, we must look to new design methods for optimising implementations for dynamic hardware. By drawing on ideas from software design, this paper demonstrates how the technique of partial evaluation can be used to systematically, and formally, derive efficient specialisations of hardware implementations optimised for dynamic hardware, and further, how one might feasibly perform such specialisation at run-time with minimal cost.
Introduction
The traditional model of a general purpose computer is that of static hardware which executes a stored program. This stored program will be specified in terms of a sequence of instructions. However, there are many algorithms that are not best implemented as a sequence of instructions, such as complex image filtering algorithms. Such algorithms are better executed by a custom piece of hardware which can make use of parallelism inherent in such algorithms. Such custom pieces of hardware might be added to a computer to speed up such algorithms-often in the form of plug-in cards. For instance, the popular image manipulation program Adobe Photoshop is able to make use of specialist image filtering cards to accelerate particular image processing functions. Unfortunately, such accelerators are, by their very nature, limited to benefiting only specific types of programs.
However, the advent of dynamically reprogrammable FPGAs (Field Programmable Gate Arrays), makes possible the concept of a general purpose accelerator card on which custom hardware algorithms may be implemented at program run-time. Indeed, the new Xilinx XC6200 series of FPGAs have been designed specifically for such purposes, with special features for interfacing to a processor [21] . For instance, an image processing program might contain hardware descriptions of image filtering functions. When the program executes, these descriptions could be used to program the general purpose accelerator. The advantage of the general purpose accelerator is that another program can implement a different algorithm using the same hardware, for instance a data compression algorithm.
However, whilst FPGA implementations of certain algorithms have been shown to be many times faster than equivalent software [17] , FPGA technology is still at least an order of magnitude slower than static hardware and is typically more limited in number of gates and routing resources. If we only use a general purpose card to implement circuits that would otherwise have been implemented in static hardware, we trade-off speed against re-use. In order to maximise the benefit of FPGAs we must instead look at how to best make use of their dynamic nature to implement hardware that we would not otherwise have been able to.
Various systems have been developed which do just this, such as the Dynamic Instruction Set Computer (DISC) project [20] , the RRANN2 neural network [6] and a system for scanning genomic databases [12] . All of these systems increase the functional density of an FPGA by modifying the hardware implementation at run-time, for instance the genomic database search engine codes the search sequence directly into the configuration information as the FPGA is programmed, resulting in a 3 to 4 orders of magnitude speed increase over a software implementation of the algorithm. However, these systems have been hand-designed for dynamic hardware implementation; in general this is difficult and FPGA architecture dependant. We describe here a possible way of automatically deriving dynamic hardware implementations of algorithms using formal transformation from static hardware descriptions. Formal transformation is an important consideration as dynamically generated hardware cannot be tested with long periods of simulation in the manner that a static design can be.
In the rest of this paper we introduce a formal transformational technique for run-time optimisation taken from software design. We demonstrate how it can be applied to hardware and, in particular, specifications given in the hardware description language Ruby [16] . We end with a discussion of a likely implementation strategy.
Partial Evaluation
The problem of how to best utilise reprogrammable hardware parallels that of how to best utilise the processor in software design. Typically the processor is used to execute different fixed pieces of code. However, attention is now being drawn to the possible speed increases that can be obtained by dynamically generating code [3] [5] . An application program tends to contain portions of code that are generalised over certain parameters. These parameters often change little in comparison to the number of times the code is executed-perhaps as a result of user input. Dynamic code generation is typically based on partial evaluation -the process of specialising a piece of code for particular parameter values.
Partial evaluation is a technique commonly used in functional programming languages, such as Haskell [8] . These languages are declarative, that is they are based on expressions, and have graph-reduction based semantics making them particularly well suited to partial evaluation [10] . For instance, consider the following Haskell function:
The type of this function indicates that it can be partially evaluated [4] 
This representation more closely reflects the graph that this function is implemented as, and is more amenable to demonstrating partial evaluation. For example, by applying the add function to the number 10, we can reduce the graph by eliminating the outer lambda abstraction: Note that, as expected, the result is a function which takes one integer and returns an integer result. This application of the function add is known as a partial application , as the function has not been provided with all of the parameters necessary to reduce it fully to a value. The process of evaluating a partial application is known as partial evaluation.
Run-time Partial Evaluation
We can see how partial evaluation can be used to optimise a piece of code by considering an example. The following piece of Haskell code describes an algorithm for rotating a list of coordinates around the origin by a given angle, perhaps as part of a drawing program: The type synonym F is used here simply for conciseness. Now consider a call to rotatePoints of the form:
We might obtain the following sequence of graph reductions: Note that the use of tuples here is not strict lambda calculus, but has been done for simplicity. Whilst a slightly contrived example, we can see that the final function which is applied to each coordinate in the list has been reduced to a very simple function. This is important if the list of coordinates is very long. This is partial evaluation within the context of a functional language however. As functional languages are executed by graph-reduction, this is a natural extension of the execution. It is less obvious to see how partial evaluation can be utilised for dynamic code generation in imperative languages, such as C. In such languages, the source code is compiled down to a sequence of machine instructions; at this level the scope for partial evaluation is very limited. The original source code contains the necessary information to be able to do partial evaluation; but source-level manipulation requires run-time compilation-the expense of which outweighs the possible execution speed gains.
A possible way of managing this problem is to look at a particular piece of code and analyse the different ways the code might evaluate based on the applied values. Consider the following short C function: If the a parameter remains constant for a number of executions of this function, we can specialise the function to improve performance. However, we cannot compile in advance with the knowledge of a particular value for a , only with the knowledge that a will be known. With the knowledge that certain variables will remain constant for a relatively long period of execution, we can break a piece of code down into statements which are not affected by these variables, expressions which can be statically calculated knowing these variables-so called holes , and different pieces of code which can be selected from at run-time based on these variables-so called templates [3] . For the previous example, we obtain a new function of the form:
where: template = if l > 100 then template 1 else template 2 template 1 = { *c += hole 1 ; } template 2 = { *c *= hole 2 ; } hole 1 = 2*l hole 2 = l l = a*a
The different templates can be compiled at normal compilation-time such that the holes can be filled-in at run-time. When a particular value of a becomes known at run-time, the value of l can be calculated, the appropriate template chosen and the hole value calculated and filled-in. This form of specialisation can be done relatively quickly, without the need for any re-compilation, perhaps by simply changing jump locations and literal values. It is worth noting however, the difficulty of formally proving these program transformations. Whilst this can be done for simple language semantics, it is a non-trivial problem for a complex language.
Partial Evaluation of Hardware
So how might partial evaluation be used in the design of hardware? Returning to the add example given in the previous section, let us consider a simple 8-bit ripple-carry adder circuit. Such a circuit would be composed of 8 full-adder circuits with their carries chained together, as shown in Figure 1 .
Figure 1: 8-bit ripple-carry adder
Now if we partially apply the number 10 (00001010 2 ) to the b inputs, this results in the individual 1 or 0 bit values being partially applied to the b input of each full-adder. The logic for the full-adder reduces to two different simpler circuits depending on whether the b input is a 0 or a 1. Figure 2 shows the general full-adder circuit, (a), the full-adder specialised for a 0 input, (b), and the full-adder specialised for a 1 input, (c). This kind of partial evaluation is similar to constant propagation . For the adder example given, applying the number 10 to the adder results in a circuit which adds 10 to its remaining input. This new circuit is only 16 gates in comparison to the 40 gates required for the generic adder.
The example given above shows how one might partially evaluate hardware by hand, what about automatic partial evaluation of hardware descriptions? Complex hardware now tends to be designed using Hardware Description Languages (HDLs) such as VHDL [9] . VHDL is a behavioural description language based on the Ada programming language. This similarity with programming languages tempts us to consider using the same approach as has been described in the previous section for imperative languages. However, there are problems with using the same approach for VHDL. VHDL is a very rich language with very complicated semantics. This would make it very difficult to do partial evaluation and even more difficult to do this using formal transformations. Indeed, the complexity of VHDL is making it difficult to provide it with any kind of consistent formal semantics [11] . We have already shown the ease with which a declarative language such as Haskell can be partially evaluated, so it would seem sensible that a declarative HDL would suit partial evaluation better. Using these definitions various laws can be derived, such as associativity of serial composition. These laws allow formal refinement and transformation of Ruby circuits. An important feature of Ruby is the capture of layout as well as behaviour; serial composition represents horizontal layout and parallel composition represents vertical layout. By considering the tiles to be 4-sided instead of 2-sided, more complex layouts can be described. A 4-sided tile is defined as a binary relation between two pairs; the domain pair representing the left and top sides of the tile, and the range pair representing the bottom and right sides, as shown in Figure 4(a) . The definition of 4-sided tiles in terms of binary relations rather than 4-relations allows 4-sided and 2-sided tiles and operations to be used together as appropriate. This definition shows the use of some Ruby wiring relations; apl-append left and apr-append right for manipulating tuples, and id-identity. The converse operation ( ) "flips" a tile such that its domain and range are swapped.
The ability of Ruby to specify the architecture of an algorithm as well as its behaviour is important when considering the design of circuits for FPGAs [18] . The layout information inherent in a Ruby circuit description can be used as constraints to a place-and-route system enabling it to perform place-and-route more quickly. Further, we can formally transform Ruby descriptions according to a rich set of proven laws, allowing us to derive different implementations with different properties from the same specification [7] .
Partial Evaluation of Ruby
The Ruby described in the previous section is that designed by Sheeran and Jones [16] , however, to examine partial evaluation of Ruby, we will look at a particular implementation of Ruby known as T-Ruby [15] . T-Ruby, designed at DTU, implements a variation of Ruby based on typed lambda calculus. T-Ruby allows the systematic transformation of expressions based on proven rules, making it an ideal framework for investigating partial evaluation. Note though that there is only room here for a cursory explanation of the T-Ruby syntax, a more detailed explanation can be found in [14] . Continuing with the adder example, let us now consider how to specify this in T-Ruby and how it might be partially evaluated. As before, a ripple-carry adder is composed of a number of full-adders, each of which can be described in terms of two half-adders. A half-adder can be specified as follows:
ha : (bit*bit) ~ (bit*bit) ;; circuit(2) ha = dub ; [and,xor] ;;
The first line gives the type of the half-adder circuit-a pair of bit signals to a pair of bit signals; it is not strictly necessary to give the type in T-Ruby, but we will give types for clarity. The second line specifies the body of the circuit, which is given as a dub wiring component in serial with an and gate and an xor gate in parallel. A full-adder can be described using two half-adder circuits as follows:
fa : ((bit*bit)*bit) ~ (bit*(bit,bit)) ;; circuit(4) fa = (Fst ha) ; reorg ; (Snd ha) ; reorg~ ; (Fst or) ;;
Note that T-Ruby allows the specification of the class of a circuit as 2-sided or 4-sided; this is used for drawing tiles. The half-adder and full-adder are shown in Figure 5 (a) and Figure 5 (b) respectively. The reorg circuit is used to regroup wires.
] apl apr 
Figure 5: Pictorial representation of the full-adder
In order to partially evaluate a circuit, we have to be able to partially apply an input. We introduce a value into a circuit using a new circuit combinator Intr, defined as follows: This specifies that Intr is a function which, given a value v, returns a circuit with that value on its domain and range (Intr is symmetrical). The spread function takes a function relating a domain and range and lifts this to a circuit. Functions in T-Ruby are specified as typed lambda abstractions (\v:T.E). Using this basic combinator we define two more useful combinators: These combinators take a value and return a circuit which takes a single input on its domain and produces a pair of outputs on its range, one of which is the input and the other is the given value. The converse projections circuits, p1ã nd p2~, construct a pair from a single signal-these are used to effectively discard the domain of Intr. Given these combinators for introducing constant values, we can now define the basic rules with which partial evaluation is done. For example the rules for applying a constant 0 or 1 input to the second input of an and gate and an xor gate are given as follows: Here the circuit any relates any domain to any range, and the circuit zero relates 0 to 0-composed together these relate any input to a 0 output. Taking a simple example to begin with, let us consider using the Force2 combinator to partially apply a 1 input to a half-adder:
The type of this expression is given to demonstrate that this circuit takes just the single remaining bit input on its domain. This expression can be evaluated as follows: 
Late Binding
These evaluations show circuit reductions given known constant values. However, what about unknown constant values? For the half-adder example, the input being applied can only be either a 1 or a 0. This could be described as a choice between two possible outcomes as follows: Here %v denotes the value of the partially applied input using the T-Ruby syntax for a free variable. The T-Ruby system allows us to give values to such variables as a final stage before converting the specification into an implementable circuit. However, if we were to delay the resolving of this free variable, we could imagine a system which would prepare both possibilities for instantiation, leaving a suitable space in the circuit. The final choice could then be made at run-time by inserting the appropriate piece. This is similar to the template mechanism described for software. We call this late binding of the free variable. Extending this example, it can be shown that the full-adder circuit on partial application of one input can be reduced, as was shown in Figure 2 , to two possibilities: We can now consider the partial evaluation of a complete adder. An 8-bit adder can be constructed from a column of full-adders as follows:
add : ((nlist [8] (bit*bit))*bit) ~ (bit*(nlist [8] bit)) ;; circuit(4) add = col 8 fa ;;
Note the type nlist [8] bit which indicates a vector of 8 bits. We use the Force2 combinator to apply a value to this adder circuit as follows: In this example, the %v variable is an integer which we convert to an 8-bit vector with the intToBits combinator. The zip wiring combinator takes two bit vectors and pairs the individual bits together suitable for input to the individual full-adder components. We can now demonstrate the partial evaluation of this circuit beginning as follows: The first thing to note is that the variable %v will need to be separated into its individual bits for application to each full-adder in turn. We use a definition for intToBits which involves taking 8 copies of the input number and taking the i'th bit of each one, where mapf varies i from 1 to 8: Note that mapf takes a combinator as its second parameter, in this case intToBit. We can now begin to push the free variable through the expression towards the full-adders: It should be noted at this point that, as the p1~ and the Intr move further apart, complex and unnecessary wiring forms between them. The expression can be simplified by pushing the projection through to follow the Intr: This gives us a column of full-adders, each specialised on a single bit of the input value, as might have been expected. However, we have arrived at this systematically using steps that can be formally proven.
Implementation
Given a circuit specialised as shown above, how could this be implemented? The circuit specification first has to be converted into FPGA programming information. This would normally be done by converting it into a netlist and invoking a place-and-route system to find an FPGA layout based on the netlist. However, in order to find a placement for a circuit containing free variables, we must consider separate placements for the different possibilities. For the ex-ample above, we need to find placements for [or,xnor] and [and,xor] . These placements must have the same shape, which can be used as a place-holder while the rest of the circuit is placed and routed. At run-time, the appropriate piece can be configured into the space left by the place-holder. Figure 6 shows a possible implementation of a specialised full-adder with a placeholder (a), and the two possible specialisations (b) and (c).
Figure 6: An implementation of the specialised adder
Having produced such a placement, an application could request the run-time system to instantiate the circuit given the values of the free variables. The run-time system would evaluate the given variables in order to produce a complete placement which would be used to configure the FPGA device. If, at a later stage, the application changed one of the free variables, the run-time system would be able to compute the affected parts and partially reconfigure the FPGA to implement the new specialisation without needing to change the static portion of the circuit.
Discussion and Further Work
To make partial evaluation worthwhile, it is important that the run-time system is able to instantiate specialised circuits very quickly in comparison to the length of time that the free variables are likely to remain constant. As this approach moves much of the computational overhead of partial evaluation into the compilation process, the run-time system described should be efficiently implementable. However, other than the possible speed gains from specialising circuits, an important side-effect of partial evaluation is a reduction in the complexity of a circuit. This is very beneficial for FPGAs where the physical resource in terms of gates and routing is typically very limited. Using the T-Ruby system we have already done some stepwise partial evaluation transformations of simple circuits. It is hoped to extend this work to more complex circuits using automatic transformation strategies. The result of this work will be an "engine" capable of specialising a Ruby circuit specification containing free variables. T-Ruby requires these variables to be bound to values before it can produce output suitable for place and route, so the specialised circuit specification will need to be processed by a separate tool for run-time reconfiguration.
Unfortunately current place and route systems lack the functionality required to produce a layout suitable for this form of run-time reconfiguration. However, newer place and route systems such as the system used for programming Xilinx XC6200 devices (currently still in Beta release) are able to produce partial placements for specific parts of a circuit. It is possible that using placement constraints, different possibilities could be separately placed meeting the same shape requirements. One of these placements could then be used as a placeholder for placing the rest of the circuit. We have already investigated using Ruby layout information to supply constraints to the XC6200 placement system with promising results. Ideally we would like to implement a place and route system which works directly on Ruby circuit specifications, using the layout information to best effect for producing an FPGA placement.
Summary
We have presented here a systematic approach to producing specialised implementations of circuits for realisation on dynamic hardware based on the formal technique of partial evaluation. The process we describe involves partial evaluation that can be done statically at compile time based on the "promise" of circuit parameters. The required run-time mechanism is thus relatively simple and should introduce minimal delay into the circuit configuration process. The run-time mechanism can also re-configure the circuit for different parameters in an efficient manner as the affected parts of the circuit can be quickly computed, and only these parts re-configured. The resulting specialised circuits are less complex in terms of required FPGA cell and routing resources, allowing effectively larger circuits to be implemented. The reduced circuit is also likely to have a shorter critical path, allowing it to be executed faster.
We believe that the mechanism we describe provides a formal way of describing, understanding and implementing run-time re-configuration of dynamic hardware. 
