Abstract. This paper presents an overview of a prototype hardware compiler which compiles a design expressed in the Ruby language into FPGAs. The features of two important modules, the re nement module and the oorplanning module, are discussed and illustrated. Target code can be produced in various formats, including device-speci c formats such as XNF or CFG, and device-independent formats such as VHDL. The viability of our oorplanning scheme is demonstrated by a compiler backend for Algotronix's CAL1024 FPGAs. The implementation of a priority queue is used to illustrate our approach.
Introduction
Compiling selected parts of application programs into hardware, such as FPGAs, has recently attracted much interest. This method holds promise of producing better special-purpose systems more rapidly than existing techniques. A number of hardware compilers (see, for example, 8], 11]) have been developed for designs described in various languages into hardware netlists, which can then be mapped onto FPGAs by vendor software.
This paper presents an overview of two important modules, the re nement module and the oorplanning module, in a prototype compilation system. The system is based on Ruby 4], 9], a relational language for capturing block diagrams parametrically. There are mechanisms in Ruby for describing spatial and temporal iteration, allowing succinct and precise design speci cation. Moreover, the explicit representation of di erent forms of spatial iteration simpli es the production of layouts, and the declarative nature of the language allows designs to be re ned by simple equational reasoning. Our aim is to exploit these features of Ruby to provide an e cient hardware compilation system.
The re nement module enables users to focus on the high-level structure of a design without being overwhelmed by details such as the size of individual datapaths. It is based on a constraint-propagation procedure. Given the size of inputs and a library of bit-level operators, it automatically constructs ecient low-level designs rapidly and in a provably-correct manner; this facilitates exploring architectures and evaluating the e ects of di erent bit-level data representations.
Another important module, the oorplanning module, is devised to reduce the time to place and route a netlist produced by a hardware compiler. Since Ruby expressions carry information about the way a circuit can be assembled from primitive parts, our method is designed to exploit the structure of the source program in generating a layout. It is also possible for the user to guide the placement of components and to import layouts that are developed manually or by other tools. Much of our oorplanning procedure is syntax-directed and is therefore very e cient.
While our oorplanning scheme is largely device-independent, to demonstrate its viability a compiler backend has been developed for Algotronix CAL1024 FPGAs. The implementation of a priority queue will be used to illustrate this approach.
Ruby
Ruby is a language of functions and relations. It has been used in developing a wide range of designs including signal processing architectures 2] and butter y networks 4], and it has also been used in producing implementations partly in hardware and partly in software 7] . Detailed descriptions of Ruby can be found, for instance, in 4] and 9].
In Ruby a design is captured by a binary relation R, which relates the interface signals x and y in the form of x R y. For instance the max operator, which produces the maximum of two numbers, can be described by hx;yimax (maximum(x; y)); so h3;4i max 4 and h10;6i max 10. The min operator for nding the minimum of two numbers can be described in a similar way. The identity relation id is given by x id x. To select or regroup components of composite data, there are wiring primitives such as fork, 1 and rsh, given by x fork hx;xi, hx;yi 1 x and hx;hy;zii rsh hhx;yi;zi. To re ect a component along its trailing diagonal, we can use the converse operator, given by Repeated compositions of n copies of Q can be described by Q n or map n Q, so for instance fork Components with connections on four sides can be joined together by the beside and below operators; below (Figure 1c ) is given by hha;bi;ci (QlR) hp;hq;rii , 9s : (ha; si Q hp;qi)^(hb;ciR hs;ri):
To deal with designs operating on time-varying data, a relation in Ruby can be considered to relate an in nite sequence of data in its domain to another in nite sequence in its range; elements in these in nite sequences can be regarded as values appearing at an interface at successive clock cycles. Given that 8t denotes \for all values of t", a squarer can be described by x sq y , 8t : Latches are used in designs with feedback to prevent unbu ered loops. A design Q containing an internal feedback path s can be modelled by the operator loop ( Figure 1d ):
x (loop R) y , 9s : hx;siR hs;yi: 3 
Re nement
We can use Ruby to describe word-level designs, like the max or the min operator for integers. At bit-level, these operators can be built by logic gates which can also be captured in Ruby. The aim of our re nement system is to automatically produce the most e cient bit-level design from a high-level description.
Bit-level designs produced by the re nement system should satisfy constraints speci ed by the designer. Examples of constraints include the speed, size, latency and power consumption of a design, the maximumand minimumvalues of inputs and outputs, or a combination of the above. Of course, if the constraints are too strict, there may not be any bit-level design that satis es them all. Our e orts so far have been concentrated on constraints specifying the maximumand minimum values of inputs for a circuit.
There may be many possible bit-level designs which can implement a given word-level design. Also each data representation (such as two's complement representation) will result in a speci c family of bit-level implementations. The re nement system can re ne a word-level design into several bit-level implementations, depending on the bit-level data representation.
The re nement module is based on a constraint-propagation algorithm. The maximum and minimum values of inputs are propagated across the circuit. For a given component, once all constraints on its inputs are known, the constraints on its outputs can be derived. Resolving the constraints xes the size of the components and the width of the output data path. Given a library of parametrised bit-level operators and their sizes, our constraint-propagation procedure can be used to determine the widths of all the data paths. A bit-level Ruby design can then be constructed. As an example, consider a priority queue which can be speci ed in Ruby as follows.
( 
Let us brie y introduce the correspondence between the Ruby program and the pictorial description of the priority queue; further details about possible designs and their development can be found in 9]. The Ruby descriptions for the word-level design (Figure 2) are shown above, which is implemented as a linear array of a repeating unit pqcell (expression 2), and the length of the array is 4 (expression 1). The repeating unit pqcell (expression 3) consists of three parts: an insertion sorter cell sort2 (expression 4), a selection unit mux2 (expression 5) and a data distribution unit fork2 (expression 6). There is an internal path in pqcell where the minimum output of the sorter is fed back while the maximum value is output to the next cell (expression 3). A latch (shown as a small triangle) is placed on the top of the feedback path, and it is initialised to the value 127.
Suppose the constraint speci ed by the designer is that the input data are natural numbers no larger than 127. Given that a bit is either T (True) or max, min and the multiplexer muxr 2 operating on unsigned integers. The bitlevel implementation of the priority queue is shown in Figure 3 . Notice that the big triangle D 7 represents seven D latches in parallel.
There are compiler backends for converting a bit-level description into various formats, such as XNF (Xilinx Netlist Format) or VHDL. The physical mapping onto FPGAs can then be carried out using commercial tools. An alternative implementation path will be sketched in the next section.
Floorplanning
A major bottleneck in automatic hardware synthesis is the time to place and route the netlist produced by a hardware compiler. The aim of our oorplanning module is to expedite the placement and routing procedure by exploring the structure of the source descriptions. To achieve high quality layouts, our oorplanning scheme includes facilities which allow combination of layouts produced both automatically and manually.
The oorplanning procedure consists of two phases. The rst phase is the global placement and routing, which is mainly device-independent. In this phase a design is modelled as a rectangular block with connecting points on its four sides. Our oorplanning scheme allows the variation of block sizes, so that connecting positions between two adjacent blocks match each other to minimise the routing between them. In the second phase, the detailed routing within the blocks and their interface will be determined.
Consider rst the global placement and routing phase. A design in Ruby is represented by a binary relation, while in pictorial form it is modelled as a rectangular block. A convention is required for assigning the domain and range variables of a relation to each side of the block { this step is known as direction assignment. The following convention is chosen: the domain data will be mapped onto the western or northern side, while the range data will be mapped onto the southern or eastern side 4].
Following this convention, the layout of a relation with its domain in the form of a two-tuple hx;yi can be a block with x on the western side and y on the northern side, or both x and y on either the western or the northern side. Similarly, the layout of a relation with its range in the form of a two-tuple can be a block with some of its connecting points on the southern side and some on the eastern side, or all of them on either the southern or the eastern side. One can show that, for a relation with both its domain and range in the form of a twotuple, there are nine possible layouts 3]. The choice of which layout to adopt is determined by context or by a default convention. For instance some combinators in Ruby carry contextual information about possible direction assignment; the below combinator requires two of its domain and two of its range connections to be horizontal (Figure 1c) . After direction assignment, we check the compatibility of the interfaces between connected components. Since polymorphism is allowed in the domain and range of some Ruby primitives such as fork, a simple structure comparison is insu cient. Instead a general uni cation algorithm was used to determine the most general substitution for the domain and range components, so that the interface constraints can be satis ed.
Sometimes information on direction of signal ow is necessary for certain devices, such as the cells used in Algotronix's CAL1024. In these cases we apply a constraint-propagation algorithm to determine the direction of signal ow for each Ruby wiring constructs.
The placement stages of our oorplanning system are not time-consuming because we exploit the structure of Ruby programs for placement. If we want to include a circuit which has been placed and routed manually or by other tools, we need to specify its size and the connection positions. Interface between the original and the imported layouts can then be produced by the compiler. A pair of curly braces are employed in the source Ruby program to indicate which part of the circuit should be laid out separately. The right curly brace is followed by a pair of parentheses which enclose the name of the manual layout le, so that the compiler can import this part of the layout and link it with others.
Further descriptions of our syntax-guided placement technique can be found in 3].
Device-Speci c Mapping
While our approach to global placement and routing is largely device-independent, the detailed placement and routing attens each block produced after global placement and routing, and it requires information speci c to a particular device. To demonstrate the viability of our oorplanning scheme, a compiler backend has been customised for CAL1024 FPGAs developed by Algotronix (now Xilinx Development Corporation).
CAL1024 arrays are orthogonally connected structures obtained by replicating a basic cell which has one input port and one output on each of its four sides. An input port can be programmed to connect to one or more output ports, or to a function unit which can be programmed to behave as a two-input combinational logic gate or as a latch. The output of this function unit may also connect to one or more output ports. Hence a CAL cell may be used to perform processing and routing simultaneously. Figure 4 shows a CAL cell with its northerly output connected to its easterly input, and its easterly output is the Boolean conjunction of its westerly and northerly inputs. During global placement and routing, two kinds of blocks are produced: blocks for combinational primitives such as AND and wiring primitives like fork. For combinational primitive blocks, we have developed a simple river routing algorithm to connect the connecting points on the four sides of the block to the cell performing the logic function of the primitive. A simple switch-box routing algorithm has also been devised to implement the detailed routing for the wiring blocks. The output of the oorplanner is a program in OAL 6], a variant of Ruby specialised for CAL devices. The OAL compiler can then be used to generate CFG les used for FPGA programming.
Although the oorplanner can perform the placement and routing fully automatically,the quality of the nal implementation may be inferior to one produced by hand or by other tools. It is our intention to give the designer the exibility to use our compiler for global placement and routing, while part of or all of the detailed placement and routing can be produced by other means. For instance, a designer may wish to develop by hand the repeating unit of an array-based circuit, since any ine ciency in the basic cell will be multiplied many times. The compiler can incorporate existing CAL designs into the implementation according to the annotations speci ed by the designer in the source program, as described in section 4. Consider a priority queue implementation obtained by optimising the bitlevel design in section 3 (see 9]). The bit-level repeating unit ( Figure 5 ) was developed by hand and is highly optimised; this unit is then replicated vertically to form a column which corresponds to the core of a pqcell in Figure 3 . The CAL implementation of the priority queue is shown in Figure 6 . Note that the number and order of the interface connections correspond to those in Figure 3 , except that the two bottom outputs of the rightmost pqcell are discarded. Fig. 6 . CAL implementation of a bit-level priority queue (n = 4, m = 7).
Future Work
In the re nement module of our compilation system, we have focused on constraints specifying the maximum and minimum values of inputs for a word-level circuit. Our method can be extended to take into consideration other kinds of constraints: examples include critical path, latency or the number of a particular component. If no solutions exist that satisfy all user-speci ed constraints, we can choose the solution that satis es most of the high-priority constraints. The CAL backend of our compiler demonstrates the viability of our oorplanning module. We have not, however, optimised the switch-box routing or the river-routing algorithms, and the layouts produced automatically can become rather large. For better results, we can use methods like min-cut or simulated annealing hierarchically in placement and routing 10]. Device-speci c compaction techniques should also be studied.
Much of our method for generating layouts is syntax-directed. The quality of the compiled implementation depends largely on the Ruby source program which describes the design; therefore source transformation can be adopted for optimisation. One way to automate this step is to have an accurate performance estimation procedure to drive the transformation engine.
It will also be interesting to extend our work to support partial and runtime recon guration of FPGAs, to support developing multi-chip systems, and to support implementing asynchronous and self-timed designs 1].
