This paper describes an integrated system for developing regular array designs based on the block description language Ruby. Ruby supports concise design description and formal veri®cation. A parametrised Ruby description can be used in simulating, re®ning and visualising designs, and in compiling hardware implementations such as ®eld programmable gate arrays. Our system enables rapid design production, while good design quality is achieved by (a) the ecient instantiation of device-speci®c libraries, (b) the size optimisation of bit-level components using the design re®ner, and (c) the exploitation of regularity information at source level in the library composition process. The development and implementation of several median ®lters are used to illustrate the system. Ó
Introduction
A regular array design is a circuit consisting of processing elements connected to their neighbours in a regular manner. Regular array designs have been widely used in signal and image processing, multimedia and communication systems [17, 28] . Many designers recognise the importance of developing high performance regular array designs rapidly and cheaply, particularly for systems-on-achip applications or embedded systems where demanding computations have to be implemented eciently.
We believe that the key to eective development of regular array designs is to use descriptions and support tools that can fully exploit their characteristics, such as regularity and spatial and temporal locality, for their speci®cation, validation and implementation. Most VLSI CAD tools based on standard hardware description languages such as VHDL and Verilog, however, do not make the best use of these characteristics. A VHDL description of a regular array design, for example, is usually larger than it needs to be, and takes a long time to code, debug and modify. Besides, VHDL does not provide an ecient mechanism for capturing the regular layout of a regular array design, and ®nal implementations often rely on an automatic placement and routing tool to generate physical layouts which may not preserve the inherent regularity in the original description. Consequently the placement and routing process is often time-consuming, and does not always result in area and time ecient designs. Furthermore VHDL is a complex language, and formal veri®-cation of VHDL designs is often dicult.
The objective of the work described in this paper is to build a system in which regular array designs can be eciently captured, re®ned, validated and implemented. The system should signi®cantly reduce design eorts and design cycles. To meet this objective, we have developed an integrated system which has the following main features: 1. The system is based on a simple and powerful notation for representing regular array designs. Both architecture and behaviour can be captured in a single parametrised description, and both high-level (such as word-level) and lowlevel (such as gate-level) aspects of a design can be described in a uniform framework. 2. The system facilitates graphical representations which can be generated automatically from high-level descriptions. It provides useful visual feedback which enables designers to rapidly obtain an overview of a design, to locate speci®c parts on which they can focus, and to obtain intuitions for design optimisation and veri®ca-tion. 3 . The system supports a hierarchical design paradigm: there is a re®ner that automatically produces, from high-level descriptions, ecient low-level designs that satisfy user-given constraints. This enables designers to focus on the high-level architectural aspects without being overwhelmed by low-level details. 4. There are also high-level synthesis tools such as hardware compilers to reduce design time and complexity. As a result, designers can spend more time on exploring alternative designs and on evaluating the eects of dierent data representations. 5. Our system provides a variety of ways for design validation, depending on the level of detail, generality and con®dence in design correctness. It supports design visualisation, mixed bit-level, numerical and symbolic simulation as well as algebraic transformation. 6. Finally, the system is based on a declarative language which supports design optimisation by correctness-preserving transformations. This framework oers users con®dence in the correctness of its optimisations, since their validity can be checked by calculation and proof. The declarative language, Ruby, that we use was invented by Sheeran [38] and further developed by Jones [13±15] , Luk [20±23] and others [36, 37, 39] . The main reasons that we have chosen Ruby rather than other hardware description languages such as VHDL [6, 31] are as follows: · Ruby supports succinct design capture. A circuit, especially one with a regular structure, can be described more concisely in Ruby than in other hardware description languages such as VHDL. The regularity information can be exploited in the synthesis and validation processes. · Ruby can capture both the behaviour and the architecture of a design within a single notation and in an elegant manner; this is seldom the case with other languages. · Ruby has a well-understood semantics. Consequently it is easy to document design decision, to re®ne a design, to compare alternative designs, and to demonstrate the correctness of an implementation. Various median ®lters will be used to illustrate these points.
While several theories for regular array design [28] have been under development for over ten years, the contribution described in this paper appears unique in providing an integrated system for producing such designs, particularly in dealing with practical bit-level implementation aspects. The following reviews some other tools for automatic synthesis of regular array designs from highlevel descriptions.
ALPHA is a system with a functional language based on the formalism of ane recurrence equations [7, 40, 41] . Regular array designs are derived through formal, correctness-preserving transformations applied to programs in ALPHA. The resulting design is speci®ed as a variant form of ALPHA known as ALPHA0, which can be used for generating netlist, and the netlist can then be implemented using a commercial CAD tool. AL-PHA has shown its strength in automatic synthesis of regular array designs, especially for applications related to ane problems. Our work is complementary to ALPHA in that Ruby is best for designs conveniently captured using binary relations, while ALPHA is best for designs conveniently captured using ane recurrence equations.
Another tool based on a functional language for hardware development is Lava [4] . Lava ex-ploits polymorphism and higher order functions in the functional language Haskell [3] to capture design regularity which results in abstract and general descriptions. Functional language features such as type classes and monads have been exploited to implement standard circuit analyses such as simulation, formal veri®cation and the generation of VHDL and EDIF for producing real circuits. So far, Lava does not include automatic re®nement facilities or visual aids as our system does.
Singh prototyped a hardware compiler, known as the Glasgow Ruby Compiler, which compiles a dialect of Ruby into FPGA devices [39] . The version of Ruby which Singh has adopted is specialised for describing FPGAs, and uses a dierent convention from the one used in this paper. Layout information is explicitly speci®ed in his Ruby expressions which are similar to those in the OAL language [26] that our descriptions are compiled to (see Section 3.5 for further details of our hardware compiler). A design sketcher [5] has also been designed as part of the Glasgow Ruby Compiler. The sketcher diers from ours in that it is based on an interface conversion scheme, and diagrams produced seem to contain more jogs.
Sharp and Rasmussen have been working on the T-Ruby design system for VLSI circuits starting from a high-level, mathematical speci®cation of their behaviour [37] . The T-Ruby system provides facilities to perform design transformation and simulation, to prove the correctness of a design and to translate the Ruby descriptions into VHDL for synthesis by a commercial tool. T-Ruby, at present, does not include facilities such as automatic re®ner, design sketcher or visualiser.
Li and Leeser [18] have developed the HML system based on the functional language SML [30] . HML supports advanced type checking and type inference techniques to verify hardware design rules, and designs can be translated into a synthesisable subset of VHDL. Unlike our approach, HML does not exploit regularity in design implementation, and no visualisation facilities have been reported so far.
The above overview is intended to provide an outline of related work, and is not meant to be exhaustive. For instance, many useful theoretical and practical results have been reported in the series of Conferences on Application-speci®c Array Processors (in which [7] appears). However, it is not always clear whether the tools described in related publications have been used in developing actual hardware implementations for working systems. We have used the tools presented in this paper for two FPGA-based systems: CHS2x4 [1] and Riley [16] . Further details can be found in Sections 3.5 and 4.7.
The paper is structured as follows. Section 2 presents a brief introduction to Ruby. An overview of the integrated system is described in Section 3. Section 4 presents the development of several regular median ®lter designs as a case study. Conclusions are drawn in Section 5.
The short introduction to Ruby in Section 2 is written mainly for readers who have a background in declarative languages, particularly those supporting higher order functions such as Haskell [3] and ML [30] . We have, however, separated the details involving the Ruby language from the essential ideas. Readers should be able to appreciate the main features of our system in Section 3 and the block descriptions in Section 4 (using Tables 1± 5 in Sections 4.3±4.6 to understand the function and connectivity of the blocks), without following the intimate details of the Ruby expressions.
Introduction to Ruby
In this section, we provide a brief introduction to Ruby, the language used in our system for parametrised description of block diagrams. Our major focus will be put on the de®nitions and concepts relevant to this paper; further details about the background and the theoretical aspects of Ruby can be found elsewhere [13, 21, 37] .
In Ruby a design is captured by a binary relation R, which relates the interface signals x and y in the form of x R y. For instance the max operator, which produces the maximum of two numbers, can be described by hx; yi max z;
where z maximumx; y. So h3; 4i max 4 and h10; 6i max 10.
The min operator for ®nding the minimum of two numbers can be described in a similar way.
Combinational primitives
There are two kinds of combinational primitives in Ruby: wiring primitives and computation primitives.
Wiring primitives select or regroup components of composite data. The simplest wiring primitive is id, de®ned by:
Other common wiring primitives are de®ned as follows:
x fork hy; zi (A x y z; hx; yi and z (A z x y; 11 hx; yi or z (A z x y; 12 hx; yi xor z (A z Xx Xy xXy; 13
x not y (A y Xx: 14
Series and parallel composition
The central idea in Ruby is that complex designs can be formed by composing simpler designs, using combinators which are higher order functions describing common patterns of computation. For instance, two components R and S with a compatible interface can be put together by (;) to form a composite design R ; S (Fig. 1(a) ):
where x is the domain and y is the range of R ; S). Clearly, the domain of R ; S) is that of R, its range is that of S and the range of R is connected to the domain of S. The W symbol means that, unlike x and y, t is not an interface signal of the composite and cannot be observed.
In general, we can describe the composition of n identical copies of a circuit R by repeated series composition R n ( Fig. 2(a) ), where
id, the identity relation, has been de®ned in Section 2.1. If there are no connections between R and S, the composite is represented by parallel composition [R, S] (see Fig. 1(b) ), where Given id is the identity relation de®ned in Section 2.1, we have the de®nitions of the following operators:
fst R R; id; 18 snd R id; R:
19
Repeated parallel compositions of n copies of R are described by map n R (Fig. 2(b) )
A triangular-shaped array is described by M n R ( Fig. 2(c) ),
Other combinators
Inverse is de®ned by
It can be treated as a re¯ected version of R. Two rectangular components with connections on every side can be assembled together by the beside 6 and the below l operators such that: ha; hb; ciiR 6 Shhp; qi; ri (A Wt : ha; bi R hp; ti ht; ci S hq; ri; 23 hha; bi; ciR l Shp; hq; rii (A Wt : hb; ci S ht; ri ha; ti R hp; qi:
24
Note that we adopt the convention that signals in the western and the northern sides are mapped onto the domain, and signals in the southern and the eastern sides are mapped onto the range [10] . Block diagram for beside and below are shown in Fig. 3(a) and (b), respectively. row n is the generalisation of beside, which is de®ned as follows (Fig. 4(a) ): given that x hx 0 ; . . . ; x nÀ1 i, y hy 0 ; . . . ; y nÀ1 i are two n-tuples of signals and R is a relation, row n is de®ned by ha; xirow n Rhy; bi
Similarly, given x hx 0 ; . . . ; x nÀ1 i is an n-tuple of signals and R is a relation, rdl n ( Fig where x and y are time sequences containing data at successive clock cycles ± this describes the behaviour of a latch. It has been shown that the Ruby laws for the static interpretation, such as those in Section 4.5, are also valid in the time sequence interpretation [13] . Latches are also used in serial circuits to prevent unbuered loops in the feedback path. A design R containing an internal feedback path s can be modelled by the loop operator loop (see Fig. 4 
(c))
xloop R y (A Ws : hx; si R hs; yi: 27
3. The integrated system
Overview
Fig . 5 shows an overview of the integrated system that we have implemented. The system consists of two parts. The ®rst part includes an optimising transformer, a numeric/symbolic simulator and a performance analyser. The second part includes a Ruby re®ner, diagram sketcher, design visualiser and a Ruby hardware compiler (labelled * in Fig. 5 ). Detailed descriptions on the ®rst part can be found in [19, 27] , and this paper will focus on the second part, although the entire system will be outlined as follows: · The optimising transformer provides assistance to optimise a high-or low-level speci®cation according to user requirements on area and performance. · The numeric/symbolic simulator performs numerical, gate-level and symbolic simulation. · The performance analyser assesses the characteristics of a design; such as the number of a speci®c component, the critical path delay and the latency. · The re®ner produces a bit-level design from a high-level design, given the constraints on the input and/or output of a design. · The Ruby sketcher produces design diagrams from the speci®cation (high-level or low-level). · The visualiser supports visualisation of both the behaviour and structure of a design by combining the facilities of the simulator and the sketcher. · The hardware compiler compiles a Ruby program into hardware such as FPGAs.
There are three ways to check the correctness of a design with our integrated system: (1) formal veri®cation, (2) numerical/symbolic simulation, and (3) design visualisation. We shall not describe formal veri®cation tools [37] or numerical/symbolic simulation tools [19] in this paper, since they have been covered elsewhere. We shall focus on using the design visualiser to validate our designs; relevant material will be presented in Sections 3.3 and 4.4.
Let us now look at the intended design¯ow supported by our integrated system. A design is started with its high-level speci®cation. It is then validated using the numeric/symbolic simulator, or the visualiser. Next, the performance of the design can be analysed using the performance analyser, and the layout of the design can be obtained using the sketcher. If the design is not satisfactory, the optimising transformer may be employed to optimise the design. This process can be iterated several times until a satisfactory design is obtained. Once a satisfactory high-level design is obtained, a low-level design can be produced using the re®ner. Like the high-level design, the low-level design can be further improved using the tools in the system. Finally, an optimal hardware design can be directly implemented using the hardware compiler.
The following sections describe the features of the sketcher, the visualiser, the re®ner and the hardware compiler.
The sketcher
The sketcher produces design diagrams from Ruby programs. Since regular patterns are captured using combinators and computation patterns are explicitly represented in Ruby, the sketching procedure is largely syntax-directed and hence ecient. Furthermore, a simple sketching scheme has been developed which allows the component sizes to vary, so that connection positions can be adapted to align the inter-connecting wires either to the horizontal or to the vertical. Consequently, the sketcher can produce design diagrams with a minimum number of jogs, which minimises the eort for routing and improves the comprehensibilities of the produced diagrams. The sketcher also includes facilities for drawing particular parts of a circuit and for producing layouts to a speci®ed level of detail.
Since the diagram generation process is syntaxdirected, the quality of the diagram depends largely on the Ruby source program. To optimise the quality of the diagram, there is a module in the sketcher which performs source-level transformation. An example of the source-level transformation is shown in Fig. 6. Fig. 6(a) is the diagram of Ruby expression row 2 A; B; Q; C; D produced without source-level transformation, while Fig.  6(b) is the diagram of the same Ruby expression generated after source-level transformation. For this example, given the relation À such that x À hxi, the following Ruby law can be used to optimise the design to become the one in Fig. 6 
(b):
row n A; B; Q; C; D row n A; B; sndÀ; row 1 Q; fstÀ À1 ; C; D:
28
More Ruby laws have been identi®ed for layout optimisation and the details can be found in Ref. [8] .
Like most tools in our system, the design sketcher is written in the language SML [30] . There is a parser for converting expressions in concrete syntax to the corresponding internal representations in abstract syntax. Other main modules include: · a preprocessor for source level transformation; · a type checker based on the uni®cation algorithm [32] to ensure that interconnected blocks have a compatible interface; · a mode manager that decides the level of detail at which the layout should be drawn, according to the source Ruby expression; · a placer which produces a hierarchical placement of the layout; · a sizer which adds dimensions and connection points to the description of primitive cells; · a router which ensures that the connections between adjacent cells are joined together properly; · an output generator, which takes the result from the router and generates diagrams in formats such as Latex and Tcl [29] .
The visualisation system
The visualisation system is a graphics-based tool that integrates the visualisation of design behaviour and structure by combining the sketcher and simulation facilities.
The main modules in our system are shown in Fig. 7 . There is a graphical user interface for convenient interaction between the user and the system, which does not appear in the ®gure for the sake of clarity. The Ruby program to be visualised and simulation data are ®rst supplied to the visualisation system as input. The sketcher produces a circuit schematic and a netlist which speci®es the connectivity of components. The numerical/symbolic simulator takes the netlist and simulation data to produce a state table which records circuit states in terms of time sequences. The visualiser displays the schematic superimposed with numerical or symbolic data values indicating circuit states at each clock cycle. The circuit states change as simulation progresses and the visualisation sequences are accordingly produced showing the circuit state changes. We shall illustrate the operation of the system in Section 4.4. Fig. 8 shows a snapshot of the graphical user interface of our visualisation system. The centre of the frame displays the block diagram of a convolver design. The operation of the design is animated by projecting a data¯ow model on the block diagram. One can select to view data values on speci®c input and output ports and internal paths. Numerical, symbolic and bit-level simulation and their combination are supported and the animation speed can be adjusted.
There are two simulation modes: one simulates the design cycle by cycle and the other supports sub-cycle simulation, showing how signals propagate through combinational components one after another. Fig. 8 shows the visualiser running symbolic simulation cycle by cycle. The button at the top left-hand corner allows the selection of simulation modes and input options. Simulation data can be provided from a ®le or from the command line input at the bottom. Various controls can be used to magnify designs, to choose step-by-step, continuous or cyclic simulation, and to adjust the simulation speed.
Our visualiser has been used in developing hardware libraries and designs involving run-time recon®guration [25] .
The re®ner
In our system, both high-level, such as wordlevel, and low-level, such as bit-level designs can be described in Ruby. A high-level design operates on abstract data types such as integer or real. At lowlevel, abstract data such as integers are replaced by concrete data, such as bits; the associated operations on abstract data are also replaced by concrete ones. The data re®nement from a high-level design to a low-level design is automatically performed using the re®ner.
The bit-level implementation diers principally from the word-level one in that some constraints such as the size of components have to be considered, since in reality we cannot build circuits which are arbitrarily large. Also, there may be many possible bit-level implementations which meet the requirements of a given word-level design. One of the features of the re®ner is that it can produce the most ecient bit-level design which satis®es the constraints given by the designer. The current version of the re®ner takes constraints specifying the maximum and minimum values of external inputs to a circuit, and optionally the size of any internal connections. A constraint-propagation algorithm has been developed to calculate the size of a particular component. The idea is that the maximum and minimum values of inputs are propagated across the circuit. Once all constraints on a given component's inputs are known, the constraints on its internal connections and its outputs can be derived. Resolving the constraints ®xes the size of the components and the width of the output data path. Given a library of parametrised bit-level operators and their sizes, our constraint-propagation procedure can be used to determine the widths of all data paths. A bitlevel Ruby design can be constructed that is guaranteed to be the most ecient design produced by the constraint-propagation algorithm.
Another important feature of the re®ner is that it can re®ne a word-level design into a number of ecient bit-level implementations, depending on the bit-level data representations such as unsigned and two's complement representations. As a result, the re®ner facilitates exploring architectures and evaluating the eects of dierent bit-level data representations.
The hardware compiler
The hardware compiler compiles into hardware a bit-level Ruby program generated by the re®ner, and the target code can be produced in various formats so that the same design can target dierent devices and dierent technologies. Some of the formats are device-independent such as structural VHDL and EDIF netlist for both ASIC development process and FPGA implementation, and some are device-dependent such as XNF and CFG which can only be implemented using the speci®c FPGAs. We have used the VHDL compiler in implementing designs on the Riley system [16] .
An important feature of our hardware compiler is that there is a¯oorplanning module which provides rapid placement and routing of components. The¯oorplanning module generates layout by exploring the structure of the source description and it is therefore fast. This is important because a major bottleneck in automatic hardware synthesis is the time to place and route the netlist produced by a hardware compiler. Besides, such source language-guided placement and routing can preserve the inherent regularity of a bit-level regular array design. It is hence possible to produce area and time ecient designs.
The¯oorplanning module is divided into two parts: the global placement and routing part and the detailed placement and routing part. The former is device-independent while the latter is device-dependent. The separation between the device-independent and device-dependent parts is desirable because (a) such arrangement facilitates re-targeting: it is comparatively easy for us to build a new¯oorplanning module for a dierent device as we only need to add the device-dependent part; (b)¯oorplanning a complex design can be extremely dicult, and a``divide and conquer'' approach should reduce the complexity of the problem; (c) some algorithms used for the Ruby sketcher (Section 3.2) can be applied to the ®rst part of the¯oorplanning module; (d) such arrangement makes it possible to separately explore various device-independent and device-dependent optimisation techniques.
The viability of our¯oorplanning scheme has been demonstrated by a proof-of-concept compiler backend [8] customised for CAL1024 FPGAs [1] , a precursor of Xilinx 6200 FPGAs. The simplicity of CAL1024 FPGAs enables us to focus on the essential tasks of mapping the high-level Ruby block descriptions into low-level device-speci®c FPGA cells, described using the OAL language [26] . The next section illustrates the system using a case study. The objective is to show how step-bystep design development can be achieved using our framework. We show the design¯ow of several median ®lter designs which includes word-level architecture development and optimisation, data re®nement, bit-level optimisation and hardware compilation.
Case study: Median ®lter designs

Introduction
Median ®ltering is a special, but popular, case of ranked order ®ltering. It has been widely used in the area of digital image processing (see, for instance, [33] ). One of its common applications is in edge detection algorithms for image feature extractions. Many such algorithms use the signi®cant changes of the grey levels of pixels in an image to indicate where edges exist. Since noise causes false edges to be detected, the smoothing of images is indispensable. If a linear ®lter is used, it will not only remove noise, but blur the potential edges as well, and hence will result in mis-locating edges or missing them altogether. A median ®lter, however, does not have this problem, since it can reduce noise spikes without extensively blurring or damaging edges. Some other ®elds in which median ®lters have been successfully applied include speech signal processing [34] and data compression [2] .
There are a number of reasons for choosing median ®lter designs to illustrate our framework. First, the implementations described in this section are complex enough to illustrate the capability of our integrated system but still simple enough to be understood. Second, median ®lters are a popular topic in signal processing in general and in non-linear ®ltering for feature detection in particular. Due to the non-linear nature of the median ®lter, it is highly desirable in many applications to implement median ®lters in the form of regular arrays for high performance and small area. Some of our designs are similar to those in [35] , which have been implemented in 2 lm CMOS technology. Finally, we show that our median ®lter designs can be very eciently implemented in FPGAs.
Speci®cation
We shall restrict our discussion here to one dimensional median ®ltering, although a twodimensional separated median ®lter can be directly performed using two one-dimensional median ®l-ters. More general descriptions on median ®ltering can be found, for example, in [12, 33] .
To specify the median ®ltering operation, let us assume that a ®lter window is slid along the input sequence to be ®ltered. Fig. 10 shows the case when the ®lter size is 5. At each ®lter window position, the elements inside the window are sorted and the median element is extracted as output. In each cycle when the ®lter window progresses, one new element enters the ®lter window and one element that has been kept in ®ve successive windows is deleted. Since there is only one element which is dierent between two successive window positions, the sorting result of the current window can be exploited to simplify the sorting process of the next window. For instance, in Fig. 10 , given that elements in window W2 have already been sorted, the next window, W3, can be obtained from W2 simply by deleting element X2 and inserting element X7 so that the resulting sequence is still ordered.
Based on the above observation, we now describe a median ®lter as a ®nite state machine which involves an ordered state s hs 0 ; s 1 ; s 2 ; s 3 i. Given an input i where s 2 < i 6 s 3 , an ordered array u hs 0 ; s 1 ; s 2 ; i; s 3 i is produced by inserting i into 
Parametrised description
We need three blocks to implement the architecture Mstl that we adopt for the state transition logic described above: InstSortB, LocatorB and DeleterB, which are stacked on top of one another (Fig. 11) . In addition, there are some interface and rewiring circuits Int® and Intfo at the input and output, respectively.
Before we get into the detailed Ruby expressions, let us give a brief explanation about the architecture of Mstl so that readers can get a feel for our design without following the Ruby descriptions. The components in Mstl and their input and output signals are described in Tables 1 and 2 . Int® takes a new input i at each clock cycle and passes it to InstSortB; similarly, it connects the signal v from LocatorB to DeleterB. It also outputs a, the element to be deleted, and b, a boolean signal. b is de®ned as follows: given that v is the minimum element of u, b 0 if v a; otherwise b 1. Intfo outputs the median o from InstSortB and discards signals a, t and d, which are outputs from blocks LocatorB and DeleterB. Block InstSortB performs insertion sorting and generates the median. It takes i, the new input element, and s, the current state which is sorted in ascending order, and then generates a sorted array u and the median o. LocatorB locates the element to be deleted within the current state. It takes as input u, the current state containing a sorted array of data, and a, the element to be deleted, and the boolean b. In addition to outputting v, a, and t, a boolean indicating whether a is the maximum value in u, LocatorB generates r, an array of boolean and element pairs, such that the boolean value true indicates that the element to be deleted has not been found. Otherwise, the element to be deleted has been found. DeleterB deletes the element located by LocatorB. It takes r, the array of boolean and element pairs, and v, the minimum element of array u, and generates the new state s H and d, the element to be deleted.
Let us now derive a parametrised description for the median ®lter. The block Mstl can be captured in Ruby as Mstl fst Intfi; Stl0; snd Intfo; 29 where Stl0 DeleterB l LocatorB l InstSortB:
The de®nition of``;'', fst, snd, l etc can be found in Section 2. The input interface Int® obtains a new input i, records the element a which will be deleted, and rearranges wires. It is de®ned by 
37
where¯atr is a Ruby function for¯attening a wiring structure. For instance, hx 1 ; hx 2 ; hx 3 ; hx 4 ; hx 5 ; hiiiiii flatr 5 hx 1 ; x 2 ; x 3 ; x 4 ; x 5 i. The block LocatorB takes as inputs the sorted array u, the element to be deleted a, and a boolean b indicating if a v, where v is the minimum element of u, and generates v, r, a, and t shown in Table 2 . Let us look at an example. Assume u h3; 5; 6; 7; 10i, a 5. Then b `T', t `F', and the array of boolean and element pairs r hhT ; 5i; hF ; 6i; hF ; 7i; hF ; 10ii.
A parametrised version implementing LocatorB is shown below LocatorB snd apr n À1 ; Locator 6 hhhx; yi; zi; zi$wirehx; yi ; lsh;
38
where Locator (Fig. 13) The block DeleterB takes as inputs r, the array of boolean and element pairs, and v, the minimum element of array u. 
Composing the above blocks together, the complete median ®lter can now be expressed by the following state machine
where Mcore0 loopStl0 ; fstmap n D: 41
Note that the correct operation of M 0 requires the feedback latches to be initialised with I. At this stage of the design development, the sketcher described in Section 3.2 proves to be useful for inspecting the architecture of the design, and the behaviour of the design is readily validated using our simulation and visualisation facilities. Design validation is described in the following section.
Design validation
To study the behaviour of the median ®lter M 0 , one can use the visualiser to examine Int®, Intfo, InstSortB, LocatorB and DeleterB separately and then their composition. For brevity, we shall only present the visualisation of the integral design M 0 here.
To visualise M 0 , we supply simulation data for the input i at each clock cycle, and the visualiser can display the architecture of M 0 and the values of user-selected wires. Fig. 12 shows a frame from an animation sequence when a number 8 is inserted into the median ®lter. The animation se-quence shows that the design M 0 behaves as we expected.
Word-level optimisation
In the preceding subsection, a parametrised architecture M 0 of median ®lter is obtained and validated using our system. The next task is to optimise M 0 to increase regularity and to produce a systolic implementation by pipelining. While the algebraic laws of Ruby facilitate the systematic transformation and proof of correctness, we shall also make use of design diagrams for obtaining insights into our optimisation. As described in Section 4.3, the sketcher described in Section 3.2 can be employed in early stages of development to rapidly generate diagrams of Ruby designs. The behaviour of the transformed designs can be studied through design simulation using the simulator or the visualisation system described in Section 3.3.
To simplify our transformation, we shall at ®rst ignore the input and output interfaces but focus on the structure of Mcore0 depicted in Fig. 13 . The input and output signals of the repeating cells are described in Table 3 . Our basic idea of optimising Mcore0 is to decompose the state transition logic containing three repeating structures, InsertSort, Locator and Deleter (Fig. 13) , into a cascade of state machines forming a single repeating structure, Mcorecell (Fig. 14) , and introducing pipelining to increase the performance of the design.
Let us now examine the architecture of Mcore0. Clearly, blocks Deleter, Locator and InsertSort can be merged to form a repeating structure Mcore1 if the block MedExtract is ignored and the maximum output of InsertSort is rewired through the output port. From the algebraic law row n Q l row n R row n Q l R;
we obtain
Mcore1 looprow n ; dltcell l row n lctcell l row n scell;
looprow n Stl1; 44 where Stl1 dltcell l lctcell l scell: 45
Now let us apply the theorem for state machine decomposition [21] , looprow n R loop R n ; 46 Fig. 12 . A frame extracted from a design animation sequence. 8 is input. As shown is the frame, current state s h1; 5; 7; 9i. After inserting the current input 8, InstSortB generates a sorted array u h1; 5; 7; 8; 9i and outputs median 7. Before this clock cycle, 1, 9, 5 and 7 have already been input in order, hence 1 is the element to be deleted at the current clock cycle. Therefore, the new state s H equals h5; 7; 8; 9i.
we obtain a regular state machine ( To generate the medians from the state machine, we need to output the sorted array, from which the median can be extracted. This can be achieved by slightly modifying the architecture Mcorecell obtained above while still retaining its regular feature (Fig. 15) . The Ruby expression for capturing the architecture shown in Fig. 15 Here Int® has been de®ned in expression (31) . Further optimisation of M 2 can be achieved by introducing various degrees of pipelining which includes registers between adjacent Mcell [22] . For instance, a fully pipelined version of implementation M 2 is shown in Fig. 16 . On the top of the diagram, there are 6 registers forming a triangularshaped array at the output; they ensure that all data at the output will emerge in the same clock cycle.
Other optimisations using transposition [20] and serialisation [21] are also possible, but we shall not go into the details here.
Data re®nement and bit-level optimisation
Using the re®nement system described in Section 3.4, it is straightforward to obtain an optimised bit-level version for the word-level median ®lter described in Eq. (51) in the preceding section. For instance, given the input elements in the range 0 to +127 (7-bit grey level image data, for example), our re®nement system generates the bit-level implementation shown in Fig. 17 , where the widths of data paths have been indicated. scellb 77 is the parametrised bit-level version of scell, with two parameters, both of them are 7 for this example, specifying the width of its ®rst and second input. lctcellb 777 is the bit-level version of lctcellb with three parameters, respectively, specifying the width of its second, third and fourth inputs Table 4 The parametrised bit-level repeating cells in (labeled as b, d and i, respectively, in Fig. 17) . Similarly, dltcellb 77 is the parametrised bit-level version of dltcell and its ®rst parameter describes the ®rst input (labelled as x in Fig. 17 ) and the second parameter speci®es the third input (labelled as y in Fig. 17 ). The parametrised bit-level repeating cells and their input and output signals are described in Tables 4 and 5 . These cells can be implemented using hardware libraries in a speci®c technology (see Section 4.7).
Although the hardware libraries used for implementing scellb, lctcellb and dltcellb can be highly-optimised and technology-speci®c, the overall implementation using FPGAs is usually inecient due to the wiring congestion between the connected blocks. It is desirable to further optimise the design at the bit-level. The objective of such an optimisation is to develop bit-level cells which can be replicated to form Mcore2.
Such cells are shown in Fig. 18 . A column of cell A implements Int® (Fig. 11) , which serves as the input port of the ®lter. It is composed of a bit-wise shift register array and some other logic gates for comparison. Note that for the cell at the most signi®cant bit position, the input K 8 should be hardwired to logic zero for initialisation. The number of registers the shift register contains depends on the window size of the ®l-ter. For instance, for a median ®lter with size 5, there should be 4 registers. Block Stl2 (Fig. 15 ) is implemented using a column of cell B, which is made up of a bit-wise comparator COMP and two two-way multiplexers MUX1 and MUX2. An init wire and an OR gate are introduced for presenting the latch to a desired value. The purpose is to initialise the latch to logic one so that each latch is initialised to the maximum number that the ®lter can input. Also, for the cell at the most-signi®cant bit position, the input I 8 should be hardwired to logic zero, while for the cell at the least-signi®cant bit position, C 0 should be wired to logic one. C and D are the periphery cells for signal propagation.
The above bit-level cells can be used to implement a median ®lter with any ®lter size and any number of input bits. As an example, a median ®lter with ®lter size 5 and 7-bit input is shown in Fig. 19 .
Using the Ruby simulator and the visualiser, it is very straightforward to validate the bit-level design against the word-level design. One can also use algebraic transformations to verify that the bitlevel design is a faithful implementation of the word-level description. Table 5 Input/output signals of the bit-level repeating cells in Fig. 17 
FPGA implementation
With our hardware compiler, we can directly compile the bit-level Ruby description of the median ®lter into hardware such as FPGAs. For the highest performance of a design, however, it is desirable to exploit device-speci®c features. For a particular implementing technology we may, for instance, manually place and route the repeating cells to achieve the optimal layout, because any ineciency in a repeating unit will be ampli®ed many times for the whole design.
Our compiler takes only a few seconds to generate the CAL-based implementation of a median ®lter with a 5-element window and 8-bit input. This design, with a compact implementation of the repeating cell, is shown in Fig. 20 . This example illustrates that the regularity information from the Ruby source code can be used to produce a compact layout. A hardware implementation based on this median ®lter design has been used to ®lter noise and locate edges from range data generated by range sensors [11] .
Conclusion
In this paper, we have presented an overview of an integrated system for developing regular array designs. It has been demonstrated that the system supports a simple and powerful notation for representing regular array designs, and both the architecture and the behaviour of an array can be captured in a single parametrised description. Various facilities have been developed and integrated for re®ning, sketching, simulating, optimising, visualising and compiling regular array designs. We have used several median ®lter designs to illustrate our system. The following steps sum up the procedure for implementing a speci®c algorithm using our integrated system. · Develop a hardware architecture for the given algorithm. Initially, the architecture may not be ecient but should be obviously correct. More ecient designs can be obtained from the initial design by optimisation. · Capture the architecture as a parametrised expression in Ruby. This step may require parametrising the design description, and using available library blocks. · Check the correctness of the design by formal veri®cation, numerical and symbolic simulation, and design visualisation.
This step is to ensure that the design conforms to the intended behaviour. · Apply correctness-preserving transformations to optimise the design to satisfy user constraints, like employing pipelining [22] to increase parallelism, or serialisation [21] to reduce size, or decomposing state machines to increase regularity. It is often helpful to use design diagrams as a vehicle to gain insights into design transformations. In such cases, the design sketcher should prove to be a useful tool in rapidly generating design diagrams. · Develop the most ecient implementation of the repeating units before hardware compilation, using device-speci®c descriptions like OAL if necessary. The system has been used in developing a number of regular array designs, including a systolic convolver [11] , a systolic priority queue [9] , a systolic sorter [10] , a sine calculator [8] , and in developing recon®gurable libraries for FPGAs [24, 25] . It has also been used in producing implementations partly in hardware and partly in software [23] .
There are a number ways in which our system can be re®ned and enhanced. The current version of the re®ner, for example, deals only with constrains of maximum and minimum values of inputs for a word-level circuit. Future work to enhance it includes extending our approaches to take into consideration other kinds of constraints such as critical path delay, latency or size.
The sketcher can automatically produce design diagrams from Ruby programs. A tool capable of producing Ruby programs from design diagrams would be very useful, because schematic representations of designs produced by other design system, such as Viewlogic and Mentor Graphics, can then be automatically incorporated in our design framework for further development.
Both the re®ner and the hardware compiler rely on various kinds of parametrised bit-level libraries to facilitate ecient design. These include both device-dependent and device-independent libraries. It is desirable to develop a wide variety of implementations such as digit-serial and pipelined designs to facilitate selecting and reusing the most appropriate ones.
