Abstract
INTRODUCTION
At one extreme of the computing spectrum, computing systems based on the traditional von Neumann model provide a single and generic computational medium for applications with diverse characteristics. These systems are known as general purpose processors (GPPs). At the other end of the spectrum there are systems with architectures customized for particular applications. These systems are built around one or more Application Specific Integrated Circuits or ASICs. In certain application areas, software executed on a single sequential processor no longer meets our ever increasing efficiency requirements. Besides, the direct architecture algorithm mapping restricts the range of applicability of ASIC-based systems. Consequently, this led to the introduction of reconfigurable computing (RC) combining the flexibility of general-purpose processors and the high performance of ASICs [1] [2] [3] [4] [5] [6] [7] [8] [9] . Field Programmable Gate Arrays (FPGAs), an instance of RCs, has recently enabled RC chips with Millions of gates (Xilinx) affording more scalability and cost effectiveness due to hardware reuse. FPGAs offer much flexibility for the design of integrated circuits (ICs) chips for parallelism.
Generally, parallelism and implementation in hardware provide us with two alternatives that can often deliver very dramatic improvements in efficiency. With the eme rgence of such reconfigurable hardware chips, the presence of a rapid development environment for these scalable hardware circuits is very useful. Moreover, it would constitute the cornerstone solution for the ever-increasing need for more: efficiency, scalability and flexibility in realizing massively parallel algorithms for a wide area of applications [2] .
The proposed rapid development model (RDM) adopts the transformational programming approach for deriving massively parallel algorithms from functional specifications (See Figure 1 ) [1] [2] [3] [4] . The functional notation is used for specifying algorithms and for the reasoning about them. This is usually done by carefully combining small number of high order functions (like map, zip and fold) to serve as the basic building blocks for writing high-level programs. The systematic methods for massive parallelisation of algorithms work by carefully composing "off the shelf" massively parallel implementation of each of the building blocks involved in the algorithm. The emphasis in this method is on correctness, scalability and reusability.
To describe parallelism we use Hoare's CSP that allows issues of immense practical importance (such as data distribution, network topology, and locality of communications) to be carefully reasoned about [12] . Relating the Functional Programming and CSP fields gives the ability to exploit a well-established functional programming paradigms and transformation techniques in order to develop efficient CSP processes.
The reconfigurable hardware realization step is done using Ha ndel-C an automated compilation development model [8, 11] . Handel-C uses much of the syntax of conventional C with the addition of explicit parallelism. Handel-C relies on the parallel constructs in CSP to model concurrent hardware resources. Accordingly, algorithms described with CSP could be implemented with Handle-C. For the desired hardware realization, Handel-C enables the integration with VHDL and EDIF along with various synthesis and placeand-route tools. Our targeted compilation is EDIF using the DK1. The EDIF output is compiled to a bit format file u sing Xilinx place and route utility in the Xilinx ISE 5.1 package [10] . For downloading the bit file on the FPGA and performance analysis we use Visual C++ IDE with the RC-PP-1000 support libraries. The compilation steps are shown in Figures 2 
BACKGROUND AND PREVIOUS WORK
Abdallah and Hawkins defined in [5] some constructs used in the development model. This looked in some depth at data refinement; the means of expressing structures in the specification as communication behaviour in the implementation. Firstly, streams are defined as a sequence of messages on a single channel, and correspond to a sequential method for communicating a list. Streams facilitate the communication of finite sequences and require some means of signalling the end of transmission (EOT).
Secondly, vectors of items are a means of communicating a list on more than one channel. The assumption is that there are as many channels in the vector as there are items in the list, such that each item is communicated on its own channel. Thirdly, vectors of streams are the parallel composition of n streams, each communicating a sublist independently as a stream. Each stream has its own end-of-transmission signal (EOT), and they can finish transmitting at different times. Lastly, streams of vectors is defined where a complete sublist is communicated in a single step.
AIM AND MOTIVATION
The current research is of three main tasks, these tasks converge to a single goal. This goal is presenting and realizing a novel method for rapid prototyping of reconfigurable circuits. As previous work established the method for refining functions into parallel processes from high-level specifications, currently the case is investigating the translation of the processes so derived into Handel-C for mapping onto hardware.
The suggested three tasks could be summarized by, firstly, discussing different implementations of all the conceptual constructs affording rules for implementation. Secondly, completing a set of assisting utility program constructs. Thirdly, targeting engineering and industrial complex applications for testing the applicability of the model and the validity of the suggested favorable i mplementation rules. Broadening the area of application of the RDM is done by addressing different areas like information coding, computer and communications security, molecular modelling, etc... 
DATA REFINEMENT
In the following we present some datatypes used for refinement.
Stream of Values
The stream is a purely sequential method of communicating a group of values. It comprises a sequence of messages on a channel, with each message representing a value. Values are communicated one after the other. Assuming the stream is finite, after the last value has been communicated, the end of transmission (EOT) on a different channel will be signaled. Given some type A, a stream containing values of type A is denoted simply as A .The Stream is shown in 
Vector of n Values
Each item to be communicated by the vector will be dealt with independently in parallel. A vector refinement of a simple list of items will communicate the entire structure in a single. The vector is shown in Figure 5 . Given some type A, a vector of length n, containing values of type A, is de-
value [2] value [3] . . .
value[n]
Figure 5. A Vector of size n.
Refinement of a List of Lists
Whenever dealing with multi-dimensional data structures, for example, lists of lists, implementation options arise from differing compositions of our "primitive" data refinementsstreams and vectors. Examples of the combined forms are the Stream of Streams, Streams of Vectors, Ve ctors of streams, and Vectors of Vectors (See Figure 6 ). These forms are denoted by:
,and
respectively. 
HIGH-ORDER-FUNCTIONS
Functional programming environments facilitate reusability through high-order-functions. Many algorithms can be built from components which are instances of some more general scheme. In this section we introduce the refinement of some high-order-functions detailed in [5] along with the refinement and implementation of the high-order-function zipWith.
Map applies a function to a list of items. Thus, in the functional setting, we have: The fold family of functions is used to "reduce" a list by inserting a binary operator between each neighbouring pair of elements. The basic fold operator (/) has no concept of direction and as such requires an associative binary operator to be well defined.
[ ]
Refining to CSP we have:
Where, F is the refinement of the operator ⊕ . An instance of (VFOLD) is shown in Figure 9 . The high-order-function zipWith is used to zip two lists (taking one element from each list) with a certain operation see Figure 10 . The high-order-function zipWith is specified 
To implement the data parallel version of this high-orderfunction we refine it to a process VZIP that takes two vector channels as input with their length and zips the two lists with a process F; F is a refined process from the function (⊕ ). 
Where, F is the refinement of the operator ( ⊕ ). Figure 11 is a vis ualization of vector zipping with a process F. F could be addition, multiplication or any other operation on two lists. 
CASE STUDY: SYNTHESIS OF PARALLEL MATRIX MULTIPLICATION ALGORITHM
In this section we demonstrate the use of the RDM to develop three designs for the refinement of the standard matrix multiplication algorithm. The purposes of this section are: 1) to give an example of applying the RDM, 2) to show the flexibility of design using the RDM, 3) preparing for the performance evaluation step which will lead to various benefits to the realization of the RDM. In each step the functional specification is introduced along with the CSP implementation. The reconfigurable hardware implementation will follow in the next section.
Functional Specification
A functional specification of matrix multiplication is formulated as a function mmult that takes two matrices as inputs and returns a matrix as a result. In this definition, we assume the first matrix is represented as a list of rows and the second matrix is represented as a list of columns.
( ) The function vmmult takes two inputs ass and bs and returns a list (a column in the resulting matrix). The vmmult function maps the function (scalarp bs) over the list ass.
The function scalarp defines the scalar product of two vectors. This function is the composition of the two functions zipwith (*) and sum. The inputs are two lists as and bs, the output is a single value that is the result of zipping the two input lists and then folding the list elements over addition.The function sum adds up all the numbers in a given list.
The functional specification of zipwithmul takes two input lists as and bs and outputs a list of the same length. The ith element of the output list is the multiplication of the ith elements of the input lists. In this design we consider the refinement of the input bss and the output css as vectors of items where each item is a list. ... The list ass is passed as an argument to each of the processes VMMUL(ass) in the above design. The list ass could be explicitly passed to the process VMMULT by exploiting the following algebraic identity:
First Design
The effect of applying this step to the previous design can be visualized as follows: In this version the list ass is locally produced and fed to each process VMMULT in the vector. The effect of having k parallel copies of PRD(ass) communicating with k instances of VMMULT can be achieved by factorizing the process PRD(ass) and broadcasting its output to the relevant processing elements in the network. Applying this rule will result in a semantically equivalent version of MMULT which has a different layout. We have: The formal rule that justifies the above transformation is: 
Figure 17. VSCALARP as a piping of two processes
For completeness, the CSP implementations of the simple addition and multiplication functions are: Keeping the refinement of the remaining functions the same, MMULT process looks as in Figure 19 . 
Third Design
In this design we will make use of pipelined parallelism, which is a very effective means for achieving efficiency in numerous algorithms. Pipelined parallelism in general is much harder to detect than data parallelism. For this task we use the function decomposition strategy found in [9] , which aims at exhibiting pipelined parallelism in functional programs. Recalling the transformation rule, consider: Figure 20 . Process SPEC. The recursive function in this case is vmmult. The value to be passed to the next stage of the pipe is a tuple, whose first is the input vector and second is result of applying vscalarp on the input vector from the matrix bss and the present argument from the matrix ass. The pipelined network is shown in Figure 21 . In this design, the matrix bss is input to the network as a stream of vectors (columns) <bs 1 , bs 2 ,…bs n >. The matrix ass vectors (row by row) are passed as arguments in the pipe stages. The result is considered as a stream of streams <cs 1 , cs 2 ,…cs n > The first result to appear from the network is the output stream (column) <cs n > corresponding to the first input vector bs n . The resultant matrix is to appear then step by step.
Fourth Design
This design makes use of an optimization of the previous design. In this case, the input matrix bss is refined as a stream of vectors <bs 1 , bs 2 ,…,bs n > and the matrix ass is refined as arguments in the pipeline stages. The output is taken from each stage as a vector of streams as shown in Figure 22 . For instance, the first vector to be fed in to the pipe is the vector bs n , accordingly the output vector of streams will give the first resultant vector cs n.
General Evaluation
Generally, the suggested algorithms inherit all the advantages from the rapid development method applied. The key issue in the adopted model is the production of engineering efficient, scalable, reusable, and correct solutions by construct as opposed to trial and testing. Correctness is ensured by construction through the functional specification step. Reusability at its best appears when using "off-theshelf" building blocks of code. High-order-functions serve as the basic building blocks to construct the parallel algorithm. The usage of message passing rather than shared memory in the rapid development model affords scalability. The targeted hardware contributes to the adopted model by being faster and smaller than truly general-purpose hardware such as a workstation. Also, compared to an ASIC, it has sma ller non-recurring engineering (NRE) costs, for it can be easily reconfigured.
The next direct step is the Handle-C implementation step where these algorithms, compiled and mapped onto the FPGA for testing and performance analysis.
CONCLUSION
In this paper we have presented an extension to the implementation stage of a methodology that can take intuitive, high level specifications of algorithms. These algorithms are specified in the functional style and then refined into efficient, parallel implementations to be later compiled in Handel-C and mapped onto reconfigurable hardware circuits. The targeted hardware is the RC-1000 Virtex 2000E FPGA from Celoxica. A case study for matrix multiplication is presented where several radically different designs are systematically derived from the common specification. At the time of writing this paper, we did not have the performance data corresponding to the outlined designs. Future work will include broadening the area of application for the RDM to cover algorithms in digital coding, m olecular mo deling, DNA matching and cryptography. 
