Programmable Logic Devices (PLDs) continue to grow in size and currently contain several millions of gates. At the same time, research effort is going into higher-level hardware synthesis methodologies for reconfigurable computing that can exploit PLD technology. In this paper, we explore the effectiveness and extend one such formal methodology in the design of massively parallel algorithms. We take a step-wise refinement approach to the development of correct reconfigurable hardware circuits from formal specifications. A functional programming notation is used for specifying algorithms and for reasoning about them. The specifications are realised through the use of a combination of function decomposition strategies, data refinement techniques, and offthe-shelf refinements based upon higher-order functions. The off-the-shelf refinements are inspired by the operators of Communicating Sequential Processes (CSP ) and map easily to programs in Handel-C (a hardware description language). The Handel-C descriptions are directly compiled into reconfigurable hardware. The practical realisation of this methodology is evidenced by a case studying the third generation mobile communication security algorithms. The investigated algorithm is the KASUMI block cipher. In this paper, we obtain several hardware implementations with different performance characteristics by applying different refinements to the algorithm. The developed designs are compiled and tested under Celoxica's RC-1000 reconfigurable computer with its 2 million gates Virtex-E FPGA. Performance analysis and evaluation of these implementations are included.
Introduction
The rapid progress and advancement in electronic chips technology provides a variety of new implementation options for system engineers. The choice varies between the flexible programs running on a general purpose processor (GPP) and the fixed hardware implementation using an application specific integrated circuit (ASIC ). Many other implementation options present, for instance, a system with a RISC processor and a DSP core. Other options include graphics processors and microcontrollers. Specialist processors certainly improve performance over general-purpose ones, but this comes as a quid pro quo for flexibility. Combining the flexibility of GPPs and the high performance of ASICs leads to the introduction of reconfigurable computing (RC ) as a new implementation option with a balance between versatility and speed.
Field Programmable Gate Arrays (FPGAs), nowadays are important components of RCsystems, have shown a dramatic increase in their density over the last few years. For example, companies like Xilinx [1] and Altera [2] have enabled the production of The traditional implementation of a function on an FPGA is done using logic synthesis based on VHDL, Verilog or a similar HDL (hardware description langauge). These discrete event simulation languages are rather different from languages, such as C, C++ or JAVA. An interesting step towards more success in hardware compilation is to grant a higherlevel of abstraction from the point of view of programmer. Designer productivity can be improved and time-to-market can be reduces by making hardware design more like programming in a high-level langauge. Recently, vendors have initiated the use of high-level languages dependent tools like Handel-C [3] , Forge [4] , Nimble [5] , and SystemC [6] .
With the availability of powerful high-level tools accompanying the emergence of multimillion FPGA chips, more emphasis should be placed on affording an even higher level of abstraction in programming reconfigurable hardware. Building on these research motivations, in the work in hand, we extend and examine a methodology whose main objective is to allow for a higher-level correct synthesis of massively parallel algorithms and to map (compile) them onto reconfigurable hardware. Our main concern is with behavioural refinement, in particular the derivation of parallel algorithms. The presented methodology systematically transforms functional specifications of algorithms into parallel hardware implementations. It builds on the work of Abdallah and Hawkins [7, 8] extending their treatment of data and process refinement. This paper is divided so that some of the following sections introduce the adopted development methodology. Section 3 presents the theoretical background. In Section 4, we put some emphasis on the approach to develop different implementations of the KASUMI cryptographic algorithm. The following section details the development steps. Section 7 demonstrates selected implementations. In Section 8, we analyze and evaluate the performance of the suggested implementations. Finally, Section 10 concludes the paper.
The Development Method
The suggested development model adopts the transformational programming approach for deriving massively parallel algorithms from functional specifications (See Figure 1) . The functional notation is used for specifying algorithms and for reasoning about them. This is usually done by carefully combining a small number of higher-order functions that serve as the basic building blocks for writing high-level programs. The systematic methods for massive parallelisation of algorithms work by carefully composing an "off-the-shelf" massively parallel implementation of each of the building blocks involved in the algorithm. The underlying parallelisation techniques are based on both pipelining and data parallelism.
Higher-order functions, such as map, filter, and fold, provide a high degree of abstraction in functional programs [9] . Not only they do allow clear and succinct specifications for a large class of algorithms, but they also are ideal starting points for generating efficient implementations by a process of mathematical calculation using Bird-Meertens Formalism (BMF ). The essence of this approach is to design a generic solution once, and to use instances of the design many times for various applications. Accordingly, this approach allows portability by implementing the design on different parallel architectures.
In order to develop generic solutions for general parallel architectures it is necessary to formulate the design within a concurrency framework such as Hoare'sCSP [10] . Often parallel functional programs show peculiar behaviours which are only understandable in the terms of concurrency rather than relying on hidden implementation details. The formalisation in CSP (of the parallel behaviour) leads to better understanding and allows for analysis of performance issues. The establishment of refinement concepts between functional and concurrent behaviours may allow systematic generation of parallel implementations for various architectures.
The previous stages of development require a back-end stage for realising the developed de- signs. We note at this point that the Handel-C language relies on the parallel constructs in CSP to model concurrent hardware resources. Mostly, algorithms described with CSP could be implemented with Handel-C. Accordingly, this langauge is suggested as the final reconfigurable hardware realisation stage in the proposed methodology. It is noted that, for the desired hardware realisation, Handel-C enables the integration with VHDL and EDIF (Electronic Design Interchange Format) and thus various synthesis and place-and-route tools.
Background
Abdallah and Hawkins defined in [8] some constructs used in the development model. Their investigation looked in some depth at data refinement; which is the means of expressing structures in the specification as communication behaviour in the implementation.
Data Refinement
In the following we present some datatypes used for refinement, these are stream, vector, and combined forms.
The stream is a purely sequential method of communicating a group of values. It comprises a sequence of messages on a channel, with each message representing a value. Values are com-municated one after the other. Assuming the stream is finite, after the last value has been communicated, the end of transmission (EOT ) on a different channel will be signaled. Given some type A, a stream containing values of type A is denoted as A .
Each item to be communicated by the vector will be dealt with independently in parallel. A vector refinement of a simple list of items will communicate the entire structure in a single. Given some type A, a vector of length n, containing values of type A, is denoted as A n .
Whenever dealing with multi-dimensional data structures, for example, lists of lists, implementation options arise from differing compositions of our primitive data refinementsstreams and vectors. Examples of the combined forms are the Stream of Streams, Streams of Vectors, Vectors of streams, and Vectors of Vectors.
These forms are denoted by:
Process Refinement
The refinement of the formally specified functions to processes is the key step towards understanding possible parallel behaviour of an implementation. In this section, the interest is in presenting refinements of a subset of functionssome of which are higher-order. A bigger refined set of these functions is discussed in [7] .
Generally, These highly reusable building blocks can be refined to CSP in different ways. This depends on the setting in which these functions are used (i.e. with streams, vectors etc.), and leads to implementations with different degrees of parallelism. Note that we don't use CSP in a totally formal way, but we use it in a way that facilitates the Handel-C coding stage later. Recall for the following subsections that values are communicated through as an elements channel, while a single bit is communicated through another eotChannel channel to signal the end of transmission (EOT ).
Basic Definitions
The produce/store process (PRD/STORE ) is fundamental to process refinement. It is used to produce/store values on/from the channels of a certain communication construct (Item, Stream, Vector, and so on). These values are to be received and manipulated by another processes.
The feed operator in CSP models function application. The feed operator is written £.
Consider a potential refinement for f , a process F . The operator denotes a process refinement, where the left hand side is a function, and the right hand side is a process. To state that f is refined to F , or in other words, the process F is a valid refinement of the function f , the following may be used:
f F
These rules were proven once [7] , and in this paper we use them systematically to refine the functional specification into a network of communicating processes.
Process Refinement of Higher-order Functions
Now the attention is turned to the refinement of higher-order functions presented in [8] ,
showing the refinement of the high-order function map as an instance. Employing this function in stream and vector settings is presented.
Streams
A process implementing the functionality of map f in stream terms should input a stream of values, and output a stream of values with the function f applied.
In general, the handling of the EOT channels will be the same. However, the handling of the value will vary depending on the type of the elements of the input and output stream.
Vectors
In functional terms, the functionality of map f in a list setting is modelled by vmap f in the vector setting. Consider F as a valid refinement of the function f . The implementation of VMAP can then proceed by composing n instances of F in parallel, and directing an item from the input vector to each instance for processing. In CSP we have:
Handel-C as a Stage in the Development Model
Based on datatype refinement and the skeleton afforded by process refinement, the desired reconfigurable circuits are built. Circuit realisation is done using Handel-C, as it is based on the theories of CSP [10] and Occam [11] .
From a practical standpoint, each refined datatype is defined as a structure in Handel-C, while each process is implemented as a macro procedure. We divide the constructs corresponding to the CSP stage into 2 main categories for organisation purposes. The first category represents the definitions of the refined datatypes. The second category implements the refined processes.
The refined processes are divided into different groups; the utility, basic, higher-order processes. A separate group contains the macros that handle the FPGA card setup and general functionality.
The datatypes definitions are implemented using structures. This method supports recursive as well as simple types. The definition for an Item of a type Msgtype is a structure that contains a communicating channel of that type.
#define Item(Name, Msgtype) struct { chan Msgtype channel; Msgtype message; } Name
For generality in implementing processes the type of the communicating structure is to be determined at compile time. This is done using the typeof type operator, which allows the type of an object to be determined at compile time. For this reason, in each structure we declare a message variable of type Msgtype.
A stream of items, called StreamOfItems, is a structure with three declarations a communicating channel, an EOT channel, and a message variable [8] : Other definitions are possible, but it affects the way a channel is called using the structure member operator (.).
The utility processes used in the implementation are related to the employed datatypes. The Handel-C implementation of these processes relies on their corresponding CSP implementation. In the following, we present an instance of these utility macros. 
Higher-Order Processes Macros
An example for an implementation in Handel-C of the CSP refinement of a higherorder function (map) in its vector setting is done as follows:
In a similar procedure to what have been introduced before, the implementations of the stream and vector settings SZipWith and VZip-With are straightforward.
Different tools are used to measure the performance metrics used for the analysis.
These tools include the design suite (DK ) from Celoxica, where we get the number of NAND gates for the design as compiled to the Electronic Design Interchange Format (EDIF ). The DK also affords the number of cycles taken by a design using its simulator. Accordingly, the speed of a design could be calculated depending on the expected maximum frequency of the design. The maximum frequency could be determined by the timing analyzer. To get the practical execution time as observed from the computer hosting the RC-1000, the C++ highprecision performance counter is used. The information about the hardware area occupied by a design, i.e. number of Slices used after placing and routing the compiled code, is determined by the ISE place and route tool from Xilinx.
The Third Generation of Mobile System Security Algorithms
The KASUMI is a modern and strong encryption algorithm designed for the use in the Third Generation Partnership Project (3GPP ) security functions for mobile systems [12] . KA-SUMI ciphers a 64-bit input data block by repeating a round procedure 8 times. The round composes a 32-bit non-linear mixing block (FO) and a 32-bit linear mixing block (FL). The FOblock is an iterated "ladder-design" consisting of 3 rounds of a 16-bit non-linear mixing block FI. In turn, FI randomising function is defined as a 4-round structure using non-linear look-up tables S7 and S9. All functions involved will mix the data input with key. The used S7 and S9 have been designed in a way that avoids linear structures in FI -this fact has been confirmed by statistical testing. Each functional component of KASUMI has been carefully studied to reveal any weakness that could be used as a basis for an attack on the entire algorithm. The fact that the key schedule of KASUMI is very simple did not constitute any real weakness. There seems to be no gain in practice by making it more complicated.
Hardware implementation of this cryptographic algorithm is currently an active area of research. The KASUMI was addressed by HoWon et al [13] , and Alcantara et al [14] . In-tel [15] proposed architecture processors for 3G control including the KASUMI. Moreover, SCI-WORX [16] produced a system board for the KASUMI cipher.
Formal Functional Specification
We will consider the following specifications for the key scheduler, and the main algorithm (KASUMI ). The key scheduler takes the private key as an input, and outputs a desired set of subkeys. This set of subkeys is of 4 packs (See Figure 2 ). The KASUMI takes two inputs, the generated subkeys and the input data, and it gives their corresponding output.
Generally, the functional specification style applied throughout this research uses higherorder functions as the main keys for later parallelism. As a start, we define some types to be used in the following formal specification:
The following specifications are also tested using the Hugs98 Haskell compiler.
Key Scheduling
As shown in Figure 2 , the 64 16-bit subkeys are organised into 4 packs of 8 sets of subkeys kL i1 , kL i2 , kO i1 , kO i2 , kO i3 , kI i1 , kI i2 , and kI i3 , where i is an index corresponding to the round number where a subkey is to be used. These subkeys are generated from the 128-bit encryption private key.
Key scheduling is specified as the function keySchedule that inputs a private key and outputs 4 packs of subkeys. We divide each pack into 6 groups for later ease of distribution to the encrypting rounds. Each group is a list of subkeys selected from the predefined lists kL i1 , kL i2 , kO i1 , kO i2 , kO i3 , kI i1 , kI i2 , and kI i3 . For instance, the first pack would contain: The function keySchedule generates the subkeys by firstly determining the predefined ks and ks'. ks is specified using the function segs as (segs 16 key). Recall that segs selects n sublists from a list xs.
After specifying ks, we formalise the computation for ks' using the higher-order function zipWith zipping two lists with the function exor. These lists corresponds to ks and C. After ks and ks' are ready, KASUMI subkeys are determined employing the higher-order functions mapWith and map. Also, using the functions shift and copy.
Finally, the functions group and transpose arrange the subkeys in the form mentioned earlier. The arranged groups are then merged into final 4 packs. To easily understand these steps we include the chart shown in Figure ? ?.
5.2
The KASUMI Block Cipher
The KASUMI block cipher has two inputs, a 64-bit data block in addition to the private key. The corresponding ciphered output is also a 64-bit data block. In this specification, we suggest the division of the KASUMI structure into 4 similar rounds. Where each single round is of two subrounds, called first and second subrounds. The 4 generated packs of subkeys (using the function keySchedule) are distributed to the KASUMI 4 rounds respectively. The total 8 subrounds of the KASUMI constitute a Feistel network. This is visualised in Figure ? ?.
KASUMI is formally specified as the function kasumi which inputs two lists of bool input and key. This function outputs a list of bool corresponding to the ciphered data. The specification is done by folding a function singleR-ound with the input over the generated subkeys packs. With respect to the network shape, the foldable single round is specified as the function singleRound.
kasumi ::
DataBlock -> Private -> DataBlock kasumi input key = foldl singleRound input (keyScheduling key)
A single round is of two blocks, the odd block formalised as the function firstSubRound and the even round formalised as the function secondSubRound. The function singleRound is specified as the functional composition of the functions firstSubRound and secondSubRound. The inputs to the function singleRound are an input block of data and a single pack of subkeys. The function firstSubRound could be described as follows. It firstly takes the 64-bit data input block and divides it into two left and right 32-bit words as shown in Figure ? ?. It also inputs a pack of subkeys and distributes them to their specific destinations. The data input left half is passed to a function fL, which corresponds to the FL block. The function fL forwards its output to a function fO (the functional specification of the FO block). The output from the function fO is XORed with the right half of the input data giving the final left half l1. The firstSubRound outputs a 64-bit word, which is the concatenation of the final left half with the initial left half. Also, it outputs the subkeys needed for the second subround.
firstSubRound :: The remaining fL, fI, fO, s7, and s9 building blocks are specified in a similar style.
Algorithms Refinements
We move now to the second stage of development following the same proposed method.
The refinement of the key scheduling, and the KASUMI specifications are presented in the following subsections.
Key Scheduling
Getting closer to hardware implementation, the general datatypes used in specifying the function keySchedule are refined as follows:
The key is a 128-bit Integer item, and the output packs of groups of lists can be refined to a vector of 4 vectors, each of 6 vectors of 16-bit Integer items. The refined processes KEYSCHEDULE corresponds to the function keySchedule.
keySchedule KEYSCHEDULE
From the specification, the process KEYSCHEDULE inputs the key and then it divides it into segments using the process SEGS the refinement of segs. These segments are broadcasted to be later used for 5 times. At this point, two parallel events could occur corresponding to the right and left branches depicted in Figure 4 . The right branch of processes refines the following part of the specification: To compute for ks' the vector setting refinement of zipWith (VZIPWITH ) is used. Then the vector refinement of mapWith, VMAP-WITH, is used to compute for the first set of subkeys.
The parallel left branch of processes computes for the second set of subkeys by piping two instances of the refined process VMAPWITH. This refines the following recalled specification: The remaining processes are used to refine the functions responsible for ordering the subkeys in the suggested form -packs of groups of lists. The complete network of processes (see Figure 4 ) is described as follows: 
The KASUMI Block Cipher
The KASUMI block is the main ciphering part used for the confidentiality and integrity algorithms standardised for 3GPP. Based on the functional specification stage of development, we suggest two refined designs for implementing the KASUMI block. The first is a 4 rounds pipelined design, while the second proposes a single round stream-based design.
First Design
In this design, we construct a fully pipelined network implementing the KASUMI block. Four single rounds are replicated to work in parallel forming a pipeline of processes. Accordingly, this design is expected to have a high degree of parallelism, and therefore to be highly efficient. However, this processes-replicating implementation will require the use of large amounts of processing resources.
The first step in refining the function kasumi observes its inputs as items with a precision of 64 bits for the data block and 128 bits for the key. This is described as follows:
where kasumi KASUMI As for this design, the four groups of subkeys are piped from the process KEYSCHEDULE to the replicated SINGLEROUND processes. The foldl higher-order function in this case is refined to its vector setting VVFOLDL. Thus, the process KASUMI is refined as follows:
Note that the upper input to each SIN-GLEROUND is a list of list of subkeys, refined as a vector of vectors. This is depicted in Figure 5 .
Moving to the refinement KASUMI subblocks, datatypes employed in the function sin-gleRound could be refined as follows:
where singleRound SINGLEROUND Recall the functional specification for a sin-gleRound, we have: singleRound input64 subKeys = This functional composition is refined to piping of two processes FIRSTSUBROUND and SECONDSUBROUND. The process SIN-GLEROUND is depicted in Figure 6 (a) and described as follows:
In refining the function firstSubRound, the datatypes could be refined as follows: The process FIRSTSUBROUND after getting its inputs, and depending on the functional specification, firstly broadcasts the input left half r1 to be used twice. Then, the subkeys are produced to the processes FL and FO in the order needed. The communications between FL and FO is implicitly synchronised by the ( ) operator. The output from FO is passed to the process EXOR with the produced input right half. At this point, the process CONCAT is synchronising on the output of the processes EXOR and the broadcasted r1. Finally, the remaining subkeys are produced to be forwarded to the process SECONDSUBROUND. These processes are shown in Figure 6 CONCAT PRD v (kss [3] )
PRD v (kss [4] ) PRD v (kss [5] )
where fL FL
fO FO
Similarly, and for the function secondSub-Round the refinement is done as follows: 
Second Design
In this design, the subkeys packs are passed in a stream setting to a single SINGLEROUND process. This stream refinement of foldl implemented by SVFOLDL will use the SINGLER-OUND process to compute for the final desired folded result. This design affords an economical use of computing resources. However, it is a quid pro quo for efficiency. This CSP network is pictured in Figure 7 and implemented as follows:
Third and Fourth Designs
The aim of introducing the third and fourth designs is to reduce the communication in the fine levels, mainly inside the FL, FI, and FO blocks. These blocks will be implemented with basic operations instead of communicating processes. For example, an addition will be implemented using a (+) operator instead of a process ADDI-TION. The refinement of the remaining blocks is to be the same. Also, the external communications with the FL, FI, and FO blocks will be the same. The third design uses the new descriptions for the F-blocks to modify the first fully-pipelined design, while the fourth design applies the changes to the second stream-based design.
Reconfigurable Hardware Implementations
Based on the refined networks of CSP processes we include samples of the Handel-C code used in the realisation of the hardware circuit.
Getting a sample from KASUMI 's main blocks, we present the macro SingleRound realising the processes SingleRound. The correspondence with the CSP description is very clear by refereing to the implementation presented in the previous stage. In this macro, the macros FirstSubRound and SecondSubRound are piped in parallel to create the macro SingleRound as follows: 
Performance Analysis and Evaluation
In this paper, we have demonstrated a methodology that can produce intuitive, highlevel specifications of algorithms in the functional programming style. The development continues by deriving efficient, parallel implementations described in CSP and realised using Handel-C that can be compiled into hardware on an FPGA. We have provided a concrete study that exploited both data and pipelined parallelism and the combination of both. The implementation was achieved by combining behavioural implementations 'off-the-shelf' of commonly used components that refine the higher-order-functions which form the building blocks of the starting functional specification.
The development is originated from a specification stage, whose main key feature is its powerful higher-level of abstraction. During the specification, the isolation from parallel hardware implementation technicalities allowed for deep concentration on the specification details. Whereby, for the most part, the style of specification comes out in favor of using higher-order functions. Two other inherent advantages for using the functional paradigm are clarity and conciseness of the specification. This was reflected throughout all the presented studies. At this level of development, the correctness of the specification is insured by construction from the used correct building blocks. The implementation of the formalised specification is tested under Haskell by performing random tests for every level of the specification.
The correctness will be carried forward to the next stage of development by applying the provably correct rules of refinement. The available pool of refinement formal rules enables a high degree of flexibility in creating parallel designs. This includes the capacity to divide a problem into completely independent parts that can be executed simultaneously (pleasantly parallel). Conversely, in a nearly pleasantly parallel manner, the computations might require results to be distributed, collected and combined in some way. Remember at this point, that the refinement steps are systematic and done by combining off-the-shelf reusable instances of basic building blocks.
In the following we will address the results found after compiling, placing and routing, and running the proposed designs. In Table ? ? the key scheduling design occupied 8905 Slices and performed at a throughput of 27.7 Mbps. The KASUMI block algorithm in the stream-based second design occupied 13225 Slices and performed at a throughput of 1.68 Mbps (See Table 2 ). The third and fourth designs outperformed the second design with speeds of 4.92 Mbps and 32 Mbps. The fourth design had a better running frequency (72.71 MHz) than of the third design (49.06 MHz).
These testing results, as compared to the requirements and to other hardware implementations, reveal the high cost of applying the methodology in that manner. Even if some tuning were made, tracking the critical paths in timing analysis to increase the maximum possible frequency of the design does not promote an elevated expectancy of the throughput. The high cost in hardware resources arises from the applied systematic rules blinding possibilities for intuitive ad hoc optimisations. The trials for better speed could continue in a similar way to those undertaken in the KASUMI third and fourth designs. Nevertheless, this lessens the use of communications on the fine-grained processes levels.
9
Acknowledgement I would like to thank Dr. Ali Abdallah, Prof. Mark Josephs, Prof. Wayne Luk, Dr. Sylvia Jennings, and Dr. John Hawkins for their insightful comments on the research which is partly presented in this paper.
Conclusion
Recent advances in the area of reconfigurable computing came in the form of FPGAs and their high-level HDLs such as Handel-C. In this paper, we build on these recent technological advances by presenting, demonstrating and examining a systematic approach for synthesizing parallel hardware implementations from functional specifications. We have observed a case study from applied cryptography, namely the KASUMI algorithm for 3GPP. The testing of the realised reconfigurable circuits allowed the ciphering with KASUMI in a throughput of 32 Mbps with an occupied area of 5594 Slices. However, this confirms the conclusion showing the expense of using the higherlevel approach adopted. Future work includes extending the theoretical pool of rules for refinement, the investigation of automating the development processes, and the optimisation of the realisation for more economical implementations with higher throughput.
11

