Parallel Algorithms Development for Programmable Devices with
  Application from Cryptography by Damaj, Issam
The link to the formal publication is via
https://doi.org/10.1007/s10766-007-0046-1
Parallel Algorithms Development for
Programmable Devices with Application from
Cryptography
Issam W. Damaj,∗
April 12, 2019
Abstract
Reconfigurable devices, such as Field Programmable Gate Arrays (FP-
GAs), have been witnessing a considerable increase in density. State-of-
the-art FPGAs are complex hybrid devices that contain up to several
millions of gates. Recently, research effort has been going into higher-
level parallelization and hardware synthesis methodologies that can ex-
ploit such a programmable technology. In this paper, we explore the
effectiveness of one such formal methodology in the design of parallel ver-
sions of the Serpent cryptographic algorithm. The suggested methodology
adopts a functional programming notation for specifying algorithms and
for reasoning about them. The specifications are realized through the
use of a combination of function decomposition strategies, data refine-
ment techniques, and off-the-shelf refinements based upon higher-order
functions. The refinements are inspired by the operators of Communicat-
ing Sequential Processes (CSP) and map easily to programs in Handel-C
(a hardware description language). In the presented research, we obtain
several parallel Serpent implementations with different performance char-
acteristics. The developed designs are tested under Celoxica’s RC-1000
reconfigurable computer with its 2 million gates Virtex-E FPGA. Perfor-
mance analysis and evaluation of these implementations are included.
Key Words: Parallel algorithms, Methodologies, Data encryption, Formal
Models, Gate Array.
1 Introduction
The rapid progress and advancement in integrated circuits (ICs) technology pro-
vides a variety of new implementation options for system engineers. The choice
∗I. Damaj is with Dhofar University, Salalah, Oman, Email:i damaj@du.edu.om
1
ar
X
iv
:1
90
4.
05
43
7v
1 
 [c
s.D
C]
  7
 A
pr
 20
19
varies between the flexible programs running on a general purpose processor
(GPP) and the fixed hardware implementation using an application specific
integrated circuit (ASIC ). Many other implementation options present, for in-
stance, a system with a RISC processor and a DSP core. Other options include
graphics processors and microcontrollers. Specialized processors certainly im-
prove performance over general-purpose ones, but this comes as a quid pro quo
for flexibility. Combining the flexibility of GPPs and the high performance of
ASICs leads to the introduction of reconfigurable computing (RC ) as a new
implementation option with a balance between versatility and speed.
Field Programmable Gate Arrays (FPGAs), are nowadays important compo-
nents of RC -systems. FPGAs have shown a dramatic increase in their density
over the last few years. For example, companies such as Xilinx [1] and Al-
tera [2] have enabled the production of FPGAs with several millions of gates,
such as in Virtex-II Pro and Stratix-II FPGAs. The versatility of FPGAs,
opened up completely new avenues in high-performance computing. These pro-
grammable hardware circuits can be supported with flexible parallel algorithms
design methodologies to form a powerful paradigm for computing.
The traditional implementation of a function on an FPGA is done using
logic synthesis based on VHDL, Verilog or a similar HDL (hardware description
langauge). These discrete event simulation languages are rather different from
languages, such as C, C++ or JAVA. An interesting step towards more success
in hardware compilation was to grant a high-level of abstraction from the point
of view of programmer. Accordingly, and recently, vendors have initiated the
use of high-level languages like Handel-C [3, 4], Forge [5], Nimble [6, 7], and
SystemC [8].
Although modern hardware compilation tools have significantly reduced the
complexity of hardware design, many research opportunities are still present
to study even more reduced design complexity. Accordingly, in this paper we
investigate a methodology enabling high-level of abstraction in the process of
hardware design. The proposed methodology is a step-wise refinement approach
for developing parallel algorithms. The development will is based on higher-
order skeletons exploiting possible inherent algorithmic parallelism. Algorithmic
skeletons provide a promising basis for the automatic utilization of parallelism
at sites of higher-order functions [9]. The correctness of the developed hardware
is put forward for further discussion through in this paper.
The research presented in this paper, builds on the work of Abdallah and
Hawkins [10, 11, 12, 13] that adopts the transformational programming ap-
proach for deriving massively parallel algorithms from functional specifications
(See Figure 1). In this approach, the functional notation is used for specifying
algorithms and for reasoning about them. This is usually done by carefully
combining a small number of generic higher-order functions (such as map, filter,
and fold) that serve as the basic building blocks for writing high-level programs.
The parallelization of algorithms work by carefully composing an ”off-the-shelf”
parallel implementation of each of the building blocks involved in the algorithm.
The underlying parallelization techniques are based on both pipelining and data
parallelism. The essence of this approach is to design a generic solution once,
2
Figure 1: An overview of the transformational derivation and the hardware
realisation processes.
and to use instances of the design many times for various applications.
In order to develop generic solutions for general parallel architectures it is
necessary to formulate the design within a concurrency framework such as CSP
[11, 14, 4]. Often parallel functional programs show peculiar behaviors which are
only understandable in the terms of concurrency rather than relying on hidden
implementation details. The formalization in CSP (of the parallel behavior)
leads to better understanding of the described network of processes and allows
for the analysis of its performance. The establishment of refinement concepts
between functional and concurrent behaviors allows for the generation of par-
allel implementations for various architectures. This gives the ability to exploit
well-established functional programming (FP) paradigms and transformation
techniques in order to develop efficient parallel and sequential CSP processes
independent from architectural details. The refinement from functional specifi-
cation to CSP descriptions is reflected in Figure 1 as transformational deriva-
tion. The transformational derivation is supported by strategies for parallelism,
CSP laws, refinement rules including those for the refinement to CSP networks
of processes.
The initial stages of development require a back-end hardware compiler
stage for realizing the developed parallel designs. In the proposed method-
ology, Handel-C is adopted as the last stage of development generating the final
hardware product. Note at this point that Handel-C language relies on the
parallel constructs in CSP to model concurrent hardware resources. Mostly, al-
gorithms described with CSP could be implemented under Handel-C. Handel-C
enables the integration with VHDL and EDIF (Electronic Design Interchange
Format) and thus various synthesis and place-and-route tools. The Handel-C
development stage is described in Figure 1 as an automated compilation step
supported by different code libraries and place-and-route tools that produces
3
the desired hardware.
The adopted methodology is systematic in the sense that it is carried out
on using step-by-step procedures. The development is yet manual and applied
according to the following informal procedure:
• Specify the algorithm in a functional setting relying on high-order func-
tions as the main building constructs wherever necessary.
• Apply the predefined set of rules to create the corresponding CSP net-
works according to a chosen degree of parallelism.
• Write the equivalent Handel-C code and complete the hardware compila-
tion.
These steps are aided with different compilers and integrated development
environments as shown in Figure 2. The set of available mathematical rules
belong mainly to the refinement to CSP stage. The automation of the de-
velopment process including the creation of a preprocessor is currently under
investigation.
The research related to the adopted methodology has been initiated by Ab-
dallah, investigating the refinement from functional specifications into concur-
rency [11], and presenting a calculus of decomposition of higher order functions
for parallel programs derivation [15]. Hawkins and Abdallah work included the
formalism for proving the refinement rules for both datatypes and processes
and investigated possible Handel-C implementations [13]. Case studies where
developed for a JPEG decoder, closest pair algorithm, sorting algorithms, DNA
processing algorithms [10, 12, 16, 17], the Kasumi cryptographic algorithm [18],
and various parallel implementations of a matrix multiplication algorithm [19].
The main focus of this paper is on the realization and application of the
theory suggested by the development methodology. An additional focus is to
test the development method and to broaden its area of use to include an indus-
trial level application. Furthermore, it includes investigating the performance
of the developed designs by carrying out a thorough analysis and evaluation.
This leads to critically extending, tuning, and enhancing the suggested method
and its realization. In addition, the current investigation enriches the adopted
method by providing libraries that supports and promotes the method for fur-
ther investigation and possibly adoption by mainstream engineers.
The remaining sections of the paper are organized so that Section 2 intro-
duces background material. In Section 3, related work is discussed. The case
study from cryptography is proposed in Section 4. The analysis and perfor-
mance evaluation are included in Sections 5 and 6. Section 7 concludes the
paper.
2 Background
Abdallah and Hawkins defined in [12] some constructs used in the adopted de-
velopment model. Their investigation looked in some depth at data refinement;
4
Figure 2: Assisting tools used in the proposed development method, including,
Haskell Hugs98 compiler to test the specification, Handel-C for hardware com-
pilation, Visual C++ integrated development environment to create the host
program driving the RC-1000 device with its FPGA
5
which is the means of expressing structures in the specification as communica-
tion behavior in the implementation.
The following parts of this section introduces briefly the proposed steps of
development; the functional paradigm, CSP, and Handel-C. In addition, we
introduce the basis of the refinement from a functional specification to networks
of CSP processes. The benefits of each development step and the adopted
refinement approach are stressed in Section 5
2.1 The Functional Paradigm
Functional programming (FP) is quite different from imperative (or procedural)
programming and also from object oriented programming. FP ’s main concern is
expressions, where everything reduces to an expression. An expression, that is a
collection of operations and variables, will results in a single value. Functions are
the main building block in functional programs and could passed around within
a program like other variables. Functional programs are usually List oriented,
and they focus description of the problem to be solved rather than focus on the
mechanism of solution. There are currently many functional languages, one of
the most widely used is Haskell [20].
As a brief overview, we can summarize that functions are considered as the
basic unit of program development and as the major routes to reuse. In addi-
tion, strong typing is considered as an aid to development pre-implementation,
during implementation and post-implementation. Some of the fundamental fea-
tures in FP are powerful high-order functions, parametric polymorphism, the
support provided of developing user-defined datatypes. Other features of no less
importance are lazy evaluation and programming with infinite data structures.
Overloading of function names are not supported in all functional languages
[21].
High-order functions are an important feature supported in functional lan-
guages. A high-order function is a function which takes another function as a
parameter. The most commonly used high-order functions are map, zipWith,
fold, and filter.
The functions map and zipWith are introduced in this section. The function
map takes a list and a function as parameters, then it applies the input function
to all elements of the input list, for example:
map even [1, 2, 3, 4] = [False,True,False,True]
Where even is a function that checks wether a number is even or not.
The function zipWith takes two lists and a function as inputs, then it ap-
plies the function on two elements; one element taken from each input list, for
example:
zipWith add [1, 2, 3, 4] [2, 3, 4, 5]
= [3, 5, 7, 9]
Where add is a function that adds two numbers.
Related work adopting FP in hardware development is introduced in Sec-
tion 3, and the main benefits gained in using this paradigm in the adopted
model are discussed in Section 5.
6
2.2 Communicating Sequential Processes (CSP)
The Communicating Sequential Processes (CSP) notation is based on events
and processes. A CSP process engages in a series of events, which can be local,
or perform channel-based communication with synchronization capabilities. A
channel communication is an event where at most two processes participate, one
acting as an input and the other as an output.
The alphabets of the components of a concurrent system determine the over-
all structure and interface of that system. The fact that the notation regards
basic concurrency operations as primitives enables the developer to concentrate
on the concurrent behavior of the system without needing to worry about the
implementation of these basic functions. A synchronization or communication
can be specified in one operation without any concern over how it takes place
[14].
Valuable features of CSP include its strong support for formal reasoning.
CSP allows to make generic assertion about the behavior of the system, such as
deadlock freedom. Moreover, specific assertions of requirements for the behavior
of the process could be done. These assertions are made with reference to a
number of models of the process. Two typical models are the trace and failures-
divergences models.
CSP also has the advantage of generality. The primitive operations of CSP
are simple enough that almost any form of concurrency can be represented using
them. Thus, CSP can be used to specify a wide variety of concurrent systems.
Moreover, it can be used to specify the intended functionality of a message
passing system at a formal level without requiring the system to be modified
for a specific architecture (as may be required by implementation). Employing
such features in hardware development gives the designer the freedom to choose
an appropriate architecture and organization of an implementation leaving no
effect on the original description. Many research projects have employed CSP
in hardware design; this is discussed in Section 3
2.3 Data Refinement
In the following, the main concern is explaining the main constructs and rules
to be used in refining a possible functional specification with its description in
CSP notation. Accordingly, we start by presenting some communication entities
used for refining datatypes declared in the initial functional step of development;
these are Item, Stream, Vector, and some of their combined forms. We note
here that the suggested methodology relies on the message passing technique to
implement parallelism.
The Item corresponds to a basic type, such as an Integer data type , and it
is to be communicated on a single communicating channel.
The Stream is a purely sequential method of communicating a list of values
(a list is a functional term equivalent to an array in a language like C ) . It
comprises a sequence of messages on a channel, with each message representing
a value. Values are communicated one after the other. Assuming the stream
7
is finite, after the last value has been communicated, the end of transmission
(EOT ) on a different channel will be signaled. Given some type A, a Stream
containing values of type A is denoted as 〈A〉.
Each item to be communicated by the vector will be dealt with in parallel. A
vector refinement of a simple list of items will communicate the entire structure
in a single step. Given some type A, a Vector of length n, containing values of
type A, is denoted as bAcn .
Whenever dealing with multi-dimensional data structures, for example, lists
of lists, implementation options arise from differing compositions of our primi-
tive data refinements - streams and vectors. Examples of the combined forms
are the Stream of Streams, Streams of Vectors, Vectors of streams, and Vec-
tors of Vectors. These forms are denoted by: 〈S1,S2, ...,Sn〉 , 〈V1,V2, ...,Vn〉,
bS1,S2, ...,Snc, and bV1,V2, ...,Vnc.
2.4 Process Refinement
The refinement is continued by looking into the functions specified in the first
stage of development. Accordingly, the refinement of the formally specified
functions to processes is the key step towards understanding possible parallel
behaviour of an implementation. In this section, the interest is in presenting
refinements of a subset of functions - some of which are higher-order. A bigger
refined set of these functions is discussed in [11].
Generally, These highly reusable building blocks can be refined to CSP in
different ways. This depends on the setting in which these functions are used
(i.e. with streams, vectors, etc.), and leads to implementations with different
degrees of parallelism. Note that we don’t use CSP in a totally formal way, but
we use it in a way that facilitates the later Handel-C coding stage. Recall for
the following subsections that values are communicated through as an elements
channel, while a single bit is communicated through another eotChannel channel
to signal the end of transmission in the case of Streams.
2.4.1 Produce
The producer process (PRD) is fundamental to process refinement. It is used
to produce values on the channels of a certain communication construct (Item,
Stream, Vector, etc.). These values are to be received and manipulated by
another processes.
Items For simple, single item types (int, char, bool, etc.), the producer process
is very simple. This is depicted in Figure 3. Here the output is just a single
channel.The definition in CSP notation is very straightforward:
PRD (Item a) = out .element .channel ! a
→ SKIP
8
Figure 3: The Produce process (PRD) for items
Figure 4: The Producer process (PRD) for streams
Streams The producer process for streams is depicted in Figure ??. As al-
ready noted, the output in this case is a pair of two other channels. One channel
carries the values of the stream, and the other is a simple channel used to signal
EOT.
In a more general case, the structure of the values which the stream is
carrying is not necessarily known. These may be simple items, but may also be
streams or vectors. Generally, producing a stream could be described as:
PRD (〈s〉) =
((; )
i=length(s)
i=1
(PRD si)[out .elements.channel/out ]);
out .eotChannel ! eot → SKIP
This description defines PRD as a process that produces items sequentially
(this is described using the sequential execution operator ”;”). The number of
items is equal to the length of the stream. After all elements are produced, an
end of transmission signal will be produced on the eotChannel channel.
Vectors For vectors of size n, n instances of the producer process are com-
posed in parallel, one for each item in the vector. The output here is an array
of channels. This is depicted in Figure 5. A general definition is given below:
PRD (bvcn) =|||i=ni=1
(PRD vi)[out .elementsi .channel/out ]
The operator |||i=ni=1 is used to indicate that n copies of the process PRD v for
producing items will be running concurrently. PRD is described as a processes
9
Figure 5: The Producer process (PRD) for vectors
that runs concurrently n instances ( of a processes that produces single items).
A process STORE stores a communication construct in a variable. We use
this process to store items, vectors, streams, or combinations of vectors and
streams. A subscript letter is used with the processes PRD and STORE to
indicate the type of communication. We sometimes omit this subscript if the
communication structure is clear from context.
2.4.2 Feeding Processes
The feed operator in CSP models function application. The feed operator is
written . The feed operator takes two processes, composes them together in
parallel, and renames both the output of the first and the input of the second
to a new name, which is then hidden. Given the lifted concepts of CSP channel
renaming and hiding, the definition can remain the same regardless of the type
of the communicating construct (Item, Stream, Vector or any combination).
P  Q =
(P [mid/out ] || Q [mid/in])\{mid}
2.4.3 Formal Process Refinement
Given the definition of a feed operator that operates on processes, a formal
definition of process refinement could be delivered. Consider a function f , which
takes input values of type A and returns values of type B . Assume that the
data refinement step has already been performed, such that A and B are both
types of some transmission value:
f :: A→ B
Then, consider a potential refinement for a function f , a process F . The
operator v denotes a process refinement, where the left hand side is a function,
and the right hand side is a process. To state that f is refined to F , or in other
10
Figure 6: The SMAP process for streams
words, the process F is a valid refinement of the function f , the following may
be used:
f v F
The rules of refinement were proven once in [11] and applied in this paper
refine a functional specification into a network of communicating processes.
2.4.4 MAP the Process Refinement of the Higher-order Function
map
Now the attention is turned to the refinement of the widely used higher-order
function map [12] . Employing this function in stream and vector settings is
presented. The refinement for combined structures is to be made in a similar
way.
Streams A process implementing the functionality of map f in stream terms
should input a stream of values, and output a stream of values with the function
f applied (See Figure 6).
In general, the handling of the EOT channels will be the same. However,
the handling of the value will vary depending on the type of the elements of the
input and output stream.
SMAP(F ) =
µX • in.eotChannel ? eot →
out .eotChannel ! eot → SKIP
2
F [in.elements.channel/in,
out .elements.channel/out ]; X
Vectors In functional terms, the functionality of map f in a list setting is
modelled by vmap f in the vector setting. Consider F as a valid refinement of
the function f . The implementation of VMAP can then proceed by composing
n instances of F in parallel, and directing an item from the input vector to each
instance for processing (See Figure 7). In CSP we have:
11
Figure 7: The VMAP process for vectors
Figure 8: The SZIPWITH process for streams
VMAPn(F ) =
|||i=ni=1 F [ini/in, outi/out ]
2.4.5 ZIPWITH the Process Refinement of the Higher-order Func-
tion zipWith
Recall another higher-order function, namely zipWith. This function is used
to zip two lists (taking one element from each list) with a certain operation.
Formally:
zipWith ::
(A→ B → C )→ [A]→ [B ]→ [C ]
zipWith (⊕) [x1, x2, ...xn ][y1, y2, ...yn ] =
[x1 ⊕ y1, x2 ⊕ y2, ..., xn ⊕ yn ]
Streams The process implementation of (zipWith f ) in stream terms should
input two streams of values, and output a stream of values with the function f
applied (See Figure 8).
12
Figure 9: The VZIPWITH process for vectors
Again, the handling of the EOT channel will be the same. Nevertheless, the
handling of the value will vary depending on the type of the input and output
streams elements.
SZIPWITH (F ) =
µX • in.eotChannel ? eot →
out .eotChannel ! eot → SKIP
2
F [in1.elements.channel/in1,
in2.elements.channel/in2,
out .elements.channel/out ]; X
Vectors To implement the data parallel version of this higher-order function,
we refine it to a process VZIPWITH that takes two vectors as input and zips
the two lists with a process F ; F is a refined process from the function (⊕).
This is depicted in Figure 9.
vzipWith (⊕) :: bAcn →, bBcn → bC cn
VZIPWITH (⊕) =
|||i=ni=1 F [outi/out , ci/in1, di/in2]
2.5 Handel-C as a Stage in the Development Model
Based on datatype refinement and the skeleton afforded by process refinement,
the desired reconfigurable circuits are built. Circuit realisation is done using
Handel-C, as it is based on the theories of CSP [14] and Occam [22].
From a practical standpoint, each refined datatype is defined as a structure
in Handel-C, while each process is implemented as a macro procedure. We divide
the constructs corresponding to the CSP stage into two main categories for or-
ganisation purposes. The first category represents the definitions of the refined
datatypes. The second category implements the refined processes. The refined
13
processes are divided into different groups. The utility processes group contains
macros responsible for producing, storing, sinking, broadcasting data, etc. The
basic processes group contains macros that correspond to simple arithmetic and
logical operations. These basic processes could be simple addition, multiplica-
tion, etc. The higher-order processes group contains the macros realising the
CSP implementations corresponding to the higher-order functions. A separate
group contains the macros that handle the FPGA card setup and general func-
tionality. The reusable macros found in these groups serves as building blocks
used for constructing a certain specified algorithm.
2.5.1 Datatypes Definitions
The datatypes definitions are implemented using structures. This method sup-
ports recursive as well as simple types. The definition for an Item of a type
Msgtype is a structure that contains a communicating channel of that type.
#define Item(Name, Msgtype)
struct {
chan Msgtype channel;
Msgtype message;
} Name
For generality in implementing processes the type of the communicating
structure is to be determined at compile time. This is done using the typeof
type operator, which allows the type of an object to be determined at compile
time. For this reason, in each structure we declare a message variable of type
Msgtype.
A stream of items, called StreamOfItems, is a structure with three decla-
rations a communicating channel, an EOT channel, and a message variable
[12]:
#define StreamOfItems(Name, Msgtype)
struct {
Msgtype message;
chan Msgtype channel;
chan Bool eotChannel;
} Name
A vector of items, called VectorOfItems, is a structure with a variable mes-
sage and another array of sub-structure elements [12].
#define VectorOfItems(Name, n, Msgtype)
struct {
struct {
chan Msgtype channel;
} elements[n];
Msgtype message;
} Name
14
Other definitions are possible, but it affects the way a channel is called using
the structure member operator (.). Examples of different extended definitions
are as follows (the first definition reuses the Item structure, while the second
one employs channel arrays supported in Handel-C ):
#define VectorOfItems(Name, n, Msgtype)
struct {
struct {
Item(element, MsgType);
} elements[n];
} Name
#define VectorOfItems(Name, n, Msgtype)
struct {
chan Msgtype channel[n];
Msgtype messages;
} Name
2.5.2 Utilities Macros
The utility processes used in the implementation are related to the employed
datatypes. The Handel-C implementation of these processes relies on their cor-
responding CSP implementation. An instance of these utility macros is shown
in the following code segment:
macro proc ProduceItem(Item, x){
Item.channel ! x;}
macro proc StoreItem(Item, x){
Item.channel ? x;}
2.5.3 Higher-Order Processes Macros
An example for an implementation in Handel-C of the CSP refinement of a
higher-order function (map) is done as follows. The process runs through a
loop which terminates when the variable eot is set to true. At each step of the
loop, the process enters a wait state until either the EOT or the value channel
of the input stream is willing to communicate. If the EOT channel is willing
to communicate, the input is consumed from it and stored in the variable eot,
then output an EOT message for the output stream. If the value channel of
the input stream is willing to communicate, the value is consumed then F is
applied to it giving the result on the output stream channel.
macro proc
SMAP (streamin, streamout, F){
Bool eot;
eot = False;
do{
prialt{
15
case streamin.eotChannel ? eot:
streamout.eotChannel ! True;
break;
default:
F(streamin.elements,
streamout.elements);
break;
}} while (!eot)}
We turn the attention to providing a definition in Handel-C for the behaviour
of the process VMAP. Here we can employ Handel-C ’s enumerated par construct
to place n instances of the process F in parallel. Each instance is passed to the
corresponding channels from both the input and output channels.
macro proc
VMAP (n, vectorin, vectorout, F) {
typeof (n) c;
par (c = 0 ; c < n ; c++){
F(vectorin.elements[c],
vectorout.elements[c]);}}
2.6 Evaluation Tools and Performance Metrics
Different tools are used to measure the performance metrics used for the analy-
sis. These tools include the design suite (DK ) from Celoxica, where we get the
number of NAND gates for the design as compiled to (EDIF ). The DK also
affords the number of cycles taken by a design using its simulator. Accordingly,
the speed of a design could be calculated depending on the expected maximum
frequency of the design.
To get the practical execution time as observed from the host computer,
the C++ high-precision performance counter is used. The counter probes the
execution of the design after loading the image of the design into the FPGA
till termination. Practically, the probation comes directly after writing a con-
trol signal to the FPGA enabling execution. The counter stops immediately
after receiving a signal through reading the status register. According to this
measurement the speed of execution is calculated.
The information about the hardware area occupied by a design, i.e. number
of Slices used after placing and routing the compiled code, is determined by the
ISE place and route tool. In the current investigation the only used metrics are
the number of Slices and the Total Equivalent Gate Count for a design.
3 Related Work
In this section we define four perspectives, not necessarily mutually exclusive or
unconnected, to be considered for relating our work with its global literature:
• Purpose: Related to frameworks created for refining correct hardware
implementations.
16
• Implementation Framework: Related to the use of the Functional Paradigm
in hardware development. Related work in this area might also meet the
purpose of developing correct reconfigurable hardware.
• Description: Related to the use of CSP in hardware development.
• Application: Related to the use of FPGAs in implementing the Serpent
cryptographic algorithm.
The idea for deriving implementations from the specification through cor-
rect well defined refinement steps has been motivated by many technical facts.
For instance, the limitations in commonly used synthesis tools and formal ver-
ification techniques utilized in equivalence checking between the synthesized
hardware and the abstract specification [23]. Many frameworks for develop-
ing correct hardware has been brought out in the literature [24, 25, 26, 23].
Our work meets these multi-stage frameworks in their aim of refining correct
hardware from specification.
The Provably Correct Systems project (ProCoS ) suggested a mathematical
basis for the development of embedded and real-time computer systems. They
used FPGAs as a back-end hardware for realizing their developed designs [24].
In [26], a formal approach to correctly generate an architecture-level model of
a system from its specification model is proposed . The proposed approach relies
on formal transformations to refine a specification model into a provably correct
architectural model. Tools have been created to support automatic generation
of refined models [23].
The attractions for using the functional paradigm in hardware development
incited many researchers. This triggered many investigations in this area, such
as Lava [27], Hawk [28, 29], Hydra [30], HML [31], MHDL [32], DDD system
[33], SAFL [34], MuFP [35], Ruby [36], and Form [37].
The compiled Occam into FPGAs [38] [39] and the Handel-C compiler [3]
are considered as the major work introducing CSP in hardware development.
Susan Stepney at the University of York [4] [40] investigated ways to translation
between CSP and Handel-C. Handel-C compiler is used to map designs onto
FPGAs. The suggested translation uses FDR2 as a front-end specification and
proof tool, then automatically translates the formal designs into executable
Handel-C.
Many efforts have been put to efficiently implement the Serpent in hard-
ware. R. Anderson proposed in [41] the Serpent algorithm and evaluated its
performance under different processing systems. Adam et al in [42] presented
an FPGA implementation and performance evaluation of the Serpent. Multiple
architecture options of the Serpent algorithm were explored with a strong fo-
cus being placed on high-speed implementations. Bora in [43] investigated the
possibilities of realising the Serpent using FLEX10K ALTERA FPGAs series.
The implementations of this algorithm was introduced in [44] with an effort
to determine the most suitable candidate for hardware implementation within
commercially available FPGAs.
17
4 Case Study: The Serpent Cryptographic Al-
gorithm
The Serpent algorithm is chosen as a test case for the proposed development
model. The motivation behind choosing the Serpent is its proven strength and
suitability for hardware implementation [41]. The Serpent algorithm is a 32-
round substitution-permutation (SP) network operating on four 32-bit words.
The algorithm encrypts and decrypts 128-bit input data and a key of 128, 192,
or 256 bits in length. The Serpent algorithm consists of three main blocks an
initial permutation (IP), A 32-round block, and a final Permutation (FP). One
round function is comprised of three operations occurring in sequence. These
are bit-wise XOR with the 128-bit round key, substitution via 32 copies of one
of eight S-boxes, and data mixing via a linear transformation. These operations
are performed in each of the 32 rounds with the exception of the last round. In
the last round, the linear transformation is replaced with a bit-wise XOR with
a final 128-bit key.
This section develops parallel implementations of the Serpent algorithms
showing all stages of development and the results of testing. The following sub-
sections presents the functional specification, followed by the refinement and
the implementation in Handel-C. Various designs with different degrees of par-
allelism are investigated. Different solutions are presented to some realization
pitfalls. The final section presents the results of running the compiled designs
with comparison among different processing systems.
4.1 Formal Functional Specification
Two main building blocks construct the Serpent, the key scheduling block and
the encryption (decryption) block. The key scheduling block inputs the private
key and outputs the desired 132 subkeys. The encryption block inputs data
segments representing the plaintext and outputs the corresponding ciphered
data segments. The formal functional specification employs the following names
used for clarifying types definitions.
type Private = [Bool]
type SubKey = [Bool]
type DataBlock = [Bool]
The following subsections present the specification of the Serpent algorithm.
The implementation of the specification under HUGs98 Haskell compiler is
tested at the unit, component and integration levels.
4.1.1 Key Scheduling
Two main steps are carried out to generate the required 132 32-bit subkeys for
the Serpent. The algorithm for generation is as follows:
18
• Generate an intermediate list ws by:
– Padding the input key to 256-bit if necessary.
– Then, partitioning the key into eight segments of equal length (32-
bit) ws0, ..,ws7.
– Then, expanding these to intermediate prekeys ws8, ..,ws139 by the
following recurrence:
wsi := (wsi−8 ⊕ wsi−5 ⊕ wsi−3 ⊕ ws ⊕ 9e3779b9hex ⊕ (i − 8)) <<11
where (<<n) is the n-element left circular shift operator.
• The round subkeys ks are now calculated from the prekeys ws using the
S-boxes as follows:
{k0; k1; k2; k3} = S3(w0; w1; w2; w3) {k4; k5; k6; k7} = S2(w4; w5; w6; w7)
{k8; k9; k10; k11} = S1(w8; w9; w10; w11) {k12; k13; k14; k15} = S0(w12; w13; w14; w15)
{k16; k17; k18; k19} = S7(w16; w17; w18; w19)
...
{k124; k125; k126; k127} = S4(w124; w125; w126; w127)
{k128; k129; k130; k131} = S3(w128; w129; w130; w131)
The function keySchedule formally specifies the above algorithm. This func-
tion inputs the private key and outputs the desired subkeys following the steps
clarified in Figure 10. This figure also shows the format of the final output as
ordered for later use in the functions specifying the encryption.
keySchedule :: Private -> [[SubKey]]
keySchedule key = concat kss
where
ws = drop 8 (generateWs 8 (segs 32 key))
kss = map (mapWith [s3, s2, s1, s0,
s7, s6, s5, s4])
(segs 8 (segs 4 ws))
The application of the S-boxes is done by mapping the function (mapWith
[s3, s2, s1, s0, s7, s6, s5, s4]) over the prepared segmentation of ws (segs 8
(segs 4 ws)). Note that the length of the list ws at this point is 132 elements.
Grouping this list into segments of four and then of eight, will give four lists
each of eight 4-element sublists, covering 128 elements from ws. The remaining
4 elements constitutes a final list of four elements. With the lazy evaluation
property found in functional programming, the final mapped mapWith only
applies the function s3 to the remaining list. This will give the desired output
list of lists representing the 132 round subkeys.
The generateWs responsible for generating the prekeys is specified as follows:
19
Figure 10: Steps for Serpent subkeys generation
generateWs :: Int -> [[Bool]] -> [[Bool]]
generateWs i ws
| ((i < 140) && (i > 7)) =
(generateWs (i+1) (ws ++ [wsD]))
| otherwise = ws
where
wsD = (shift 11 (foldr1 fullexor
[(ws!!(i-8)), (ws!!(i-5)),
(ws!!(i-3)), (ws!!(i-1)),
const, (itob (i-8))]))
const = concat
(map itob.htoi ["9e37", "79b9"])
The S-boxes are specified using the logic functions fullexor, fullOR, fullAND,
and fullComplement. These corresponds to the full-word bitwise version of XOR,
OR, AND, and NOT logic operations. For instance, the first S-box is specified
as the function s0 with a list of list of bool as input and output. The input
list elements [a, b, c, d] are distributed to different operations computing for
the final output list [w, x, y, z]. Temporary variables used to compute the final
output list are grouped to be zipped with their operation using the higher-order
function zipWith . The current specification does not reflect the order that these
operations should be carried out. A dependency analysis has to be done aiding
20
the later refinement. Note that the decryption inverse S-boxes are specified in
a similar way. In the following we show the specification of the s0 function.
s0 :: [[Bool]] -> [[Bool]]
s0 [a,b,c,d] = [w, x, y, z]
where
[t01, t03, z, t06, y,
t12, t13, t15, t17, x ]=
zipWith
fullexor [b, a, t02, a, t09,
c, t07, t06, w, t12]
[c, b, t01, d, t08,
d, t11, t13, t14, t17]
[t05,t07, t02] =
zipWith fullOR [c, b, a] [z, c, d]
[t08, t09, t11, t14] =
zipWith fullAND [d, t03, t09, b]
[t05, t07, y, t06]
w = fullComplement t15
4.1.2 Serpent Block Cipher
Flowcharts showing the steps to carry out the encryption and the decryption are
shown in Figure 11. Decryption is different from encryption in that the inverse
of the S-boxes must be used in the reverse order, as well as the inverse linear
transformation and reverse order of the subkeys.
A functional specification formulates Serpent encryption as a function ser-
pentEncrypt. This function works by firstly inputting a list of lists of data
blocks. Then, it maps the function serpentEncryptSeg, responsible for a single
128-bit data block encryption, with the input private key to all the input list
elements. The functional specification of serpentEncrypt is as follows:
serpentEncrypt :: [[DataBlock]] -> Private
-> [[DataBlock]]
serpentEncrypt inputs key =
map (serpentEncryptSeg(keySchedule key))
inputs
The formalised function serpentEncryptSeg inputs the generated round sub-
keys in a form of a list of lists, besides, the 128-bit plaintext input data block.
The first 31 rounds subkeys are taken from the input list of subkeys and zipped
in a list of pairs with the corresponding S-box number. The higher-order func-
tion foldl is used with the function serpentFold to fold the input data block over
the zipped list of pairs. In other words, the function foldl replicates the required
21
Figure 11: Serpent encryption (a) and decryption (b) flowcharts
22
31 rounds in a pipelined fashion. The final round is carried out by XORing the
output from the 31st round with the 32nd set of subkeys (sKeys!!32), at this
point the result is passed to the function s7. The final ciphered output is the
result of XORing the output from the function s7 with the last set of subkeys
(sKeys!!32). The suggested formal functional specification is as follows:
serpentEncryptSeg :: [[SubKey]] ->
[DataBlock]->[DataBlock]
serpentEncryptSeg sKeys input =
zipWith fullexor (sKeys!!32) (s7 xorOut))
where
xorOut = zipWith fullexor (sKeys!!31)
roundsOut
roundsOut = foldl serpentFold input
(zip (take 31
(concat (copy1 [0,1,2,3,4,5,6,7] 5)))
(take 31 sKeys))
A Serpent fold, specified as the function serpentFold, inputs a data block
and a pair corresponding to a list of four subkeys and the corresponding S-box
number employed in that fold. The subkeys are zipped with the input, passed
to the corresponding S-box, and finally linearly transformed using the function
lTransfrom. The input S-box number is used to choose one of the available
S-boxes listed in the list of functions s. A possible formalisation is as follows:
serpentFold :: [DataBlock] ->
(Int, [SubKey]) -> [[Bool]]
serpentFold input (i,skey) =
lTransform ((s!!i)
(zipWith fullexor skey input))
where
s = [s0, s1, s2, s3,
s4, s5, s6, s7]
The function lTransform linearly transforms a list of 4 inputs into a list of
4 outputs. the transformation uses the left circular shift function shift and the
left shift function lshift as follows:
lTransform :: [[Bool]] -> [[Bool]]
lTransform [x0, x1, x2,x3] =
[y0, y1, y2, y3]
where
[y0i, y2i, y0, y1, y2, y3] =
mapWith [(shift 13), (shift 3),
(shift 5), (shift 1),
(shift 22), (shift 7)]
23
[x0, x2, y0ii, y1i, y2ii, y3i]
[y1i, y3i, y0ii, y2ii] =
zipWith fullexor
(zipWith fullexor
[x1, y2i, y0i, y2i]
[y0i, (lshift 3 y0i),
y1, y3])
[y2i, x3, y3, (lshift 7 y1)])
lshift :: Int -> [Bool] -> [Bool]
lshift n ls =
(drop n ls) ++ (copy False n)
4.2 Algorithms Refinement to CSP
For the key scheduling part we suggest two designs. The first design implements
the scheduling in a data-parallel fashion. The second design economises the
implementation by carefully removing replication from one of the main building
blocks. For the encryption part, we suggest three designs. The first design
presents a fully pipelined network of rounds. The second design uses only one
stage from the pipeline suggested in the first design. In this case inputs and
outputs are refined to streams. The third design leaves a flexible choice for the
level of parallelism, allowing control over the number of pipelined stages.
4.2.1 Key Scheduling
At this development stage, we refine each function from the specification of the
key scheduling part. In the following section, the two suggested designs are
presented and explained.
First Design The types used in the specification of the function keySchedule
are refined to a 256-bit Integer item for the private key, and a vector of vectors
of items of size (33× 4) for the output subkeys:
keySchedule :: Int256→ bbInt32c4c33
The refinement implements the function keySchedule as a process KEYSCHED-
ULE. According to the specification, the first event to occur is the segmentation
of the input key into eight segments using a predefined process SEGS. These
eight segments are passed to the process GENERATEWS.
KEYSCHEDULE = ((PRD(32)  SEGS)8 STOREv (ws)); (GENERATEWS(8,ws))132
(VMAP4(VMAPWITH ([S0,S1,S2,S3,S4,S5,S6,S7])) ‖ S3)
where,
S0 v s0; S1 v s1; S2 v s2; S3 v s3; S4 v s4; S5 v s5; S6 v s6; S7 v s7;
The higher-order process (VMAP4) creates four parallel instances of the pro-
cess VMAPWITH. In turn, 32 parallel instances of the S-boxes processes is now
available for parallel computation. These 32 S-boxes process takes 128 items
24
Figure 12: The process KEYSCHEDULE, first design
from the 132 generate prekeys in the process GENERATEWS. The final four
prekeys are passed to a parallel instance of the process S3. The output from
these parallel S-boxes processes is the desired vector of 132-round subkeys. The
process KEYSCHEDULE is depicted in Figure 12.
The function generateWs could be refined as follows:
generateWs :: Int32→ bInt32c8 → bInt32c132
generateWs v GENERATEWS
GENERATEWS(i ,ws) =
if (7 < i < 140)
then WsD(i ,ws) StoreItem(wsd);
GENERATEWS(i + 1,ws ++ [wsd ])
else PRD(ws)
Unrolling the above recursive implementation for GENERATEWS (8,ws):
GENERATEWS(8,ws) =
WsD(8,ws) STOREv (wsd); GENERATEWS(9,ws ++ [wsd ]);
WsD(9,ws) STOREv (wsd); GENERATEWS(10,ws ++ [wsd ]);
.
.
.
WsD(139,ws) STOREv (wsd); GENERATEWS(140,ws ++ [wsd ]); PRD(ws)
This could be done as:
GENERATEWS(8,ws) =
for(i = 8; i < 140; i + +){
WsD(i ,ws) StoreItem(wsd); }
where,
25
Figure 13: The process KEYSCHEDULE, second design with replication re-
duced
WsD(i ,ws) = out !(11 (ws[i − 8]⊕ws[i − 5]⊕ws[i − 3]⊕ws[i − 1]⊕ (9e3779b9hex )⊕
(i − 8)))
Second Design The second design intends to eliminate the replication in the
S-boxes computation processes. This leads to a smaller hardware circuit in the
later stage as a trade for the expected speed. The change from the first design
is made by refining map to its stream setting. This implementation is depicted
in Figure 13 and described in the following CSP network:
KEYSCHEDULE = ((32  SEGS)8 STOREv (ws));
(GENERATEWS(8,ws))
(SMAP(VMAPWITH ([S0,S1,S2,S3,S4,S5,S6,S7])) ‖ S3)
4.2.2 Serpent Block Cipher
The current refinement is done in three different designs. The process responsi-
ble for a single block ciphering is SERPENTESEG, the refinement of the func-
tion serpentEncryptSeg. The input data items, for instance, could be passed as
a stream of vectors of four 32-bit data items to the encrypting block SERPEN-
TESEG. The output is refined also to a stream of items as follows:
serpentEncrypt(key) :: 〈bInt32c4〉 → 〈bInt32c4〉
Consequently, we suggest the following refinement employing the higher-
order process SMAP. The key, in this case, is passed as an argument to the
process SERPENTENCRYPT.
26
Figure 14: The process SERPENTESEG, first pipelined design
SERPENTENCRYPT (key) = KEYSCHEDULE(Key) SMAP(SERPENTESEG)
A multi-way Serpent encryption version is implemented as follows:
serpentEncrypt(key) :: 〈b〈bInt32c4〉cn〉 → 〈b〈bInt32c4〉cn〉
SERPENTENCRYPT (key) = KEYSCHEDULE(key)
SMAP(VMAPn(SMAP(SERPENTESEG)))
where the value of n is limited by the ability to realise this network on
the available hardware in the following stage. The following three designs are
suggested for the implementation of the process SERPENTESEG.
First Design This design suggests a fully pipelined implementation of the
Serpent encryption specification. The pipeline is constructed by replicating the
single round specified as the function serpentFold. The replication is done using
the vector setting refinement of the higher-order function foldl, where the input
is a vector of items. The input 132 subkeys are distributed to the pipelined folds
as shown in Figure 14. Also, the number of the round in use is distributed to the
pipelined folds. The output from the pipeline is the input to the higher-order
process VZIPWITH(EXOR), zipping it with a set of four subkeys. The result
of zipping is passed to an S7 S-box process, whose output vector is zipped again
using another VZIPWITH(EXOR) with the last generated set of four subkeys.
The CSP description is as follows:
serpentEncryptSeg = bInt32c132 → 〈bInt32c4〉 → 〈bInt32c4〉
serpentEncryptSeg v SERPENTESEG
SERPENTESEG = (BROADCAST3([0..7]) ‖ (PRD([0..6])))  (VVFOLDL(SERPENTFOLD) ‖
VZIPWITH4(EXOR)4 S7 ‖ VZIPWITH4(EXOR))
where,
serpentFold v SERPENTFOLD
The serpent fold is implemented as in the following:
27
Figure 15: The process SERPENTESEG, second design with stream of subkeys
serpentFold :: bInt32c4 → (Int3, bInt32c4)→ bInt32c4
SERPENTFOLD = (in?i → SKIP); VZIPWITH4(EXOR) Si  LTRANSFORM
where
lTransform v LTRANSFORM
The linear transformation function lTransform is refined to the process
LTRANSFORM. The input and output are refined as a vector of items as fol-
lows:
lTransform :: bInt32c4 → bInt32c4
The process LTRANSFORM is implemented as follows:
LTRANSFORM = (|||i=3i=0 in[i ]?x [i ]→ SKIP);
LSHIFT (3) ‖ LSHIFT (7) ‖
(VZIPWITH4(EXOR)4 VZIPWITH4(EXOR)) ‖
VMAPWITH ([SHIFT (1),SHIFT (13),SHIFT (3),
SHIFT (5),SHIFT (1),SHIFT (22),SHIFT (7)])
Second Design In this design, the network component processes are still the
same, as shown in the first design, with a modification to the way they com-
municate. The stream communication with the main process SERPENTFOLD,
allows the elimination of copies of this process using SVFOLDL the stream re-
finement of foldl, where the input is a vector of items. The subkeys distribution,
at this point, are passed sequentially to the process SERPENTFOLD. Only the
last two sets of subkeys are produced as vectors to be used in the two similar
parallel processes VZIPWITH4(EXOR). This network is shown in Figure 15.
The CSP description is as follows:
serpentEncryptSeg = 〈Int32〉 → 〈bInt32c4〉 → 〈bInt32c4〉
SERPENTESEG = (BROADCAST3([0..7]) ‖ (PRD([0..6])))  (SVFOLDL(SERPENTFOLD) ‖
VZIPWITH4(EXOR)4 S7 ‖ VZIPWITH4(EXOR))
28
Figure 16: The process SERPENTESEG, third partially pipelined design
Third Design Based on the above suggested implementations, this design
composes both a pipelined part and a stream-based part to build the final de-
sired Serpent network. This implementation is shown in Figure 16 and done as
follows:
serpentEncryptSeg = 〈Int32〉 → bInt32c132 → 〈bInt32c4〉 → 〈bInt32c4〉
SERPENTESEG = (BROADCAST3([0..7]) ‖ (PRD([0..6])))
 (((PRD(n)  VVFOLDL(SERPENTFOLD)) ‖
SVFOLDL(SERPENTFOLD)) ‖ VZIPWITH4(EXOR)4
S7 ‖ VZIPWITH4(EXOR))
4.3 Reconfigurable Hardware Implementations
The part of the hardware implementation included in this section is aimed
to show samples of the implemented code. We put some emphasis on some
code segments, where we could not base the implementation from the previous
stage in a straightforward manner. Remember that the main reason behind the
faced coding difficulties resides in the level of generality of the constructs to be
implemented.
The following macros are for the two designs of key scheduling. The first
macro KeySchfedule1st outputs the subkeys as a vector of vectors of vectors of
items from the macros GenerateWsVOVOV and S3.
macro proc KeySchedule1st
(keyIn, KssOutVOVOV, lastksV){
.
.
.
par{
Segs(keyIn, segmentsOut);
GenerateWsVOVOV
29
(segmentsOut, WsOutVOVOV, lastwsV);
VMap
(WsOutVOVOV, 4, KssOutVOVOV, VMapWithSs);
S3(lastwsV, lastksV);}
The macro for the second design with its stream implementation is as follows:
macro proc KeySchedule2nd
(keyIn, KssOutSOVOV, lastksV) {
.
.
.
par{
Segs(keyIn, segmentsOut);
GenerateWsSOVOV
(segmentsOut, WsOutSOVOV, lastwsV);
Map(WsOutSOVOV, KssOutSOVOV, VMapWithSs);
S3(lastwsV, lastksV);}}
The stream-version macro implementing the process GenerateWs is shown
in the following code section. In this macro a 140 (32-bit Integer) elements array
ws is used to store the generated prekeys. This means occupying a large area
from the targeted FPGA. An alternative implementation is to use the available
internal RAM, so, this would dramatically save the needed space. The RAM
property of allowing only one access to it at once (read or write at a time)
imposes some restrictions. For instance, the production of the final calculated
prekeys should be done as stream of items instead of a stream of vectors of
vectors of items. Both cases are shown in the following code sections:
macro proc GenerateWsSOVOV
(wsIn, wsOutSOVOV, lastwss) {
.
.
.
Int32 ws[140];
par(j = 0; j < 8; j++){
jTemp[j] = 0@j;
wsIn.elements[j].channel ?
ws[jTemp[j]];}
PHI = 0x9e3779b9;
for(i = 8; i < 140; i++){
iTemp = 0@i;
wTemp = ws[i-3]^ws[i-5]^
ws[i-8]^ws[i-1]^
30
PHI^(iTemp-8);
par{
ProduceItem(wItem, wTemp);
Shift(wItem, 11, sOut);
StoreItem(sOut, ws[i]);}
if (i == 139){
break;}}
ProduceSOVOVOItemsFromArrayWithOffset
(wssOutSOVOV, 4, 8, 4, ws, 8);
par{
lastwss.elements[0].channel ! ws[136];
lastwss.elements[1].channel ! ws[137];
lastwss.elements[2].channel ! ws[138];
lastwss.elements[3].channel ! ws[139];}}
The second version is as follows:
macro proc GenerateWsRam
(wsIn, wssOut) {
.
.
.
ram Int32 ws[140];
par(j = 0; j < 8; j++){
jTemp[j] = 0@j;
wsIn.elements[j].channel ?
ws[jTemp[j]];}
PHI = 0x9e3779b9;
for(i = 8; i < 140; i++){
iTemp = 0@i;
wTemp = ws[i-1]^PHI^(iTemp-8);
wTemp1 = wTemp ^ ws[i-8];
wTemp2= wTemp1^ws[i-5];
wTemp3 = wTemp2^ws[i-3];
par{
ProduceItem(wItem, wTemp3);
Shift(wItem, 11, sOut);
StoreItem(sOut, ws[i]);}
if (i == 139){
break;}}
ProduceStreamOfItems(wsOut, 140, ws);}
The use of an FPGA’s on-chip memory is constrained with its supported
memory capabilities and corresponding Handel-C compilation options. The
available sophisticated SelectRAM memory hierarchy available on the used
31
Virtix-E FPGA supports True Dual-Port BlockRAMs and Distributed RAMs.
However, Handel-C declaration of an array is equivalent to declaring a number
of variables. Each entry in an array may be used exactly like an individual
variable, with as many reads, and as many writes to a different element in the
array as required within a clock cycle. Arrays are more efficient to implement in
terms of concurrent access required by fast pleasantly parallel designs. Arrays
are implemented using the available logic blocks in an FPGA (Slices in the case
of Xilinx devices). RAMs, are normally more efficient to implement in terms of
hardware resources than arrays since they use the on-chip RAM blocks. RAMs,
would allow one location to be accessed in any one clock cycle.
To take the advantage of an available multi-port memory blocks, one can
use the mpram declaration in Handel-C instead of ram. A design that uses
an mpram with two ports would outperform the sequential design in terms of
speed, but still replications of some processes would be necessary to cope with
the doubled amount of information retrieved. A design that uses a dual-ported
memory to store a list should have refined the list as a stream of vectors of two
elements in the description stage.
Before we present parts of the realization of the encryption designs, we note
the solution we suggest for implementing the higher-order process VMAPWITH
with a list of different processes. The macro VMapWith needs to map a list of
macros to a list of items. The problem we faced is for how to pass a list of
macros as an argument to the macro VMapWith. A best case scenario is having
the following code implementation:
macro proc VMapWith
(vIn, , vProcesses, vOut, n){
par(i = 0, i < n, i++){
vProcesses[i]
(vIn.elements[i],
vOut.elements[i]);}}
The vector of macros vProcesses passing to the macro VMapWith is not
supported in the current version of Handel-C. A second possible form for a
possible implementation in Handel-C is as follows:
macro proc VMapWith
(vIn, P1, P2,..., Pn, vOut, n){
par{
P1(vIn.elements[i], vOut.elements[i]);
P2(vIn.elements[i], vOut.elements[i]);
.
.
.
Pn(vIn.elements[i], vOut.elements[i]);}}
32
A step forward in the code generation leads to the third possible form of
implementation. This form would fit the calling of the process VMAPWITH
from another higher-order macro as had been done in:
VMap(WsOutVOVOV, 4,
KssOutVOVOV, VMapWithSs);
This suggests the removing of the zipped-with macro names from the argu-
ments lists in the macro procedure definition as follows:
macro proc VMapWithPs(vIn, vOut, n){
par{
P1(vIn.elements[i], vOut.elements[i]);
P2(vIn.elements[i], vOut.elements[i]);
.
.
.
Pn(vIn.elements[i], vOut.elements[i]);}}
A possible solution to such a limitation is, again, the availability of a prepro-
cessor automatically generating the allowed implementation from the best case
scenario presented. For the case of mapping with the list of S-boxes macros; the
code is as follows:
macro proc VMapWithSs(vIn, vOut){
par{
S3(vIn.elements[0], vOut.elements[0]);
S2(vIn.elements[1], vOut.elements[1]);
S1(vIn.elements[2], vOut.elements[2]);
S0(vIn.elements[3], vOut.elements[3]);
S7(vIn.elements[4], vOut.elements[4]);
S6(vIn.elements[5], vOut.elements[5]);
S5(vIn.elements[6], vOut.elements[6]);
S4(vIn.elements[7], vOut.elements[7]);}}
For the encryption part, we include the implementation done for the third
design. Whereby, a combination of parallel and sequential fold are employed
with vector of items as input. Based on the CSP implementation, the macro
EncryptSegsVVandSV is implemented as follows:
macro proc EncryptSegsVVandSV
(input, sKeysVOV, VRnds,
sKeysSOV, finalKeys, output) {
par{
VVFoldL(sKeysVOV, output1, 4,
NParalRnds, SerpentFold, input);
33
SVFoldL(sKeysSOV, 4, output2,
SerpentFold, output1, 4, NParalRnds);
VZipWith(4, output2,
finalKeys.elements[0], output3, EXOR);
S7(output3, output4);
VZipWith(4, output4,
finalKeys.elements[1], output, EXOR);}}
The macro SerpentFold implements its corresponding process as follows:
macro proc SerpentFold
(input, i, sKeys, output) {
VectorOfItems (vOut, 4, Int32);
VectorOfItems (output1, 4, Int32);
par{
par{
VZipWith(input, sKeys, vOut, EXOR);}
if(i==0)
S0(vOut, output1);
else if(i==1)
S1(vOut, output1);
else if(i==2)
S2(vOut, output1);
else if(i==3)
S3(vOut, output1);
else if(i==4)
S4(vOut, output1);
else if(i==5)
S5(vOut, output1);
else if(i==6)
S6(vOut, output1);
else if(i==7)
S7(vOut, output1);
LinearTransformation
(output1, output);}}
5 General Evaluation
In this paper, the contribution of the presented work could be found in many
aspects. Some additions were crucial to the realization step of the method so
that it can cope with real-life complex areas of applications. A famous algorithm
34
from cryptography has been targeted as a test case that has given a clear idea
about the practical use of the methodology. Reusable libraries are created at all
levels of development. The availability of such libraries supports and facilitates
the development in general. The created libraries for the different studies from
cryptography are highly reusable for developing other cryptographic algorithm.
This might include the introduction of new components to the libraries, or
slightly modifying the available ones. According to these points, we stress the
following aspects:
The development is originated from a specification stage, whose main key
feature is its powerful higher-level of abstraction. During the specification,
the isolation from parallel hardware implementation issues allowed for deep con-
centration on the specification details. Whereby, for the most part, the style
of specification comes out in favor of using higher-order functions. Two other
inherent advantages for using the functional paradigm are clarity and concise-
ness of the specification. This was reflected throughout all the presented stud-
ies. At this level of development, the correctness of the specification is insured
by construction from the used correct building blocks. The implementation of
the formalized specification is tested under Haskell by performing random tests
for every level of the specification.
The correctness will be carried forward to the next stage of development
by applying the provably correct rules of refinement. The available pool of re-
finement formal rules enables a high degree of flexibility in creating parallel
designs. This includes the capacity to divide a problem into completely inde-
pendent parts that can be executed simultaneously (pleasantly parallel). Con-
versely, in a nearly pleasantly parallel manner, the computations might require
results to be distributed, collected and combined in some way. Remember at this
point, that the refinement steps are done by combining off-the-shelf reusable
instances of basic building blocks.
6 Performance Analysis
In this section we show the testing results of mapping the designs, analyzing
their timing, and showing the speeds as measured for testing the RC-1000 board
from the used P4 machine. The fully-pipelined design was over-mapped, thus,
the following presented speeds are for the remaining designs. Note that in the
suggested Serpent implementations, the finest grains of basic building blocks are
refined as processes rather than using Handel-C operators. Thus, an increase
in communications cost between processes is found.
In Table 1, we show the testing results of the encryption subkeys genera-
tion. The keys generation (second design) runs with a throughput of 96 Mbps
occupying 13097 Slices, i.e. 68% of the FPGA area.
As shown in Table 2, the testing results of the Serpent second and third
designs are included, while the first design failed to compile with its large gates
count. The maximum achieved parallelism was in running the third design with
2 parallel folds and a third performing the remaining 29 sequential folds. This
35
Table 1: Testing results of Serpent encryption subkeys generation
implementation has a throughput of 12.21 Mbps occupying an area of 19198
Slices (99% of the available FPGA area). The second design with its sequential
single fold implementation achieved throughput of 12.15 Mbps with an area of
12291 Slices.
In Table 3, we include some results from literature mapping the same al-
gorithms onto FPGAs. The high-speeds achieved for the suggested optimised
implementations is very clear, as compared to our high-level (un-optimised) im-
plementation (yet) - from performance perspective. The shown results include
a high-speed implementations for the Serpent (333 Mbps) presented by Elbirt
et al [42]. Gaj et al in [45] presented another high-speed implementation for the
Serpent (431.4 Mbps).
In Table 4 [46, 47, 48] we compare the number of cycles for different hardware
implementations of the Serpent including a number of microprocessor-based
implementations.
The higher-level development caused high replication in using basic building
blocks, and more clearly their communications. Many instances of PRODUCE
and STORE processes caused the high use of intermediate variables. Other
processes were used for structuring data in the format corresponding to their
functional definitions. For instance, to collect some vectors of subkeys and
produce them as a vector of vectors of vectors of items. Such use also plays a
big role in occupying larger silicon area after realization.
If we consider the implementation of an algorithm without using our pro-
posed method, we might implement the whole design with a small number of
macros and minimum use of communications. Moreover, possible handmade
enhancements could be done with the aid of shared variables. This would un-
doubtedly reduce the cost paid for communicating parallel processes implemen-
tation and might lead to a more economical realization and less congested design
with a higher frequency. This certainly comes as quid pro quo for the step-wise
development.
36
Table 2: Testing results of Serpent encryption
Table 3: Comparisons among similar FPGA systems implementing optimized
Serpent
37
Table 4: Comparisons among different hardware systems, with respect to num-
ber of clock cycles, implementing the Serpent
7 Conclusion
Mapping parallel versions of algorithms onto hardware could enormously im-
prove computational efficiency. Recent advances in the area of reconfigurable
computing came in the form of FPGAs and their high-level HDLs such as
Handel-C. In this paper, we build on these recent technological advances by
presenting, demonstrating and examining a high-level hardware development
method. The used method creates a functional specification of an algorithm
without defining parallelism. Correspondingly, an efficient parallel implementa-
tion is derived in the form of CSP network of processes. Accordingly, we create
diffident parallel implementations in Handel-C. The presented work included
theory and practices about the suggested methodology. In this paper, we ob-
served a case study from applied cryptography, namely the Serpent algorithm.
The encryption block ciphers and key expansions were addressed. The correct-
ness, conciseness and clarity of the specification is emphasized. The systematic
and flexible refinements of the specification allowed the reasoning about var-
ious implementations with different degrees of parallelism for each case. The
described designs ranged from fully-pipelined, partially-pipelined, to streamed
input and output implementations. At this stage, the realization using Handel-
C is presented, emphasizing some code segments which tackled different noted
implementation pitfalls. Future work includes extending the theoretical pool of
rules for refinement, the investigation of automating the development processes,
and the optimization of the realization for more economical implementations
with higher throughput.
38
Acknowledgment
I would like to thank Dr. Ali Abdallah, Prof. Mark Josephs, Prof. Wayne Luk,
Dr. Sylvia Jennings, and Dr. John Hawkins for their insightful comments on
the research which is partly presented in this paper.
References
[1] Xilinx, Information available from, http://www.xilinx.com (2007).
[2] Altera, Information available from, http://www.Altera.com (2007).
[3] Celoxica, Information available from, http://www.celoxica.com (2007).
[4] S. Stepney, CSP/FDR2 to Handel-C translation, Tech. Rep. YCS-2002-357,
Department of Computer Science, University of York (June 2003).
[5] D. Edwards, S. Harris, J. Forge, High performance hardware from java,
Xilinx Whitepaper http://www.xilinx.com (2007).
[6] Y. Li, T. Callahan, E. Darnell, R. Harr, U. Kurkure, J. Stockwood,
Hardware-software codesign of embedded reconfigurable architectures, in:
Proceedings of the 37th Design Automation Conference, Los Angeles - USA,
2000.
[7] N. Technology, Information available from, http://www.nimble.com (2007).
[8] S. Network, Information available arom, http://www.systemc.org (2007).
[9] G. Michaelson, N. Scaife, P. Bristow, P. King, Nested algorithmic skeletons
from higher order functions (2000).
[10] A. E. Abdallah, Functional process modelling, Research Directions in Par-
allel Functional Programming, (Springer Verlag, October 1999) (1999) 339–
360.
[11] A. E. Abdallah, Derivation of Parallel Algorithms: From Functional Speci-
fications to csp Processes, in: B. Moller (Ed.), Proceedings of Mathematics
of Program Construction, Vol. 947 of Lecture Notes in Computer Science,
Springer-Verlag, 1994, pp. 67–96.
[12] A. E. Abdallah, J. Hawkins, Calculational Design of Special Purpose Par-
allel Algorithms, in: Proceedings of 7th IEEE International Conference on
Electronics, Circuits and Systems (IEEE/ICECS), IEEE Computer Society
Press, 2000, pp. 261–267.
[13] A. E. Abdallah, J. Hawkins, Formal Behavioural Synthesis of handel-c Par-
allel Hardware Implementation for Functional Specifications, in: Proceed-
ings of the 36th Annual Hawaii International Conference on System Sci-
ences, IEEE Computer Society Press, 2003, pp. 278–288.
39
[14] C. A. R. Hoare, Communicating Sequential Processes, Prentice-Hall, 1985.
[15] A. E. Abdallah, Synthesis of massively pipelined algorithms for list manip-
ulation, in: L. Bouge, P. Fraigniaud, A. Mignotte, Y. Robert (Eds.), Pro-
ceedings of the European Conference on Parallel Processing, EuroPar’96,
LNCS 1024, (Springer Verlag, 1996), Springer Verlag, 1996, pp. 911–920.
[16] J. Hawkins, A. Abdallah, Synthesis of a highly parallel JPEG decoder
implementation from its functional specification, in: Proceeding of IFIP
Working Conference on Distributed and Parallel Embedded Systems,
Kluwer, 2004.
[17] A. E. Abdallah, G. Simiakakis, T. Theoharis, Formal Development of
a Reconfigurable Tool for Parallel dna Matching, in: Proceedings of
7th IEEE International Conference on Electronics, Circuits and Systems
(IEEE/ICECS), IEEE Computer Society Press, 2000, pp. 268–272.
[18] I. Damaj, Higher-level hardware synthesis of the kasumi cryptographic al-
gorithm, Journal of Computer Science and Technology 22 (1) (2007) 60–70.
[19] I. Damaj, Parallel algorithms development for programmable logic devices,
Advances in Engineering Software 37 (9) (2006) 561–582.
[20] S. Thompson, Haskell: The Craft of Functional Programming, 2nd Edition,
Addison-Wesley, 1999.
[21] D. J. Russel, Fad: A functional analysis and design methadology, Ph.D.
thesis, The University of Kent at Canterbury, United Kingdom (August
2000).
[22] I. Ltd., OCCAM 2 reference manual, Prentice-Hall International (1988).
[23] J. Peng, S. Abdi, D. Gajski, Automatic model refinement for fast architec-
ture exploration, in: the Asia-Pacific Design Automation Conference, 2002,
p. 332337.
[24] J. Bowen, M. Fra¨nzle, E. Olderog, A. Ravn, Developing correct systems,
in: Proc. 5th Euromicro Workshop on Real-Time Systems, IEEE Computer
Society Press, 1993, pp. 176–187.
[25] J. Bowen, C. A. R. Hoare, H. Langmaack, E. Olderog, A. Ravn, A ProCoS
II project final report: ESPRIT basic research project 7071, in: Bulletin
of the European Association for Theoretical Computer Science (EATCS),
1996, pp. 59:76–99.
[26] S. Abdi, D. Gajski, Provably correct architecture refinement, Technical
Report CECS0329, Center for Embedded Computer Systems at University
of California Irvine, Irvine-USA (September 2003).
40
[27] K. Claessen, Embedded languages for describing and verifying hardware,
Ph.D. thesis, Chalmers Univesity of Technology and Go¨teborg University,
Sweden (April 2001).
[28] J. Launchbury, J. Lewis, B. Cook, On embedding a microarchitectural de-
sign language within haskell, in: Proceedings of the fourth ACM SIGPLAN
international conference on Functional programming, ACM Press, 1999, pp.
60–69.
[29] J. Matthews, J. Launchbury, B. Cook, Specifying microprocessors in hawk,
in: Proceedings of the International Conference on Computer Languages,
IEEE, 1998, pp. 90–101.
[30] J. O’Donnell, Hydra: hardware description in a functional language using
recursion equations and high order combining forms, in: G. J. Milne (Ed.),
The Fusion of Hardware Design and Verification, North-Holland, Amster-
dam, 1988, pp. 309–328.
[31] Y. Li, M. Leeser, HML: An innovative hardware design language and its
translation to VHDL, in: Conference on Hardware Design Languages, 1995.
[32] D. Barton, Advanced modeling features of MHDL, in: In International
Conference on Electronic Hardware Description Languages, 1995.
[33] S. Johnson, B. Bose, DDD: A system for mechanized digital design deriva-
tion, Tech. Rep. 323, Indiana University, Indiana (1990).
[34] R. Sharp, Higher-level hardware synthesis, Ph.D. thesis, Robinson College
University of Cambridge, Cambridge (November 2002).
[35] M. Sheeran, muFP: a language for VLSI design, in: Proc. ACM Symposium
on LISP and Functional Programming, ACM Press, 1984, pp. 104–112.
[36] G. Jones, M. Sheeran, Circuit design in ruby, In Formal Methods for VLSI
design (1990) 13–70.
[37] T. Cheung, G. Hellestrand, Multi-level equivalence in design transforma-
tion, in: Proceedings of International Conference on Computer Hardware
Description Languages, Chiba Japan, 1996, pp. 559–566.
[38] I. Page, W. Luk, Compiling Occam into field-programmable gate arrays,
in: W. Moore, W. Luk (Eds.), FPGAs, Oxford Workshop on Field Pro-
grammable Logic and Applications, Abingdon EE&CS Books, 15 Harcourt
Way, Abingdon OX14 1NV, UK, 1991, pp. 271–283.
[39] H. Jifeng, I. Page, J. Bowen, Towards a provably correct hardware im-
plementation of Occam, in: G. Milne, L. Pierre (Eds.), Correct Hardware
Design and Verification Methods (CHARME’93), Vol. 683 of Lecture Notes
in Computer Science, Springer-Verlag, 1993, pp. 214–225.
41
[40] C. T. Library, CSP/FDR2 to Handel-C translation,
http://www.celoxica.com/techlib/files/CEL-W0309221A18-133.htm.
[41] R. Anderson, E. Biham, L. Knudsen, Serpent: A proposal for the advanced
encryption standard, in: Proceedings of the First Advanced Encryption
Standard (AES) Conference, Ventura - CA, 1998.
[42] A. Elbirt, C. Paar, An FPGA implementation and performance evaluation
of the Serpent block cipher, in: Proceedings of the 2000 ACM/SIGDA
eighth international symposium on Field programmable gate arrays, ACM
Press, New York - USA, 2000, pp. 33 – 40.
[43] P. Bora, T. Czajka, Implementation of the SERPENT algorithm using
ALTERA FPGA devices, Public Comments on AES Candidate Algorithms
- Round 2 (October 2000).
[44] A. Yip, W. Chetwynd, B. Paar, An FPGA-based performance evaluation
of the AES block cipher candidate algorithm finalists, IEEE Transactions
on Very Large Scale Integration (VLSI) Systems 9 (4) (2001) 545–557.
[45] K. Gaj, P. Chodowiec, Fast implementation and fair comparison of the final
candidates for advanced encryption standard using field programmable gate
arrays, Lecture Notes in Computer Science 2020 (2001) 84–100.
[46] B. Gladman, Implementation experience with aes candidate algorithms, in:
Proceedings of the Second AES Candidate Conference, 1999.
[47] V. Journot, Evaluation of serpent, one of the aes finalists on 8-bit micro-
controllers, in: Proceedings of the Third AES Candidate Conference, 2000.
[48] R. Anderson, E. Biham, L. Knudsen, Information available from,
http://csrc.nist.gov/encryption/aes.
42
