Parallel algorithms development for programmable logic devices by Damaj, Issam
The link to the formal publication is via
https://doi.org/10.1016/j.advengsoft.2006.01.009
.
Parallel Algorithms Development for Programmable
Logic Devices
Issam W. Damaj
Electrical and Computer Engineering Department, Hariri Canadian Academy of Sciences
and Technology, Meshref P.O.Box: 10 Damour- Chouf 2010 Lebanon,
damajiw@hariricanadian.edu.lb
Abstract
Programmable Logic Devices (PLDs) continue to grow in size and currently
contain several millions of gates. At the same time, research effort is going
into higher-level hardware synthesis methodologies for reconfigurable comput-
ing that can exploit PLD technology. In this paper, we explore the effectiveness
and extend one such formal methodology in the design of massively parallel
algorithms. We take a step-wise refinement approach to the development of
correct reconfigurable hardware circuits from formal specifications. A func-
tional programming notation is used for specifying algorithms and for reasoning
about them. The specifications are realised through the use of a combination of
function decomposition strategies, data refinement techniques, and off-the-shelf
refinements based upon higher-order functions. The off-the-shelf refinements
are inspired by the operators of Communicating Sequential Processes (CSP)
and map easily to programs in Handel-C (a hardware description language).
The Handel-C descriptions are directly compiled into reconfigurable hardware.
The practical realisation of this methodology is evidenced by a case studying
the matrix multiplication algorithm as it is relatively simple and well known. In
this paper, we obtain several hardware implementations with different perfor-
mance characteristics by applying different refinements to the algorithm. The
developed designs are compiled and tested under Celoxica’s RC-1000 reconfig-
urable computer with its 2 million gates Virtex-E FPGA. Performance analysis
and evaluation of these implementations are included.
1. Introduction
The rapid progress and advancement in electronic chips technology provides
a variety of new implementation options for system engineers. The choice varies
Preprint submitted to Elsevier April 15, 2019
ar
X
iv
:1
90
4.
05
98
0v
1 
 [c
s.P
L]
  1
 A
pr
 20
19
between the flexible programs running on a general purpose processor (GPP)
and the fixed hardware implementation using an application specific integrated
circuit (ASIC ). Many other implementation options present, for instance, a sys-
tem with a RISC processor and a DSP core. Other options include graphics
processors and microcontrollers. Specialist processors certainly improve perfor-
mance over general-purpose ones, but this comes as a quid pro quo for flexibility.
Combining the flexibility of GPPs and the high performance of ASICs leads to
the introduction of reconfigurable computing (RC ) as a new implementation
option with a balance between versatility and speed.
Generally, reconfigurable computing is computer processing with highly flex-
ible computing fabrics. The principal difference when compared to using ordi-
nary microprocessors is the ability to make substantial changes to the data path
itself in addition to the control flow. In the last decade, there was a renaissance
in the area of reconfigurable computing research with many proposed reconfig-
urable architectures developed both in industry and academia such as, Matrix,
Garp, RAW, DPGA, RaPiD, PRISM, Pleiades, and Morphosys [1]. Such designs
were feasible due to the relentless progress of silicon technology that allowed
complex designs to be implemented on a single chip.
Field Programmable Gate Arrays (FPGAs), nowadays are important com-
ponents of RC -systems, have shown a dramatic increase in their density over
the last few years. For example, companies like Xilinx [2] and Altera [3] have
enabled the production of FPGAs with several millions of gates, such as in
Virtex-II Pro and Stratix-II FPGAs. The versatility of FPGAs, opened up
completely new avenues in high-performance computing. These reconfigurable
digital electronic hardware circuits can be combined with high-level software
and design methodologies to form a powerful paradigm for computing.
The traditional implementation of a function on an FPGA is done using
logic synthesis based on VHDL, Verilog or a similar HDL (hardware descrip-
tion langauge). These discrete event simulation languages are rather different
from languages, such as C, C++ or JAVA. Many FPGA implementation tools
are primarily HDL-based and not well integrated with high-level software tools.
Furthermore, these HDL-based IP (intellectual property) cores are expensive
and they have complex licensing schemes [4]. These obstacles had caused some
blockage to the infiltration of FPGAs as the main platform solution for hardware
engineers. An interesting step towards more success in hardware compilation
is to grant a higher-level of abstraction from the point of view of programmer.
Designer productivity can be improved and time-to-market can be reduces by
making hardware design more like programming in a high-level langauge. Re-
cently, vendors have initiated the use of high-level languages dependent tools
like Handel-C [5, 6, 7, 8], Forge [9], Nimble [10, 11], SystemC [12] and Viva
[13] (an object-oriented graphical development environment for programming
FPGAs).
With the availability of powerful high-level tools accompanying the emer-
gence of multi-million FPGA chips, more emphasis should be placed on afford-
ing an even higher level of abstraction in programming reconfigurable hardware.
Building on these research motivations, in the work in hand, we extend and ex-
2
amine a methodology whose main objective is to allow for a higher-level correct
synthesis of massively parallel algorithms and to map (compile) them onto re-
configurable hardware. Our main concern is with behavioural refinement, in
particular the derivation of parallel algorithms. The presented methodology
systematically transforms functional specifications of algorithms into parallel
hardware implementations. It builds on the work of Abdallah and Hawkins
[14, 15, 16, 17] extending their treatment of data and process refinement.
This paper is divided so that the following section introduces the adopted
development methodology. Section 3 presents the theoretical background. In
Section 4, we put some emphasis on the approach to develop different imple-
mentations of the matrix multiplication algorithm. The following section details
the development steps. Section 7 demonstrates selected implementations. In
Section 8, we analyze and evaluate the performance of the suggested implemen-
tations. Finally, Section 10 concludes the paper.
2. The Development Method
Although compilers can expose parallelism through data flow analysis [18],
imperative languages are perhaps not ideal as a starting point. This is because
imperative programs already incorporate design decisions (concerning control
flows and data structures), preconditions (that can be assumed), post-conditions
(that must be achieved), and invariants (that must be maintained). The direct
manipulation of state makes it both difficult to prove that any two pieces of code
are equivalent, and to perform substitutions, modify and rewrite the algorithm.
Functional languages [19], such as Haskell [20], however, do not manipulate
state directly, and as such gain the property of referential transparency. Any
sub-expression of an algorithm can be substituted for any other that is provably
equivalent. This is aided by an effective set of laws given to us by such reasoning
frameworks as Bird-Merteen Formalism (BMF ) [21], along with a wealth of
other work in the functional programming and parallel processing fields [22, 23,
24, 25, 26, 27].
Although, many hardware development methods still use the powerful data
flow analysis, such as Viva [13], the attractions for using the functional paradigm
has incited many researchers. This triggered many investigations in this area,
such as Lava [28], Hawk [29, 30], Hydra [31], HML [32], MHDL [33], DDD
system [34], SAFL [35], MuFP [36], Ruby [37], and Form [38].
The suggested development model adopts the transformational programming
approach for deriving massively parallel algorithms from functional specifica-
tions (See Figure 1). The functional notation is used for specifying algorithms
and for reasoning about them. This is usually done by carefully combining a
small number of higher-order functions that serve as the basic building blocks
for writing high-level programs. The systematic methods for massive paralleli-
sation of algorithms work by carefully composing an ”off-the-shelf” massively
parallel implementation of each of the building blocks involved in the algorithm.
The underlying parallelisation techniques are based on both pipelining and data
parallelism.
3
High-Level Functional Specification
Network of Communicating CSP
Processes
Reconfigurable Hardware
Transformational
Derivation
Automated Compilation
Handel-C
Refinement to
Processes
CSP Algebraic
Laws
Developed
Libraries
Functional
Calculus
Strategies for
Parallelism
Place and
Route Tools
Figure 1: An overview of the transformational derivation and the hardware realisation pro-
cesses.
Higher-order functions, such as map, filter, foldl, and foldr, provide a high
degree of abstraction in functional programs [20]. Not only they do allow clear
and succinct specifications for a large class of algorithms, but they also are
ideal starting points for generating efficient implementations by a process of
mathematical calculation using BMF. Over the past decade, there have been
attempts to apply BMF for generating data parallel programs from abstract
specifications using the skeleton approach [24, 15]. The main attraction of this
approach is the potential for increasing reusability of parallel programs without
sacrificing too much performance. The essence of this approach is to design a
generic solution once, and to use instances of the design many times for various
applications. Accordingly, this approach allows portability by implementing the
design on different parallel architectures.
In order to develop generic solutions for general parallel architectures, it is
necessary to formulate the design within a concurrency framework such as CSP
[15, 8]. Often parallel functional programs show peculiar behaviours which
are only understandable in the terms of concurrency rather than relying on
hidden implementation details. The formalisation in CSP (of the parallel be-
haviour) leads to better understanding and allows for analysis of performance
issues. The establishment of refinement concepts between functional and con-
current behaviours may allow systematic generation of parallel implementations
for various architectures. This gives the ability to exploit well-established func-
tional programming (FP) paradigms and transformation techniques in order to
develop efficient CSP processes. These systematic refinement rules refine the
4
specification to what we call the CSP implementation stage, where parallelism
will be described using Hoare’s CSP [39]. Again, this allows issues of immense
practical importance, such as, the careful reasoning about data distribution,
network topology, and locality of communications.
The previous stages of development require a back-end stage for realising
the developed designs. We note at this point that the Handel-C language re-
lies on the parallel constructs in CSP to model concurrent hardware resources.
Mostly, algorithms described with CSP could be implemented with Handel-C.
Accordingly, this langauge is suggested as the final reconfigurable hardware re-
alisation stage in the proposed methodology. It is noted that, for the desired
hardware realisation, Handel-C enables the integration with VHDL and EDIF
(Electronic Design Interchange Format) and thus various synthesis and place-
and-route tools.
3. Background
Abdallah and Hawkins defined in [16] some constructs used in the devel-
opment model. Their investigation looked in some depth at data refinement;
which is the means of expressing structures in the specification as communica-
tion behaviour in the implementation.
3.1. Data Refinement
In the following we present some datatypes used for refinement, these are
stream, vector, and combined forms.
The stream is a purely sequential method of communicating a group of
values. It comprises a sequence of messages on a channel, with each message
representing a value. Values are communicated one after the other. Assuming
the stream is finite, after the last value has been communicated, the end of
transmission (EOT) on a different channel will be signaled. Given some type
A, a stream containing values of type A is denoted as 〈A〉.
Each item to be communicated by the vector will be dealt with independently
in parallel. A vector refinement of a simple list of items will communicate the
entire structure in a single. Given some type A, a vector of length n, containing
values of type A, is denoted as bAcn .
Whenever dealing with multi-dimensional data structures, for example, lists
of lists, implementation options arise from differing compositions of our primi-
tive data refinements - streams and vectors. Examples of the combined forms
are the Stream of Streams, Streams of Vectors, Vectors of streams, and Vec-
tors of Vectors. These forms are denoted by: 〈S1,S2, ...,Sn〉 , 〈V1,V2, ...,Vn〉,
bS1,S2, ...,Snc, and bV1,V2, ...,Vnc.
3.2. Process Refinement
The refinement of the formally specified functions to processes is the key
step towards understanding possible parallel behaviour of an implementation.
In this section, the interest is in presenting refinements of a subset of functions
5
- some of which are higher-order. A bigger refined set of these functions is
discussed in [15].
Generally, These highly reusable building blocks can be refined to CSP in
different ways. This depends on the setting in which these functions are used
(i.e. with streams, vectors etc.), and leads to implementations with different
degrees of parallelism. Note that we don’t use CSP in a totally formal way, but
we use it in a way that facilitates the Handel-C coding stage later. Recall for
the following subsections that values are communicated through as an elements
channel, while a single bit is communicated through another eotChannel channel
to signal the end of transmission (EOT).
3.2.1. Produce
The produce process (PRD) is fundamental to process refinement. It is used
to produce values on the channels of a certain communication construct (Item,
Stream, Vector, and so on). These values are to be received and manipulated
by another processes.
Items. For simple, single item types (int, char, bool, etc.), the produce process
is very simple. This is depicted in Figure 2. Here the output is just a single
channel.
The definition in CSP notation is very straightforward:
PRD (Item a) = out .element .channel ! a → SKIP
PRD(x)
x
Figure 2: The Produce process (PRD) for items
Streams. The produce process for streams is depicted in Figure 3. As already
noted, the output in this case is a pair of two other channels. One channel will
produce the values of the stream, and the other will be a simple channel used
to signal EOT.
In a more general case, the structure of the values which the stream is
carrying is not necessarily known. These may be simple items, but may also be
streams or vectors. Generally, producing a stream could be described as:
PRD (〈s〉) = ((; )i=length(s)i=1 (PRD si)[out .elements.channel/out ]);
out .eotChannel ! eot → SKIP
6
PRD(<…,x 3,x 2,x 1>) …,x 3 ,x2 ,x1
Figure 3: The Produce process (PRD) for streams
Vectors. For vectors of size n, n instances of the produce process are composed
in parallel, one for each item in the vector. The output here is an array of
channels. This is depicted in Figure 4. A general definition is given below:
PRD (bvcn) = |||i=ni=1 (PRD vi)[out .elementsi .channel/out ]
A process STORE stores a communication construct in a variable. We use
this process to store items, vectors, streams, or combinations. A subscript letter
is used with the processes PRD and STORE to indicate the type of commu-
nication. We sometimes omit this subscript if the communication structure is
clear from context.
PRD([x
n
,…,x 2,x1])
x1
x2
xn
Figure 4: The Produce process (PRD) for vectors
3.2.2. Feeding Processes
The feed operator in CSP models function application. The feed operator is
written . The feed operator takes two processes, composes them together in
parallel, and renames both the output of the first and the input of the second
to a new name, which is then hidden. Given the lifted concepts of CSP channel
renaming and hiding, the definition can remain the same regardless of the type
of the communicating construct (Item, Stream, Vector or any combination).
P  Q = (P [mid/out ] || Q [mid/in])\{mid}
7
3.2.3. Formal Process Refinement
Given the definition of a feed operator that operates on processes, a formal
definition of process refinement could be delivered. Consider a function f , which
takes in values of type A and returns values of type B . Assume that the data
refinement step has already been performed, such that A and B are both types
of some transmission value:
f :: A→ B
Then, consider a potential refinement for f , a process F . The operator v
denotes a process refinement, where the left hand side is a function, and the
right hand side is a process. To state that f is refined to F , or in other words,
the process F is a valid refinement of the function f , the following may be used:
f v F
The rules of refinement were proven once [15], and in this paper we use them
systematically to refine the functional specification into a network of communi-
cating processes.
3.2.4. MAP the Process Refinement of the Higher-order Function map
Now the attention is turned to the refinement of the widely used higher-order
function map [16] . Employing this function in stream and vector settings is
presented. The refinement for combined structures is to be made in a similar
way.
Streams. A process implementing the functionality of map f in stream terms
should input a stream of values, and output a stream of values with the function
f applied (See Figure 5).
In general, the handling of the EOT channels will be the same. However,
the handling of the value will vary depending on the type of the elements of the
input and output stream.
SMAP(F ) =
µX • in.eotChannel ? eot → out .eotChannel ! eot → SKIP
2
F [in.elements.channel/in, out .elements.channel/out ]; X
SMAP
…,x 3 , x 2, x 1
EOT
…, f x3 , f x 2, f x 1
EOT
Figure 5: The SMAP process for streams
8
Vectors. In functional terms, the functionality of map f in a list setting is mod-
elled by vmap f in the vector setting. Consider F as a valid refinement of the
function f . The implementation of VMAP can then proceed by composing n
instances of F in parallel, and directing an item from the input vector to each
instance for processing (See Figure 6). In CSP we have:
VMAPn(F ) = |||i=ni=1 F [ini/in, outi/out ]
F
in
1
out1
F
in
2
out2
... F
in
n
outn
Figure 6: The VMAP process for vectors
3.2.5. ZIPWITH the Process Refinement of the Higher-order Function zipWith
Recall another higher-order function, namely zipWith. This function is used
to zip two lists (taking one element from each list) with a certain operation.
Formally:
zipWith :: (A→ B → C )→ [A]→ [B ]→ [C ]
zipWith (⊕) [x1, x2, ...xn ][y1, y2, ...yn ] = [x1 ⊕ y1, x2 ⊕ y2, ..., xn ⊕ yn ]
3.2.6. Streams
The process implementation of (zipWith f ) in stream terms should input two
streams of values, and output a stream of values with the function f applied
(See Figure 7).
Again, the handling of the EOT channel will be the same. Nevertheless, the
handling of the value will vary depending on the type of the input and output
streams elements.
SZIPWITH (F ) =
µX • in.eotChannel ? eot → out .eotChannel ! eot → SKIP
2
F [in1.elements.channel/in1, in2.elements.channel/in2,
out .elements.channel/out ]; X
9
SZIPWITH(F)
xn...x 2 , x1
yn...y 2 , y1
x1 y1...
Figure 7: The SZIPWITH process for streams
3.2.7. Vectors
To implement the data parallel version of this higher-order function, we
refine it to a process VZIPWITH that takes two vectors as input and zips the
two lists with a process F ; F is a refined process from the function (⊕). This is
depicted as in Figure 8.
vzipWith (⊕) :: bAcn →, bBcn → bC cn
VZIPWITH (⊕) =|||i=ni=1 F [outi/out , ci/in1, di/in2]
F F F...
y1 y2 ynx1 x2 xn
x1 y1 x2 y2 xn yn
Figure 8: The VZIPWITH process for vectors
3.3. Decomposition of Higher-Order Functions for Parallel Programs Derivation
Intrinsic richness and usefulness of higher-order functions could be made
clear by recalling some of previous work presented in [40]. This work concen-
trated on providing systematic decomposition methods for exploiting pipelined
parallelism in instances of the higher-order function foldr. In this section, the
decomposition for the higher-order function map is shown. This decomposition
rule is used in the forthcoming applications.
The following decomposition rule is recalled along with its corresponding
CSP implementation. This rule decomposes a specification of the form (map (h
m)), where h is a function and m is a given list of values. The CSP network
SPEC which refines this specification is shown in Figure 9.
10
spec :: A → B ; h :: [T ] → A → B ; m :: [T ]; e :: B f :: T → A → B → B
spec = map (h m) h [] x = e h (a : s) x = f a x (h s x )
This could be decomposed to:
spec = (◦)/[final∗ ] ++ (map ◦ f ′ ∗m) ++ [initial∗ ] f ′ a 〈x , y〉 = 〈x , f
a x y〉 initial x = 〈x , y〉 final 〈x , y〉 = y
The pipelined network of CSP processes SPEC, which refines the functional
specification spec is synthesised as follows:
SPEC = ()/[MAP(initial)] ++ ((MAP◦f ′)∗(reverse m)) ++ [MAP(final)]
MAP(initial) = µX • left?eot → right !eot → SKIP
|
left?x → right !〈x , e〉 → X
MAP(f ′ as) = µX • left?eot → right !eot → SKIP
|
left?〈x , y〉 → right !〈x , f a x y〉 → X
MAP(final) = µX • left?eot → right !eot → SKIP
|
left?〈x , y〉 → right !y → X
It is important not jump to the conclusion that every parallel algorithm can
be deve1oped this way. There are two limitations to this approach. First, it
only deals with systems which can be specified functionally. Second, it may not
be possible to develop some parallel algorithms which use multi-directional com-
munications using this method. This second point will be practically assessed
in later section while designing a multilevel-pipelined parallel program.
MAP(initial) MAP(f‘ a
n
) MAP(f‘ a 2) MAP(f‘ a 1) MAP(final)...
Figure 9: The decomposed network SPEC
3.4. Handel-C as a Stage in the Development Model
Based on datatype refinement and the skeleton afforded by process refine-
ment, the desired reconfigurable circuits are built. Circuit realisation is done
using Handel-C, as it is based on the theories of CSP [39] and Occam [41].
From a practical standpoint, each refined datatype is defined as a structure
in Handel-C, while each process is implemented as a macro procedure. The
constructs corresponding to the CSP stage are divided into 2 main categories
for organisation purposes. The first category represents the definitions of the
refined datatypes. The second category implements the refined processes. The
refined processes are divided into different groups. The utility processes group
contains macros responsible for producing, storing, sinking, broadcasting data
and etc. The basic processes group contains macros that correspond to simple
arithmetic and logical operations. These basic processes could be simple addi-
tion, multiplication, etc. The higher-order processes group contains the macros
11
realising the CSP implementations corresponding to the higher-order functions.
A separate group contains the macros that handle the FPGA card setup and
general functionality. The reusable macros found in these groups serves as build-
ing blocks used for constructing a certain specified algorithm. This organisation
is visualised in Figure 10.
Modulus
Squaring
Multiplication
Addition
Produce
PP1000Load
PP1000Store
Sink
Basic Processes Macros
Header and Definition Files
Main Handel - C
Program
Utility Macros Header
and Definition Files
RC-1000 Header Files &
Libraries
Higher-Order Processes
Macro Definitions &
Header Files
Map
Fold
Filter
ZipWith
Control
Status
Load
Store
Figure 10: Handel-C code constructs organisation
3.4.1. Datatypes Definitions
The datatypes definitions are implemented using structures. This method
supports recursive as well as simple types. The definition for an Item of a type
Msgtype is a structure that contains a communicating channel of that type.
#define Item(Name, Msgtype)
struct {
chan Msgtype channel;
Msgtype message;
} Name
For generality in implementing processes the type of the communicating
structure is to be determined at compile time. This is done using the typeof
type operator, which allows the type of an object to be determined at compile
time. For this reason, in each structure we declare a message variable of type
Msgtype.
12
A stream of items, called StreamOfItems, is a structure with three decla-
rations a communicating channel, an EOT channel, and a message variable
[16]:
#define StreamOfItems(Name, Msgtype)
struct {
Msgtype message;
chan Msgtype channel;
chan Bool eotChannel;
} Name
A vector of items, called VectorOfItems, is a structure with a variable mes-
sage and another array of sub-structure elements [16].
#define VectorOfItems(Name, n, Msgtype)
struct {
struct {
chan Msgtype channel;
} elements[n];
Msgtype message;
} Name
Other definitions are possible, but it affects the way a channel is called using
the structure member operator (.). Examples of different extended definitions
are as follows (the first definition reuses the Item structure, while the second
one employs channel arrays supported in Handel-C ):
#define VectorOfItems(Name, n, Msgtype)
struct {
struct {
Item(element, MsgType);
} elements[n];
} Name
#define VectorOfItems(Name, n, Msgtype)
struct {
chan Msgtype channel[n];
Msgtype messages;
} Name
In general, there are certain limitations in the Handel-C language, which
make the expression of a number of useful constructs either difficult or impos-
sible. An example of an impossible to implement declaration is as follows:
StreamOfItems(Name, VectorOfItems(Name, n, Int16));
13
A simple preprocessor would facilitate a higher level of Handel-C generic
definitions, and allow them to flow much more freely from our functional spec-
ifications. The implementation of such a preprocessor is being investigated
within our research group.
3.4.2. Utilities Macros
The utility processes used in the implementation are related to the employed
datatypes. The Handel-C implementation of these processes relies on their
corresponding CSP implementation. In the following, we present an instance of
these utility macros.
macro proc ProduceItem(Item, x){
Item.channel ! x;}
macro proc StoreItem(Item, x){
Item.channel ? x;}
3.4.3. Basic Processes Macros
This group of macros represents the fine-grained processes. A sample basic
macro procedure Addition is included as an example.
macro proc Addition(xItem, yItem, output){
typeof (xItem.message) x,y;
xItem.channel ? x;
yItem.channel ? y;
output.channel ! (x + y);}
3.4.4. Higher-Order Processes Macros
An example for an implementation in Handel-C of the CSP refinement of
a higher-order function (map) is done as follows. The process hinges around a
loop which terminates when the variable eot is set to true. At each step of the
loop, the process enters a wait state until either the EOT or the value channel
of the input stream is willing to communicate. If the EOT channel is willing
to communicate, the input is consumed from it and stored in the variable eot,
then output an EOT message for the output stream. If the value channel of
the input stream is willing to communicate, the value is consumed then F is
applied to it giving the result on the output stream channel [16].
macro proc SMAP (streamin, streamout, F){
Bool eot;
eot = False;
do{
prialt{
case streamin.eotChannel ? eot:
streamout.eotChannel ! True;
14
break;
default:
F (streamin.elements,streamout.elements);
break;
}} while (!eot)}
We turn the attention to providing a definition in Handel-C for the behaviour
of the process VMAP. Here we can employ Handel-C ’s enumerated par construct
to place n instances of the process F in parallel. Each instance is passed to the
corresponding channels from both the input and output conduits [16].
macro proc VMAP (n, vectorin, vectorout, F) {
typeof (n) c;
par (c = 0 ; c < n ; c++){
F(vectorin.elements[c], vectorout.elements[c]);}}
The implementations of the stream and vector settings of the remaining
high-order functions is done in a similar manner.
3.4.5. The RC-1000 System Control Macros
The Celoxica RC-x000 boards provide high-performance, real-time process-
ing capabilities and are optimised for the Celoxica DK design suite. The RC-
1000 is a standard PCI bus card, with four onboard banks of SRAM, equipped
with a Xilinx Virtex with up to 2 million system gates[6].
According to the characteristics of the used system, some reusable macro
procedures were implemented to be employed in the development model. For
instance, reading or storing an Item from (in) a bank could be implemented as
in the following macros:
macro proc ReadItemFromBank1(r){
Int temp;
PP1000ReadBank1(0, temp);
r.channel ! temp;}
macro proc StoreItemToBank1(r) {
Int temp;
unsigned int 21 count;
r.channel? temp;
PP1000WriteBank1(0, temp);}
15
3.5. Evaluation Tools and Performance Metrics
Different tools are used to measure the performance metrics used for the
analysis. These tools include the design suite (DK ) from Celoxica, where we
get the number of NAND gates for the design as compiled to the Electronic
Design Interchange Format (EDIF ). The DK also affords the number of cycles
taken by a design using its simulator. Accordingly, the speed of a design could
be calculated depending on the expected maximum frequency of the design. The
maximum frequency could be determined by the timing analyzer. Accordingly,
the time to execute 1 cycle could be determined (Period). The execution time
of a design is then the Period multiplied by the number of cycles. Thereat, the
throughput is calculated depending on the amount of data processed in that
execution time.
To get the practical execution time as observed from the host computer,
the C++ high-precision performance counter is used. The counter probes the
execution of the design after loading the image of the design into the FPGA till
termination.
The information about the hardware area occupied by a design, i.e. number
of Slices used after placing and routing the compiled code, is determined by
the ISE place and route tool. Using the same tool we are able to get more
detailed statistics about our compilation. In the current investigation the only
used metrics are the number of Slices and the Total Equivalent Gate Count for
a design.
4. A New Approach for developing a Matrix Multiplication Algo-
rithm
Many parallel implementations of matrix multiplication have been investi-
gated in the literature. Although this algorithm is simple and it has a long
history, the continuous advancement in computer architectures made the study
of this algorithm very interesting. Many matrix multiplication algorithms were
suggested for parallel implementation. Horowitz and Zorat in [42] and Hake in
[43] suggested a recursive divide-and-conquer solution. Fox et al. in [44] and
Canon in [45] presented different ways this algorithm could be developed for a
mesh topology. Other implementations were discussed in [46, 47, 48, 49].
An important requirement for any parallel hardware computation is the
proper use of available computing resources. This is done either to minimise
overall computation time, to minimise the chip area, or to compromise between
these two goals. With the development model in hand, design flexibility is
one of the main advantages granted. Accordingly, five refined designs from the
functional specification of the standard matrix multiplication algorithm are pre-
sented. These designs vary between high-speed implementation with expensive
use of resources, and lower speed implementation with less use of resources. The
development of the matrix multiplication algorithm is presented in the following
sections.
16
The development will start by formalizing the matrix multiplication algo-
rithm. The functional specification will favor the use of the predefined high-
order functions. This functional specification will be the source for different
refinements with different degrees of parallelism described using CSP notation.
The created parallel designs will be used in the section Section 7 as the basis of
the code written in Handel-C.
5. Formal Functional Specification
An informal definition of the problem considers that the multiplication of
two matrices ass and bss produces the matrix css whose elements, cij (0 ≤ i <
n, 0 ≤ j < k) are computed as follows:
cij =
∑m−1
t=0 ai,tbt,j
where ass is an n × m matrix and bss is an m × k matrix. Items a and b
correspond to elements from matrices ass and bss, respectively.
Generally, partitioning can be done very easily with matrix multiplication,
where each matrix is divided into sub-matrices that can be manipulated as if
they were a single matrix element [44]. This method is used to divide matrices
with large dimensions to suit the expected limited capability of the available
computer.
Turning our attention to the formalisation of the algorithm. A functional
specification of matrix multiplication is formulated as a function mmult that
takes two matrices as inputs and returns a matrix as a result. In this definition,
we assume the first matrix is represented as a list of rows and the second matrix
is represented as a list of columns.
mmult :: [[Int]] -> [[Int]] -> [[Int]]
mmult ass bss = map (vmmult ass) bss
vmmult :: [[Int]] -> [Int] -> [Int]
vmmult ass bs = map (scalarp bs) ass
scalarp :: [Int] -> [Int]-> Int
scalarp as bs = sum (zipwithmul as bs)
sum :: [Int] -> Int
sum rs = foldrl (+) rs
zipwithmul :: [Int] -> [Int] -> [Int]
zipwithmul as bs = zipwith (*) as bs
The suggested algorithm for multiplying two matrices is done by mapping
(using the higher-order function map) the function (vmmult ass) to all vectors
in bss. This function is the multiplication of a vector with a matrix. It takes
two inputs a matrix (list of lists) ass and a vector (list) bs and returns a list cs
(a column in the resulting matrix).
17
In turn, vmmult maps the function (scalarp bs) over the list of lists ass. The
function scalarp defines the scalar product of two vectors. The inputs to this
function are two lists as and bs. The higher-order function zipWith is used in
the function zipwithmul to zip the inputs with multiplication, then the function
sum is employed to fold the already zipped lists with addition. The output of
this composition is an element from the resultant matrix.
According to this specification, the implementation under HUGs98 Haskell
compiler is tested at the unit, component and integration levels.
6. Algorithm Refinements
Clearly, parallelism is not a part of the starting specification of the stated
problem. Typically, parallelism and communications are introduced at this stage
in the development for the sole purpose of capturing functionally equivalent, but
parallel, designs.
Applying the provably correct refinement rules, the previous specification is
refined to CSP as a middle stage towards hardware realisation. The capability
of doing different data refinements implies the availability of various designs.
Whereby, each design would have different characteristics and levels of paral-
lelism.
In the following refinements, five designs are presented. The first design is
a data-parallel design, while, the second design is a stream-based design. The
third and fourth designs addresses refinement to pipelined parallelism using
function decomposition strategy. The last design is a 2D pipelined design with
an extension leading to a systolic implementation of the problem.
For more clarification of the used terms we recall the following informal
definitions. A data-parallel design replicates the same processes in order to
compute for different data inputs. Commonly, a Single Program Multiple Data
(SPMD) approach is used in data-parallel models, where data are distributed
across processors. In pipelined computations, a program is divided into a series
of tasks that have to be completed consecutively. Accordingly, these tasks are
executed by separate pipeline stages. The pipe stages stream data from stage to
stage to form the required computation. A stream-based design eliminates some
replication (in data-parallel designs) or some stages (in pipelined designs) and it
processes the input and output as streams of data. Systolic arrays are another
parallel computing architecture. It is best described by analogy with the regular
pumping of blood by the heart. In systolic arrays, processors are arranged in
an array where data flow across the array elements between neighbours. For
instance, a process firstly takes in data from one or more neighbours (North and
West). Secondly, the process manipulates the input data. Finally, the process
outputs results in the opposite direction (South and East).
6.1. First Design - Data Parallelism
Recalling the high-level specification of mmult :
18
mmult :: [[Int]] -> [[Int]] -> [[Int]]
mmult ass bss = map(vmmult ass) bss
In this design we consider the refinement of the input bss and the output
css as vectors of items of size k, where each item is a list.
mmult(ass) :: b[Int ]ck → b[Int ]ck
The CSP implementation of the functional specification of the matrix mul-
tiplication algorithm mmult realises this function as a process MMULT. The
CSP process MMULT is the parallel execution of k-copies of VMMULT (the
refinement of vmmult). This description is implemented using VMAP the vec-
tor setting of the higher-order function map. Therefore, it is the interleaving
with renaming of the process VMMULT for all columns of bss:
mmult(ass) v MMULT (ass) MMULT (ass) = VMAPk (VMMULT (ass))
The list ass is passed as an argument to each of the processes VMMUL(ass)
in the above design. This design can be pictured as in Figure 11.
VMMULT(ass)
cs1
VMMULT(ass) VMMULT(ass)...
bs1 bs2 bsk
cs2 csk
in1 in2
ink
out1 out2
out k
Figure 11: The process MMULT
The list ass could be explicitly passed to the process VMMULT by exploiting
the following algebraic identity:
VMMULT (ass) = PRD(ass)  VMMULT
The effect of applying this step to the previous design is visualised in Fig-
ure 12. In this version, the list ass is locally produced and fed to each process
VMMULT in the vector. The effect of having k parallel copies of PRD(ass)
communicating with k instances of VMMULT can be achieved by factorising the
process PRD(ass) and broadcasting its output to the relevant processing ele-
ments in the network. Applying this rule will result in a semantically equivalent
version of MMULT which has a different layout, this is shown in Figure 13.
Now we turn our attention to the refinement of the function vmmult.
vmmult::[Int] -> [[Int]] -> [Int]
vmmult bs ass = map (scalarp bs) ass
Clearly, vmmult (bs) is a map pattern. Since map has two different im-
plementations, we will consider them in turn in this and the next designs. In
19
VMMULT VMMULT VMMULT...
cs1
bs1 bs2 bsk
cs2 csk
in1 in2
ink
out1 out2
out k
ass ass ass
Figure 12: The process MMULT, an alternative implementation
VMMULT VMMULT VMMULT...
BROADCAST k(ass)
cs1
bs1 bs2 bsk
cs2 csk
in1 in2
ink
out 1 out 2
out k
Figure 13: The process MMULT, optimised implementation
20
this design, the CSP implementation realises the function vmmult as a process
VMMULT with a vector of items ass=bas1, as2, ...asnc as input and a vector
of items cs=bc1, c2, ...cnc as output. By refining scalarp into VSCALARP, the
CSP implementation of vmmult (bs) is again the off the shelf refinement of a
vector map:
vmmult(bs) :: b[Int ]cm → [Int ]n
VMMULT (bs) = VMAPn(VSCALARP(bs))
By appealing to the same technique already used in the refinement of VM-
MULT we get:
VSCALARP(bs) = PRD(bs)  VSCALARP
This leads to a new design ofVMMULT(bs):
VMAPn(PRD(bs)  VSCALARP)
BROADCASTn(bs)  VMAPn(VSCALARP(bs))
Figure 14 shows the process VMMULT. This step also shows clearly the repli-
cation of the process VSCALARP, which is an indicator for a later replication
in the hardware implementation.
VSCALARP VSCALARP VSCALARP...
BROADCAST n(bs)
c1
as1 as2 asn
c2 cn
in1 in2
inn
out 1 out 2
outn
Figure 14: The process VMMULT
Figure 15 expands the main building block in Figure 13 by corresponding
configuration in Figure 14. This gives a two dimensional visualisation of the
process MMULT as a data parallel implementation.
The next step is to present the building block VSCALARP corresponding
to a CSP refinement of the function:
scalarp :: [Int] -> [Int] -> Int
scalarp as bs = sum (zipwithmul as bs)
This function can be refined as the piping of two processes VZIPWITH
and VFOLD corresponding to refinements of the functions zipwithmul and sum
respectively.
scalarp :: bIntcm → bIntcm → Int
VSCALARP = VZIPWITHm(MUL) >>m VFOLDm(ADD)
This description is depicted in Figure 16.
21
VSCALARP
VSCALARP
VSCALARP
BROADCAST
n
(bs 1)
.
.
.
as 1
as 2
as n
c11
c 21
c
n1
VMMULT
BROADCAST k(ass)
VSCALARP
VSCALARP
VSCALARP
BROADCAST
n
(bs 2 )
.
.
.
as 1
as 2
as n
c12
c 22
c
n2
VMMULT
...
VSCALARP
VSCALARP
VSCALARP
BROADCAST
n
(bs k)
.
.
.
as 1
as 2
asn
c 1k
c 2k
c
nk
VMMULT
cs 1 cs 2
bs 1 bs2 bs k
cs k
...
...
Figure 15: The process MMULT, data-parallel design
22
For completeness, the CSP implementations of the simple addition and mul-
tiplication functions are:
ADD = (in1?a → SKIP ||| in2?b → SKIP); out !(a + b)
MUL = (in1?a → SKIP ||| in2?b → SKIP);out !(a × b)
VZIPWITH(MUL) VFOLD(ADD)
as
bs
Output
.
.
.
Figure 16: VSCALARP as a piping of two processes
6.2. Second Design - Streaming I/O
For this design the refinement of mmult is not changed. The change will
appear in the refinement of the function vmmult. Recall the formal specification
of this function:
vmmult::[Int] -> [[Int]] -> [Int]
vmmult bs ass = map (scalarp bs) ass
At this point, the input list ass is viewed as a stream of values ass=
〈as1, as2, ...asn〉 and the output list as a stream of values as well. The CSP re-
finement of vmmult(bs) is directly obtained from the off-the-shelf stream-based
implementation of the higher-order function map:
vmmult(bs) :: 〈[Int ]〉 → 〈Int〉
VMMULT (bs) = MAP(VSCALARP(bs))
Figure 17 shows the new version of the process VMMULT. This step also
shows clearly that there is no more replication of the process VSCALARP, which
is an indicator for the later reduction in use of hardware resources.
Keeping the refinement of the remaining functions the same, MMULT pro-
cess looks as in Figure 18.
6.3. Third Design - Pipelining
Demonstrating the refinement to pipelined parallelism is the purpose of this
design. Generally, this kind of parallelism is a very effective means for achiev-
ing efficiency in numerous algorithms. Usually, pipelined parallelism is much
harder to detect than data parallelism. Accordingly, the function decomposi-
tion strategy, found in [40] and recalled in Section 3.3 is used. This strategy
aims at exhibiting pipelined parallelism in functional programs. According to
the decomposition rule the definition of the function mmult is pipelined.
23
MAP(VSCALARP(bs))
.
.
.
c2
c1
.
.
.
as2
as1
Figure 17: The process VMMULT the input and output refined as streams of values
MAP(VSCALARP)
... as 2  as 1
.
.
.
c21
c11
VMMULT
BROADCAST k(ass)
bs 1 bs 2
..
.
... as 2  as1
.
.
.
c22
c12
VMMULT
... as 2  as 1
.
.
.
c 2k
c 1k
VMMULT
bs k
MAP(VSCALARP) MAP(VSCALARP)
Figure 18: The process MMULT the input and output refined as streams of values
24
mmult ass bss = map (vmmult ass) bss
vmmult [] bs = []
vmmult (as : ass) bs = vscalarp as bs : vmmult ass bs
The recursive function in this case is vmmult. The value to be passed to the
next stage of the pipe is a tuple. Its first is the input vector and its second is
result of applying vscalarp on the input vector from matrix bss and the present
argument from matrix ass. According to the decomposition rule, the efficient
implementation of mmult as a pipelined network of CSP processes can be as
follows:
MMULT = PRD(bss)  (()/(MAP ◦ f ′) ∗ [asn , asn−1, ..., as0])
MAP(initial) = µZ • left?eot → right !eot → SKIP
|
left?x → right !〈bs, []〉 → Z
MAP(f ′ as) = µZ • left?eot → right !eot → SKIP
|
left?〈bs, y〉 → right !〈bs, y++(vscalarp as bs)〉 → Z
MAP(final) = µZ • left?eot → right !eot → SKIP
|
left?〈bs, y〉 → right !y → Z
The decomposed pipelined network is shown in Figure 19. In this de-
sign, the matrix bss is input to the network as a stream of vectors (columns)
〈bs1, bs2, ...bsk 〉. The matrix ass vectors (row by row) are produced in the pipe
stages. The result is considered as a stream of streams 〈cs1, cs2, ...csk 〉 The first
result to appear from the network is the output stream (column) 〈csk 〉 corre-
sponding to the first input vector bsk . This design is independent from the size
of k a dimension of bss and css.
MAP( scalarp as
n
) MAP( scalarp as
n-1 ) MAP( scalarp as 1 )…bs k-1 , bsk
(bsk, <c nk >) (bsk, <c n-1k , cnk >) (bsk , <c1k ,…,cn-1k , cnk >)
MAP( final )MAP( initial )
(bsk , <>)
…csk-1 , cs k...
Figure 19: The process MMULT as a pipelined network, third design
6.4. Fourth Design - Pipelined Turnout Stages
This design makes use of an optimisation of the previous design. In this
case, the input matrix bss is refined as a stream of vectors 〈bs1, bs2, ...bsk 〉 and
the matrix ass is refined as arguments in the pipeline stages. The output from
each stage is turned out as a result, besides forwarding the input from bss to
the next stage. Thus, the output from this pipeline is a vector of streams as
shown in Figure 20. Note that, this design also doesn’t depend on the size of k
- the dimension in bss and css. The CSP implementation is as follows:
25
MMULT = PRD(Bss)(()/(MAP ◦ f ′) ∗ [asn , asn−1, ..., as0]) ‖ SINK
MAP(f ′ as) = µZ • left?eot → right !eot → SKIP
|
left?bs → down!(vscalarp as bs)→ right !bs → Z
MAP( scalarp as
n
) MAP( scalarp as
n-1 ) MAP( scalarp as1 )...bs k-1 , bsk
bsk
bsk bs k
cnk = <scalarp as n bs k>
.
.
cn-1k =<scalarp as n-1  bs k>
.
.
c1k  = <scalarp as 1  bs k>
.
.
SINK...
Figure 20: The process MMULT as a pipelined network, fourth design
6.5. Multilevel Pipelined Design
This design applies the function decomposition strategy for pipelined paral-
lelism on two levels. The first level is decomposing the vector matrix multipli-
cation into a pipeline of processes performing the scalar product of two vectors,
this is similar to the fourth design. An addition is made by pipelining the
scalar product routine creating a second level pipeline. The final structure of
the suggested design is multilevel pipelines realising the matrix multiplication
algorithm. The decomposition of the process VSCALARP is done in a similar
manner (See Figure 21):
VSCALARP = PRD(0)  (()/(MAP ◦ f ′) ∗ [am , am−1, ..., a0])
MAP(f ′ a) = µZ • left?eot → right !eot → SKIP
|
left?x → up?b → right !(x + (a × b))→ Z
Thus MMULT implementation is as follows:
MMULT = PRD(Bss)(()/(MAP ◦ f ′) ∗ [asn , asn−1, ..., as0]) ‖ SINK
MAP(f ′ as) = µZ • left?eot → right !eot → SKIP
|
left?bs → (PRD(bs)  VSCALARP(as)[down/right ]);
right?bs → Z
26
...bs k-1 , bsk
bs k bs k
cnk = <vscalarp as n  bs k>
.
.
cn-1k=<vscalarp as n-1 bs k>
.
.
c1k = <vscalarp a 1 bs k>
.
.
SINK
0
bk bk-1 b1
0 + a nxbk l + an-1xbk-1 l + a1xb1 0
bk bk-1 b1
0 + a nxbk l +an-1xbk-1 l + a1xb1
l l l l
.
.
.
0
bkbk-1b1
0 + anxb kl +an-1xbk-1l + a1xb1
ll
c2k = <vscalarp a 2 bs k>
.
.
0
bkbk-1b1
0 + anxbkl +an-1 xbk-1l + a1xb1
ll bskbs k
Figure 21: The process MMULT as 2D network design
An optimisation of this design would lead to a systolic solution. The main
idea of the change is to enable the communication between parallel VSCALARP
stages in MMULT. A VSCALARP is to be the parallel execution of basic cells,
each responsible of forwarding down the upper input from bss. The cell is also
responsible for doing its part for the scalar product computation. This part
is done by outputting to the right the result of adding the left input to the
multiplication of the upper input b (from bss) with the item a (from ass). The
basic process is called CELL (See Figure 22) and defined as:
CELL(a) = PRD(a)  (up?u → left?l → right !(u ∗ a + l)→ down!u)
Then, VSCALARP is implemented as:
VSCALARP(as) = PRD(0) 
(‖i=mi=1 CELL(as[i ])[di/left , d(i+1)/right , ei/up, e(i+1)/down])
Thus, MMULT implementation is as follows:
MMULT = PRD(Bss)(()/(MAP ◦ f ′) ∗ [asn , asn−1, ..., as0]) ‖ SINK
MAP(f ′ as) = µZ • up?eot → down!eot → SKIP
|
VSCALARP(as)[right/dm ]; Z
This implementation is depicted in Figure 23.
For the sake of giving a similar design, an intuitive CSP implementation of
the matrix multiplication algorithm shown in Figure 23 is as follows:
27
CELL(i,j )(ass[i , j ]) = µX •PRD(ass[i , j ])  ((up?u → down!u → (SKIP l
u = eot m left?l → right !(u ∗ ass[i , j ] + l))→ SKIP → X )
The matrix multiplication process MMULT is then implemented as:
MMULT = BROADCASTn(0)[d/out ] 
(‖i=ni=1 (‖j=mj=1 CELL(i , j )[bij/lwft , bi(j+1)/right , eij/up, e(i+1)j/down]))
Finally, these designs depend only on the dimensions n and m from ass.
CELLijleft
up
dow
n
right
(up*aij)+left
Figure 22: Basic cell
7. Reconfigurable Hardware Implementation
As a stage in the development model, Handel-C code follows the refined CSP
implementation of the presented designs. The targeted circuit implementation
is to have the same topology of the communicating processes shown in the
refinement section. In the following subsections, some pieces of code taken from
the five different designs are presented.
7.1. First Design - Data Parallelism
From first design, we recall the CSP implementation of the process MMULT :
MMULT (ass) = VMAPK (VMMULT (ass))
The code corresponding to the process MMULT is the macro MatrixMult
with bss as an input and css as an output. This macro is vector mapping of the
macro VectMatrixMult, where the vector of vectors ass is internally produced.
Recall that the macro VMap is the implementation of the higher-order process
VMAP. In this case, VMap works by distributing the matrix bss for each process
VectMatrixMult, where a vector from css will be the output.
macro proc MatrixMult (bss, css, n){
VMap(bss, css, n, VectMatrixMult);}
28
CELL11 CELL1m
left
up
dow
n
right
(up*a11)+left CELL12
up
dow
n
right
(up*a12)+left
up
dow
n
c1k
.
.
.
c13
c12
c11
CELL21 CELL2m
left
dow
n
right
(up*a21)+left CELL22
dow
n
right
(up*a22)+left
dow
n
c2k
.
.
.
c23
c22
c21
CELL
n1 CELLnm
left
up
right
(up*an1)+left CELL
n2
up
right
(up*an2)+left
up
cnk
.
.
.
cn3
cn2
cn1
b1k
.
.
.
b12
b11
d
o
w
n
d
o
w
n
d
o
w
n
b2k
.
.
.
b22
b21
b mk
.
.
.
bm2
bm1
SINK SINK SINK
0
0
0
...
...
...
.
.
.
.
.
.
.
.
.
Figure 23: The process MMULT as a systolic network
29
Then, the macro VectMatrixMult implements the process VMMULT, refined
as:
VMMULT (ass) = PRD(ass)  VMMULT
This applies again VMap calling the macro VScalarP (the implementation
of the process VSCALARP) for each vector in ass. At his point, bss is sinked
as it will be later internally produced in VScalarP.
macro proc VectMatrixMult (bss, cs) {
VectorOfVectorsOfItems(ass, n, m, Int);
SinkVectorOfVectorsOfItems(bss, n, m, sink);
par {
ProduceVectorOfVectorsOfItems(ass, n, m, tempAss);
VMap(ass, cs, n, VScalarP);}}
The macro VScalarP implements the CSP process VSCALARP :
VSCALARP(bs) = PRD(bs)  VSCALARP
The internal production for a vector from bss is done according to an in-
dex. This allows producing a different vector bs from bss for each execution of
VScalarP.
macro proc VScalarP(as, outputItem, index){
VectorOfItems(internalV, m, Int);
VectorOfItems(bs, m, Int);
par {
ProduceVectorOfItems(bs, m, tempBss[index]);
VZipWith(m, as, bs, internalV, Multiplication);
VFoldR (internalV, outputItem, m, Addition, 0);}}
The macros Addition and Multiplication corresponding to the processes
ADD and MUL are implemented as:
macro proc Multiplication(xItem, yItem, output) {
Int x,y;
xItem.Channel ? x;
yItem.Channel ? y;
output.Channel ! (x * y);}
macro proc Addition(xItem, yItem, output) {
Int x,y;
xItem.Channel ? x;
yItem.Channel ? y;
output.Channel ! (x + y);}
Running the above implementation is done practically by producing bss and
storing css from/to a buffer. The RC-1000 board internal SRAMS are used as
the input and output buffers. The main macro call that runs the code imple-
menting the first design for the matrix multiplication algorithm is as follows:
30
LoadVectorOfVectorsOfItemsFromBank0(n, m, ass);
par{
ProduceVectorOfVectorsOfItems(bss, m, k, bssTemp);
MatrixMult(bss, css, m);
StoreVectorOfVectorsOfItems(css, n, k, c ssTemp);}
7.2. Second Design - Streaming I/O
The code implementation of the current design reflects the change as applied
to the first design in the refinement to CSP change. The process VMMULT
definition is recalled:
VMMULT (ass) = PRD(ass)  VMMULT
The only modification to the implementation is done by refining ass to a
stream of vectors produced internally within VectMatrixMult. To meet the
change, this macro employs the macro SMap the sequential implementation
of the process SMAP. Accordingly, the Handel-C code is:
macro proc VectMatrixMult (bss, cs) {
StreamOfVectorsOfItems(ass, n, m, Int);
SinkVectorOfVectorsOfItems(bss, n, m, sink);
par {
ProduceStreamOfVectorsOfItems(Ass, n, m, tempAss);
SMap(ass, cs, n, VScalarP);}}
7.3. Third Design - Pipelining
For this design, the decomposed process for a single pipe stage is imple-
mented as the macro PipeStage. This macro starts by inputting from the left a
vector bs and a stream cs. These two inputs are the output of the left identical
pipe stage. For the initial pipe stage one input is the produced stream of vec-
tors bss and nothing on the stream channel. The PipeStage then computes for
the scalar product, forward the new stream cs and the vector bs to the right
identical process and finally waits for another input signal. When the end of
transmission is reached, an EOT signal is sent to the right pipe stage. The
output bss from the final stage is sinked, while the stream of streams output
css is stored as the result. The code of a single pipe stage is as follows:
macro proc PipeStage(tupleIn, tupleOut, iAs){
.
.
.
Item(outputItem, Int16);
VectorOfItems(tempBs, m, Int16);
VectorOfItems(tempAs, m, Int16);
do{
31
prialt{
case tupleIn.element1.elements[0].channel ? tempVbs[0]:
par(j = 1; j < m; j++){
sOVIn.elements[j].channel ? tempVbs[j];}
StoreStreamOfItems(tupleIn.element2.elements, iAs, tCs);
par{
ProduceVectorOfItems(tempBs, m, tempVbs);
ProduceVectorOfItems(tempAs, m, ass[iAs]);
VScalarP(tempAs, tempBs, m, outputItem);
StoreItem(outputItem, tCs[iAs]);}
ProduceVectorOfItems(tupleOut.element1, m, tempVbs);
ProduceStreamOfItems(tupleOut.element2, (iAs + 1), tCs);
break;
case TupleIn.element1.eotChannel ? eot:
TupleIn.element2.eotChannel ? eot1;
TupleOut.element1.eotChannel ! True;
TupleIn.element2.eotChannel ! True;
break;}} while (!eot);}
The general replicating macro that corresponds to the employed decompo-
sition is implemented as in the following code section. This macro takes the
advantage of using the ifselect and par Handel-C constructs. By using ifselect,
whole statements can be selected or discarded at compile time, depending on
the evaluation of the expression. Accordingly, the par statement selects only
one macro execution for P according to the value of c for each replication. The
parameter c corresponds to the pipe stage number. In this design this parameter
is initialised to 1 instead of 0 since a different initial pipe stage is implemented
to overcome a limitation in Handel-C. This limitation forbids the production of
stream of streams with a size 0 needed for the first pipe stage. The final picture
of this implementation is best depicted as in Figure 19.
macro proc Pipe (tIn, tOut, n, P){
typeof(tIn) cmids[n - 1];
par (c = 1; c < n; c++){
ifselect (c == 1)
P(tIn, cmids[c], c);
else ifselect (c < n - 1)
P(cmids[c - 1], cmids[c], c);
else
P(cmids[c - 1], TOut, c);}}
The execution of the matrix multiplication is done as follows.
par {
32
ProduceStreamOfVectorsOfItemsFromBank0(bssIn, m, bss);
PipeStageInitial(bssIn, tIn, 0);
Pipe (tIn, tOut, n, PipeStage);
StoreStreamOfStreamsOfItemsInBank1(tOut.element2, n, tempOut1);
SinkStreamOfVectorsOfItems(tOut.element1, m, tempOut2);}
7.4. Fourth Design - Pipelined Turnout Stages
In a similar way of implementing the third design, this design uses a modified
version of the previous macros. In the new TurnoutPipeStage the output is a
vector of streams vOSOUT. Each pipe stage outputs its own stream and issues
its own termination signal through its own eot channel. This macro works by
firstly inputting a vector from the stream bss, forwards it to the right stage,
then computes for the scalar product turning out its result. These steps are
repeated till the end of transmission of bss. At that point, local termination
signals are issued from the stage. The final picture of this implementation is
best depicted as in Figure 20.
macro proc TurnoutPipeStage(sOVIn, sOVOut, vOSOut, indexForAs) {
.
.
.
VectorOfItems(tempBs, m, Int16);
VectorOfItems(tempAs, m, Int16);
do{
prialt{
case sOVIn.elements[0].channel ? tempVbs[0]:
par(j = 1; j < m; j++){
sOVIn.elements[j].channel ? tempVbs[j];}
ProduceVectorOfItems(sOVOut, m, tempVbs);
par{
ProduceVectorOfItems(tempBs, m, tempVbs);
ProduceVectorOfItems(tempAs, m, ass[indexForAs]);
VScalarP(tempAs, tempBs, m, vOSOut);}
break;
case sOVIn.eotChannel ? eot:
sOVOut.eotChannel ! True;
vOSOut.eotChannel ! True;
break;}} while (!EOT);}
This pipe pattern is implemented as:
macro proc TurnoutPipe(in1, out1, out2, n, p) {
typeof(in1) cmids1[n + 1];
par (c = 0; c < n; c++){
ifselect (c == 0)
p (in1, cmids1[c], out2.elements[c], c);
33
else ifselect (c < n - 1)
p (cmids1[c - 1], cmids1[c], out2.elements[c], c);
else
p(cmids1[c - 1], out1, out2.elements[c], c);}}
The execution of this design’s implementation is done as follows:
void main(void) {
.
.
.
StreamOfVectorsOfItems(bssIn, m, Int16);
StreamOfVectorsOfItems(bssOut, m, Int16);
VectorOfStreamsOfItems(cssOut, n, Int16);
par {
ProduceStreamOfVectorsOfItems(bssIn, n, m, ass);
TurnOutPipe(bssIn, bssOut, cssOut, 3, TurnoutPipeStage);
StoreVectorOfStreamsOfItems(cssOut, n, k, tempCss);
SinkStreamOfVectorsOfItems(bssOut, m, tempBss);}
StoreMatrixInBank2(tempBss, n, k);}
It is clear from this design that n-parallel output streams are employed.
Consequently, a buffer with multi-concurrent access is needed. This kind of
access is not allowed with the available single R/W onboard SRAMs. Besides,
the number of available banks for concurrent access is only 4. This introduces
a limitation to the practical implementation of this design:
• The parameter k will appear as a static constant in the compilation. Thus,
for any new stream with a certain length a new compilation is needed.
• The limited ability of the FPGA to store the results on its internal area,
especially, for matrices with large dimensions.
the suggested general solution to this problem is storing the resultant Vec-
torOfStreamsOfItems on the local FPGA memory, and then storing them back
as a stream of values or a vector of streams with up to 4 parallel streams only.
7.5. Multilevel Pipelines Design
This design implementation uses two kinds of pipe replicating macros for the
two needed pipelined levels. The first pipe implements the decomposition for the
vector matrix multiplication process. Thus, we reuse the macro TurnoutPipe
employed in the fourth design. Furthermore, the second pipe macro, called
SystolicPipe, implements the scalar product process. This macro replicates pipe
stages with two inputs and two outputs. Recall the CSP implementation for a
the basic computation cell:
CELL(a) = PRD(a)  (up?u → left?l → right !(u ∗ a + l)→ down!u)
The code corresponding to CELL is:
34
macro proc SystolicPipeCell(li, upV, r, downV, i, j) {
.
.
.
Item(temp, Int16);
upV.Channel ? tempb;
result = tempb * ass[i][j];
par{
Addition(temp, li, r);
ProduceItem(temp, result);}
downV.Channel ! tempb;}
To implement the complete algorithm, these cells are composed to form a
scalar product pipeline using the macro SystolicPipe.
macro proc SystolicPipe (in1l, in2up, out1r, out2d, n, i,P){
typeof(in1l) cmids1[n + 1];
par (c = 0; c < n; c++){
ifselect (c == 0)
P(in1l,in2up.elements[c],cmids1[c],out2d.elements[c],i,c);
else ifselect (c < n - 1)
P(cmids1[c-1],in2up.elements[c],cmids1[c],out2d.elements[c],i,c);
else
P(cmids1[c-1],in2up.elements[c],out1r,out2d.elements[c],i,c);}}
This scalar product pipe is the core of the pipe stage needed for the pipeline
implementing the matrix multiplication algorithm. A pipe stage is implemented
as:
macro proc PipeStage(bssIn, bssOut, cssOut, i){
VectorOfItems(tempBs, m, Int16);
Item(rO, Int16);
Item(lI, Int16);
.
.
.
do{
prialt{
case bssIn.elements[0].channel ? temp[0]:
par(j = 1; j < m; j++){
bssIn.elements[j].channel ? temp[j];}
par{
ProduceItem(lI, 0);
ProduceVectorOfItems(tempBs, m, temp);
SystolicPipe(lI, tempBs, rO, cssOut, m, i, SystolicPipeCell);
StoreItem(rO, tempItem);}
35
break;
case bssIn.eotChannel ? eot:
bssOut.eotChannel ! eot;
break;}} while (!eot);}
This pipe stage is then replicated to form a pipeline using the predefined
macro TurnoutPipe. The main code section running the above implementation
is:
void main(void) {
.
.
.
StreamOfVectorsOfItems(bssIn, m, Int16);
StreamOfVectorsOfItems(bssOut, m, Int16);
StreamOfVectorsOfItems(cssFinal, n, Int16);
par{
ProduceStreamOfVectorsOfItemsFromBank1(bssIn, m);
{TurnoutPipe(bssIn, bssOut, cssFinal, n, PipeStage);
cssFinal.eotChannel !True;}
SinkStreamOfVectorsOfItemsToBank3(bssOut, m);
StoreStreamOfVectorsOfItemsInBank2(cssFinal, n);}}
8. Performance Analysis and Evaluation
The development is originated from a specification stage, whose main key
feature is its powerful higher-level of abstraction. During the specification,
the isolation from parallel hardware implementation technicalities allowed for
deep concentration on the specification details. Whereby, for the most part,
the style of specification comes out in favor of using higher-order functions.
Two other inherent advantages for using the functional paradigm are clarity
and conciseness of the specification. This was reflected throughout all the
presented studies. At this level of development, the correctness of the speci-
fication is insured by construction from the used correct building blocks. The
implementation of the formalised specification is tested under Haskell by per-
forming random tests for every level of the specification.
The correctness will be carried forward to the next stage of development
by applying the provably correct rules of refinement. The available pool of
refinement formal rules enables a high degree of flexibility in creating par-
allel designs. This includes the capacity to divide a problem into completely
independent parts that can be executed simultaneously (pleasantly parallel).
Conversely, in a nearly pleasantly parallel manner, the computations might re-
quire results to be distributed, collected and combined in some way. Remember
at this point, that the refinement steps are systematic and done by combining
off-the-shelf reusable instances of basic building blocks.
36
In this case study, we will measure the speed in Items per Second (ips). For
instance, a 3× 3 matrix has 9 items.
In Table 1 the results for running the different designs are presented. The
first design occupied an area of 564 Slices per Item running at a speed of 257.14
Kips for a network with dimensions of (3×3×3). The second design occupied a
smaller area, as expected, but the realised design runs with a speed of 280 Kips.
The pipelined second and third designs achieved a better speed of 2.25 Mips
with less areas of 227 and 237 Slices per Item. The smallest area ratio of 158.7
Slices per Item has been occupied when placing and routing the 2D pipelined
design. The speed achieved by this design is 3.1 Mips.
37
T
a
b
le
1
:
T
h
e
re
su
lt
s
o
f
te
st
in
g
th
e
su
g
g
es
te
d
m
a
tr
ix
m
u
lt
ip
li
ca
ti
o
n
d
es
ig
n
s
M
et
ric
s
Fi
rs
t D
es
ig
n
Se
co
nd
 D
es
ig
n
Th
ird
 D
es
ig
n
Fo
ur
th
 D
es
ig
n
2D
 P
ip
el
in
es
 D
es
ig
n
H
ig
he
st 
D
im
en
sio
n
R
ea
ch
ed
3x
3x
3 
(nx
mx
k)
7x
7 
(m
xk
)
9x
9 
(nx
m)
9x
9 
(nx
m)
11
x1
1 
(nx
m)
N
um
be
r o
f
G
at
es
10
51
58
 N
A
N
D
 G
at
es
25
71
25
 N
A
N
D
 G
at
es
36
66
67
 N
A
N
D
 G
at
es
38
81
03
 N
A
N
D
 G
at
es
46
69
55
  N
A
N
D
 G
at
es
M
ea
su
re
d
Ex
ec
ut
io
n 
Ti
m
e
35
 M
ic
ro
 S
ec
.
17
5 
 M
ic
ro
 S
ec
.
36
 M
ic
ro
 S
ec
.
36
 M
ic
ro
 S
ec
.
36
 M
ic
ro
 S
ec
.
M
ea
su
re
d 
Sp
ee
d
25
7.
14
 K
ip
s
28
0 
K
ip
s
2.
25
 M
ip
s
2.
25
 M
ip
s
 
3.
1 
M
ip
s
N
um
be
r o
f
O
cc
up
ie
d 
Sl
ic
es
50
76
 (2
6%
)
15
83
1 
(82
%)
18
38
5 
(95
%)
19
19
8 
(99
%)
19
19
8 
(99
%)
To
ta
l e
qu
iv
al
en
t
ga
te
 c
ou
nt
36
36
82
24
87
52
30
19
47
32
70
24
37
65
15
Sl
ic
es
 to
 It
em
s r
at
io
56
4 
Sl
ic
es
/It
em
32
3 
Sl
ic
es
/It
em
22
7 
Sl
ic
e/
Ite
m
23
7 
Sl
ic
es
/It
em
15
8.
7 
Sl
ic
es
/It
em
38
The second design is found to be 8.9% faster than the first design, also the
Slices to Items ratio of the second design is 42.7% less than that of the first
design. Thus, the second design can accommodate for a larger number of items
with a better speed as compared to the first design. The modification done to
the third pipelined design yielding the fourth pipelined design shows that there
were no effect on the speed of execution. However, the fourth design occupied
a 4.4% larger area. Thus, the modification didn’t leave a positive effect on the
performance. The 2D pipelined design has shown a better performance than
the other designs, for instance, it occupies a 30.1% less Slices per Item than the
third pipelined design, also achieving a 38% higher speed.
The (11×11) 2D pipelined cells design is independent from the third dimen-
sion k . In Table 2 we compare the execution time of running for different values
of k between the RC-1000 and two computer machines. These are a 1.2 GHz
Athlon AMD machine with 512MB of RAM, and a 1.4 GHz P4 with 1GB of
RAM. It is shown in Figures 25 and 24 that these machines will outperform the
suggested design implementation on the RC-1000 when the value of k is nearly
at a value of 299 items. We note here the possible effect of the bus connecting
the memory and FPGA on the speed of execution. To cope with this limitation
a suggestion could be proposed for adding a cache memory to handle the input
and output streams of data.
Table 2: Comparisons between the results of testing the 2D pipelined design and a C++
implementation running on two different personal computing machines; the results shown are
in Micro Seconds
Dimension
11x11xk
k =
RC-1000 Athlon Machine1.2 GHz
Pentium 4 Machine
1.4 GHz
11 39 239.3 37.19
99 56 540.739 342.339
199 477 702.552 661.62
299 1403 1030.5 946.59
599 3890 2711.7 1887.76
999 7158 4051 3176
2999 23569 10790.4 9484
6999 57374 26249.5 22358
9999 81963 34702.8 32502.15
9. Acknowledgement
I would like to thank Dr. Ali Abdallah, Prof. Mark Josephs, Prof. Wayne
Luk, Dr. Sylvia Jennings, and Dr. John Hawkins for their insightful comments
on the research which is partly presented in this paper.
39
0
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
2400
2600
2800
3000
3200
3400
3600
3800
4000
4200
11
x1
1x
11
11
x1
1x
99
11
x1
1x
19
9
11
x1
1x
29
9
11
x1
1x
59
9
M
ic
ro
 S
ec
on
ds 16-Bit 2D Pipelines
Athlon 1.2 
P4
Figure 24: A chart showing the change in execution wrt dimension as shown in Table 2 for
small values of k
40
0
4000
8000
12000
16000
20000
24000
28000
32000
36000
40000
44000
48000
52000
56000
60000
64000
68000
72000
76000
80000
84000
88000
11
x1
1x
11
11
x1
1x
99
11
x1
1x
19
9
11
x1
1x
29
9
11
x1
1x
59
9
11
x1
1x
99
9
11
x1
1x
29
99
11
x1
1x
69
99
11
x1
1x
99
99
M
ic
ro
 S
ec
on
ds
16-Bit 2D Pipelines
Athlon 1.2 
P4
Figure 25: A chart showing the change in execution wrt dimension as shown in Table 2
41
10. Conclusion
Mapping parallel versions of algorithms onto hardware could enormously im-
prove computational efficiency. Recent advances in the area of reconfigurable
computing came in the form of FPGAs and their high-level HDLs such as
Handel-C. In this paper, we build on these recent technological advances by
presenting, demonstrating and examining a systematic approach of behavioural
synthesis. This system creates a functional specification of an algorithm without
defining parallelism. Correspondingly, an efficient parallel implementation is de-
rived in the form of CSP network of processes. Accordingly, we create efficient
parallel implementations in Handel-C. The presented work included theory and
practices about the suggested methodology. This paper also presented a demon-
stration for using a proposed model to synthesise reconfigurable hardware for
the matrix multiplication algorithm. The general functional specification is dis-
cussed firstly followed by the provably correct step-wise refitment to CSP. Many
possible designs were engineered and compiled to reconfigurable hardware with
different levels of parallelism. The hardware implementation using Handel-C is
shown stressing the correspondence to the CSP refined networks. To complete
the synthesis, these designs were compiled to EDIF format and then placed and
routed. Accordingly, a performance study has been included for the realised
designs. The first design required the largest area with respect to the number
of items used, that is 564 Slices per Item for a speed of 257.14 Kips. The mod-
ification to the first design which lead to the second design helped in reducing
the area to 323 Slices per Item for a speed of 280 Kips. The third and fourth
pipelined designs occupied areas of 227 and 237 Slices per Item for a speed of
2.25 Mips. The lastly realised 2D pipelines design has the best area to items
ratio of 158.7 Slices per Item and a speed of 3.1 Mips. Future work includes
extending the theoretical pool of rules for refinement, the investigation of au-
tomating the development processes, and the optimisation of the realisation for
more economical implementations with higher throughput.
42
References
[1] G. Estrin, B. Bussell, R. Turn, J. Bibb, Parallel processing in a restruc-
turable computer system, IEEE Transactions on Electronic Computers
12 (6) (1963) 747–755.
[2] Xilinx, Information available from, http://www.xilinx.com.
[3] Altera, Information available from, http://www.Altera.com.
[4] Altium, Altium unveils new board-on-chip technology, Altium Limited Cat-
egory: Press Releases : Industry News (Market) http://www.altium.com
(April 2003).
[5] K. Torkelsson, J. Ditmar, Header Compression in Handel-C An Internet
Application and A New Design Language, in: Symposium on Digital Sys-
tems Design, Euromicro, 2001, pp. 2–7.
[6] Celoxica, Information available from, http://www.celoxica.com.
[7] I. Page, Logarithmic greatest common divisor example in Handel-C, Em-
bedded Solutions (April 1998).
[8] S. Stepney, CSP/FDR2 to Handel-C translation, Tech. Rep. YCS-2002-357,
Department of Computer Science, University of York (June 2003).
[9] D. Edwards, S. Harris, J. Forge, High performance hardware from java,
Xilinx Whitepaper http://www.xilinx.com.
[10] Y. Li, T. Callahan, E. Darnell, R. Harr, U. Kurkure, J. Stockwood,
Hardware-software codesign of embedded reconfigurable architectures, in:
Proceedings of the 37th Design Automation Conference, Los Angeles - USA,
2000, p. 30.
[11] N. Technology, Information available from, http://www.nimble.com.
[12] S. Network, Information available arom, http://www.systemc.org.
[13] Viva, Information available from, http://www.starbridgesystems.com.
[14] A. E. Abdallah, Functional process modelling, Research Directions in Par-
allel Functional Programming, (Springer Verlag, October 1999) (1999) 339–
360.
[15] A. E. Abdallah, Derivation of Parallel Algorithms: From Functional Speci-
fications to csp Processes, in: B. Moller (Ed.), Proceedings of Mathematics
of Program Construction, Vol. 947 of Lecture Notes in Computer Science,
Springer-Verlag, 1994, pp. 67–96.
43
[16] A. E. Abdallah, J. Hawkins, Calculational Design of Special Purpose Par-
allel Algorithms, in: Proceedings of 7th IEEE International Conference on
Electronics, Circuits and Systems (IEEE/ICECS), IEEE Computer Society
Press, 2000, pp. 261–267.
[17] A. E. Abdallah, J. Hawkins, Formal Behavioural Synthesis of handel-c Par-
allel Hardware Implementation for Functional Specifications, in: Proceed-
ings of the 36th Annual Hawaii International Conference on System Sci-
ences, IEEE Computer Society Press, 2003, pp. 278–288.
[18] Y. Lee, B. Ryder, A comprehensive approach to parallel data flow analysis,
in: Proceedings of the 6th international conference on Supercomputing,
ACM Press, 1992, pp. 236–247.
[19] R. Bird, P. Wadler, Introduction to Functional Programming, Prentice-
Hall, 1988.
[20] R. Bird, Introduction to Functional Programming Using Haskell, Addison
Wesley, 1999.
[21] R. Bird, An introduction to the theory of lists, in: M. Broy (Ed.), Logic of
Programming and Calculi of Discrete Design, Springer, Berlin, Heidelberg,
1987, pp. 5–42.
[22] M. Cole, Algorithmic Skeletons: A Structured Approach to the Manage-
ment of Parallel Computation, Ph.D. thesis, Computer Science Depart-
ment, University of Edinburgh, Edinburgh, Scotland, UK (1988).
[23] J. Darlington, A. Field, P. Harrison, H. Paul, J. Kelly, D. Sharp, Q. Wu,
Parallel programming using skeleton functions, in: Proceedings of the 5th
International PARLE Conference on Parallel Architectures and Languages
Europe, Springer-Verlag, 1993, pp. 146–160.
[24] S. Gorlatch, C. Lengauer, Parallelization of divide-and-conquer in the bird-
meertens formalism, Formal Aspects of Computing 7 (6) (1995) 663–682.
[25] F. A. Rabhi, Exploiting Parallelism in Functional Languages: A
”Paradigm-Oriented“ Approach, in: T. Lake, P. Dew (Eds.), Abstract Ma-
chine Models for Highly Parallel Computers, Oxford University Press, 1993,
p. 30.
[26] F. Hanna, W. Howells, Parallel Theorem Proving, in: C. Runciman,
D. Wakeling (Eds.), Applications of Functional Programming, UCL Press,
1994, Ch. 12, pp. 221– 235.
URL http://www.cs.ukc.ac.uk/pubs/1994/432
[27] D. Skillicorn, Foundations of Parallel Programming, Cambridge University
Press, 1994.
44
[28] K. Claessen, Embedded languages for describing and verifying hardware,
Ph.D. thesis, Chalmers Univesity of Technology and Go¨teborg University,
Sweden (April 2001).
[29] J. Launchbury, J. Lewis, B. Cook, On embedding a microarchitectural de-
sign language within haskell, in: Proceedings of the fourth ACM SIGPLAN
international conference on Functional programming, ACM Press, 1999, pp.
60–69.
[30] J. Matthews, J. Launchbury, B. Cook, Specifying microprocessors in hawk,
in: Proceedings of the International Conference on Computer Languages,
IEEE, 1998, pp. 90–101.
[31] J. O’Donnell, Hydra: hardware description in a functional language using
recursion equations and high order combining forms, in: G. J. Milne (Ed.),
The Fusion of Hardware Design and Verification, North-Holland, Amster-
dam, 1988, pp. 309–328.
[32] Y. Li, M. Leeser, HML: An innovative hardware design language and its
translation to VHDL, in: Conference on Hardware Design Languages, 1995.
[33] D. Barton, Advanced modeling features of MHDL, in: In International
Conference on Electronic Hardware Description Languages, 1995.
[34] S. Johnson, B. Bose, DDD: A system for mechanized digital design deriva-
tion, Tech. Rep. 323, Indiana University, Indiana (1990).
[35] R. Sharp, Higher-level hardware synthesis, Ph.D. thesis, Robinson College
University of Cambridge, Cambridge (November 2002).
[36] M. Sheeran, muFP: a language for VLSI design, in: Proc. ACM Symposium
on LISP and Functional Programming, ACM Press, 1984, pp. 104–112.
[37] G. Jones, M. Sheeran, Circuit design in ruby, In Formal Methods for VLSI
design (1990) 13–70.
[38] T. Cheung, G. Hellestrand, Multi-level equivalence in design transforma-
tion, in: Proceedings of International Conference on Computer Hardware
Description Languages, Chiba Japan, 1996, pp. 559–566.
[39] C. A. R. Hoare, Communicating Sequential Processes, Prentice-Hall, 1985.
[40] A. E. Abdallah, Synthesis of massively pipelined algorithms for list manip-
ulation, in: L. Bouge, P. Fraigniaud, A. Mignotte, Y. Robert (Eds.), Pro-
ceedings of the European Conference on Parallel Processing, EuroPar’96,
LNCS 1024, (Springer Verlag, 1996), Springer Verlag, 1996, pp. 911–920.
[41] I. Ltd., OCCAM 2 reference manual, Prentice-Hall International (1988).
[42] E. Horowitz, A. Zorat, Divide-and-conquer for parallel processing, IEEE
Trans. Comput. C32 (6) (1983) 582–585.
45
[43] J. Hake, Parallel algorithms for matrix operations and their performance
on multiprocessor systems, Advances in Parallel Algorithms.
[44] G. Fox, M. Johnson, G. Lyenga, S. Otto, J. Salmon, D. Walker, Solv-
ing Problems on Concurrent Processors, Vol. 1, Prentice Hall, Emglewood
Cliffs, New Jersy, 1988.
[45] L. Cannon, A celluler computer to implement the kalman filter algorithm,
Ph.D. Thesis, Montana State University, Bozman - Montana (1969).
[46] J. Choi, J. Dongarra, R. Pozo, D. Walker, SCALAPACK: A scalable lin-
ear algebra library for distributed memory concurrent computers, in: Pro-
ceedings of the Fourth Symposium on the Frontiers of Massively Parallel
Computation, IEEE Comput. Soc., 1992, pp. 120–127.
[47] J. Choi, J. Dongarra, D. Walker, PUMMA: Parallel universal matrix mul-
tiplication algorithms on distributed memory concurrent computers, Con-
currency: Practice and Experience 6 (1994) 543–57.
[48] J. Dongarra, I. Duff, D. Sorensen, H. V. D. Vorst, Solving linear systems
on vector and shared memory computers, in: SIAM, 1991, p. 30.
[49] S. Huss-Lederman, E. Jacobson, A. Tsao, Comparison of scalable parallel
matrix multiplication libraries, in: Proceedings of the Scalable Parallel
Libraries Conference, IEEE Comput. Soc., Starksville - MS, 1993, pp. 120–
127.
46
11. Biography for Issam Damaj
Issam W. Damaj (Ph.D. M.Eng. B.Eng. MIEEE MIEE) received his B.Eng.
in Computer Engineering from Beirut Arab University in 1999 (with high dis-
tinction), and his M.Eng. in Computer and Communications Engineering from
the American University of Beirut in 2001 (with high distinction). He was
awarded his Ph.D. degree in Computer Science from London South Bank Uni-
versity, London, United Kingdom in 2004. Currently, he is with the Electrical
and Computer Engineering Department at Hariri Canadian Academy for Sci-
ences and Technology, Lebanon. His research interests include reconfigurable
computing, parallel processing, h.w./s.w. co-design, computer interfacing and
applications, fuzzy logic, and computer security. He has more than 25 interna-
tional and regional research publications and projects. He is a Member of the
IEEE and IEE professional organizations, and the order of Engineers in Beirut.
47
