High Performance Reconfigurable Computing Systems by Damaj, Issam
  
 
 
 
 
Chapter 3 
 
 
 
HIGH PERFORMANCE RECONFIGURABLE 
COMPUTING SYSTEMS 
 
 
Issam W. Damaj* 
Dept. of Electrical and Computer Engineering Dhofar University 
P.O. Box 2509, 211 Salalah, Oman 
Abstract 
The rapid progress and advancement in electronic chips technology provides a variety of 
new implementation options for system engineers. The choice varies between the flexible 
programs running on a general purpose processor (GPP) and the fixed hardware implementation 
using an application specific integrated circuit (ASIC). Many other implementation options 
present, for instance, a system with a RISC processor and a DSP core. Other options include 
graphics processors and microcontrollers. Specialist processors certainly improve performance 
over general-purpose ones, but this comes as a quid pro quo for flexibility. Combining the 
flexibility of GPPs and the high performance of ASICs leads to the introduction of 
reconfigurable computing (RC) as a new implementation option with a balance between 
versatility and speed. 
Field Programmable Gate Arrays (FPGAs), nowadays are important components of RC-
systems, have shown a dramatic increase in their density over the last few years. For example, 
companies like Xilinx [1] and Altera [2] have enabled the production of FPGAs with several 
millions of gates, such as, the Virtex-2 Pro and the Stratix-2 FPGAs. Considerable research 
efforts have been made to develop a variety of RC-systems. Research prototypes with fine-grain 
granularity include Splash [3], DECPeRLe-1 [4], DPGA [5] and Garp [6]. Examples of systems 
with coarse-grain granularity are RaPiD [7], MorphoSys [8], and RAW [9]. Many other systems 
were also developed, for instance, rDPA [10], MATRIX [11], REMARC [12], DISC [13], Spyder 
[14] and PRISM [15].  
The focus of this chapter is introducing reconfigurable computers as modern 
supercomputing architectures. The chapter also investigates the main reasons behind the current 
advancement in the development of RC-systems. Furthermore, a technical survey of various 
RC-systems is included laying common grounds for comparisons.  In addition, this chapter 
mainly presents case studies implemented under the MorphoSys RC-system. The selected case 
studies belong to different areas of application, such as, computer graphics and information 
coding. Parallel versions of the studied algorithms are developed to match the topologies 
                                                        
* E-mail address: i_damaj@du.edu.om; 
 2 
 
supported by the MorphoSys. Performance evaluation and results analyses are included for 
implementations with different characteristics. 
1. Introduction 
The rapid progress and advancement in electronic chips technology provides a variety of new 
implementation options for system engineers. The choice varies between the flexible programs 
running on a general purpose processor (GPP) and the fixed hardware implementation using 
an application specific integrated circuit (ASIC). Many other implementation options present, 
for instance, a system with a RISC processor and a DSP core. Other options include graphics 
processors and microcontrollers. Specialist processors certainly improve performance over 
general-purpose ones, but this comes as a quid pro quo for flexibility. Combining the flexibility 
of GPPs and the high performance of ASICs leads to the introduction of reconfigurable 
computing (RC) as a new implementation option with a balance between versatility and speed. 
GPPs are programmed entirely through software. GPPs have wide applicability; 
nevertheless they may not match the computational needs of many applications. ASICs are 
custom designed for applications. The architecture of an ASIC exploits intrinsic characteristics 
of an applications algorithm that lead to a high performance. However, the direct architecture 
algorithm mapping restricts the range of applicability of ASIC-based systems. ASICs provide 
precise function needed for a specific task. The designer, by synchronizing each ASIC to 
execute a job, can produce chips that are fast, cheap and consume less power than 
programmable or general-purpose processors. 
Along with trying to find the right balance between versatility and speed, designers have 
to face the cost constraint too. In addition to the cost of manufacturing several ASICs, another 
cost is the cost of design. Since a well-designed ASIC can solve a certain problem but not a 
slightly modified problem, efforts to design the new ASIC cannot make use of all the effort 
spent on the old ASIC since it is too highly customized to be reused. Thus, the effort expended 
on the design of an ASIC is almost lost when designing other ASIC, even those which 
performance of a closely related task. 
With reconfigurable computing (RC), a new computation paradigm has emerged over the 
last decade, which intents to fill the gap between conventional microprocessors and application-
specific integrated circuits (ASICs) [16]. All reconfigurable systems share the same basic idea: 
to benefit from programmable logic, which allows to dynamically adapting the system's 
functionality to the requirements of the running application. The most popular devices, which 
actually enabled reconfigurable computing, are Field-programmable Gate Arrays (FPGAs), 
which were introduced in the mid eighties. Many approaches of reconfigurable systems have 
been proposed in recent years; some of them are known as hybrids, which combine 
reconfigurable hardware with a processor core such as the MorphoSys reconfigurable system 
designed at the University of California Irvine (UCI). 
The focus of this chapter is introducing reconfigurable computers as modern 
supercomputing architectures. The chapter also investigates the main reasons behind the current 
advancement in the development of RC-systems. Furthermore, a technical survey of various 
RC-systems is included laying common grounds for comparisons.  In addition, this chapter 
mainly presents case studies implemented under the MorphoSys RC-system. The selected case 
studies belong to different areas of application, such as, computer graphics, information coding, 
 3 
and signal processing. Parallel versions of the studied algorithms are developed to match the 
topologies supported by the MorphoSys. Performance evaluation and results analyses are 
included for implementations with different characteristics. 
2. Reconfigurable Computing Systems 
2.1. History of Reconfigurable Computing 
The first descriptions of computing automata capable of reconfiguration were put forward by 
John von Neumann in a series of lectures and unfinished manuscripts dating back to the late 
40’s and early 50’s. After the death of von Neumann in 1957, his works on self-reproducing 
automata were collected and edited by Arthur Burks, who published them in 1966. Although 
von Neumann is generally regarded as the main developer of the conventional serial model of 
computing, it seems obvious that during the last years of his life, he was more interested in 
more complicated computing automata. The earliest electronic computers were built with error-
prone vacuum tube technology. In 1959, Jack Kilby invented the monolithic integrated circuit 
at Texas Instruments, but it was not until the introduction of Intel’s 4004 microprocessor that 
general-purpose computers began to be integrated on the same silicon chip. 
In 1963, Gerald Estrin of University of California at Los Angeles proposed a variable 
structure computer system to achieve performance gains in a variety of computational asks 
[18]. The central idea was to combine both fixed and variable structure computer organizations, 
where the variable subsystem could be reorganized into a variety of problem-oriented special 
purpose configurations. 
The first suggestion for a programmable logic device is due to Sven Wahlstrom, who in 
1967 proposed the inclusion of additional gates to customize an array of integrated circuitry. 
However, the silicon “real estate” was an extremely scarce resource in those days. In 1967, 
Robert Minnick published a survey of microcellular research. He described both fixed cell-
function arrays and variable cell-function arrays. In fixed cell-function arrays the switching 
function of each cell remained fixed, and only the interconnections between cells were 
programmable. In the case of variable cell-function arrays, the function produced by each cell 
could also be determined by parameter selection. 
In the 70’s, interest in and the corresponding financial support for non-serial forms of 
computation seems to have tapered off. This was most probably caused by the introduction of 
the first microprocessor - the famous 4004 by Intel - in 1971 and the ever-growing number and 
scope of applications enabled by the expanding market of microprocessors and 
microcontrollers. 
In the eighties, there was a revived interest in both systolic and parallel architectures. This 
interest was inspired partly by the advances in semiconductor integration technology and the 
evolution of system design concepts —the seminal work on Very Large-Scale Integration 
(VLSI) design by Carver Mead and Lynn Conway was published in 1980 and partly by new and 
more demanding applications of supercomputing. An interesting design combining parallelism 
with reconfigurability was the Texas Reconfigurable Array Computer (TRAC) [19]. In the TRA 
project, reconfigurability meant the reprogramming of interconnections between individual 
computing elements. 
 4 
 
In the late eighties and early nineties, the first platforms for reconfigurable computing were 
built. One of the first such platforms was designed at Digital Equipment Corporation’s (DEC) 
Paris Research Laboratory (PRL) and was called Programmable Active Memory (PAM) [20]. 
The speedups achieved by the PAM project were very impressive and as similar results were 
reported by other research groups at approximately the same time, one could say that 
reconfigurable computing had passed its first test. This was also realized by two influential 
engineering societies, the Association for Computing Machinery (ACM) and the Institute of 
Electrical and Electronics Engineers (IEEE). These engineering societies began sponsoring two 
annual conference series about the applications of FPGAs, namely the ACM/SIGDA 
International Symposium on Field-Programmable Gate Arrays and the IEEE Symposium on 
FPGA-Based Custom Computing Machines.  In Europe, the first International Workshop on 
Field-Programmable Logic and Applications was held in 1991. Reconfigurable computing 
would not be possible without the advances in electronics, because large FPGA circuits have 
enormous silicon overhead; for example, a programmable logic device with 50000 usable gates 
may have well over a million transistors. This demonstrates that the FPGA market has benefited 
tremendously from advances in semiconductor manufacturing technology. 
Xilinx, which was founded in 1984, introduced the world’s first FPGA in 1985. Being the 
first in the FPGA market, Xilinx continued to dominate it well into the nineties, but in recent 
years Altera Corporation has increased its market share substantially, due to the popularity of 
its new SRAM -based FPGAs, the FLEX 8000 and FLEX 10K family. The convergence of the 
theoretical path and the technological path in the nineties may mark the beginning of a 
promising era for reconfigurable computing. Many influential industry observers feel, that the 
promises of reconfigurable computing are not just superficial media hype, but that there are 
real advantages to be gained by applying reconfigurable computing. 
2.2. Field Programmable Gate Arrays 
FPGAs are a hybrid device between PALs and Mask-Programmable Gate Arrays (MPGAs). 
Like PALs, they are fully electrically programmable, and they can be customized nearly 
instantaneously. Like MPGAs they can implement very complex computations on a single chip, 
with millions of gates devices currently in production. These devices have opened up 
completely new avenues in high-performance computation, forming the basis of reconfigurable 
computing. Most current FPGAs are SRAM -programmable. This means that SRAM bits are 
connected to the configuration points in the FPGA and programming the SRAM bits configures 
the FPGA. Thus, these chips can be programmed and reprogrammed as easily as a standard 
static RAM. Here the programming bit will turn on a routing connection when it is configured 
with a true value, allowing a signal to flow from one wire to another, and will disconnect these 
resources when the bit is set to false. With a proper interconnection of these elements, which 
may include millions of routing choice points within a single device, a rich routing network can 
be created. 
Xilinx, one of the major manufacturers of FPGAs defines them as high-density Application 
Specific Integrated Circuits (ASICs) combining the logic integration of custom Very Large 
Scale Integration (VLSI) with the time to market and cost advantages of standard products [1]. 
Another major producer Lucent also stresses the versatility of the FPGA in its publication 
and of the time to market benefits. In the beginning FPGAs were mostly looked at as being of 
 5 
importance for rapid prototyping of circuits which would later be hardwired and thus of being 
a design tool to aid profitability rather than being of use directly in applications. That is 
changing as chips become available with higher gate counts on tools and faster routing and 
placing tools becoming available [21]. 
Utilizing high-level programming languages allows a complete picture of structure and 
functionality of the whole circuit is built up via a text file that is then fed through the appropriate 
software to produce a layout of gates, known as a net list, and downloaded to the FPGA. This 
is where the advantages of a system based around FPGAs for rapid prototyping is seen. 
2.3. Reconfigurable Systems Generalities 
Reconfigurable hardware can be used to provide reconfigurable functional units within a host 
processor. This allows for a traditional programming environment with the addition of custom 
instructions that may change over time. The reconfigurable units execute as functional units 
controlled by a main microprocessor. Registers are used to hold the input and output operands. 
Thus, the basic architecture of comprises a software programmable core processor and a 
reconfigurable hardware component. The core processor executes sequential tasks of the 
application and controls data transfers between the programmable hardware and data memory. 
Generally, the reconfigurable hardware is dedicated to exploitation of parallelism available in 
the applications algorithm. This hardware typically consists of a collection of interconnected 
reconfigurable elements. Both the functionality of the elements and their interconnection is 
determined through a special configuration program called the context. 
State of the art RC-systems are of different architectures. A general classification could be 
viewed in terms of 4 main categories: granularity, depth of programmability, reconfigurability, 
and interface coupling. We note at this point that more detailed taxonomies are presented in 
[22, 23]. These 4 main categories are defined as follows: 
2.3.1. Granularity 
 
System granularity is defined by the internal structure of the reconfigurable elements. 
Computation blocks within the reconfigurable hardware vary from system to system. Each unit 
of computation can be as simple as a 3-input look up table (LUT), or as complex as a 4-bit ALU. 
This difference in block size is commonly referred to as the granularity of the logic block. A 3-
bit LUT is an example of a very fine-grained computational element, and a 4-bit ALU is an 
example of a quite coarse-grained unit. Each element operates at the bit level implementing a 
Boolean function or a finite-state machine. The finer grained blocks are useful for bit-level 
manipulations, while the coarse-grained blocks are better optimized for standard datapath 
applications. Several reconfigurable systems use a medium-sized granularity of logic block. A 
number of these architectures operate on two or more 4-bit wide data words. Examples of fine-
grain reconfigurable systems are Splash and DECPeRLe-1, and Matrix  is an example of a 
coarse-grained reconfigurable system. 
 
 
 6 
 
2.3.2. Depth of Programmability 
 
In terms of depth of programmability, a reconfigurable system may have a singlecontext or 
multiplecontexts. For single-context systems only one configuration program context may be 
resident in the system. In this case the systems functionality is limited to the context currently 
loaded. On the contrary, in multiple-context systems, several contexts can be resident in the 
system at once. This allows execution of different tasks simply by changing the operating 
context. 
2.3.3. Reconfigurability 
 
Reconfigurability pertains to the ability of the system to overlap execution with loading with 
new context. In statically reconfigurable systems, reconfiguration of the programmable 
hardware can occur only if the current execution is interrupted or when it finishes. On the other 
hand, in dynamically reconfigurable systems reconfiguration can be done concurrently with 
execution. The interface coupling of a reconfigurable system refers to the level of integration 
of the core processor and the reconfigurable hardware. 
Frequently, the areas of a program that can be accelerated through the use of reconfigurable 
hardware are too numerous or complex to be loaded simultaneously onto the available 
hardware. For these cases, it is helpful to use dynamically RC-systems to swap different 
configurations in and out of the reconfigurable hardware as they are needed during program 
execution. Accordingly, the run-time reconfigurability is more likely to lead to an overall 
improvement in performance. 
2.3.4. Interface Coupling 
 
An RC-system is tightly coupled if the core processor and the programmable component reside 
in the same chip. The system is loosely coupled, if core processor and programmable logic are 
implemented as separate devices. 
In loosely coupled systems, an attached reconfigurable processing unit behaves as if it is 
an additional processor in a multiprocessor system. The host processor's data cache is invisible 
to the attached reconfigurable processing unit. Thus, a higher delay exists in communication 
between the host processor and the reconfigurable hardware. Such as, when communicating 
configuration information, input data, and results. However, this type of reconfigurable 
hardware does allow for a great deal of computation independence, by shifting large chunks of 
a computation over to the reconfigurable hardware. The most loosely coupled form of 
reconfigurable hardware is that of an external standalone processing unit. This type of 
reconfigurable hardware communicates infrequently with a host processor. This model is 
similar to that of networked workstations, where processing may occur for very long periods 
of time without a great deal of communication. 
Each of the addressed styles has distinct benefits and drawbacks. The tighter the integration 
of the reconfigurable hardware, the more frequently it can be used within an application or set 
of applications due to a lower communication overhead. The more loosely coupled styles allow 
for greater parallelism in program execution but suffer from higher communications overhead. 
 7 
2.4. Reconfigurable Systems 
There has been considerable research effort to develop a variety of RC-systems. Research 
prototypes with fine-grain granularity include Splash [3], DECPeRLe-1 [4], DPGA [5] and 
Garp[6]. Array processors with coarse-grain granularity, such as rDPA [10], MATRIX [11], and 
REMARC [12] form another class of reconfigurable systems. Other systems with coarse-grain 
granularity include MorphoSys [25-27], RaPiD [7], and RAW [9]. Other reconfigurable systems 
with a core control processor and FPGAs as the reconfigurable part are DISC [13], Spyder [14], 
and PRISM [15]. Other systems include the PipeRench [24]. 
The Splash and DECPeRLe-1 computers were among the first research efforts in 
reconfigurable computing. Splash, a linear array of processing elements with limited routing 
resources, is useful mostly for linear systolic applications. DECPeRLe-1 is organized as a two-
dimensional array of 16 FPGAs with more extensive routing. Both systems are fine-grained, 
with remote interface, single configuration and static reconfigurability. 
rDPA, The reconfigurable data-path architecture (rDPA) consists of a regular array of 
identical data-path units (DPUs). Each DPU consists of an ALU, a micro-programmable control 
and four registers. The rDPA array is dynamically reconfigurable and scalable. The ALUs are 
intended for parallel and pipelined implementation of complete expressions and statement 
sequences. The configuration is done through mapping of statements in high-level languages 
to rDPA using DPSS (Data Path Synthesis System). 
MATRIX is an array of 8-bit basic units (BUs), ALU-multiplication unit and control logic 
interconnected through a hierarchy of three levels. MATRIX aims to unify resources for 
instruction storage and computation. The basic unit (BU) can serve either as a memory or a 
computation unit. The 8-bit BUs are organized in an array, and each BU has a 256-word 
memory, ALU-multiply unit and reduction control logic. The interconnection network has a 
hierarchy of three levels, and it can deliver up to 10 GOPS (Giga-operations/s) with 100 BUs 
when operating at 100 MHz. 
REMARC consists of a reconfigurable coprocessor, which has a global control unit for 64 
programmable blocks (nano-processors). Each 16-bit nano-processor has a 32 entry instruction 
RAM, a 16-bit ALU, 16 entry data RAM, instruction register, and several registers for program 
data, input data and output data. The interconnection is two-level (2D mesh and global buses 
across rows and columns). The global control unit (1024 instruction RAM with data and control 
registers) controls the execution of the nano-processors and transfers data between the main 
processor and nano-processors. This system performs well for multimedia applications, such 
as MPEG encoding and decoding (though it is not specified if it satisfies the real-time 
constraints). 
RaPiD is a linear array (8 to 32 cells) of functional units, configured to form a linear 
computation pipeline. Each array cell has an integer multiplier, three ALUs, registers and local 
memory-segmented buses are used for efficient utilization of interconnection resources. It 
achieves performance close to its peak 1.6 GOPS for applications such as FIR filters or motion 
estimation. 
The Reconfigurable Architecture Workstation (RAW) is a set of replicated tiles, where each 
tile contains a simple RISC processor, some bit-level reconfigurable logic and some memory 
for instructions and data. Each RAW tile has an associated programmable switch, which 
connects the tiles in a wide-channel point-to-point interconnect. When tested on benchmarks 
 8 
 
ranging from encryption, sorting, to FFT and matrix operations, it provided gains up to 100 
times, as compared to a Sun SparcStation 20. 
The Dynamically Programmable Gate Arrays (DPGA) is a fine-grained prototype system 
that use traditional 4-input lookup tables as the basic processing elements. DPGA supports rapid 
run-time reconfiguration. 
Garp is a loosely coupled system with fine granularity. Garp has rows of blocks, which 
are like the CLBs of the Xilinx 4000 FPGA series. Garp architecture has more than 24 columns 
of blocks, whilst the number of rows is implementation dependant. The blocks operate on 2-bit 
data. There are vertical and horizontal block-to-block wires for data movement within the array. 
Separate memory buses move information (data as well as configuration) in and out of the 
array. Speedups ranging from 2 to 24 times are obtained for applications, such as encryption, 
image dithering and data sorting. 
MorphoSys: Morphing System features a novel architecture for reconfigurable computing 
systems. The MorphoSys is primarily targeted to applications with inherent parallelism, high 
regularity, word-level granularity, and computations with intensive nature. Some examples of 
such applications are video compression, image processing, graphics acceleration, and security. 
The first Processor Reconfiguration through Instruction-Set Metamorphosis (PRISM-I) 
consists of a board with four Xilinx 3090’s plugged into a host system based around a Motorola 
68010 system. A C-language-like compiler was created for PRISM-I to automatically translate 
subroutines to be mapped onto the reconfigurable hardware. Compiled programs run partly on 
the host processor and partly on the attached FPGA. The second prototype, PRISM-2, brought 
the host processor and FPGAs closer together, attaching an AMD Am29050 directly to three 
Xilinx 4010’s. 
Spyder extended a custom processor with three Xilinx 4010 FPGAs acting as 
reconfigurable execution units. With Spyder, the programmer is responsible for dividing a 
program between the main processor and the reconfigurable units and programming each in a 
special subset of C++ programming language. 
DISC - the Dynamic Instruction Set Computer constructs the main processor and 
reconfigurable component together within the FPGA parts. The first DISC was made with two 
National Semiconductor CLAy31s FPGAs. A primitive main processor is implemented on a 
part of one CLAy31, with the majority of the same chip supplying the prototype reconfigurable 
component. The second CLAy31 served only to control the loading of configurations on the 
first one. With DISC-2, the main processor is moved onto a separate third CLAy31. 
PipeRench is a reconfigurable fabric - an interconnected network of configurable logic and 
storage elements. By virtualizing the hardware, PipeRench overcomes the disadvantages of 
using FPGAs as reconfigurable computing fabrics. Unlike FPGAs, PipeRench is designed to 
efficiently handle computations. Using a technique called pipeline reconfiguration; PipeRench 
improves compilation time, reconfiguration time, and forward compatibility. 
On the industrial level, many RC-systems are currently being produced by different 
companies like Xilinx [1], Celoxica [28], Elixent [29], Altera [2], Lucent [21], Actel [30], 
NallaTech [31], Chameleon Systems [32], MorphoTech [33], and Intel [34]. 
 9 
2.5. Application of Reconfigurable Systems 
Reconfigurable computers where brought in to many areas of application, such as, information 
coding, digital signal processing, digital image processing, space and solar applications, 
biomedical engineering, networking, and computers and communications security. 
2.5.1. Information Coding 
 
Reconfigurable hardware implementation of the Viterbi coding algorithm is presented in [35]. 
The investigation includes the implementation of a reduced-complexity adaptive Viterbi 
algorithm (AVA). Run-time dynamic reconfiguration is used in response to changing channel 
noise conditions to achieve improved decoder performance. Implementation parameters for the 
decoder have been determined through simulation. The decoder has been implemented on a 
Xilinx XC4036-based PCI board. An overall decode performance improvement of 7.5 times for 
AVA has been achieved versus algorithm implementation on a Celeron-processor based system. 
The uses of dynamic reconfiguration lead to a 20 percent performance improvement over a 
static implementation with no loss of decode accuracy. 
2.5.2. Space and Solar Applications 
 
For solar applications, the research done in [36] describes how the advent of FPGAs has 
allowed replacement of a huge number of DSP chips (30,000) by a smaller number (231) of 
custom chips with Nobeyama antenna array for monitoring solar flares. 
In [37] N-body methods are used to simulate the evolution and interaction of galaxies using 
reconfigurable computing. These simulations are usually run on large-scale supercomputers or 
on very expensive full-custom reconfigurable hardware. 
2.5.3. Digital Signal Processing 
 
Considerable research for digital signal processing (DSP) has been done, such as; RC hardware 
implementations for elliptic 2R filters, discrete Fourier transform (DFT), FIR filters, and 
adaptive digital filters. 
The use of FPGAs to implement a fourth order band pass elliptic 2R filter is investigated 
in [38]. The key idea is to replace multipliers with Distributed Arithmetic (DA), in order to 
simplify the process involved. DA is where numbers are converted to two's complement digital 
format and then the multiplier can be implemented using adders and accumulators. 
In [39], the investigation describes a new approach to computing the Discrete Fourier 
Transform (DFT) that significantly increases the power of DSP chips. The DFT is reduced to 
additions performed in parallel register arrays ideal for implementation in FPGAs. The 
algorithm is highly flexible and 10 to 100 times faster than standard approaches due to its 
elimination of add, multiply and accumulate operations. 
FIR filters implementations where carried in different research groups [40 - 43] and 
industrial firms (e.g. Xilinx).  Some of that work was done in the University of Kansas for 
synthesizing an efficient FIR filter reconfigurable architecture using FPGAs. The FPGA 
implementation suggests that 60-70 tap chips with sampling rates exceeding 100 MHz should 
be feasible [42]. Another automatic implementation of FIR filters on FPGAs is given in [43]. 
 10 
 
In [44] the application of reconfigurable computing to exploit the inherent serial features 
in some DSP algorithms is investigated with the aim of building cost-effective, real-time 
hardware. The addressed DSP algorithms are decomposed so that the most computationally-
intensive and data-flow-oriented part is separated from the less intensive and more control-
flow-oriented part. The different modules in the intensive part are expected to be serialized and 
implemented on a reconfigurable platform. The proposed architecture consists of an array of a 
minimum of two Field Programmable Gate Arrays (FPGAs). The FPGAs are grouped in two 
sets such that when one set executes the current batch of modules the second set could be 
configured to execute the next batch of modules. The control flow oriented part of the algorithm 
is implemented on a DSP processor. 
Research is also carried for the creation of adaptive digital filters implemented in FPGAs, 
with the attraction that the coefficients and the filter structure can be changed through the 
reconfiguration of the FPGA [45]. 
2.5.4. Digital Image Processing 
 
Accelerating desktop publishing (DTP) with Reconfigurable Computing Engines is discussed 
in [46]. This research deals with how a reconfigurable computing engine can be used to 
accelerate DTP functions. Also, how PostScript rendering can be accelerated using a 
commercially available FPGA co-processor cards. In the case of the FPGA PostScript project, 
Xilinx XC6200 FPGA is used to accelerate the computationally intensive areas of PostScript 
rendering. Efforts were done for the development of plug-ins for Adobe Photoshop which use 
the same board to accelerate image processing operations like color space conversion and image 
convolution. 
In [47], the authors describe the use of the SPLASH-2 custom computing platform for real-
time median and morphological filtering images. SPLASH-2 is an FPGA-based attached 
processor that can be reconfigured to perform a wide variety of tasks. Although not specifically 
designed for image processing, the architecture is well suited for the repetitive computations 
and high data transfer rates that characterize most low-level image processing problems. 
Median filtering is a particularly good benchmark, since nonlinear rank ordering must be 
performed for 2D neighborhoods at every pixel location in an image. General-purpose 
workstations are inefficient at such tasks, whereas SPLASH-2 is configured to perform this at 
a rate of 30 images per second. This research presents the hardware/software co-design process 
that have been used to implement this operation, which can be pipelined with other operations 
by using additional SPLASH-2 processor boards. 
A prototype end-to-end real time video coded is described in [45]. The presented system is 
implemented on a CLAy31 (Configurable Logic Array) by National Semiconductors. One of 
the steps decomposes the image using a discrete wavelet transform (DWT), whose filters 
consume many resources of the FPGA. The rest of the chip is used by the associated addressing 
and control logic and frame grabber interface. 
Another example of partial reconfiguration in video processing applications is presented in 
[48]. The hardware platform used in this case was the Field Configurable Multi-Chip Module 
(FCM) by National Semiconductor. 
 
 11 
2.5.5. Biomedical Engineering 
 
Reconfigurable computing is also used in the field of biomedical engineering. Some efforts 
were spent on a high-speed, voxel data processing computer architecture. A high-performance 
computer architecture aimed at computationally intensive applications of biomedical volume 
data processing is proposed in [49]. Algorithms used on volumetric data include visualization 
and digital radiograph reconstruction by ray-casting, oncological dose calculation, 3D FFT, 
and 3D-convolution processing. The generation of voxel address for algorithms that perform 
voxel data manipulation using a special address generator FPGA. 
2.5.6. Networking 
 
Some research in networking applications with reconfigurable computing is done for 
examining the possibility of introducing the MorphoSys reconfigurable system into IP routing. 
IP routing algorithms are mapped onto this system and its performance evaluated in order to 
measure the efficiency of this system relative to existing IP routing implementation, mainly, 
multi-homed computers and dedicated physical routers. The results were obtained by running 
the emulator program mULATE to implement and evaluate the performance of MorphoSys. 
Results have indicated that MorphoSys presented a much better performance than both multi-
homed computers and dedicated physical routers, and can therefore be integrated as an active 
element in IP routing. 
The research in [50] introduces a suite of tools called NCHARGE (Networked Configurable 
Hardware Administrator for Reconfiguration and Governing via End-systems). The system has 
been developed to simplify the co-design of hardware and software components that process 
packets within a network of FPGAs. A key feature of NCHARGE is that it provides a high-
performance packet interface to hardware and standard Application Programming Interface 
(API) between software and reprogrammable hardware modules. Using this API, multiple 
software processes can communicate to one or more hardware modules using standard TCP/IP 
sockets. NCHARGE also provides a Web-Based user interface to simplify the configuration and 
control of an entire network switch that contains several software and hardware modules. 
2.5.7. Security 
 
In recent years, we have witnessed a rapid increase in the number of individuals and 
organizations using advanced data communications and computer networks for personal and 
professional activities. Among the variety of new uses of data communications, there are 
several applications which are highly sensitive to data security. Examples are commercial 
exchange on the Internet, computer networks, wireless communications, and military. 
Reconfigurable devices such as FPGAs are a highly attractive option for hardware security, 
mainly for cryptographic algorithms. They provide the flexibility of a dynamic system as well 
as the ability to easily implement a wide range of algorithms [51]. Potential advantages of 
encryption algorithms implemented in FPGAs include the following: 
Algorithm Agility: This expression refers to the change of the used cryptographic 
algorithm during operation. For example, during a session in one of the modern security 
protocols like SSL, an exchange could be done between 3DES, Blowfish, IDEA, or any other 
 12 
 
algorithm. As they are reprogrammable, FPGAs seems to be a cheap alternative for traditional 
hardware with such a continuously evolvable area [52]. 
Algorithm Upload: It is obvious that FPGAs could be upgraded with a new cipher that did 
not exist (or was not standardized) at design time. Algorithm Modification: The modification 
of a standardized security algorithm is possible, for instance, by using S-boxes, permutations, 
or even as changing the mode of operation. Such modifications could be easily done with a 
reconfigurable hardware. 
Architecture Efficiency: In certain cases, hardware architecture can perform better if it is 
designed for a specific set of parameters. For example, with fixed keys the main operation in 
the IDEA cipher [53] degenerates into a constant multiplication which is far more efficient than 
a general multiplication. Using FPGAs enables the switching to much more efficient 
implementation in certain specified cases. 
Throughput: Although typically slower than ASIC implementations, FPGA 
implementations have the potential of running substantially faster than software 
implementations. 
Reconfigurable hardware security has been the aim of many research investigations. Ploog 
et al studied in [54] about how modern smartcards can perform high security operations. The 
authors emulated the main smartcard algorithms in FPGA hardware and quantified the impact 
of the main ASIC design parameters on overall speed and silicon area. 
Kim et al presented in [55] the use of FPGAs for a fully-pipelined, 56-bit DES encryption 
(decryption) and authentication at memory-bus bandwidths. Other implementation for DES was 
presented by Tom Kean and Ann Duncan from Xilinx in [56]. Prototype designs were realized 
and tested on the XC6200DS PCI Development System. 
The IDEA was addressed in [53] by Davor and Mario presenting an FPGA core 
implementation. Gao et al introduced a compact fast elliptic curve crypto coprocessor with 
variable key size, which highly utilizes the internal SRAM and registers in a Xilinx FPGA [57]. 
In [58] the authors discuss reconfigurable PAM implementation of RSA. The PAM 
implementation of RSA used many advanced techniques in accelerating the execution time of 
the algorithm; for example, Chinese remainders, star chains, Hensel´s odd division, carry-save 
representation, quotient pipelining and asynchronous carry completion adders. An FPGA-based 
implementation of DEA is reported in [59] with a complete performance study and evaluation. 
Kim et al presented various architectures (low hardware complexity and high-performance 
versions) of the KASUMI 3GPP block cipher using a Xilinx FPGA [60]. 
Adam et al studied in [51] the hardware implementation within commercially available 
FPGAs of the potential AES candidates. Multiple architectural implementation options were 
explored for each algorithm. 
2.6. High-Level Reconfigurable Hardware Development 
The need for higher level solutions has been always present when using any computer 
architecture. An example high-level solution was the historic leap from punched cards to 
assembly language that was made with the early advancement in computers. Another example 
is the move up to high-level programming languages with the support of many developed 
design methodologies and paradigms. Another step up has been made with the languages 
capable of automatic code generation. 
 13 
Verilog and VHDL (Very High Speed Integrated Circuit Hardware Description Language) 
[61] are by far the most commonly used hardware description languages (HDLs) in industry. 
Both of these two HDLs support different styles for describing hardware, for example, 
behavioral style, structural gate-level style, etc. VHDL became an IEEE standard 1076 in 1987. 
Verilog became an IEEE standard 1364 in December 1995. The Verilog language uses the 
module construct to declare logic blocks (with several inputs and outputs).  In VHDL, each 
structural block consists of an interface description and architecture. VHDL enables behavioral 
descriptions in Data flow and Algorithmic styles. 
Focused efforts for creating tools with higher levels of abstraction lead to the production 
of many powerful modern hardware design tools. Ian Page and Wayne Luk developed a 
compiler that transformed a subset of Occam into a netlist [62]. Nearly ten years later we have 
seen the development of Handel-C, the first commercially available high-level language for 
targeting programmable logic devices.  Handel-C is a parallel programming language based on 
the theories of communicating sequential processes (CSP) and Occam with a C-like syntax 
familiar to most programmers. This language is used for describing computations which are to 
be compiled into hardware [28]. 
Building on the work carried out in Oxford’s Hardware Compilation Group by Page and 
Luk, Saul at Oxford’s Programming Research Group introduced a different co-design compiler, 
Dash FPGA-Based Systems [63]. This compiler provides a co-synthesis and co-simulation 
environment for mixed FPGA and processor architectures. It compiles a C-like description to 
a solution containing both processors and custom hardware. 
Luk and McKeever in [64] introduced Pebble, a simple language designed to improve the 
productivity and effectiveness of hardware design. This language improves productivity by 
adopting reusable word-level and bit-level descriptions which can be customized by different 
parameter values, such as design size and the number of pipeline stages. Such descriptions can 
be compiled without flattening into various VHDL dialects. Pebble improves design 
effectiveness by supporting optional constraint descriptions, such as placement attributes, at 
various levels of abstraction; it also supports runtime reconfigurable designs. 
Todman and Luk in [65] proposed a method that combines declarative and imperative 
hardware descriptions. They investigated the use of Cobble language, which allows 
abstractions to be done in an imperative setting. Designs done in Cobble are to benefit from 
efficient bit-level implementations developed in Pebble. Transformations are suggested to 
allow the declarative Pebble blocks to be used in Cobbles’ imperative programs. 
Weinhardt in [66] proposes a high-level language programming approach for 
reconfigurable computers. This automatically partitions the design between hardware and 
software and synthesizes pipelined circuits from parallel for loops. 
W. Najjar et al in [67] presented a high-level, algorithmic language and optimizing 
compiler for the development of image processing applications on RC-systems. SA-C, a single 
assignment variant of the C programming language, was designed for this purpose. 
A prototype HDL called Lava is developed by Satnam Singh at Xilinx and Mary Sheeran 
and Koen Claessen at Chalmers University in Sweden [68]. Lava allows circuit tiles to be 
composed using powerful higher-order combinators. This language is embedded in the Haskell 
lazy functional programming language. Xilinx implementation of Lava is designed to support 
the rapid representation, implementation and analysis of high-performance FPGA circuits. 
Besides the above advances in the area of high-level hardware synthesis, the current market 
has other tools employed to aid programmable hardware implementations. These tools include 
 14 
 
Forge compiler from Xilinx, SystemC language, Nimble compiler for Agileware architecture 
from Nimbel Technology, and Superlog. 
Other famous hardware design tools include Altera’s Quartus, Xilinx ISE, Mentor Graphics 
HDL Designer, Leonardo Spectrum, Precision Synthesis, and ModelSim. 
2.7. Future of RC-Systems at a Glance 
Finally, after having this brief overview of reconfigurable computing world, we would like to 
stress the expectations attached to this field. Wim Roelandts, the president and CEO of Xilinx, 
said:” I expect that within the first ten years of the next millennium we will see programmable 
logic devices inside every piece of electronic equipment, because hardware will become just as 
programmable as software” [69]. A different point is made by Mangione-Smith et al in [70-73] 
setting expectations for the near future of reconfigurable computing, they said: “Mainstream 
microprocessor vendors will not adopt FPGA blocks into their products, and barring some 
dramatic discovery programmable logic will not be widely available on the main datapath of 
high performance processors. However, it does appear likely that FPGA blocks will soon be 
incorporated in embedded processor devices.” 
The trends in the development in that field could be summarized as the expectation of ever 
denser FPGAs with very low cost. Internet reconfigurable logic is of possible interest in 
designing hardware that can be reconfigured remotely. Besides, efforts could be invested in 
creating high-level, fully-integrated, development tools for both large and small designs. All in 
all, in the coming years, bigger, faster, and cheaper reusable programmable devices are likely 
to appear. 
3. The MorphoSys 
One of the emerging RC systems includes the MorphoSys designed and implemented at the 
University of California, Irvine. It has the block diagram shown in Figure 1. It is composed of 
the following components: 
 
• An array of reconfigurable cells (ReC) called the ReC array. 
• ReC array configuration data memory called context memory. 
• A control processor (TinyRISC). 
• A data buffer called the frame buffer. 
• A DMA controller. 
 15 
 
Figure 1. MorphoSys Block Diagram. 
3.1. The Core Processor 
The core processor known as TinyRISC is a MIPS-like processor with a stage scalar pipeline. 
The core processor has sixteen-bit registers, and three functional units; a bit ALU, a bit shift 
unit, and a memory unit. An on-chip data cache memory minimizes the accesses to external 
main memory. In addition to typical RISC instructions, TinyRISCs is augmented with specific 
instructions for controlling other MorphoSys components. The special instructions fall in two 
categories DMA instructions and ReC array instructions. DMA instructions initiate data 
transfers between main memory and the Frame Buffer and context loading from main memory 
into the Context Memory. ReC array instructions control the operation of the reconfigurable 
component the ReC array by specifying the context and the broadcast mode. 
The reconfiguration instructions are classified into two sections DMA instructions and ReC 
array instructions.  The DMA instructions specify load/store, memory address, number of bytes 
to be transferred, and the destination storage address (frame buffer or context memory).  The 
ReC array instructions specify the context for execution, frame buffer address and broadcast 
mode (row, column, broadcast, or selective). 
3.2. The Reconfigurable Cell 
The ReC is the basic programmable element in MorphoSys. Each ReC comprises five 
components; the ALU-Multiplier, the shift unit, the input multiplexers, a register file with four 
bit registers, and the context register. 
 16 
 
3.3. The Reconfigurable Cell Array 
The reconfigurable part of the MorphoSys is called the ReC array that contain 64 ReCs arranged 
as an 8 x 8 matrix (See Figure 2).  An important feature of the ReC array is its three-layer 
interconnection network, which enables two-dimensional mesh topology, complete row and 
column connectivity within a quadrant, and inter-quadrant connectivity. 
 
 
Figure 2.  ReC Array Interconnection. 
The ReC array operates in one of two modes; column context broadcast mode in which all 
the cells of a column perform the same operation, or row context broadcast mode in which all 
the cells of a row perform the same operation. Any application to be mapped onto MorphoSys 
and expected to make use of the reconfigurable core has to use the functions provided by the 
ReCs and specific interconnections that will enhance application performance by increasing the 
speed of execution of the application.  Each ReC cell has two ports, A and B, through which it 
has access to: 1) the output of other cells, 2) the operand data bus, and 3) the internal register 
file of the ReC. 
3.4. The Context Memory 
The Context Memory stores the configuration program context for the ReC array. The Context 
Memory is logically organized into two context blocks each block containing eight context sets 
Each context set has sixteen context words. The major focus of the ReC array is on data-parallel 
applications, which exhibit a definite regularity. Following this principle of regularity and 
parallelism the context is broadcast on a row/column basis. The context words from one context 
memory block are broadcast along the rows, while context words from the other block are 
broadcast along the columns. Each block has eight context sets and each context set is 
associated with a specific row/column of the ReC array. The context word from a context set is 
 17 
broadcast to all eight ReCs in the corresponding row/column. Thus all ReCs in a row/column 
share a context word and perform the same operations. Recall that a context word is stored in 
the context register within each ReC. 
A context plane is formed by the corresponding context words within each context set 
across the Context Memory. As there are sixteen context words in a context set up to sixteen 
context planes can be simultaneously resident in each of the two blocks of Context Memory. 
3.5. The Frame Buffer and the DMA Controller 
The Frame Buffer is an internal data memory logically organized into two sets. Each set is 
further subdivided into two banks. Each bank has 64 rows of  bytes (therefore, the entire Frame 
Buffer has 128 x 16 bytes). A 128-bit operand bus is used to transfer data operands from the 
Frame Buffer to the ReC array. This bus is connected to the column elements of the ReC array. 
The cells along a ReC array row share the same 16-bit segment of the operand bus. In this way, 
eight different operands can be loaded into all cells of a ReC array column in just a single cycle. 
Results from the ReC Array are written back to the Frame Buffer through a separate 128-bit 
bus called the result bus. The physical connection of the result bus to the ReC array is similar 
to that of the operand bus, i.e., the bus segments running along the rows. 
The DMA controller performs data transfers between the Frame Buffer and the main 
memory. It is also responsible for loading contexts into the Context Memory. The TinyRISC 
core processor uses DMA instructions to specify the necessary data-context transfer parameters 
for the DMA controller. 
3.6. The MorphoSys Execution Flow Model 
A program runs on MorphoSys so that general-purpose operations are handled by the TinyRISC 
processor, while operations that have a certain degree of parallelism, regularity, or intensive 
computations are mapped to the ReC array. The TinyRISC processor controls, through the DMA 
controller, the loading of the context words to context memory. These context words define the 
function and connectivity of the cells in the ReC array. The processor also initiates the loading 
of application data, such as image frames, from main memory to the frame buffer.  This is also 
done through the DMA controller. Now that both configuration and application data are ready, 
the TinyRISC processor instructs the ReC array to start execution. The ReC array performs the 
needed operation on the application data and writes it back to the frame buffer. The ReC array 
loads new application data from the frame buffer and possibly new configuration data from 
context memory. Since the frame buffer is divided into two sets, new application data can be 
loaded into it without interrupting the operation of the ReC array.  Configuration data is also 
loaded into context memory without interrupting ReC array operation; this option allows the 
MorphoSys to expectedly achieve high speeds of execution. 
 18 
 
3.7. Important Features of MorphoSys 
MorphoSys is a coarse-grain multiple-context reconfigurable system with considerable depth 
of programmability (32 contexts) and two different context broadcast modes. It provides a high 
degree of flexibility for application mapping, by offering two levels of reconfigurability: 
Operand-type configurability: this is configurability that allows switching between classes 
of applications with different data types. It affects two aspects of the system: (a) configuration 
of the array multiplier in each ReC as either a signed or unsigned multiplier, and (b) 
configuration of the operand bus (as either interleaved or contiguous) and of the result bus (as 
either 8-bit or 16-bit). 
Functional configurability: this is the short-term, run-time reconfigurability level. It 
controls ReC functionality and ReC Array connectivity on a cycle-to-cycle basis. 
The hierarchical ReC Array interconnection network also contributes for algorithm 
mapping flexibility Structures like the express lanes enhance global connectivity. Even 
irregular communication patterns, that otherwise require extensive interconnections can be 
handled efficiently. For instance, an eight-point butterfly can be accomplished in only three 
cycles. Finally, bus configurability supports applications with different data sizes and data flow 
patterns. 
4. Graphics Geometrical Transformations under MorphoSys 
Graphics hardware accelerators are of major importance to modern high-quality computer 
graphics. Geometrical transformations play a major role in computer graphics. 
Transformations, with their computational complexity, are usually supported in graphics 
accelerators. In this chapter, and for the purpose of hardware acceleration, parallel versions of 
basic and composite 2D transformations are developed and implemented under MorphoSys 
[74]. 
4.1. Geometrical Transformations 
 
Figure 3. Image tracking while applying different 2D transformations. 
Transformations are a fundamental part of computer graphics. Transformations are used to 
position, shape, and change viewing positions of objects, as well as change how they are viewed 
 19 
(e.g. the type of projection used). Basic transformations include Translation, Scaling, Rotation, 
and Shear. These basic transformations can also be combined to obtain more complex 
transformations. Figure 3 shows the effects of some 2D transformations on an image. 
A point in a 2D space could be represented by its name, abscissa, and ordinate; for instance, 
P(x, y) is a point with coordinates x and y. 2D objects are often represented as a set of points 
(vertices), {P1,P2,...,Pn}, and an associated set of edges; {e1,e2,...,em}. An edge is defined as a 
pair of points, e{Pi, Pj}. We can also represent points in vector/matrix notation as: 
 
 





=
y
x
P  
4.1.1. Translations 
 
A translation can also be represented by a pair of numbers, t = (tx, ty), where tx is the change in 
the abscissa and ty is the change in the ordinate. To translate the point P by t, we simply add to 
obtain the translated point Q(x’,y’). 
 
 





+
+
=





+





=
y
x
y
x
ty
tx
t
t
y
x
Q  
 
In order to make translation into a multiplication homogeneous coordinates could be 
used [71]. Then, the matrix form for the translation is: 
 
 




















=










1100
10
01
1
'
'
y
x
t
t
y
x
y
x
 
 
Homogeneous coordinates allow many transformations to be concatenated into a single 
matrix representation. 
4.1.2. Scaling 
 
Matrices can easily represent scaling transformations. Let the scale matrix be S, where: 
 
 





=
y
x
s0
s 0
S  
 
The scaled point Q(x’,y’) could be determined as follows: 
 
 





=











==
yy
x
yS
xS
y
x
S
S
PSQ
x
0
0
.  
The matrix form for scaling using homogeneous coordinates is as follows: 
 20 
 
 
 




















=










1100
00
00
1
'
'
y
x
S
S
y
x
y
x
 
 
The matrix product is also called compounding, catenation, concatenation or composition. 
Once the transform is formulated, it may be applied to all points in the scene. All this applies 
to other transformations also. Homogeneous coordinates allow many scaling operations to be 
concatenated into a single matrix representation. 
4.1.3. Rotation and Shearing 
 
For rotation by an angle θ counterclockwise about the origin, the functional form is x' = xcosθ 
− ysinθ and y' = xsinθ + ycosθ. Written in matrix form, this becomes as follows: 
 
 





+
−
=










 −
==




cossin
sincos
cossin
sincos
.
yx
yx
y
x
PRQ  
 
The matrix form for scaling using homogeneous coordinates is as follows: 
 
 



















 −
=










1100
0cossin
0sincos
1
'
'
y
x
y
x


 
 
For shearing, there are two possibilities. A shear parallel to the x axis has x' = x + ky and 
y' = y; the matrix form is: 
 
 




 +
=











==
y
kyx
y
xk
PHQ x
10
1
.  
 
A shear parallel to the y axis has x' = x and y' = y + kx, which has matrix form: 
 
 





+
=











==
kxy
x
y
x
k
PHQ y
1
01
.  
 
The matrix form for shearing using homogeneous coordinates is as follows: 
 
 




















=










1100
010
01
1
'
'
y
xk
y
x
 
and 
 21 
 
 




















=










1100
01
001
1
'
'
y
x
ky
x
 
 
It is clear from the presented formulas that 2D geometrical transformation, basic or 
composite, are at the end a computationally intensive matrix multiplication problem. 
4.2. Mapping Translation and Scaling in Basic Forms 
In geometrical transformations, points are represented by vectors. To scale or translate vectors 
they are operated on by adding (subtracting) or multiplying (dividing) with constant values 
(scalars) that also could be represented using vectors. In MorphoSys system, the M1 chip 
version, some operations could be classified as vector-vector and vector-scalar operations. 
These operations are used to implement basic translation, scaling, composite translations, and 
composite scaling. 
4.2.1. Translation Using Vector-Vector Operations 
 
A one-dimensional n-element vector could have the following transposed form: 
 
 ].....MMM[MM 1n210
T
−=  
 
A vector U could be considered as the original coordinates, while a vector V could be 
considered as the corresponding translation values. Mapping an algorithm for addition, or any 
other operation, of the two vectors is done by first storing them in the Frame Buffer set “0” and 
set “1”. Then we can exploit the properties of the interconnection, where some contents of 
Frame Buffer set “0” are added to some contents of Frame Buffer set “1” and the result would 
be in columns 0-7 of the ReC array. Figure 4 shows the final output in the ReC array after 
running the algorithm of adding two 64-element vectors. 
 
Columns | 
Rows 
C0 C1 C2 C3 C4 C5 C6 C7 
R0 U0+V0 U8+V8 U16+V16 U24+V24 U32+V32 U40+V40 U48+V48 U56+V56 
R1 U1+V1 U9+V9 U17+V17 U25+V25 U33+V33 U41+V41 U49+V49 U57+V57 
R2 U2+V2 U10+V10 U18+V18 U26+V26 U34+V34 U42+V42 U50+V50 U58+V58 
R3 U3+V3 U11+V11 U19+V19 U27+V27 U35+V35 U43+V43 U51+V51 U59+V59 
R4 U4+V4 U12+V12 U20+V20 U28+V28 U36+V36 U44+V44 U52+V52 U60+V60 
R5 U5+V5 U13+V13 U21+V21 U29+V29 U37+V37 U45+V45 U53+V53 U61+V61 
R6 U6+V6 U14+V14 U22+V22 U30+V30 U38+V38 U46+V46 U54+V54 U62+V62 
R7 U7+V7 U15+V15 U23+V23 U31+V31 U39+V39 U47+V47 U55+V55 U63+V63 
Figure 4. ReC array contents after vector addition. 
 22 
 
For the MorphoSys to perform the required calculations, three sets of data must be first 
entered to the M1 chip. The first data set is the TinyRISC program which is placed in main 
memory.  The TinyRISC program handles all the operations that are not mapped onto the ReC 
array, such as, data transfer and ReC array contexts. The second data set contain the context 
codes of the ReC array.  The context codes are written in either column mode, row mode, or 
for both. The context codes also define what operation each row or column is going to carry 
out, what input it takes, and where the output is to be stored. The last set entered to the 
MorphoSys is the input data set required for computations. 
For translating vectors, the desired function of the ReC array interconnection is Out = A + 
B. Assume that vector U is stored at address 10,000hex of main memory, vector V stored at 
address 20,000hex, and the context word stored at address 30,000hex. The output is stored back 
at address 40,000hex. The TinyRISC code for translating a 64-element vector with its discussion 
is provided in Table 1. 
Table 1. TinyRISC code translating a 64-element vector, (a) addresses setup, (b) 
broadcast data and operations, (c) write-back output. 
0: ldui r1, 0x1; R1  10000hex.   Vector U is stored. 
1: ldfb r1, 0, 0, 16 ; FB  16 x 32 bits at set 0, bank A, address 0. 
2: add r0, r0, r0; No-operation. 
. . .  
33: ldui r1, 0x2; 
R1  20000hex. 
This is where vector V is stored. 
34: ldfb r1, 1, 0, 16; FB  16 x 32 bits at set 0, bank B, address 0. 
35: add r0, r0, r0; NOP 
. .   
66: ldui r3, 0x3; 
R3  30000hex. This is where the context word is stored in 
main memory. 
67: ldctxt r3, 0, 0, 0, 1; 
Load one context word from main memory starting at the 
address stored in register 3 into plane 0, block 0 and 
starting at word 0. 
68: add r0, r0, r0; NOP 
. . .  
(a) 
 
71: ldui r4, 0x0; R4  00000hex. 
72: dbcdc 
r4, 0, 0, 0, 0, 0, 
0; 
Double bank column broadcast.  It sends data from both 
banks address 0 in the frame buffer and broadcasts the 
context words column-wise.  It triggers the RC array to 
start execution of column 0 by the context word of address 
0 in the column block of context memory operating on data 
in set 0. Bank A starting at 0x0. Bank B starting at (0x0 + 
0). 
73: ldli r4, 0x4 R4  4hex 
 23 
Table 1. Continued. 
74: dbcdc 
r4, 0, 0, 1, 0, 0, 
0x40; 
It sends data from both banks address 40hex in the frame 
buffer. Bank A starting at 0x40. Bank B starting at (0x4 + 
0x0 = 0x40). 
75: ldli r4, 0x8 R4  8hex 
76: dbcdc 
r4, 0, 0, 2, 0, 0, 
0x80; 
It sends data from both banks. 
77: ldli r4, 0xC R4  Chex 
78: dbcdc 
r4, 0, 0, 3, 0, 0, 
0xC0; 
It sends data from both banks address C0hex in the frame 
buffer. Bank A starting at 0xC0. Bank B starting at (0xC + 
0x0 = 0xC0). 
79: ldli r4, 0x10 R4  10hex 
80: dbcdc 
r4, 0, 0, 4, 0, 0, 
0x100; 
It sends data from both banks address 100hex in the frame 
buffer. Bank A starting at 0x100. Bank B starting at (0x10 
+ 0x0 = 0x100). 
81: ldli r4, 0x14 R4  14hex 
82: dbcdc 
r4, 0, 0, 5, 0, 0, 
0x140; 
It sends data from both banks address 140hex in the frame 
buffer. Bank A starting at 0x140. Bank B starting at (0x14 
+ 0x0 = 0x140). 
83: ldli r4, 0x18 R4  18hex 
84: dbcdc 
r4, 0, 0, 6, 0, 0, 
0x180; 
It sends data from both banks. 
85: ldli r4, 0x1C R4  1Chex 
86: dbcdc 
r4, 0, 0, 7, 0, 0, 
0x1C0; 
It sends data from both banks. 
(b) 
 
87: wfbi 0, 0, 0, 1, 0x0; 
Write data back to the frame buffer from the output 
registers 
of column 0 into set 1, address 0. 
88: wfbi 1, 0, 0, 1, 0x40; of column 1 into set 1, address 64. 
89: wfbi 2, 0, 0, 1, 0x80; of column 2 into set 1, address 128. 
90: wfbi 3, 0, 0, 1, 0xC0; of column 3 into set 1, address 192. 
91: wfbi 4, 0, 0, 1, 0x100; of column 4 into set 1, address 256. 
92: wfbi 5, 0, 0, 1, 0x140; of column 5 into set 1, address 320. 
93: wfbi 6, 0, 0, 1, 0x180; of column 6 into set 1, address 384. 
94: wfbi 7, 0, 0, 1, 0x1C0; of column 7 into set 1, address 448. 
95: ldui r5, 0x4; R5  40000hex. 
96: stfb r1, 1, 0,10hex; 
Store data from frame buffer set 1, address 0 into main 
memory starting at address stored in reg1. 
(c) 
 24 
 
4.2.2. Scaling Using Vector-Scalar Operations 
 
Consider the following 8-element vector U: 
 
 ]
7
U
6
U
5
U
4
U
3
U
2
U
1
U
0
U[
T
U =  
 
Scaling U with a constant c, the following is obtained: 
 
 ]cU....cU[cUWUcW 710
T ==  
 
For n-element vectors: 
 
 ].....cUcUU[cW n10
T = . 
 
Mapping the algorithm for multiplication, or any other operation (arithmetic or logical), of 
a vector by a scalar, is done by first storing the vector in the Frame Buffer set “0”. Then we can 
exploit the properties of the interconnection, where some contents of the Frame Buffer set “0” 
are multiplied by a constant to be stored in the context word. Figure 8 shows the final output in 
the ReC array after running the algorithms of two 64-element vectors. 
The desired function of the interconnection is: Out (t+1) = c x A.  Assume that a vector U 
is stored at address 30,000hex of main memory, and the context word stored at address 40,000hex. 
Then the answer will be stored back at address 50,000hex. The code and its discussion for scaling 
a 64-element vector are provided in Table 2. 
 
Columns| 
Rows 
C0 C1 C2 C3 C4 C5 C6 C7 
R0 c.U0 c.U8 c.U16 c.U24 c.U32 c.U40 c.U48 c.U56 
R1 c.U1 c.U9 c.U17 c.U25 c.U33 c.U41 c.U49 c.U57 
R2 c.U2 c.U10 c.U18 c.U26 c.U34 c.U42 c.U50 c.U58 
R3 c.U3 c.U11 c.U19 c.U27 c.U35 c.U43 c.U51 c.U59 
R4 c.U4 c.U12 c.U20 c.U28 c.U36 c.U44 c.U52 c.U60 
R5 c.U5 c.U13 c.U21 c.U29 c.U37 c.U45 c.U53 c.U61 
R6 c.U6 c.U14 c.U22 c.U30 c.U38 c.U46 c.U54 c.U62 
R7 c.U7 c.U15 c.U23 c.U31 c.U39 c.U47 c.U55 c.U63 
Figure 5. ReC array contents after multiplication with a constant. 
4.3. Basic Transformation Compositions 
4.3.1. Composition of Two Translations 
 
The composition of two successive translations could be viewed as a combination of both 
vector-vector and vector-scalar MorphoSys operations. Two successive translations means first 
 25 
translating a vector V1 with translating values found in another vector V2. Then, exploiting the 
MorphoSys ALU-Features to perform the second translation by adding the result of the first 
translation to a third vector V3 containing the second translating values. This second step is 
done by using a new vector-scalar operation, which suggests the addition of the output of the 
ReC array cells to another vector in the frame buffer. In other words, the algorithm could be 
divided into two steps: 
Table 2. TinyRISC code for the uniform scaling routine of a 64-element vector. 
0: Ldui r1, 0x1; R1  10000hex. This is where vector U is stored. 
1: Ldfb r1, 0, 0, 16 ; FB  16 x 32 bits at set 0, bank A, address 0. 
2: Add r0, r0, r0; No-operation. 
. . .  
33: ldui r3, 0x3; 
R3  30000hex. This is where the context word is stored 
in main memory. 
34: ldctxt r3, 0, 0, 0, 1; 
Load one context word from main memory starting at the 
address stored in register 3 into plane 0, block 0 and 
starting at word 0. 
35: add r0, r0, r0; NOP 
. . .  
38: sbcb 1, 0, 0, 0, 0, 0, 0x0; 
Single bank column broadcast causing all the cells in the 
RC array to perform their operations specified by the 
context word in context memory starting with data from 
frame buffer, set 0, bank A, address offset 0. 
39: sbcb 1, 0, 0, 0, 0, 0, 0x40; It sends data from both banks address 40hex in FB. 
40: sbcb 1, 0, 0, 0, 0, 0, 0x80; It sends data from both banks address 80hex in FB. 
41: sbcb 1, 0, 0, 0, 0, 0, 0xC0; It sends data from both banks address C0hex in FB 
42: sbcb 1, 0, 0, 0, 0, 0, 0x100; It sends data from both banks address 100hex in FB 
43: sbcb 1, 0, 0, 0, 0, 0, 0x140; It sends data from both banks address 140hex in FB 
44: sbcb 1, 0, 0, 0, 0, 0, 0x180; It sends data from both banks address 180hex in FB 
45: sbcb 1, 0, 0, 0, 0, 0, 0x1C0; It sends data from both banks address 1C0hex in FB 
46: wfbi 0, 0, 0, 1, 0x0; 
Write data back to the frame buffer from the output 
registers 
of column 0 into set 1, address 0. 
47: wfbi 1, 0, 0, 1, 0x40; of column 1 into set 1, address 64. 
48: wfbi 2, 0, 0, 1, 0x80; of column 2 into set 1, address 128. 
49: wfbi 3, 0, 0, 1, 0xC0; of column 3 into set 1, address 192. 
50: wfbi 4, 0, 0, 1, 0x100; of column 4 into set 1, address 256. 
51: wfbi 5, 0, 0, 1, 0x140; of column 5 into set 1, address 320. 
52: wfbi 6, 0, 0, 1, 0x180; of column 6 into set 1, address 384. 
53: wfbi 7, 0, 0, 1, 0x1C0; of column 7 into set 1, address 448. 
54: ldui r5, 0x4; R5  40000hex. 
55: stfb r1, 1, 0, 10hex; 
Store data from frame buffer set 1, address 0 into main 
memory starting at address stored in reg1. 
 26 
 
The output of the ReC array is: Out (t) = V1 + V2, where V1 and V2 are found in the frame 
buffer. 
The output is then added to the third vector V3 using the context word that yields: Out (t+1) 
= CxV3 +Out (t), where C is considered to be a constant of unity value and V3 a vector in the 
frame buffer. Therefore, Out (t+1) = V3 +Out (t) or Out (t+1) = V1 + V2 +V3. 
The mapping works as following: the contents of the Frame Buffer set “0” are added to the 
some contents of the Frame Buffer set “1” and the result would be in column 0. This is 
demonstrated in Figure 6. The main motivation behind the composite translations mapping is 
it’s expected major saving in cycles count over the composition using matrix operations. 
 
Columns|
Rows 
C0 C1 C2 C3 C4 C5 C6 C7 
R0 U0 +V0        
R1 U1 +V1        
R2 U2 +V2        
R3 U3 +V3        
R4 U4 +V4        
R5 U5 +V5        
R6 U6 +V6        
R7 U7 +V7        
Figure 6. ReC array contents after addition is performed. 
The next step is to take the output of the previous cycle and to change the ReC-
interconnection and the ALU-operation to perform the next calculation (See Figure 7). 
 
Columns|
Rows 
C0 … C6 C7 
R0 c.U0 + out(t)    
R1 c.U1+ out(t)    
R2 c.U2+ out(t)    
R3 c.U3+ out(t)    
R4 c.U4+ out(t)    
R5 c.U5+ out(t)    
R6 c.U6+ out(t)    
R7 c.U7+ out(t)    
Figure 7. The result of a scalar*vector operation for an 8-element vector to a previously calculated output. 
The vectors V1, V2, and V3 could to be loaded into: Bank A/Set 0, Bank B/Set 0, Bank A/Set 
1 of the frame buffer and the constant is loaded in the context word. Assuming that the needed 
context word is available in main memory and that the vector is also available in main memory, 
the code for the TinyRISC has to take care of all the rest. The TinyRISC code has to load the 
vectors from main memory to the frame buffer. After that it has to load the needed context from 
main memory into the column block in context memory twice. Assume that vector V1 is stored 
in address 10,000hex of main memory, vector V2 is stored in address 20,000hex, and vector V3 is 
 27 
stored in address 30,000hex. The context words are stored in address 40,000hex and 50,000hex. 
The code and its discussion are in Table 3. 
Table 3. TinyRISC code for the composite translations routine (a) addresses setup, (b) 
broadcast data and operations, (c) write-back output. 
0: ldui r1, 0x1; R1  10000hex.  This is where vector V1 is stored. 
1: ldfb r1, 0, 0, 16 ; FB  16 x 32 bits at set 0, bank A, address 0. 
2: add r0, r0, r0; No-operation. 
. . .  
33: ldui r1, 0x2; R1  20000hex. This is where vector V2 is stored. 
34: ldfb r1, 1, 0, 16 ; 
FB  16 x 32 bits at set 0, bank B, and address 0. From main 
memory starting at the address stored in register 1. 
35: add r0, r0, r0; No-operation. 
. . .  
66: ldui r1, 0x3; R1  30000hex. This is where vector V3 is stored. 
67: ldfb r1, 0, 1, 16; FB  16 x 32 bits at set 1, bank A, address 0. 
68: add r0, r0, r0; NOP 
. .   
99: ldui r3, 0x4; R3  40000hex.  This is where the context stored in memory. 
100: ldctxt r3, 0, 0, 0, 1; 
Load one context word from main memory starting at the address 
stored in register 3 into plane 0, block 0 and starting at word 0. 
101: add r0, r0, r0; NOP 
. . .  
104: ldui r3, 0x5; R3  50000hex.  This is where the context is stored in memory. 
105: ldctxt r3, 0, 0, 1, 1; 
Load one context word from main memory starting at the address 
stored in register 3 into plane 1, block 0 and starting at word 0. 
106: add r0, r0, r0; NOP 
. . . . 
(a) 
 
109: ldui r4, 0x0; R4  00000hex. 
110: dbcdc r4, 0, 0, 0, 0, 0, 0; 
Double bank column broadcast.  It sends data from both 
banks address 0 in the frame buffer and broadcasts the 
context words column-wise.  It triggers the RC array to 
start execution of column 0 by the context word of address 
0 in the column block of context memory operating on data 
in set 0. Bank A starting at 0x0. Bank B starting at (0x0 + 
0). 
111: ldli r4, 0x4 R4  4hex 
112: dbcdc r4, 0, 0, 1, 0, 0, 0x40; 
It sends data from both banks address 40hex in the frame 
buffer. Bank A starting at 0x40. Bank B starting at (0x4 + 
0x0 = 0x40). 
113: ldli r4, 0x8 R4  8hex 
 28 
 
Table 3. Continued. 
114: 
Dbcd
c 
r4, 0, 0, 2, 0, 0, 0x80; 
It sends data from both banks address 80hex in the frame 
buffer. Bank A starting at 0x80. Bank B starting at (0x8 + 
0x0 = 0x80). 
115: ldli r4, 0xC R4  Chex 
116: dbcdc r4, 0, 0, 3, 0, 0, 0xC0; 
It sends data from both banks address C0hex in the frame 
buffer. Bank A starting at 0xC0. Bank B starting at (0xC + 
0x0 = 0xC0). 
117: ldli r4, 0x10 R4  10hex 
118: dbcdc r4, 0, 0, 4, 0, 0, 0x100; 
It sends data from both banks address 100hex in the frame 
buffer. Bank A starting at 0x100. Bank B starting at (0x10 
+ 0x0 = 0x100). 
119: ldli r4, 0x14 R4  14hex 
120: dbcdc r4, 0, 0, 5, 0, 0, 0x140; 
It sends data from both banks address 140hex in the frame 
buffer. Bank A starting at 0x140. Bank B starting at (0x14 
+ 0x0 = 0x140). 
121: ldli r4, 0x18 R4  18hex 
122: dbcdc r4, 0, 0, 6, 0, 0, 0x180; 
It sends data from both banks address 180hex in the frame 
buffer. Bank A starting at 0x180. Bank B starting at (0x18 
+ 0x0 = 0x180). 
123: ldli r4, 0x1C R4  1Chex 
124: dbcdc r4, 0, 0, 7, 0, 0, 0x1C0; 
It sends data from both banks address 1C0hex in the frame 
buffer. Bank A starting at 0x1C0. Bank B starting at (0x1C 
+ 0x0 = 0x1C0). 
125: sbcb 1, 0, 0, 1, 0, 1, 0x0; 
Single bank column broadcast causing all the cells in the 
RC array to perform their operations specified by the 
context word in context memory/plane, 1 starting with data 
from frame buffer, set 1, bank A, address offset 0. 
126: sbcb 1, 0, 0, 1, 0, 1, 0x40; It sends data from bank A/Set 1 address 40hex in FB. 
127: sbcb 1, 0, 0, 1, 0, 1, 0x80; It sends data from bank A/Set 1 address 80hex in FB. 
128: sbcb 1, 0, 0, 1, 0, 1, 0xC0; It sends data from bank A/Set 1 address C0hex in FB 
129: sbcb 1, 0, 0, 1, 0, 1, 0x100; 
It sends data from bank A/Set 1 address 100hex in the 
frame buffer. 
130: sbcb 1, 0, 0, 1, 0, 1, 0x140; 
It sends data from bank A/Set 1 address 140hex in the 
buffer. 
140: sbcb 1, 0, 0, 1, 0, 1, 0x180; 
It sends data from bank A/Set 1 address 180hex in the 
buffer. 
150: sbcb 1, 0, 0, 1, 0, 1, 0x1C0; 
It sends data from bank A/Set 1 address 1C0hex in the 
buffer. 
(b) 
 
 29 
Table 3. Continued. 
 
 
wfbi 0, 0, 0, 1, 0x0; 
Write data back to the frame buffer from the output registers 
of column 0 into set 1, address 0. 
152: wfbi 1, 0, 0, 1, 0x40; of column 1 into set 1, address 64. 
153: wfbi 2, 0, 0, 1, 0x80; of column 2 into set 1, address 128. 
154: wfbi 3, 0, 0, 1, 0xC0; of column 3 into set 1, address 192. 
155: Wfbi 4, 0, 0, 1, 0x100; of column 4 into set 1, address 256. 
156: wfbi 5, 0, 0, 1, 0x140; of column 5 into set 1, address 320. 
157: wfbi 6, 0, 0, 1, 0x180; of column 6 into set 1, address 384. 
158: wfbi 7, 0, 0, 1, 0x1C0; of column 7 into set 1, address 448. 
159: ldui r5, 0x4; R5  40000hex. 
160: stfb r1, 1, 0,10hex; 
Store data from frame buffer set 1, address 0 into main 
memory starting at address stored in register 1. 
161: add r0, r0, r0; NOP 
. . . . 
224: add r0, r0, r0; NOP 
(c) 
4.3.2. Composition of Two Scaling Transformations 
 
The composition of two scaling operations would mean either to calculate a composite scaling 
parameter, or to perform two successive scaling operations i.e. to call twice the subroutine 
suggested in (scalar-vector multiplication routine). This is similar to the case of any other 
geometrical transformation. Generally, to construct a subroutine, for two successive 
transformations, is useful for code reusability, especially, if we could obtain a major saving in 
execution time due to the use of such a subroutine. To perform the first way of performing two 
scaling operations, a composite scaling parameter is calculated. This scaling parameter is the 
result of multiplying the two scaling factors. Then, to perform the scaling of the points, which 
is also the multiplication of the scaled points with the composite scaling factor. Using the 
MorphoSys ALU basic operations the steps could be written as following: 
 
 Out(t) = C1xC2 
 Out(t+1) = Out(t) x A 
 
In a different context: 
 
 Out(t) = C1xA. 
 Out(t+1) = C2 x Out(t). 
 
Both the previous steps require two unsupported operations in the ALU of the MorphoSys 
ReC array. This led us to the possibility of modifying the ALU operations to support the 
operations Out(t+1) = Out(t) x A (where A is a vector), Out(t+1) = C x Out(t) (where C is a 
scalar), and Out (t) = B x A (where B is a vector). 
 30 
 
Discussing only the first modification that assumes the presence of the operation Out(t+1) 
= C x Out(t), this modification assumes the presence of the multiply operation in the ReCs. The 
code after this modification is discussed in Table 4. 
4.3.3. Composition of Translation and Scaling 
 
For a composite translation and scaling, the desired output at the ReCs is: 
 Out(t) = (C x A) + B, where A and B are vectors, while C is a constant. 
 
The code for such a composite transformation is similar to what is presented in the previous 
code segments. 
Table 4. TinyRISC code, composite scaling operations using the suggested modification. 
0: ldui r1, 0x1; R1  10000hex.  This is where vector A is stored. 
1: ldfb r1, 0, 0, 16 ; FB  16 x 32 bits at set 0, bank A, address 0. 
2: add r0, r0, r0; No-operation. 
. . .  
33: ldui r3, 0x4; R3  40000hex. where the context word is in mem. 
34: ldctxt r3, 0, 0, 0, 1; 
Load one context word from main memory starting at 
the address stored in register 3 into plane 0, block 0 and 
starting at word 0. 
35: add r0, r0, r0; NOP 
. . .  
38: ldui r3, 0x5; R3  50000hex.  where the context word is in mem. 
39: ldctxt r3, 0, 0, 1, 1; 
Load another context word from main memory starting 
at the address stored in register 3 into plane 1, block 0 
and starting at word 0. 
40: Add r0, r0, r0; NOP 
. . . . 
42: ldui r4, 0x0; R4  00000hex. 
43: sbcb 1, 0, 0, 0, 0, 0, 0x0; 
Single bank column broadcast causing all the cells in 
the RC array to perform their operations specified by 
the context word in context memory starting with data 
from frame buffer, set 0, bank A, address offset 0. 
44: sbcb 1, 0, 0, 0, 0, 0, 0x40; It sends data from bank at address 40hex in FB. 
45: sbcb 1, 0, 0, 0, 0, 0, 0x80; It sends data from bank at address 80hex in FB. 
46: sbcb 1, 0, 0, 0, 0, 0, 0xC0; It sends data from bank at address C0hex in FB 
47: sbcb 1, 0, 0, 0, 0, 0, 0x100; It sends data from bank at address 100hex in the FB. 
48: sbcb 1, 0, 0, 0, 0, 0, 0x140; It sends data from bank at address 140hex in the FB. 
49: sbcb 1, 0, 0, 0, 0, 0, 0x180; It sends data from bank at address 180hex in the FB. 
50: sbcb 1, 0, 0, 0, 0, 0, 0x1C0; It sends data from bank address 1C0hex in the FB . 
 31 
Table 4. Continued. 
51: sbcb 1, 0, 0, 1, 0, 1, 0x0; 
Single bank column broadcast causing all the cells in 
the RC array to perform their operations specified by 
the new suggested context word (with ALU 
enhancement) in context memory causing Out (t+1) = 
C2 x Out (t). Where, no data is taken from the frame 
buffer. 
52: sbcb 1, 0, 0, 1, 0, 1, 0x0; 
53: sbcb 1, 0, 0, 1, 0, 1, 0x0; 
54: sbcb 1, 0, 0, 1, 0, 1, 0x0; 
55: sbcb 1, 0, 0, 1, 0, 1, 0x0; 
56: sbcb 1, 0, 0, 1, 0, 1, 0x0; 
57: sbcb 1, 0, 0, 1, 0, 1, 0x0; 
58: sbcb 1, 0, 0, 1, 0, 1, 0x0; 
59: Wfbi 0, 0, 0, 1, 0x0; 
Write data back to the frame buffer from the output 
registers of column 0 into set 1, address 0. 
60: wfbi 1, 0, 0, 1, 0x40; of column 1 into set 1, address 64. 
61: wfbi 2, 0, 0, 1, 0x80; of column 2 into set 1, address 128. 
62: wfbi 3, 0, 0, 1, 0xC0; of column 3 into set 1, address 192. 
63: wfbi 4, 0, 0, 1, 0x100; of column 4 into set 1, address 256. 
64: wfbi 5, 0, 0, 1, 0x140; of column 5 into set 1, address 320. 
65: wfbi 6, 0, 0, 1, 0x180; of column 6 into set 1, address 384. 
66: wfbi 7, 0, 0, 1, 0x1C0; of column 7 into set 1, address 448. 
4.4. Transformations using the General Matrix Form 
All basic and composite transformations, such as rotation and shearing, could be performed 
using the general matrix form. In this section, a parallel matrix multiplication algorithm is 
developed and mapped onto the MorphoSys. Because of the ReC array size, the matrices 
dimensions are chosen to be 8x8. 
In the standard matrix multiplication algorithm, multiplying a matrix A with a matrix B 
would mean multiplying row one (r1) of A with column one of B and then adding their results 
yielding (c11) of the resultant matrix C. The multiplication with (r1) is repeated to all columns 
of B resulting in (c12 … c1n). Then, (r2) of A is multiplied with all columns of B. This algorithm 
is repeated till the last row in A.  Matrices A, B, and C are dense matrices. 
4.4.1. First Mapping 
 
This simple algorithm could be mapped onto the ReC array as follows: The contents of the 
matrix A are passed row by row through the context words, thus, stored in the context memory 
for later retrieval and manipulation by the ReCs. The contents of matrix B are broadcasted also 
row by row to the columns of the ReC array (See Figures 8, 9 and 10). The multiplication stage 
(row x column) is done by using the CMUL ALU operation where Out(t) = C x A; which is the 
required computation. Note that CMUL is a vector-scalar operation. 
 
 32 
 
CMUL by a11 a12 a13 a14 a15 a16 a17 a18 
Columns|
Rows 
C0 C1 C2 C3 C4 C5 C6 C7 
R0 b11 b21 . . . . . . 
R1 b12 b22 . . . . . . 
R2 b13 . . . . . . . 
R3 b14 . . . . . . . 
R4 b15 . . . . . . . 
R5 b16 . . . . . . . 
R6 b17 . . . . . . . 
R7 b18 . . . . . . b88 
Figure 8. ReC array before CMUL operation. 
CMUL by a11 a12 a13 a14 a15 a16 a17 a18 
Columns|
Rows 
C0 C1 C2 C3 C4 C5 C6 C7 
R0 a11x b11 . . . . . . . 
R1 a11x b12 . . . . . . . 
R2 a11x b13 . . . . . . . 
R3 a11x b14 . . . . . . . 
R4 a11x b15 . . . . . . . 
R5 a11x b16 . . . . . . . 
R6 a11x b17 . . . . . . . 
R7 a11x b18 . . . . . . a18x b88 
Figure 9. ReC array after CMUL operation. 
Columns|Rows C0 C1 … 
R0 a11x b11 
a11x b11+ 
a12x b21 
… 
R1 a11x b12 . 
R2 a11x b13 . 
R3 a11x b14 . 
R4 a11x b15 . 
R5 a11x b16 . 
R6 a11x b17 . 
R7 a11x b18 . 
Figure 10. ReC array after one accumulation. 
After finishing the multiplication of the first row of A with the matrix B, The results output 
from the ReCs need to be accumulated to produce the first element of matrix C. The required 
ReC operation is CMULOADD defined as Out(t + 1) = Out(t) + Out [From Left Cell]. Finally, 
the contents of the ReC array would be as shown in Figure 11. Note that the general form of 
this operation is Out(t +1) = Out(t) + (A x C), where A is the output from the left cell, and C is 
 33 
a constant stored in the context word; here C is equal to 1. Indeed, the contents of column 7 of 
the ReC array are stored back to the frame memory and then to the main memory. The same 
steps are repeated with the same context word but with different constant field containing the 
data from matrix A until obtaining the resultant matrix C. The code and its discussion are in 
Table 5. 
4.4.2. Second Mapping 
 
Using the same standard matrix multiplication algorithm, in this section, we introduce a new 
parallelization scenario, taking the advantages of other supported topologies in the MorphoSys. 
The mapping uses the upper left quadrant along with the bottom right quadrant of the ReC 
array. The matrices are considered to be of size 4x4. Accordingly, the matrices multiplication 
finishes with 2 steps. Matrix B is broadcasted to the ReC array, while matrix A is stored in the 
frame buffer. Figure 11 shows the contents of the ReC array before any operation. Figure 12 
shows the contents of the ReC array after the CMUL operation. Figure 13 shows the contents 
of the ReC array after one accumulation. The algorithm is to stop after one repetition of the 
same procedure over the remaining elements in A and B. The code is discussed in Table 6. 
 
Columns|
Rows 
C0 C1 C2 C3 C4 C5 C6 C7 
R0 b11 b21 b31 b41 . . . . 
R1 b12 b22 b32 b42 . . . . 
R2 b13 b23 b33 b43 . . . . 
R3 b14 b24 b34 b44 . . . . 
R4 . . . . b11 b21 b31 b41 
R5 . . . . b12 b22 b32 b42 
R6 . . . . b13 b23 b33 b43 
R7 . . . . b14 b24 b34 b44 
Figure 11. ReC array Before CMUL Operation. 
Columns|
Rows 
C0 C1 C2 C3 C4 C5 C6 C7 
R0 a11x b11 … … … . . . . 
R1 a11x b12 … … … . . . . 
R2 a11x b13 … … … . . . . 
R3 a11x b14 … … … . . . . 
R4 . . . . a11x b14 … … … 
R5 . . . . a12x b11 … … … 
R6 . . . . a12x b12 … … … 
R7 . . . . a12x b13 … … … 
Figure 12. ReC array after one CMUL Operation. 
 34 
 
Columns|
Rows 
C0 C1 C2 C3 C4 C5 C6 C7 
R0 a11x b11 
a11x b11+ 
a12x b21 
… ... . . . . 
R1 a11x b12 … … … . . . . 
R2 a11x b13 … … … . . . . 
R3 a11x b14 … … … . . . . 
R4 . . . . a12x b14 
a12x b11+ 
a12x b21 
… … 
R5 . . . . a12x b11 … … … 
R6 . . . . a12x b12 … … … 
R7 . . . . a12x b13 … … … 
Figure 13. ReC array after the accumulation. 
Table 5. TinyRISC code for a parallel 8x8 matrix multiplication algorithm. 
0: ldui r1, 0x1; R1  10000hex. This is where matrix B is stored. 
1: ldfb r1, 0, 0, 16 ; FB  16 x 32 bits at set 0, bank A, address 0. 
2: add r0, r0, r0; No-operation. 
. . .  
33: ldui r3, 0x3; 
R3  30000hex. This is where the first set of context 
words is stored in main memory. 
34: ldctxt r3, 0, 0, 0, 8; 
Load 8-context words from main memory starting at the 
address stored in register 3 into plane 0, block 0 and 
starting at word 0. for a single row calculation from 
matrix A. 
35: add r0, r0, r0; NOP 
. . .  
38: sbcb 0, 0, 0, 0, 0, 0, 0x0; 
Single bank column broadcast causing all the cells in 
the RC array column 0 to perform their operations 
specified by the context word in context memory 
starting with data from frame buffer, set 0, bank A 
[which contain the matrix B elements], address offset 0. 
39: sbcb 0, 1, 0, 0, 0, 0, 0x40; 
It sends data from address 40hex in FB. 
Column 1 in the RC-Array. 
40: sbcb 0, 2, 0, 0, 0, 0, 0x80; 
It sends data from address 80hex in FB. 
Column 2 in the RC-Array. 
41: sbcb 0, 3, 0, 0, 0, 0, 0xC0; It sends data from address C0hex in FB 
42: sbcb 0, 4, 0, 0, 0, 0, 0x100; It sends data from address 100hex in FB 
43: sbcb 0, 5, 0, 0, 0, 0, 0x140; It sends data from address 140hex in FB 
44: sbcb 0, 6, 0, 0, 0, 0, 0x180; It sends data from address 180hex in FB 
 35 
Table 5. Continued. 
45: sbcb 0, 7, 0, 0, 0, 0, 0x1C0; It sends data from address 1C0hex in FB 
46: ldui r3, 0x3; 
R3  60000hex. This is where the context word for 
CMULOADD is stored in main memory. 
47: ldctxt r3, 0, 0, 8, 1; 
Load 1-context word from main memory starting at the 
address stored in register 3 into plane 8, block 0 and 
starting at word 0. 
48: add r0, r0, r0; NOP 
. . .  
51: ldui r4, 0x0; R4  00000hex. 
52: sbcb 0,1, 0, 0, 0, 0, 0x40; 
Single bank column broadcast causing all the cells in 
the RC array column 0 to perform their operations 
specified by the context word in context memory with 
data from the left cell output. 
53: sbcb 0,2, 0, 0, 0, 0, 0x80; 
54: sbcb 0,3, 0, 0, 0, 0, 0xC0; 
55: sbcb 0,4, 0, 0, 0, 0, 0x100; 
56: sbcb 0,5, 0, 0, 0, 0, 0x140; 
57: sbcb 0,6, 0, 0, 0, 0, 0x180; 
58: sbcb 0,7, 0, 0, 0, 0, 0x1C0; 
59: wfbi 7, 0, 0, 1, 0x0; 
Write data back to the frame buffer from the output 
registers of column 7 into set 1, address 0. 
248: 
Repeat the above code 7 times, with the appropriate memory shifts inside the 
instructions this repetition would take additional 189 cycles. 
 
Table 6. TinyRISC code for a parallel 4x4 matrix multiplication algorithm. 
0: ldui r1, 0x1; R1  10000hex. This is where matrix B is stored. 
1: ldfb r1, 0, 0, 16 ; FB  4 x 32 bits at set 0, bank A, address 0. 
2: add r0, r0, r0; No-operation. 
. . .  
9: ldui r3, 0x3; 
R3  30000hex. This is where the first set of context words 
is stored in main memory. 
10: ldctxt r3, 0, 0, 0, 8; 
Load 8-context words from main memory starting at the 
address stored in register 3 into plane 0, block 0 and 
starting at word 0. for a single row calculation from matrix 
A. 
11: add r0, r0, r0; NOP 
. . .  
14: sbcb 0, 0, 0, 0, 0, 0, 0x0; 
Single bank column broadcast causing all the cells in the 
RC array column 0 to perform their operations specified 
by the context word in context memory starting with data 
from frame buffer, set 0, bank A [which contain the matrix 
B elements], address offset 0. 
 36 
 
Table 6. Continued. 
15: sbcb 0, 1, 0, 0, 0, 0, 0x20; 
It sends data from address 20hex in FB. 
Column 1 in the RC-Array. 
16: sbcb 0, 2, 0, 0, 0, 0, 0x40; 
It sends data from address 40hex in FB. 
Column 2 in the RC-Array. 
17: sbcb 0, 3, 0, 0, 0, 0, 0x60; It sends data from address 60hex in FB 
18: sbcb 0, 4, 0, 0, 0, 0, 0x0; It sends data from address 0hex in FB 
19: sbcb 0, 5, 0, 0, 0, 0, 0x20; It sends data from address 20hex in FB 
20: sbcb 0, 6, 0, 0, 0, 0, 0x40; It sends data from address 40hex in FB 
21: sbcb 0, 7, 0, 0, 0, 0, 0x60; It sends data from address 60hex in FB 
22: ldui r3, 0x3; 
R3  60000hex. This is where the context word for 
CMULOADD is stored in main memory. 
23: ldctxt r3, 0, 0, 8, 1; 
Load 1-context word from main memory starting at the 
address stored in register 3 into plane 8, block 0 and 
starting at word 0. 
24: add r0, r0, r0; NOP 
. . .  
27: ldui r4, 0x0; R4  00000hex. 
28: sbcb 0,1, 0, 0, 0, 0, 0x0; 
Single bank column broadcast causing all the cells in the 
RC array column 1 to perform their operations specified 
by the context word in context memory with data from the 
left cell output. 
29: sbcb 0,2, 0, 0, 0, 0, 0x00; 
30: sbcb 0,3, 0, 0, 0, 0, 0x00; 
31: sbcb 0,4, 0, 0, 0, 0, 0x00; 
32: sbcb 0,5, 0, 0, 0, 0, 0x00; 
33: sbcb 0,6, 0, 0, 0, 0, 0x00; 
34: sbcb 0,7, 0, 0, 0, 0, 0x00; 
35: wfbi 7, 0, 0, 1, 0x0; 
Write data back to the frame buffer from the output 
registers of column 7 into set 1, address 0. 
70: 
Repeat the above code once, with the appropriate memory shifts inside the instructions 
this repetition would take additional 35 cycles. 
4.5. Performance Evaluation and Analysis 
The performance evaluation is based on the speed of execution of the developed parallel 
algorithms. The speed of execution is calculated based on the number of cycles taken by a 
program under MorphoSys to execute. The MorphoSys system is considered to be operational 
at a frequency of 100 MHz. 
In Table 7 the results are shown including the number of elements, number of cycles, 
Execution Time in sec, the number of elements processed by cycle, and number of cycles 
taken to produce an element from the desired output. 
 37 
Table 7. Summary of Findings. 
Algorithm 
 N
o
. 
o
f 
E
le
m
en
ts
 
N
o
. 
 o
f 
C
y
cl
es
 
E
x
ec
u
ti
o
n
 
T
im
e 
in
 

se
c 
E
le
m
en
ts
 
p
er
 C
y
cl
e 
C
y
cl
es
 
p
er
 
E
le
m
en
t 
Translation 8 21 0.21 0.38 2.625 
Translation 64 96 0.96 0.667 1.5 
Composite Translation 8 44 0.44 0.18 5.5 
Composite Translation 64 224 2.24 0.285 3.5 
Scaling 8 14 0.14 0.57 1.75 
Scaling 64 55 0.55 1.16 0.859 
Composite Scaling 8 23 0.23 0.35 2.875 
Composite Scaling 64 66 0.66 0.97 1.03 
Composition of Translation and Scaling 
Operations 
8 21 0.21 0.380 2.625 
Composition of Translation and Scaling 
Operations 
64 151 1.51 0.42 2.36 
Transformations using Matrix Multiplication 
Algorithm I 
64 248 2.48 0.258 3.875 
Transformations using Matrix Multiplication 
Algorithm II 
16 70 0.7 0.228 4.375 
 
With a cycle count of 21, the number of elements per cycle for the 8-element translation 
algorithm is 0.38. The number of elements per cycle measure for a 64-element translation is 
0.667, with its 96 cycles. Accordingly, the maximum utilization of the ReC array has lead to a 
higher throughput. The same conclusion is reached throughout all the algorithms mapped; the 
higher the utilization of the ReC array, then the higher the throughput. 
The composite routines (translation and scaling operations) achieves a speedup over 
performing the same composite operations by running a single translation (or scaling operation) 
twice. Table 8 shows the speedup of the composite routines over the standard basic routines. 
The speedup is calculated as the ratio in number of cycles. 
 
 
 
 
 
 
 
 
 
 
 
 38 
 
Table 8. Comparisons among different translation and scaling operation mappings. 
Algorithm 
N
o
. 
o
f 
E
le
m
en
ts
 
N
o
. 
o
f 
C
y
cl
es
 
S
p
ee
d
u
p
 
E
le
m
en
t 
P
er
 
C
y
cl
es
 
C
y
cl
es
 P
er
 
E
le
m
en
t 
Translation 
Translating twice using two calls of the 
translation routine 
8 
42 
1.35 
0.19 5.2 
Translating twice using the composite 
translation routine 
8 
31 0.258 3.075 
Translating twice using two calls of the 
translation routine 
64 
192 
1.28 
0.33 3 
Translating twice using the composite 
translation routine 
64 
150 0.42 2.34 
Scaling 
Scaling twice using two calls of the scaling 
routine 
8 
28 
1.2 
0.285 3.5 
Scaling twice using the composite scaling 
routine 
8 
23 0.35 2.875 
Scaling twice using two calls of the scaling 
routine 
64 
110 
1.67 
0.58 1.78 
Scaling twice using the composite scaling 
routine 
64 
66 0.97 1.03 
Translation and Scaling 
Executing the translation routine then scaling 
routine 
8 
35 
1.67 
0.228 4.38 
Translation and scaling in a single routine 8 21 0.25 3.88 
Executing the translation routine then scaling 
routine 
64 
151 
1.57 
0.42 2.35 
Translation and scaling in a single routine 64 96 0.42 0.34 
 
The developed parallel matrix multiplication algorithms allows for an increased flexibility 
in performing geometrical transformations. The matrix representation is a general 
representation to implement all basic and composite geometrical transformations; this is one of 
the key advantages. From performance point of view, the parallel matrix multiplication 
algorithms might not be apparently faster than those designed for specific transformations. The 
matrix multiplication-based transformation algorithms become faster when the number of calls 
of basic transformations becomes large. For example, the matrix-based representation, and after 
calculating the composite translation constant, is more efficient than running the basic 
translation routine a large number of times. The matrix form will allow applying the matrix 
multiplication routine once with a single accumulated translation value. The same scenario is 
true in the case of scaling, rotation, and shearing. 
 39 
5. Cyclic Redundancy Checkers under MorphoSys 
Redundant encoding is a method of error detection that spreads the information across more 
bits than the original data. The more redundant bits used, the greater the chance to detect errors. 
CRCCs are check for differences between transmitted data and the original data. CRCCs are 
effective for two reasons: Firstly, they provide excellent protection against common errors, 
such as burst errors where consecutive bits in a data stream are corrupted during transmission. 
Secondly, systems that use CRCCs are easy to implement [72, 75, 76]. When a CRCC is used 
to verify a frame of data, the frame is treated as one very large binary number, which is then 
divided by a generator number. This division produces a reminder, which is transmitted along 
with the data. At the receiving end, the data is divided by the same generator number and the 
remainder is compared with the one sent at the end of the data frame. If the two remainders are 
different, then an error occurred during data transmission. Types of errors that a CRCC detects 
depend on the generator polynomial. Table 9 shows the most common generator polynomials. 
Table 9. Common generator polynomials. 
Generator Polynomial 
SDLC (CCITT) X16 + X12 + X5 + X0 
SDLC Reverse X16 + X11 + X4 + X0 
CRC-16 X16 + X15 + X2 + X0 
CRC-16 Reverse X16 + X14 + X1 + X0 
CRC-12 X12 + X11 + X3 + X2 + X1 + X0 
Ethernet X32 + X26 + X23 + X22 + X16 + X12 + X11 + X10 + X8 + X7 + X5 + X4 
+ X2 + X1 + X0 
5.1. CRC Sequential Implementation 
 
Figure 14. LFSR implementation of the CCITT CRC-16. 
CRC implementation is usually done with linear-feedback shift registers (LFSRs). Figures 14 
and 15 show the CCITT CRC-16 and CRC-16 generators with their serial implementation using 
LFSRs. This serial method works well when the data is available in bit-stream form. 
 
 40 
 
 
Figure 15. LFSR implementation of the CRC-16. 
5.2. CRC Parallel Implementation 
With the currently available high-speed digital signal processing (DSP) systems, the processing 
of data is done in a byte, word, double word, or larger widths rather than serially. Even with 
serial telecommunication systems, data is buffered in chips responsible for synchronizations 
and framing. For parallel implementation, the data is available in 8-bit frames with manageable 
speed [72]. A one channel parallel CRC algorithm with LFSR approach is done by considering 
the state of the circuit on 8-shifts basis [73]. Tables 10 and 11, show two different 
implementations of the CCITT CRC-16 and CRC-16. The term Registeri represents the LFSR 
internal register number “i”, while XORj represents the output of the XOR gate number “j”, and 
XOR indicates the exclusive-OR operation. With the emergence of the highly scalable 
reconfigurable circuits, more implementation capabilities are present. Along with the byte-wise 
or word-wise CRC implementation, it is possible to implement parallel channels each with byte-
wise CRC implementation. 
  
Table 10. The states of registers after 8-shifts for the CCITT CRC-16 Algorithm. 
New Values After 8-shifts of the registers and the output of the XOR-gates 
XORi = Registeri   DataIni          i = 0, 1, …, 7 
 
Register0 = Register8 XOR4 XOR0 
Register1 = Register9  XOR5 XOR1 
Register2 = Register10 XOR6 XOR2 
Register3 = Register11 XOR0 XOR7 XOR3 
Register4 = Register12 XOR1 
Register5 = Register13 XOR2 
Register6 = Register14 XOR3 
Register7 = Register15 XOR4 XOR0 
Register8 = XOR0 XOR5 XOR1 
Register9 = XOR1 XOR6 XOR2 
Register10 = XOR2 XOR7 XOR3 
 41 
Register11 = XOR3 
Register12 = XOR4 XOR0 
Register13 = XOR5 XOR1 
Register14 = XOR6 XOR2 
Register15 = XOR7 XOR3 
Table 11. The states of the registers after 8-shifts for the CRC-16 Algorithm. 
New Values After 8-shifts of the registers and the output of the XOR-gates 
XORi = Registeri DataIni          i = 0, 1, …, 7 
 
X XOR0 XOR1 … XOR7 
 
Register0 = Register8 X 
Register1 = Register9 
Register2 = Register10 
Register3 = Register11 
Register4 = Register12 
Register5 = Register13 
Register6 = Register14 XOR0 
Register7 = Register15 XOR1 XOR0 
Register8 = XOR3 XOR2 
Register9 = XOR4 XOR3 
Register10 = XOR5 XOR4 
Register11 = XOR6 XOR5 
Register12 = XOR7 XOR6 
Register13 = XOR8 XOR7 
Register14 = XOR8 X 
Register15 = X 
5.3. Mapping CRC Algorithms 
From the underlying architecture point of view, the mapping of any algorithm onto the proposed 
reconfigurable system requires in-depth knowledge of all the available interconnection 
topologies. Moreover, the designer should take into consideration the possibility of 
dynamically changing the shape of the interconnection. From the algorithmic point of view, the 
design of a parallel version of the addressed algorithms requires the best use of recourses with 
the least possible redundancy in computations. 
 
 
 42 
 
5.3.1. The CCITT CRC-16 Algorithm Mapping 
 
The mapping of the parallel CCITT CRC-16 algorithm will make use of the redundant 
computations utilized in several steps of the algorithm. Firstly, the values of XORi for all values 
of i from 0 to 7 are calculated.  In Table 10 the computations that are used more than once are 
shown. Particularly, the redundant values are (XORi+4  XORi) for i from 0 to 3. Thus, the 
second computation step involves registers 4, 5, 6, 11, 12, 13, 14, and 15 depending on results 
found in the first step. In the final step the computations for registers 0, 1, 2, 3, 7, 8, 9 and 10 
are carried out depending on the results calculated in the first and second step. 
The algorithm mapping will be explained by introducing the three needed sets of code. The 
first set is the interconnection context words. The context word used in this algorithm is that 
for XOR with column broadcast, where each cell XORs two inputs from frame buffers A and B. 
This context word is stored at address 30000hex. 
The second set of code is the input data and the initial data in the circuits registers. These 
two sets of data are stored in main memory address 10000hex and 20000hex. 
The third set is that of the TinyRISC code, which the main code is. This code and its 
discussion are shown in Table 12. Main steps of the addressed algorithm are shown in Figures 
16 and 17. The final contents of frame buffer A is shown in Figure 18. 
Table 12. The TinyRISC code for the CRC CCITT-16 Algorithm. 
0: ldui r1, 0x1; R1  10000hex. This is where DataIn Stored 
1: ldfb r1, 0, 0, 2 ; FB  2 x 32 bits at set 0, bank A, address 0. 
2: add r0, r0, r0; No-operation. 
. . .  
6: ldui r2, 0x1; R2  20000hex. This is where Circuit Registers Stored. 
7: ldfb r2, 1, 0, 2 ; FB  2 x 32 bits at set 0, bank B, address 0. 
8: add r0, r0, r0; No-operation. 
. . .  
12: ldui r3, 0x3; 
R3  30000hex. This is where the context word is stored in main 
memory. 
7: ldctxt r3, 0, 0, 0, 1; 
Load 1 context word from main memory starting at the address 
stored in register 3 into plane 0, block 0 and starting at word 0. 
8: add r0, r0, r0; NOP 
. . .  
11: ldui r4, 0x0; R4  00000hex. 
12: dbcdc 
r4, 0, 0, 0, 0, 
0, 0; 
Double bank column broadcast.  It sends data from both banks 
address 0 in the FB and broadcasts the context words column-
wise.  It triggers the RC array to start execution of column 0 by 
the context word of address 0 in the column block of context 
memory operating on data in set 0. Bank A starting at 0. Bank B 
starting at (0 + 0). 
13: wfbi 
0, 0, 0, 0, 
0x0; 
Write data back to the FB A from the output registers of column 
0 into set 0, address 0. Results for XORi. The value of Register11. 
 43 
14: wfbi 
0, 0, 1, 0, 
10hex; 
Write data back to FB B from the output registers of column 0 
into set 0, address 10hex. 
15: dbcdc 
r4, 14hex, 0, 0, 
0, 1, 0; 
Double bank column broadcast.  Bank A starting at 0x0. Bank B 
starting at (0x0 + 14), i.e. shifted 4 words from the starting point 
of XOR0. The calculated items in this operation are the repeatedly 
used operations shaded in Table 1. 
16: wfbi 
0, 0, 0, 0, 
30hex; 
Write data back to the FB from the output registers of column 0 
into set 0, address 1Dhex. Values for Circuits Registers [12..15] are 
now available in FB A starting from address 30hex. 
17: dbcdc 
r4, 8hex, 0, 0, 
0, 0, 30hex; 
Double bank column 0 broadcast. Bank A starting at 30hex. Bank 
B starting at (0x0 + 8hex). Thus, Registers [0..2] values are now 
available. 
18: dbcdc 
r4, Chex, 0, 1, 
0, 0, 1hex; 
Double bank column 1 broadcast. Bank A starting at 1hex. Bank B 
starting at (0x0 + Chex). Thus, Registers [4..6] values are now 
available. 
19: dbcdc 
r4, 10hex, 0, 2, 
0, 0, 31hex; 
Double bank column 2 broadcast Bank A starting at 30hex. Bank 
B starting at (0x0 + 10hex). Thus, Registers [8..10] values are now 
available. 
 
  
 44 
 
Table 12. Continued. 
20: wfbi 
0, 0, 0, 0, 
40hex; 
Write data back to the frame buffer A from the output registers of 
column 1 into set 0, address 40hex. Values for Circuits Registers 
[0..2]. 
21: wfbi 
1, 0, 0, 0, 
50hex; 
Write data back to the FB A from the output registers of column 2 
into set 0, address 50hex. Values for Circuits Registers [4..6]. 
22: wfbi 
2, 0, 0, 0, 
55hex; 
Write data back to the frame buffer A from the output registers of 
column 0 into set 0, address 55hex. Values for Circuits Registers 
[8..10]. 
23: dbcdc 
r4, 10hex, 0, 0, 
0, 0, 43hex; 
Double bank column 0 broadcast. Bank A starting at 43hex. Bank B 
starting at (0x0 + 10hex). The output calculated here is the xor 
operation of XOR0, XOR7, & XOR3 the value is in cell 0 c 0. 
22: wfbi 
0, 0, 0, 0, 
60hex; 
Write data back to the FB A from the output registers of column 0 
into set 0, address 60hex. 
23: dbcdc 
r4, Chex, 0, 0, 
0, 0, 60hex; 
Double bank column 2 broadcast. Bank A starting at 60hex. Bank B 
starting at (0x0 + Chex). Thus, Register3 value is now calculated. 
24: wfbi 
0, 0, 0, 0, 
60hex; 
Write data back to the FB A from the output registers of column 0 
into set 0, address 60hex. 
25: dbcdc 
r4, 10hex, 0, 0, 
0, 0, 30hex; 
Double bank column 2 broadcast Bank A starting at 30hex. Bank B 
starting at (0x0 + 10hex). Thus, Register7 value is now calculated. 
26: wfbi 
0, 0, 0, 0, 
65hex; 
Write data back to the FB A from the output registers of column 0 
into set 0, address 65hex. 
 
Rows |Columns C0 C1 C2 C3 C4 C5 C6 C7 
R0 Register0  DataIn0 . . . . . . . 
R1 Register1  DataIn1 . . . . . . . 
R2 Register2  DataIn2 . . . . . . . 
R3 Register3  DataIn3 . . . . . . . 
R4 Register4  DataIn4 . . . . . . . 
R5 Register5  DataIn5 . . . . . . . 
R6 Register6  DataIn6 . . . . . . . 
R7 Register7  DataIn7 . . . . . . . 
Figure 16. ReC array contents after calculating for XORi. 
Rows |Columns C0 C1 C2 C3 C4 C5 C6 C7 
R0 XOR4  XOR0 . . . . . . . 
R1 XOR5  XOR1 . . . . . . . 
R2 XOR6  XOR2 . . . . . . . 
R3 XOR7  XOR3 . . . . . . . 
R4 . . . . . . . . 
R5 . . . . . . . . 
R6 . . . . . . . . 
R7 . . . . . . . . 
Figure 17. ReC array contents for the second computation step for CRC-CCITT-16. 
 45 
Address In HEX Frame Buffer A Address In HEX Frame Buffer A 
0 DataIn0 40 Register0 
1 DataIn1 41 Register1 
2 DataIn2 42 Register2 
3 DataIn3 . . 
4 DataIn4 50 Register4 
5 DataIn5 51 Register5 
6 DataIn6 52 Register6 
7 DataIn7 . . 
. . 55 Register8 
10 XOR0 56 Register9 
11 XOR1 57 Register10 
12 XOR2 .  
13 XOR3 / Register11 60 Register3 
14 XOR4 .  
15 XOR5 65 Register7 
16 XOR6 . . 
17 XOR7 . . 
. . . . 
30 XOR4 XOR0/ Register12 . . 
31 XOR5 XOR1/ Register13 . . 
32 XOR6 XOR2/ Register14 . . 
33 XOR7 XOR3/ Register15 . . 
Figure 18. Contents of Frame Buffer A after the algorithm terminates after one computation step the new 
registers values are shown at the specified locations. 
5.3.2. The CRC-16 Algorithm Mapping 
 
The mapping of this algorithm depends also on eliminating redundant computations, besides, 
the parallel computation of the required values. This mapping is of three steps. Firstly, the 
values of XORi for all values of i from 0 to 7 are calculated. From Table 11, the computations 
used more than once are shaded, particularly, the redundant value X (XOR0 … XOR7). 
Thus, the second computation step is for X. Thirdly, the rest of the values are calculated in 
parallel. 
Table 13. The TinyRISC code for the CRC-16 Algorithm. 
0: ldui r1, 0x1; R1  10000hex. This is where DataIn Stored 
1: ldfb r1, 0, 0, 2 ; FB  2 x 32 bits at set 0, bank A, address 0. 
2: add r0, r0, r0; No-operation. 
. . .  
6: ldui r2, 0x1; R2  20000hex. This is where Circuit Registers Stored. 
7: ldfb r2, 1, 0, 2 ; FB  2 x 32 bits at set 0, bank B, address 0. 
 46 
 
Table 13. Continued. 
8: add r0, r0, r0; No-operation. 
. . .  
12: ldui r3, 0x3; 
R3  30000hex. This is where the context word is stored in main 
memory. 
7: ldctxt r3, 0, 0, 0, 3; 
Load 3 context words from main memory starting at the address 
stored in register 3 into plane 0, block 0 and starting at word 0. 
8: add r0, r0, r0; NOP 
. . .  
11: ldui r4, 0x0; R4  00000hex. 
12: dbcdc 
r4, 0, 0, 0, 0, 0, 
0; 
Double bank column broadcast.  It sends data from both banks 
address 0 in the frame buffer and broadcasts the context words 
column-wise.  It triggers the RC array to start execution of 
column 0 by the context word of address 0 in the column block 
of context memory operating on data in set 0. Bank A starting 
at 0x0. Bank B starting at (0x0 + 0). 
13: wfbi 0, 0, 0, 0, 0x0; 
Write data back to the frame buffer A from the output registers 
of column 0 into set 0, address 0. Results for XORi. 
14: wfbi 0, 0, 1, 0, 10hex; 
Write data back to FB B from the output registers of column 0 
into set 0, address 10hex. 
15: sbcb 
0, 0, 0, 1, 0, 0, 
0hex; 
Single bank column broadcast causing all the cells in the RC 
array column 0 to perform their operations specified by the 
second context word in context memory starting with data from 
frame buffer, set 0, bank A 
16: sbcb 
0, 1, 0, 1, 0, 0, 
1hex; 
It sends data from address 1hex in FB Column 1. 
17: sbcb 
0, 2, 0, 1, 0, 0, 
2hex; 
It sends data from address 2hex in FB Column 2. 
18: sbcb 
0, 3, 0, 1, 0, 0, 
3hex; 
It sends data from address 3hex in FB. 
19: sbcb 
0, 4, 0, 1, 0, 0, 
4hex; 
It sends data from address 4hex in FB. 
20: sbcb 
0, 5, 0, 1, 0, 0, 
5hex; 
It sends data from address 5hex in FB. 
21: sbcb 
0, 6, 0, 1, 0, 0, 
6hex; 
It sends data from address 6hex in FB, the new value of 
Register14. 
22: sbcb 
0, 7, 0, 1, 0, 0, 
7hex; 
It sends data from address 7hex in FB, the new value of 
Register15. 
23: wfbi 6, 0, 0, 0, 30hex; 
Write data back to FB A from the output registers of column 6 
into set 0, address 30hex. Value of Register14. 
24: wfbi 7, 0, 0, 0, 31hex; 
Write data back to the FB A from the output registers of column 
7 into set 0, address 31hex. Value of Register15. 
 47 
Table 13. Continued 
25: dbcdc 
r4, 13hex, 0, 0, 0, 
0, 2hex; 
Double bank column broadcast.  Bank A starting at 2hex. Bank 
B starting at 13hex. New values for Registers [8..13] are now 
available. 
26: wfbi 0, 0, 0, 0, 35hex; 
Write data back to the FB A from the output registers of column 
0 into set 0, address 35hex. 
27: dbcdc 
r4, 8hex, 0, 0, 0, 
0, 31hex; 
Double bank column broadcast.  It sends data from Bank A 
starting at 31hex. Bank B starting at 8hex. New value for 
Register0. 
28: wfbi 0, 0, 0, 0, 45hex; 
Write data back to the FB A from the output registers of column 
0 into set 0, address 45hex. 
29: dbcdc 
r4, 14hex, 0, 0, 0, 
0, 0hex; 
Double bank column broadcast.  It sends data from Bank A 
starting at 0hex. Bank B starting at 14hex. New values for 
Register6. 
30: wfbi 0, 0, 0, 0, 50hex; 
Write data back to the FB A from the output registers of column 
0 into set 0, address 50hex. 
31: dbcdc 
r4, 10hex, 0, 0, 0, 
0, 35hex; 
Double bank column broadcast.  It sends data from Bank A 
starting at 2hex. Bank B starting at 13hex. New values for 
Register7. 
30: wfbi 0, 0, 0, 0, 55hex; 
Write data back to the FB A from the output registers of column 
0 into set 0, address 45hex. 
 
The algorithm mapping will be explained by introducing the three needed sets of code. The 
first set is the interconnection context words. The context words used in this algorithm are 
firstly, that for XOR with column broadcast, where each cell XORs two inputs from frame 
buffers A and B. Secondly, the same cell operation is used also by taking one input from the 
frame buffer, and the second from the output of the left adjacent cell. The context words are 
stored at address 30000hex. The second set of code is the input data and the initial data in the 
circuits registers. These two sets of data are stored in main memory address 10000hex and 
20000hex. The third set is that of the TinyRISC code which is the main code, this code and its 
discussion are shown in Table 13. Main steps of the addressed algorithm are shown in Figures 
19 and 20. The final contents of frame buffer A is shown in Figure 21. 
 
Rows |Columns C0 … C7 
R0 Register0  DataIn0 . . 
R1 Register1  DataIn1 . . 
R2 Register2  DataIn2 . . 
R3 Register3  DataIn3 . . 
R4 Register4  DataIn4 . . 
R5 Register5  DataIn5 . . 
R6 Register6  DataIn6 . . 
R7 Register7  DataIn7 . . 
Figure 19. ReC array contents after calculating for XORi. 
 48 
 
Rows 
|Columns 
C0 C1 C2 C3 C4 C5 C6 C7 
R0 XOR0 XOR0 XOR1 . . . . . XOR0 …XOR7 
R1 . . . . . . . . 
R2 . . . . . . . . 
R3         
R4         
R5         
R6         
R7         
Figure 20. ReC array contents for the second computation step for CRC-16. 
Address In HEX Frame Buffer A Address In HEX Frame Buffer A 
0 DataIn0 . . 
1 DataIn1 36 Register8 
2 DataIn2 37 Register9 
3 DataIn3 38 Register10 
4 DataIn4 39 Register11 
5 DataIn5 40 Register12 
Figure 21. Continued on next page. 
Address In HEX Frame Buffer A Address In HEX Frame Buffer A 
6 DataIn6 41 Register13 
7 DataIn7 . . 
. . 45 Register0 
10 XOR0 . . 
11 XOR1 50 Register6 
12 XOR2 . . 
13 XOR3 55 Register7 
14 XOR4 . . 
15 XOR5 . . 
16 XOR6 . . 
17 XOR7 . . 
. . . . 
30 
XOR0 …
XOR7/Register14 
. . 
31 
XOR0 …
XOR8/Register15 
. . 
Figure 21. Contents of Frame Buffer A after the algorithm terminates after one computation step the new 
registers values are shown at the specified locations. 
 49 
5.4. Performance Evaluation and Analysis 
The algorithm in Table 12 (CRC-CCITT-16 Parallel Algorithm for a single channel) takes 30 
cycles to complete. The speed in bits per cycle of the algorithm of Table 12 is equal to 0.267 
bits/cycles i.e. 3.75 cycles for each bit. The time for the algorithm to terminate is equal to 0.3 
sec, and the data rate is 26.67 Mbps. 
The algorithm in Table 13 (CRC-16 Parallel Algorithm for a single channel) takes 26 cycles 
in order to terminate. The cycle time for the MorphoSys is equal to 10 nsec. Thus, the speed in 
bits per cycle of the algorithm of Table 13 is equal to 0.307 bits/cycles i.e. 3.25 cycles for each 
bit. The time for the algorithm to terminate is equal to 0.26 sec, and then the rate in Mega bits 
per second (Mbps) is 30.76 Mbps. 
The MorphoSys can calculate in parallel the input of up to 8-channels simultaneously. The 
results are shown in Table 14. 
Table 14. Table of results. 
Algorithms 
N
o
. 
o
f 
C
h
a
n
n
el
s 
N
o
. 
o
f 
C
y
cl
es
 
T
im
e 
in
 
M
ic
ro
 
S
ec
. 
B
it
s 
p
er
 
C
y
cl
e 
M
eg
a
 
B
it
s 
P
er
 
S
ec
o
n
d
 
C
y
cl
es
 
p
er
 B
it
s 
CRC-CCITT-16 1 30 0.3 0.267 26.67 3.75 
CRC-16 1 26 0.26 0.307 30.76 3.25 
CRC-CCITT-16 8 30 0.3 2.13 213.13 0.46 
CRC-16 8 26 0.26 2.46 246.15 0.41 
6. Conclusion 
This chapter has introduced reconfigurable computers as powerful modern supercomputing 
architectures. The chapter have also investigated the main reasons behind the current 
advancement in the development of reconfigurable systems. A technical survey of various 
reconfigurable systems is included laying common grounds for comparisons.  In addition, this 
chapter have presented case studies implemented under the MorphoSys. The selected case 
studies belong to computer graphics and information coding. Parallel versions of the studied 
algorithms are developed to match the topologies supported by the MorphoSys. Performance 
evaluation and results analyses are included for implementations with different characteristics. 
References 
[1] Xilinx. http://www.xilinx.com. 
[2] Altera. http://www.atera.com. 
[3] M. Gokhale; W. Holmes; A. Kopser; S. Lucas; R. Minnicha; D. Sweely; and D.Lopresti. 
IEEE Computer. 1991, 81-89. 
[4] P. Bertin, D. Roncin; J. Vuillemin. Introduction to Programmable Active Memories, in 
Systolic Array Processors, Prentice-Hall: Englewood Cliffs NJ, 1989; 300-309. 
 50 
 
[5] E. Tau; D. Chen; I. Eslick; J. Brown; and A. DeHon. A first generation DPGA 
implementation. Canadian Workshop of Field-Programmable Devices, 1995. 
[6] J. Hauser; J. WawrzynekGrape. A mips processor with a re-configurable coprocessor. 
IEEE Symposium on FPGAs for Custom Computing Machines. 1997. 
[7] C. Ebeling; D. Cronquist; and P. Franklin. Configurable computing: The catalyst for high-
performance architectures. IEEE Conference on Application-specific Systems, 1997, 364-
72. 
[8] G. Lu; H. Singh; M. Lee; N. Bagherzadeh; and F. Kurdahi. Parallel recon-figurable 
system. EuroPar Conference Parallel Processing, 1999. 
[9] J. Babb; M. Frank; V. Lee; E. Waingold; R. Barua; M. Taylor; J. Kim; S. Devabhaktuni; 
and A. Agrawal. The RAW benchmark suite: computation structures for general-purpose 
computing. IEEE Conference on Application-specific Systems, 1997, 134-43. 
[10] R. Hartenstein; R. Kress. A datapath synthesis system for the reconfigurable datapath 
architecture. Asia and South Pacific Design Automation Conference. 1995, 479-484. 
[11] E. Mirsky and A. DeHon. Matrix: A re-configurable computing architecture with 
configurable instruction distribution. IEEE Symposium on FPGAs for Custom Computing 
Machines. 1996, 157-66. 
[12] T. Miyamori and K. Olukotun. A quantitative analysis of reconfigurable coprocessors for 
multimedia applications. IEEE Symposium on Field Programmable Custom Computing 
Machines, 1998. 
[13] M. Wirthlin and B. Hutchings. A dynamic instruction set computer. IEEE Workshop on 
FPGAs for Custom Computing Machines. 1995, 99-107. 
[14] C. Iseli and E. Sanchez. A C++ compiler for FPGA custom execution units synthesis. 
IEEE Workshop on FPGAs for Custom Computing Machines, 1995, 173–179. 
[15] L. Agarwal, M.Wazlowski, and S. Ghosh. An asynchronous approach to efficient 
execution of programs on adaptive architectures utilizing FPGAs. IEEE Workshop on 
FPGAs for Custom Computing Machines. 1994, 101-110. 
[16] M. Gokhale; W. Holmes; A. Kopser; S. Lucas; R. Minnich; D. Sweely; D.Lopresti; 
Building and Using a Highly Programmable Logic Array. IEEE Computer, 1991, 81-89. 
[17] Barry B.; Brey B. Bray. Intel Microprocessors 808618088, 80186180188, 80286, 80386, 
80486, Pentium, and Pentium Pro Processor: Architecture, Programming, and 
Interfacing. Book News Inc: Portland OR. 1999.  
[18] Estrin G.; Bussell B.; Turn R.; Bibb J. IEEE Transactions on Electronic Computers. 1963: 
12, 747-755. 
[19] Lipovksi G.; Malek M. Parallel Computing: Theory and Comparisons, John Wiley & 
Sons: New York, 1987. 
[20] Bertin P.; Roncin D.; Vuillemin J.; Introduction to Programmable Active Memories, 
ftp://ftp.digital.com/pub/DEC/PRL/research-reports/ PRL-RR-3.pdf   
[21] Lucent, FPGA Data Book, www.alcatel-lucent.com.  
[22] Radunovic B.; Milutinovic V. A survey of reconfigurable computing architectures. 
Workshop on Field Programmable Logic and Applications, 1998. 
[23] Radunovic B. An overview of advances in reconfigurable computing systems. Hawaii 
International Conference on System Sciences, 1999, 3, 3039. 
[24] Cadambi S.; Weener J.; Goldstein S.; Schmit H.; Thomas D. Managing pipeline-
reconfigurable FPGAs. Symposium on Field Programmable Gate Arrays, 1998, 55-64. 
 51 
[25] Lu G.; Singh H.; Lee M.; Bagherzadeh N.; Kurdahi F.; Filho E.; and Alves V. The 
MorphoSys Dynamically Reconfigurable System-on-Chip. Workshop on Evolvable 
Hardware. 1999, 152-160. 
[26] Lu G.; Singh H.; Lee M.; Bagherzadeh N.; Kurdahi F. The MorphoSys Parallel 
Reconfigurable System. Euro-Par Conference - Parallel Processing. 1999, 727. 
[27] Singh H.; Lee M.; Lu G.; Kurdahi F.; Bagherzadeh. MorphoSys: A Reconfigurable 
Architecture for Multimedia Applications. Symposium on Integrated Circuit Design. 
1998, 573 -578. 
[28] Celoxica. www.celoxica.com. 
[29] Elixent. www.elixent.com. 
[30] Actel. www.actel.com. 
[31] NALLATECH. www.nallatech.com. 
[32] Tag X.; Aalsma M.; and Jou R.; A compiler directed approach to hiding configuration 
latency in chameleon processors. Conference on Field-Programmable Logic and 
Applications, 2000. 
[33] MorphTech. www.morphotech.com. 
[34] Intel. www.intel.com. 
[35] Swaminathan S.; Tessier R.; Goeckel D.; and Burleson W. A Dynamically Reconfigurable 
Adaptive Viterbi Decoder. Symposium on Field Programmable Gate Arrays. 2002. 
[36] Mark S. Multiplication substitutions enable fast algorithm implementations in FPGAs. 
Personal Engineering, 1995. 
[37] Cook T.; Kim H.; Louca L. Hardware Acceleration of n-body simulations for galactic 
dynamics, Field Programmable Gate Arrays. Fast Board Development and 
Reconfigurable Computing. 1995, 2607, 12. 
[38] Paul C.; Personal Engineering, 1995.  
[39] Ashok B. FPGAS moving to boost DSP applications. Electronic Engineering Times. 
1995, 837.  
[40] Wei W.; Swamy M.; Ahmad M. Low power FIR filter FPGA implementation based on 
distributed arithmetic and residue number system. Symposium on Circuits and Systems. 
2001, 102-105. 
[41] Kah-Howe T.; Wen Fung L.; Kaluri K.; Soderstrand M., Johnson L. FIR filter design 
program that matches specifications rather than filter coefficients results in large savings 
in FPGA resources. Asilomar Conference on Signals, Systems and Computers. 2001, 
1349-1352. 
[42] Evans J. An efficient FIR filter architecture. IEEE Symposium on Circuits and Systems. 
1993, 627-630. 
[43] Mohanakrishnan S.; Evans J. IEEE Signal Processing Letters. 1995, 2, 3, 50-53. 
[44] Erdogan S.; Wahab A.; Jainandunsing K. Reconfigurable computing for multimedia DSP 
applications, Field Programmable Gate Arrays. Fast Board Development and 
Reconfigurable Computing. 1995, 2607, 1. 
[45] Rosenberg J. (1997). Application Note: DSP Acceleration Using a Reconfigurable 
Coprocessor FPGA. Atmel Corporation. www.atmel.com/atmel/acrobat/doc0724.pdf  
[46] Singh S.; Bellec P. Virtual Hardware for Graphics Applications using FPGAs. IEEE 
Workshop on FPGAs for Custom Computing Machines, 1994. 
 52 
 
[47] Abbott A.; Athanas P.; Tarmaster A. Accelerating image filters using custom computing 
machine, Field Programmable Gate Arrays. Fast Board Development and Reconfigurable 
Computing. 1995, 2607, 7. 
[48] Villasenor J.; Jones C.; Schoner B. IEEE Transactions on Circuits and Systems for Video 
Technology. 1995, 5, 565-567. 
[49] Sallinen S.; Alakuijala J.; Helminen H.; Laitinen J.; Oy E. A high-speed, voxel data 
processing. IEEE Computer Architecture, 1995, 2, 1041-1042. 
[50] Sproull T.; Lockwood J.; Taylor D. Control and Configuration Software for a 
Reconfigurable Networking Hardware Platform, Workshop on Cryptographic Hardware 
and Embedded Systems, 2002. 
[51] Elbirt A.; Yip W.; Chetwynd B.; Paar C. IEEE Transactions on Very Large Scale 
Integeration (VLSI) Systems. 2001, 9, 545. 
[52] Torkelsson K.; Ditmar J. Header Compression in Handel-C An Internet Application and 
A New Design Language. Symposium on Digital Systems Design. 2001, 2-7. 
[53] Runje D.; Kovac M. Universal strong encryption FPGA core implementation. 
Proceedings of Design, Automation and Test in Europe. 1998, 923-924. 
[54] Ploog H.;Timmermann D.  FPGA based architecture evaluation of cryptographic 
coprocessors for smartcards. IEEE Symposium on FPGAs for Custom Computing 
Machines. 1998, 292-293. 
[55] Kim I.; Steele S.; Koller J. A fully pipelined, 700 MBytes/s DES encryption. IEEE Ninth 
Great Lakes Symposium on core VLSI. 1999, 386-387. 
[56] Kean T.; Duncan A.  DES key breaking, encryption and decryption on the XC6216. IEEE 
Symposium on FPGAs for Custom Computing Machines. 1998, 310-311. 
[57] Gao L.; Lee H.; Sobelman G. A compact fast variable key size elliptic curve cryptosystem 
coprocessor. IEEE Symposium on Field-Programmable Custom Computing Machines, 
1999, 304-305. 
[58] Shand M.; Vuillemin J. Fast Implementations of RSA Cryptography. Symposium on 
Computer Architecture. 1993, 252-259 
[59] Wing K.; Yuk T.; Chan S. Implementation of the Data Encryption Standard Algorithm 
with FPGAs. More FPGAs. Abingdon EE&CS Books. 1994, 412-419. 
[60] Kim H.; Choi Y.; Kim M.; Ryu H. Hardware implementation of GPP KASUMI crypto 
algorithm. Conference on Circuits/Systems, Computers and Communications. 2002, 1, 
317-320. 
[61] IEEE, (1993) Standard VHDL reference manual, IEEE Standard 1076. www.ieee.org. 
[62] Page I.; Luk W. Compiling Occam into field-programmable gate arrays. Workshop on 
Field Programmable Logic and Applications. 1991, 271-283. 
[63] Saul J. Hardware/software codesign for FPGA-based systems, Conference on System 
Sciences. 1999, 3, 3040. 
[64] Luk W.; McKeever S. Pebble: a language for parameterized and reconfigurable hardware 
design. Field Programmable Logic and Applications. 1998, 1482, 9-18. 
[65] Todman T.; Luk W. Combining imperative and declarative hardware descriptions. 
Conference on System Sciences. 2003, 280. 
[66] Weinhardt M.; Portable pipeline synthesis for FCCMs, Field Programmable Logic: Smart 
Applictions New paradigms and compilers. 1996, 1-13. 
 53 
[67] Najjar W.; Draper B.; Bohm W.; Beveridge R. The cameron project: High-level 
programming of image processing applications on reconfigurable computing machines. 
Workshop on Reconfigurable Computing. 1998. 
[68] Claessen K. PhD Thesis: Embedded Languages for Describing and Verifying Hardware. 
Chalmers University of Technology and Göteborg University. 2001. 
[69] Seither M.; Keller K. New quickturn design verification system major step forward in 
commercializing technology. Xilinx Inc. www.xilinx.com  
[70] Mangione-Smith W.; Hutchings B. Reconfigurable Architectures: High Performance by 
Configware, Microsystems Engineering Series, IT Press: Chicago. 1997, 81-96. 
[71] Foley J.; Van Dam A.; Feiner S.; Hughes J. Computer Graphics: Principles and Practice 
in C. Addison-Wesley: Redwood City, CA. 1995.  
[72] Lee R. Cyclic Code Redundancy. Digital Design.1987. 
[73] Perez A. Byte-wise CRC Calculations. IEEE Micro. 1983, 40-50. 
[74] I. Damaj, H. Diab, Performance Evaluation of Linear Algebraic Functions Using 
Reconfigurable Computing, The International Journal of Super Computing, Kluwer 
Academic Publishers, Springer, 2003. I 1, V 24, P 91 – 107. 
[75] H. Diab, I. Damaj, F. Kurdahi, Optimizing FIR Filter Mapping on MorphoSys, The 
International Journal of Parallel and Distributed Systems and Networks, ACTA press, 
2002. I 3, V 5, P 108 – 115. 
[76] H. Diab, I. Damaj, Cyclic Coding Algorithms Under MorphoSys Reconfigurable 
Computing System, Advances in Engineering Software, Elsevier Science, 2003. I 2, V 34, 
P 61 – 72.  
 
