Just-in-time Hardware generation for abstracted reconfigurable computing by Grocutt, Thomas Christopher
Durham E-Theses
Just-in-time Hardware generation for abstracted
reconﬁgurable computing
Grocutt, Thomas Christopher
How to cite:
Grocutt, Thomas Christopher (2005) Just-in-time Hardware generation for abstracted reconﬁgurable
computing, Durham theses, Durham University. Available at Durham E-Theses Online:
http://etheses.dur.ac.uk/2704/
Use policy
The full-text may be used and/or reproduced, and given to third parties in any format or medium, without prior permission or
charge, for personal research or study, educational, or not-for-proﬁt purposes provided that:
• a full bibliographic reference is made to the original source
• a link is made to the metadata record in Durham E-Theses
• the full-text is not changed in any way
The full-text must not be sold in any format or medium without the formal permission of the copyright holders.
Please consult the full Durham E-Theses policy for further details.
Academic Support Oﬃce, Durham University, University Oﬃce, Old Elvet, Durham DH1 3HP
e-mail: e-theses.admin@dur.ac.uk Tel: +44 0191 334 6107
http://etheses.dur.ac.uk
2
Just-In-Time Hardware 
Generation For Abstracted 
Reconfigurable Computing 
Thomas Christopher Grocutt 
The copyright of this thesis rests with the 
auttior or the university to which it was 
submitted. No quotation from It, or 
information derived from it may be published 
without the prior written consent of the author 
or university, and any Information derived 
from It should be acknowledged. 
A Thesis presented for the degree of 
Doctor of Philosophy 
D u r h a m 
University 
Centre for Electronic Systems 
School of Engineering 
University of Durham 
England 
2005 
1 OCT 2006 
Declaration 
The work in this thesis is based on research carried out in the Centre for Electronic 
Systems, School of Engineering, University of Durham, England. No part of this 
thesis has been submitted elsewhere for any other degree or qualification and it is 
all my own work unless otherwise referenced to the contrary in the text. 
Copyright (c) 2005 by Thomas Christopher Grocutt. 
The copyright of this thesis rests with the author. No quotation from it should be 
pubhshed in any format, including electronic and the Internet, without the author's 
prior written consent. All information derived from this thesis must be acknowledged 
appropriately. 
Acknowledgements 
I would like to thank my project supervisor, Simon Johnson, for his time, help and 
encouragement; the staff in the engineering department for their continued help and 
the AIG group in the department of physics for allowing me to use their Cray XDl . 
In particular, I would like to thank my father for his patience whilst proof reading 
this thesis. I would also Uke to thank the open source programming team that 
created KT^X and thus saving me from Word. 
ni 
Abstract 
This thesis addresses the use of reconfigurable hardware in computing platforms, in 
order to harness the performance benefits of dedicated hardware whilst maintain-
ing the flexibihty associated with software. Although the reconfigurable computing 
concept is not new, the low level nature of the supporting tools normally used, 
together with the consequent limited level of abstraction and resultant lack of back-
wards compatibility, has prevented the widespread adoption of this technology. In 
addition, bandwidth and architectural limitations, have seriously constrained the 
potential improvements in performance. A review of existing approaches and tools 
flows is conducted to highlight the current problems being faced in this field. 
The objective of the work presented in this thesis is to introduce a radically new 
approach to reconfigurable computing tool flows. The runtime based tool flow intro-
duces complete abstraction between the application developer and the underlying 
hardware. This new technique eliminates the ease of use and backwards compatibil-
ity issues that have plagued the reconfigurable computing concept, and could pave 
the way for viable mainstream reconfigurable computing platforms. An easy to 
use, cycle accurate behavioural modelling system is also presented, which was used 
extensively during the early exploration of new concepts and architectures. Some 
performance improvements produced by the new reconfigurable computing tool flow, 
when applied to both a MIPS based embedded platform, and the Cray XDl , are also 
presented. These results are then analyzed and the hardware and software factors 
iv 
affecting the performance increases that were obtained are discussed, together with 
potential techniques that could be used to further increase the performance of the 
system. 
Lastly a heterogenous computing concept is proposed, in which, a computer sys-
tem, containing multiple types of computational resource is envisaged, each having 
their own strengths and weaknesses (e.g. DSPs, CPUs, FPGAs). A revolutionary 
new method of fully exploiting the potential of such a system, whilst maintaining 
scalability, backwards compatibility, and ease of use is also presented. 
Contents 
Acknowledgements iii 
Abstract iv 
List of Figures xvi 
List of Tables xix 
List of Listings xx 
Glossary 1 
1 Introduction 4 
1.1 Conventional Computing Architectures 4 
1.1.1 Instruction Set Expansion 4 
1.1.2 Dedicated Hardware 5 
1.2 Reconfigurable Computing . . . . . . . 6 
1.2.1 Hardware Implementations 7 
vi 
1.2.1.1 Direct Processor Integration 7 
1.2.1.2 External RC Area 9 
1.2.2 Tools Flows 11 
1.2.2.1 Separate SW And HW Tool Flows 11 
1.2.2.2 Common High Level Language 12 
1.2.2.3 Low Level Tool Flows 15 
1.2.3 Scheduling 16 
1.2.4 Floating Point Maths 17 
1.3 Research Conducted 18 
1.4 Warp Processing 20 
1.4.1 Hardware Platforms 20 
1.4.2 Tool Flow 21 
1.4.3 Performance Improvements 22 
1.5 Summary 23 
2 Behavioural Simulation 25 
2.1 Background 25 
2.2 CPUSim 27 
2.2.1 Concurrency. • • • . • • - j^: 27 
2.2.2 Flexibihty 30 
vii 
2.2.3 Instantiation and Connection 32 
2.2.4 Visualisation and Statistics 34 
2.3 Summary 35 
3 E P I C Simulation 37 
3.1 CPU Architecture 37 
3.1.1 Additional RC Hardware Blocks 38 
3.1.1.1 Code Profiler 39 
3.1.1.2 Reconfigurable Execution Unit 39 
3.2 RC Conversion Algorithms 40 
3.2.1 Instruction Combination 40 
3.2.2 Loop Conversion 43 
3.2.2.1 Data Flow Pipehne 44 
3.3 Performance evaluation 46 
3.3.1 Test Algorithms 46 
3.3.1.1 Copy algorithm 47 
3.3.1.2 Haff Brightness Algorithm 47 
3.3.1.3 Mandelbrot Algorithm 47 
3.3.2 Loop Dependencies . ^  . 47 
3.4 Results 48 
viii 
3.4.1 Copy algorithm 48 
3.4.2 Half Brightness Algorithm 49 
3.4.3. Mandelbrot Algorithm 51 
3.4.4 HW Pipehne Implementation 54 
3.4.5 Summary 57 
4 Loop Conversion 58 
4.1 Abstract Instruction Model 60 
4.1.1 Supporting Additional IS As 61 
4.2 Targeting The Hardware Pipehne 62 
4.3 Loop Identification 63 
4.4 Instruction Linearization 64 
4.4.1 MUX Insertion 65 
4.4.2 Instruction Guarding 66 
4.5 Optimization 66 
4.5.1 Hardware Dependency Removal 67 
4.5.2 Stack Removal 67 
4.5.3 Iteration Dependency Removal 68 
4.5.4 Instruction Removal 69 
4.5.5 Tree Re-balancing 70 
ix 
4.6 Pipeline Generation 71 
4.6.1 Operation Scheduling 71 
4.6.1.1 Pointer Ahasing 73 
4.6.2 Data Forwarder Addition 74 
4.6.3 Register Remapping 74 
4.7 Target Implementation 75 
4.7.1 Pipeline configuration generation 75 
4.7.2 Program modification 76 
5 MIPS Test Platform 78 
5.1 Platform Details 78 
5.1.1 RC Area Integration 79 
5.1.2 Hardware 80 
5.1.2.1 CPU/RC Area Interface 84 
5.2 Software 86 
5.2.1 Console Software 86 
5.2.2 Data Transfer Software 87 
5.3 Test Algorithms 87 
5.3.1 PRBS Generator 88 
5.3.2 FFT 91 
5.3.3 Low Pass Filter 91 
5.3.4 Normalization 92 
5.3.5 Block Search 92 
5.3.6 Mandelbrot 94 
5.3.7 Half Brightness 94 
5.3.8 Factorial and Series Sum 95 
5.3.9 Copy 96 
5.3.10 Sort 96 
5.4 Performance Scalability 97 
5.4.1 Bandwidth 97 
5.4.2 Parallelism 98 
5.4.3 Algorithm Complexity 100 
5.4.3.1 Increased section size 101 
5.4.3.2 Increased complexity 102 
5.5 Platform Evaluation 103 
5.5.1 Abstraction 103 
5.5.2 Automatic conversion 103 
5.5.3 Low conversion time 104 
5.5.4 Large performance increase 104 
xi 
5.6 Summary 104 
6 Cray X D l Platform 106 
6.1 Cray XDl Overview 106 
6.2 Platform Details 109 
6.2.1 Execution 109 
6.2.2 Tool flow 110 
6.2.3 Memory Access I l l 
6.3 Performance Evaluation 113 
6.3.1 PRBS Generator 114 
6.3.2 Half Brightness 116 
6.3.3 Low Pass Filter 117 
6.3.4 Normalization 117 
6.3.5 Copy 119 
6.3.6 Series Sum 120 
6.4 MIPS RC Platform Comparison 120 
6.5 Summary 122 
6.5.1 Cray X D l Platform Limitations 122 
6.5.2 Clock Speeds 123 
6.5.3 Performance Improvements 124 
xii 
7 Optimisations Of The Reconfigurable Computing System 125 
7.1 Hardware Conversion Tools 125 
7.1.1 Loop Extraction 125 
7.1.2 Floating Point Operations 127 
7.1.3 Optimization 127 
7.1.4 Scheduling 128 
7.1.4.1 Operation Variants 128 
7.1.4.2 Local Feedback 129 
7.1.4.3 DMA Operation Scheduling 130 
7.1.4.4 FPGA Tool Integration 130 
7.1.5 Hardware Software Integration 131 
7.2 Platforms 132 
7.2.1 MIPS 132 
7.2.2 XDl 133 
7.2.3 Benchmark Algorithms 134 
7.3 An Ideal Reconfigurable Computing Platform 135 
7.3.1 Processor Integration 135 
7.3.2 Homogeneous RC Area 137 
7.3.2.1 Partial Reconfigurability 138 
xiii 
7.3.2.2 Homogeneous Structure 138 
7.3.2.3 Configuration Controller 139 
7.3.2.4 Specialized Hardware 139 
7.3.2.5 Design For Place And Route 140 
7.3.2.6 Clock Domains 140 
7.3.3 Memory Sub-System 141 
7.3.4 Code Profiling 142 
7.3.5 Hardware Scheduling 143 
7.4 Heterogeneous Computing 144 
7.5 Future Research 147 
7.6 Summary 148 
8 Conclusion 150 
Bibliography 156 
A MIPS Test Algorithms Source Code 172 
A . l PRBS Generator (Standard) 172 
A.2 PRBS Generator (Unrolled) 173 
A.3 FFT 175 
A.4 Low Pass Filter 178 
xiv 
A.5 Normalization 180 
A.6 Block Search (Planar) 182 
A.7 Block Search (Packed) 185 
A.8 Mandelbrot 188 
A.9 Half Brightness 189 
A. 10 Factorial and Series Sum 190 
A. 11 Copy 190 
A.12 Sort 191 
XV 
List of Figures 
1.1 Separate SW and HW tool chains 11 
1.2 Common high level language 13 
1.3 Low level tool chain 15 
1.4 Warp processor architecture 21 
2.1 Single function clocking 29 
2.2 Dual function clocking 29 
2.3 CPUSim simulating an experimental processor, showing the contents 
of the instruction cache and a Mandelbrot fractal generated by soft-
ware running on the simulated processor 35 
3.1 EPIC CPU core with additional RC blocks shown in red 38 
3.2 EPIC CPU RC tool flow 41 
3.3 Example RC data flow pipeline 44 
3.4 Performance improvement of copy algorithm with loop extraction . . 50 
XVI 
3.5 Performance improvement of half brightness algorithm with instruc-
tion combination 52 
3.6 Performance improvement of half brightness algorithm with loop ex-
traction 53 
3.7 Performance improvement of Mandelbrot algorithm with instruction 
combination 55 
3.8 Performance improvement of Mandelbrot algorithm with loop extrac-
tion 55 
3.9 FPGA tiles used to implement real hardware data flow pipeline . . . 56 
4.1 Simphfied Loop conversion block diagram 59 
4.2 Example operation scheduhng onto data flow pipeline 72 
4.3 Loop constant and stage 0 shift register arrangement 75 
5.1 FPGA tiles used to implement MIPS CPU, RC pipeUne, and peripherals 82 
5.2 MIPS system block diagram 83 
5.3 Example logic analyzer trace showing RC pipeline execution 84 
5.4 USB data transfer application displaying the contents of the data 
buffer as an image 87 
5.5 Logic analyzer trace showing RC pipehne execution of FFT algorithm 91 
5.6 Pixel graphics formats 93 
5.7 Logic analyzer trace showing RC pipeline execution of half brightness 
algorithm 95 
xvii 
5.8 Logic analyzer trace showing RC pipeline execution of quick sort al-
gorithm 97 
5.9 Generalized effects of parallelism on performance (Software vs Hard-
ware) 101 
5.10 Picture processing and compression 102 
6.1 Cray X D l blade architecture 108 
6.2 Cray FPGA interface cores 108 
6.3 Hardware conversion tool flow for Cray X D l I l l 
6.4 Cross bar architecture for QDR memory interface 112 
6.5 Operations on the hardware data flow pipeline for the low pass filter 
algorithm 118 
7.1 Example data flow pipelines with and without the local feedback op-
timization 129 
7.2 Block diagram of idealized CPU architecture 136 
xvni 
List of Tables 
1.1 Warp processing test platforms 23 
3.1 EPIC CPU specification 38 
3.2 Test algorithms and characteristics 46 
3.3 Test cases implemented in FPGA hardware 57 
5.1 MIPS platform summary 82 
5.2 Test algorithms used on MIPS test platform 89 
6.1 Test algorithms used on the Cray X D l 115 
6.2 Predicted performance improvement after resolving current Cray X D l 
limitations 115 
X I X 
List of Listings 
3.1 Sample code before instruction combination 42 
3.2 Sample code after instruction combination code 42 
3.3 Example code loop with dependency 48 
3.4 Example code loop without dependency 48 
4.1 If-else statement implemented with branches 65 
4.2 If-else statement implemented with MUXs 65 
4.3 Conditional store implemented with branching 66 
4.4 Conditional store implemented with guarding 66 
4.5 Hardware dependency present 67 
4.6 Hardware dependency removed 67 
4.7 Stacking ("push" first) 68 
4.8 Stack operations removed ("push" first) 68 
4.9 Stacking ("pop" first) 69 
4.10 Stack operations removed ("pop" first) 69 
X X 
4.11 Before instruction removal 70 
4.12 After instruction removal 70 
4.13 Sequential value combination 70 
4.14 Balanced value combination 70 
4.15 Program before trigger instruction insertion 77 
4.16 Program after trigger instruction insertion 77 
5.1 Example RC trigger instruction sequence 80 
7.1 Example code with multiple nested loops 126 
X X I 
Glossary 
A I M Abstract Instruction Model 
A M D Advanced Micro Devices 
API Application Programming Interface 
ASIC Apphcation Specific Integrated Circuit 
BSD Berkeley Software Distribution 
CISC Complex Instruction Set Computer 
CMS Code Morphing Software 
CPU Central Processing Unit 
DDR Double Data Rate 
DMA Direct Memory Access 
DSP Digital Signal Processing 
EPIC Explicitly Parallel Instruction Computing 
FFT Fast Fourier Transform 
FIFO First In First Out 
FIR Finite Impulse Response 
FPGA Field Programmable Gate Array 
GCC GNU Compiler Collection 
GNU GNU'S Not UNIX 
GUI Graphical User Interface 
HDL Hardware Description Language 
HPC High Performance Computing 
HSI Hardware Software Interface 
HW Hardware 
IEEE Institute of Electrical & Electronic Engineers 
ILP Instruction Level Parallelism 
10 Input/Output 
IP Intellectual Property 
ISA Instruction Set Architecture 
JIT Just In Time 
JPEG Joint Photographic Experts Group 
LFSR Linear Feedback Shift Register 
LOG Logarithm 
LPF Low Pass Filters 
LRU Least Recently Used 
LUT Look Up Table 
MAC Multiply and Accumulate 
M M U Memory Management Unit 
M M X Multi-Media Extensions 
MPEG Motion Pictures Experts Group 
MUX Multiplexer 
NRE Non-Recurring Expenditure 
OS Operating System 
PAR Place And Route 
PC Personal Computer 
PCB Printed Circuit Board 
PCI Peripheral Component Interconnect 
PNG Portable Network Graphics 
PRBS Pseudo Random Binary Sequence 
QDR Quad Data Rate 
RAM Random Access Memory 
RLE Run Length Encoding 
ROM Read Only Memory 
RC Reconfigurable Computing 
RGB Red Green and Blue 
RS232 Serial Interface 
SAD Sum of Absolute Differences 
SIMD Single Instruction Multiple Data 
SMT Simultaneous Multi-Threading 
SoC Systems On a Chip 
SDRAM Synchronous Dynamic RAM 
SRAM Static RAM 
SW Software 
TSF Technology Scale Factor 
TV Television 
USB Universal Serial Bus 
WAV Waveform Audio 
VHDL VHSIC Hardware Description Language 
VHSIC Very High Speed Integrated Circuit 
V L I W Very Long Instruction Word 
v u Vertical Unit 
YUV Y = Luminance, U = Normalised BY, V = Normalised RY 
Chapter 1 
Introduction 
1.1 Conventional Computing Architectures 
Since the birth of the "Von Neumann architecture" [1] the performance of computers 
has continued to increase. In 1965 Gordon Moore made the observation [2] that the 
number of components on an integrated circuit doubles every 18 months, leading 
in turn to a doubling in processor performance over the same time period. Despite 
this dramatic and ongoing increase in computational speed, there are applications 
such as games, computational fluid dynamics and realtime multimedia applications 
that tax even the fastest modern computers. This problem has been addressed in 
the past by expanding the instruction set and also adding dedicated hardware. 
1.1.1 Instruction Set Expansion 
Although central processing unit (CPU) instruction sets allow the programmer to 
perform almost any task, it can require many lO's of instructions to perform some 
relatively simple tasks. If an operation frequently occurs it may, under certain cir-
1. Introduction 
cumstances, be added into the instruction set to advantage. A good example of 
this is floating point maths. Many early instruction sets (e.g. x86, 68k, MIPS) 
didn't include this functionality, requiring i t to be emulated using numerous inte-
ger operations. Floating point instructions were added when applications required 
the performance boost that they provided and when the available transistor count 
increased to the point where this became a practical solution. 
Many modern CPU instruction sets have been extended to include single instruction 
multiple data (SIMD) instructions [3, 4]. This allows multiple arithmetic operations 
to be performed by a single instruction, by packing multiple values into a single 
register. An example of this would be a 32bit CPU that has a SIMD instruction 
that performs 4x8bit additions. This is implemented in hardware by disabUng the 
carry propagation between the 8 bit groups which results in a significant performance 
improvement with very httle additional hardware. 
Although extending the instruction set can speedup a wide range of apphcations, 
the performance gain will always be hmited by the instruction pipehne, register file 
and other fundamental components of the Von Neumann architecture. 
1.1.2 Dedicated Hardware 
As the execution units of a processor occupy only a small proportion of the total 
die area, CPUs are very inefficient in terms of both hardware utihzation and power 
consumption. A popular way to address this and overcome the hmitations of the 
Von Neumann architecture is to create dedicated hardware for specific tasks. As 
this solution doesn't require many of the hardware blocks found in a processor, (e.g. 
instruction fetch/decode, register files, etc) a larger proportion of the hardware is 
used to perform useful calculations. _ . -
A common example of this approach can be found in the current generation of 
5 
1. Introduction 
personal computers (PCs). Since the computational power required by modern 3D 
computer games is far greater than that available from even the fastest processors, 
dedicated hardware on the graphics card [5, 6] is used to render the 3D scenes. This 
releases the CPU to perform ah the other tasks that are required in order to cre-
ate the game environment. This "offloading" of computationally intensive tasks to 
hardware is even more frequently employed in embedded systems (e.g. set top boxes 
7], mobile phones [8], etc) where CPU resources are often hmited. Other common 
examples of hardware acceleration are cryptography, MPEG compression/decom-
pression, and image improvement algorithms. 
Using dedicated hardware is extremely powerful, however, it is only a practical 
solution where a task needs to be performed very frequently as such hardware is 
only capable of performing the specific task that it was designed to do. I f that 
task doesn't need to be performed then the dedicated hardware will be idle. Rising 
mask costs and other non-recurring expenditures (NREs) are forcing manufacturers, 
especially in the embedded markets, to create devices for a much larger market 
segment. Consequently it is not always economically viable to design and implement 
hardware that is specific to only a small section of the market. A popular technique 
to reduce the amount of apphcation specific hardware in these types of devices is 
to use one or more digital signal processing (DSP) cores [7, 8]. However there is a 
limit to how much processing power can be provided by this approach. 
1.2 Reconfigurable Computing 
By adding reconfigurable logic to a system it is possible to obtain a substantial 
performance improvement [9]. Although the performance achieved is very depen-
dent on the application, speedups of 50 times or. greater are not uncommon, whilst 
maintaining most of the flexibihty associated with software [10, 11, 12, 13, 14 . 
1. Introduction 
In general algorithms with the characteristics listed below will benefit the most 
from the use of reconfigurable computing. Algorithms that possess all of these 
characteristics (e.g. encryption) achieve speed up factors of 1000 or more [15, 16 . 
High levels of instruction level parallelism ( I L P ) The greater the amount of 
ILP the more data can be processed in parallel. 
Inherently parallel algorithms Some algorithms are inherently parallel in na-
ture, as such different parts of the computation can be done in parallel (this 
parallelism is orthogonal to ILP). One example of this kind of parallelism is 
a simple brightness reduction algorithm, the calculation of the values of the 
pixels are independent from one another. 
Simple operations Some classes of operation (e.g. bitwise AND, OR, etc) require 
minimal hardware resources to implement. The lower the number of resource 
required my an algorithm, the more instances of that algorithm can be imple-
mented in the reconfigurable hardware. 
Low memory bandwidth requirements Since the core algorithm will be instan-
tiated many times in the FPGA, the total bandwidth required can become 
considerable. Since the amount of "off chip" bandwidth is restricted by phys-
ical factors (i.e. number of pins and maximum frequency) the amount of 
memory bandwidth available can severely limit the overall performance of a 
reconfigurable computing system. 
1.2.1 Hardware Implementations 
1.2.1.1 Direct Processor Integration 
In some systems reconfigurable logic is an integral part of the processor [12, 13]. In 
these cases there are usually two communication channels. The first channel allows 
7 
1. Introduction 
instructions to be issued to the reconfigurable computing (RC) area, enabling the 
software to control the execution and setup initial parameters. The second channel 
gives access to memory allowing the RC area to perform direct memory access 
(DMA) operations, independently of the rest of the CPU core. This approach has 
several advantages :-
Low Latency As the RC area is directly connected to the instruction pipeline, only 
a few clock cycles are required to trigger execution. This is a major perfor-
mance advantage for applications where the RC area is triggered repeatedly. 
Cache Coherency As the DMA channel from the RC area is usually connected 
to the data cache, rather than directly to the memory interface, the system is 
naturally cache coherent. 
Virtual Memory On systems that implement virtual memory, the DMA connec-
tion from the RC area can be made in the processor before the memory man-
agement unit (MMU), resulting in both the application hardware and software 
existing in the same virtual address space. This removes the overhead associ-
ated with translating pointers from the virtual to the physical address space 
and also reduces the complexity of the software to hardware conversion pro-
cess. 
High Bandwidth Because the computational throughput of reconfigurable com-
puters is significantly higher than conventional processors, the memory band-
width available will have an even greater impact on performance [17]. As a 
result the RC area must be closely integrated with the existing memory system 
to maximize bandwidth and to minimize the effect of this bottleneck. 
In many systems where there is a direct connection to the instruction pipeline, the 
CPU will stall and remain idle while execution takes place in the RC area. However, 
this inefficiency is avoided on systems that implement simultaneous multi-threading 
8 
1. Introduction 
(SMT) [18] as the CPU issues instructions from multiple thread contexts at the 
same time. Consequently, a stall in one thread does not result in the entire CPU 
core becoming idle. Because the RC area is very tightly integrated with the CPU 
core, it must be fabricated on the same die, resulting in a significant increase to the 
overall die area of the device; this can also provide a significant improvement over 
dual core processors [19, 20, 21] which at most yield a 2x speedup for the cost of a 
doubling in the die area. 
1.2.1.2 External R C Area 
Although directly connecting the RC area to the CPU has many performance ad-
vantages, it is not always feasible to do this due to the high levels of integration 
required. The notable exception being CPUs that provide an interface to the in-
struction pipeline which allows the user to create their own execution units; this 
feature is not present on main stream processors, and is only usually found on CPU 
cores that are specifically developed for FPGAs [22, 23 . 
In systems where it is not possible to directly connect the RC area to the CPU 
pipehne, the RC area can, instead, be connected to one of the peripheral buses, 
usually the peripheral component interconnect (PCI) bus [24, 25, 26]. As discussed 
in section 1.2.1.1, the bandwidth available to the RC area has a significant im-
pact on performance, however the standard PCI bus (32bit,33MHz) only provides 
133 MBytes/s, which rises to 533 MBytes/s with the PCLX bus (64bit,66MHz). 
Both are significantly lower than the 6.4 GBytes/s provided to the processor by 
the dual channel DDR-400 memory interface, present on modern computers. To 
try to address these bandwidth issues, research has been conducted into connecting 
field programmable gate arrays (FPGAs) directly to a computer's memory inter-
face. This can be accomplished by mounting aff FPGA onto a printed circuit board 
(PCB) with the same form factor as a memory module. The FPGA can then be 
9 
1. Introduction 
simply plugged into any computer system that supports the appropriate memory 
standard [27, 28, 29]. Although this improves both bandwidth and latency, addi-
tional problems are created:-
D M A Conventional memory buses are designed to be single master, which limits 
the FPGA to only accessing memory to which it is directly connected. 
Obsolescence Unhke PCI, backwards compatibihty is not a design goal for the 
memory bus, therefore any system that utilizes such an interface will need to 
be redesigned for each new memory standard. 
Operating system (OS) integration Due to the low level and unorthodox na-
ture of this interface, integration with the OS is required at the kernel level. 
This generally limits the use of the technique to those operating systems where 
the kernel source is available, for example Linux/BSD. 
High performance computing (HPC) vendors Cray [30] and SGI [31] have introduced 
systems where FPGAs are directly connected to the main system bus, which results 
in high bandwidths. In addition, because these busses are multi master, there are 
no issues associated with DMA transactions. Although the latency is significantly 
better than in PCI based solutions, it is not as low as that in systems where the RC 
area is directly connected to the CPU. To reduce the performance impact caused, 
both the Cray and SGI systems have high bandwidth, low latency QDR-SRAM 
memories directly connected to the FPGAs. Both these external memories and the 
SRAM present inside the FPGA itself can be used to create small caches and buflfers 
to improve performance. 
10 
1. Introduction 
1.2.2 Tools Flows 
Creating the hardware infrastructure for a reconfigurable computing system is rel-
atively straightforward and several solutions [26, 30, 31, 32] are commercially avail-
able. However a viable platform also requires a tool flow to create both the software 
application and the accompanying hardware configuration. 
1.2.2.1 Separate S W And H W Tool Flows 
Figure 1.1 shows the simplest tool flow. This has a high level language compiler (e.g. 
C/C-I-+) for the software and a completely separate hardware description language 
(HDL) (e.g. VHDL or Verilog) in order to generate the hardware configuration 
image [33]. Although this approach gives the application developer great flexibility, 
it also exposes him/her to a spht programming paradigm. This can introduce design 
flaws and bugs as most developers do not possess both hardware and software skill 
sets. 
As current HDLs are very low level languages, manually porting existing software 
to hardware can take a considerable amount of time. One example of this is a seis-
mic data processing algorithm, the kernel of which was ported to hardware [34 . 
Although this corresponds to a mere 80 lines of C code, the process took six man 
Hardware 
description 
(eg VHDU 
Verilog) 
High level 
language 
(eg C/C++ 
rdware 
generator 
Compiler 
Hardware 
conlig 
image 
Program 
executable 
Figure 1.1: Separate SW and HW tool chains 
11 
1. Introduction 
months to complete. Although the result provided a considerable increase in per-
formance, the effort levels and skills required represent a significant barrier to the 
widespread adoption of this technology. 
It is possible to build library components of common functions (e.g. Fast Fourier 
transforms (FFTs), random number generators, sorting algorithms, etc) to reduce 
the amount of effort required to port applications to reconfigurable computing plat-
forms. These library components consist of the hardware and a software wrapper 
to control it . This not only promotes intellectual property (IP) reuse but also pro-
vides a familiar interface to software developers requiring the performance benefits 
of hardware. An additional advantage of this approach is that the library function 
can also be implemented in software, thus allowing the system to decide, at runtime, 
whether functions are to be performed in software or in hardware [35 . 
1.2.2.2 Common High Level Language 
To overcome the problems outlined in the section 1.2.2.1 the tool flow shown in 
figure 1.2 is becoming increasingly common. This flow uses a high level language 
to describe both the hardware and software [36, 37, 38, 39, 40, 41]; often a variant 
of C (as a large proportion of software is written in this language, porting software 
kernels to hardware is made considerably easier). 
As with many other high level languages C [42], was originally created as a pure soft-
ware language. Consequently it lacks some of the constructs required in a high level 
HDL. Most notably, there is no way of explicitly specifying parallelism. In addition, 
for reconfigurable computing platforms, it is not possible to specify whether a sec-
tion of code should be implemented in software or in hardware. To overcome these 
hmitations, most hardware C compilers extend the language with either additional 
keywords or pragmas. As both these directives have a significant impact on the effi-
12 
1. Introduction 
mptler 
Program 
executable 
High level 
language 
(eg C/C++) 
Hardware 
generator 
Hardware 
image 
Figure 1.2: Common high level language 
ciency of the resulting hardware and also have the potential to introduce bugs into 
the system, it is important that the developer has, at the very least, a rudimentary 
knowledge of both hardware in general and the target platform in particular. 
The resulting executable from these types of tool flow contain the RC area config-
uration data which is very closely related to the underlying hardware architecture. 
Because of these low levels of abstraction, applications need to be compiled for the 
specific RC area concerned. Therefore, in order to create a viable platform one of 
the following options must be chosen:-
Fixed R C area By fixing the architecture, size, and speed of the RC area it would 
be possible to distribute applications to customers with a variety of different 
reconfigurable computing systems without the need to recompile and re-target. 
However this approach is not without its drawbacks. Given that different 
market segments have vastly different needs (e.g. mobile, desktop, server) it 
would be impractical to fix the RC area. In addition to this, since the speed 
of processors is increasing, as stated in section 1.1 if the parameters of the 
RC area were to be fixed the performance of pure software would eventually 
1.3 
1. Introduction 
overtake the performance of hardware. 
Compile every application for every R C area If there is no requirement for 
backwards compatibiUty, the RC area may be freely changed with each version 
of the processor, allowing the performance of the RC area to be appropriately 
scaled as technology advances. However with no backwards compatibility, 
every application would need to be compiled for every system. Clearly this is 
not feasible in the mainstream computer market. However this does provide a 
possible solution for embedded markets were applications are typically written 
for specific platforms. 
To address some of the above problems, it has been proposed that the hardware 
configuration image could be distributed in a high level, abstracted form [43]. As 
a result the target platform would perform both the mapping and place and route 
(PAR) stages and new device architectures and tools have been designed that sig-
nificantly reduce the time taken to perform these steps [44], albeit at the expense 
of device size and operating frequency. However there are still several problems 
associated with this approach:-
Application vendors need to target hardware Because the program contains 
an abstracted hardware binary, application vendors will require additional time 
and hardware expertise to gain the performance advantages associated with 
reconfigurable computing. 
Performance does not automatically scale with R C size Although the size 
of the RC area can be increased without the need to recompile, only those 
applications that have been written specifically to take advantage of the in-
creased hardware resources will benefit. 
Compiler modifications The compiler tool chain would need to be extended in 
order to support both the additional partitioning and synthesis steps required. 
14 
1. Introduction 
Program 
executable 
Execution 
profile 
Hardware 
generator 
New 
program 
executable 
Hardware 
config 
image 
Figure 1.3: Low level tool chain 
This would have to be done for every supported language (e.g. C / C + + , For-
tran, Java). 
1.2.2.3 Low Level Tool Flows 
Instead of using a high level language as the starting point for the reconfigurable tool 
chain as described in section 1.2.2.2, it is possible to start from an existing software 
binary [45]. This approach can be extended to form the tool flow shown in figure 1.3 
by using runtime execution profiling [46, 47] to identify computationally intensive 
sections of code. The key difference between this and other tool flows is that it is run 
by end users on their systems and not by application developers, as is usually the 
case. This is similar to just in time (JIT) compiler technology, now commonplace 
in the software domain, where it is used to execute non native code at close to 
native performance [48, 49, 50]. However, instead of translating one instruction set 
to another, this tool flow translates computationally intensive sections of code from 
the native instruction set to hardware whilst at the same time generating a modified 
version of the application in order to utilize the newly generated hardware. 
With this tool flow, the reconfigurable hardware is completely abstracted from the 
application software. Indeed, the software developer, is not even aware of the pres-
1. Introduction 
ence of the reconfigurable hardware. Two benefits arise from this approach: appli-
cation developers do not need specific hardware knowledge and existing applications 
will benefit from hardware acceleration without having to be recompiled. This com-
plete abstraction also gives hardware vendors the fiexibility to change the size, speed, 
and architecture of the RC area without adversely affecting backwards compatibil-
ity. As a result the performance, cost, and power consumption of the system can be 
tailored to suit different market segments (e.g. mobile, desktop, and server) whilst 
still providing the fiexibility to allow the system to keep up with the ever increasing 
pace of technological change. 
1.2.3 Scheduling 
Amdahl's law [51] states:-
"If F is the fraction of a calculation that is sequential, and (1-F) is 
the fraction that can be parallelised, then the maximum speed-up that 
1 5) can be achieved by using P processors is ^ ^ i_p 
Although originally related to multi processor systems, Amdahl's law can also be 
applied to reconfigurable computing since, in order to achieve a large overall increase 
in performance, a large proportion of the computationally intensive code must be 
converted to hardware. Because computer systems often perform multiple tasks with 
each task containing several computationally intensive loops, there are, potentially, a 
large number of code sections that may benefit from hardware conversion. However 
merely increasing the size of the RC area to accommodate simultaneously all the 
intensive sections of code is very wasteful of hardware resource, since only a small 
proportion of the RC area will be active at any one time. To address this problem it 
is possible, utilizing the dynamic nature of a reconfigurable computing environment 
to move hardware in and out of the RC area as required [52, 53 . 
16 
1. Introduction 
To minimize the idle time associated with reconfiguring the RC area, it has been 
proposed that multi context devices [54] be used, allowing the RC area to be active 
with one context whilst another context is being reprogrammed. These contexts can 
then be quickly swapped thus activating the newly programmed context. This is 
similar to the double buffering technique often used in computer graphics. Mult i-
context aware scheduling algorithms can be used to minimize both the memory 
bandwidth and the idle time associated with moving configuration data around the 
system [55 . 
1.2.4 Floating Point Maths 
The hardware resource required to implement some mathematical functions can be 
considerable. As a result most modern FPGAs contain dedicated hardware to per-
form such operations as multiplication, since this and similar functions are relatively 
common and consume a considerable amount of hardware resource. Until recently, 
reconfigurable computing has been restricted to integer and fixed point arithmetic 
due to the large hardware resources required by floating point units. However, the 
combination of floating point cores that are specifically designed and optimized for 
FPGAs [56, 57, 58], together with the latest generation of larger devices means that 
floating point arithmetic is now achievable on FPGAs. 
Due to their size, it is important to match floating point cores to the specific ap-
plication that is to be implemented in hardware. For example, in most cases, it is 
advantageous to increase the latency of a core to match the surrounding hardware. 
Not only does this ehminate the need to pipeline the result through several stages of 
registers to match the rest of the data flow, but it also provides the opportunity to 
further optimize the core and reduce its hardware utilization. To this end, research 
has been conducted to produce floating point core generators [59] rather than fixed 
cores. These core generators allow the RC tool flow to generate different cores for 
17 
1. Introduction 
different sections of the algorithm depending on both the latency and throughput 
requirements. 
To reduce further the hardware requirements, it is possible to change the number 
representations that are used depending on both the number and type of operations 
present in the hardware. For example, by using a higher radix floating point nota-
tion, it is possible to reduce the hardware requirements of floating point cores by 
12-25% [60]. However, the decision to use a higher radix notation must be made on 
a case by case basis. This is due to the fact that the amount of hardware required to 
convert to and from the IEEE format used by the host CPU can, in some cases, be 
greater than the hardware saved by using the high radix notation. Using logarithm 
(LOG) notation has also been investigated, as this can also produce substantial re-
ductions in hardware usage provided the algorithm contains mainly multiply and/or 
divide operations [61]. The decision to employ LOG notation has to be informed by 
the overhead associated with converting to and from the IEEE floating point format 
and also by the number of addition/subtraction operations involved, as these require 
significant amounts of hardware resource in the LOG domain. 
1.3 Research Conducted 
The research presented in this thesis concentrates on the development and investi-
gation of low level tool flows (see section 1.2.2.3) with the ultimate aim of creating 
a reconflgurable computing system that would be suitable for mass market adop-
tion. In order to achieve this goal it is important that the final system exhibit the 
following characteristics:-
Abstraction Complete abstraction between the hardware and software is required 
to provide backward compatibihty. Without this basic requirement it is im-
possible to create a platform that will achieve mass market adoption. 
18 
1. Introduction 
Automatic conversion The conversion from software to hardware, and the sub-
sequent integration of the newly created hardware must be completely au-
tomatic, and require no user intervention. This requirement is particularly 
important if the conversion is performed on the end user computer, rather 
than the original software developers (see section 1.2.2.3). 
Low conversion time To obtain wide spread user acceptance the software to hard-
ware conversion process must be performed relatively quickly. Again this is 
especially important on systems where the conversion is performed by the end 
used. Although the output of the conversion process can be cached and re-
used during aU subsequent executions of a program, this technique can't be 
used during the software development process, where the program is chang-
ing frequently. In addition some users would find the relatively slow, initial 
execution un-executable. 
Large performance increase The use of reconfigurable technology in mainstream 
environments would represent a significant departure from the conventional 
techniques used to increase performance (e.g. larger caches, wider execution 
pipelines, faster clock speeds, etc). If such a radical technology is be be adopted 
then it must offer significant performance advantages over existing approaches 
over a wide range of different types of software. 
Much of the previous research conducted in this field has concentrated on producing 
systems that produce significant increases in performance [10, 11, 12, 13, 14, 15 . 
Although this is an important aspect to any reconfigurable computing platform, 
it must not be overshadow other aspects hke abstraction, as without these extra 
features it is not possible to create a viable reconfigurable computing platform. 
The research presented in this these addresses all of the required features, with 
the exception of the time required to perform the software to hardware conversion, 
as this is an issue that is closely linked to the architecture of the reconfigurable 
19 
1. Introduction 
area, and not the conversion tool chain. This has however been addressed on other 
research [44]. In addition to the work presented in this thesis, extra development and 
investigation is required before a viable platform is available, details of the required 
work can be see in chapter 7. 
1.4 Warp Processing 
The concept of generating hardware at runtime from an existing software binary was 
termed "warp processing" by F Vahid and his team at the Department of Computer 
Science and Engineering, University of Cahfornia, Riverside [62]. This research 
group performed some preliminary research into the development and use of low 
level tool flows that is similar to that described in section 1.2.2.3. 
1.4.1 Hardware Platforms 
During the course of their research, evaluations of several reconfigurable computing 
systems based on different host processors were performed. The general hardware 
architecture that they used is shown in figure 1.4. The first thing to note about this 
particular architecture is the use of a completely separate processor to perform the 
software to hardware conversion which, in many of the test cases that they presented, 
was a duplicate of the main host CPU [62, 63, 64]. Although not made exphcit in 
their research, this would result in a device that is considerably larger than twice 
the size of the original processor core, after the additional hardware required for 
the RC area is taken into account. As a result, the hardware acceleration must 
produce substantially more than a 2x increase in performance for their platform to 
pfoduce any advantage. In the majority of computer systems the CPU will spend 
a significant proportion of its time idle. Since this idle time, could in most cases, 
20 
1. Introduction 
be used to perform the hardware conversion without impacting performance, it is 
questionable, whether this second, dedicated processor is required. 
1.4.2 Tool Flow 
The hardware generation tool flow utilized by the "warp processor" developed by F 
Vahid and his team, differs shghtly from the tool flow described in section 1.2.2.3 
as their first step was to decompile the program that was to be converted to the C 
language [63, 65]. They then used the additional information available in this high 
level language to further optimize the hardware generation process [65, 66]. However 
deriving this additional information necessitated making certain assumptions based 
on a knowledge of the original compilation process. Such assumptions can lead to 
errors in the decompiled source code and this is especially true in cases where the 
software was originally written in a different language (e.g. Fortran). This is due 
to the fact that different languages can handle high level constructs (e.g. arrays) 
very differently. Any errors in the decompiled source code can result in bugs, data 
corruption, and/or system instabihty. 
It is not always possible to decompile a program executable. The use of intricate 
branching, differing high level languages and high levels of optimization all increase 
R C Area 
Host C P U 
Hardware Conversion Block 
C P U MM Memory 
Instruction 
Cache 
Data Cache 
Links To 
Y Main 
Memory 
Figure 1.4: Warp processor architecture 
21 
1. Introduction 
Host CPU Reference 
Clock speeds Speedup produced 
CPU RC area Average Peak 
ARM 7 [62] 100 MHz 250 MHz 7.4x 16x 
MicroBlaze [64] 85 MHz 250 MHz 5.8x 16.9x 
ARM 7 [68] 75 MHz 60 MHz 2.1x 4.2x 
MIPS [69] 100 MHz 100 MHz 1.4x 1.9x 
Table 1.1: Warp processing test platforms 
the complexity of the program executable, reducing the likelihood that the decompi-
lation of a specific section of code wUl be successful. In a "warp processing" system, 
this can lead to highly computationally intensive sections of code not being con-
verted to hardware. This is probably the reason why the majority of the compiler 
optimizations were turned off for some of the test cases evaluated [65]. This dis-
abhng of the compiler optimizations reduces the performance of the software and 
can artificially increase the speedup achieved by the conversion to hardware. 
1.4.3 Performance Improvements 
The "warp processing" concept was evaluated by the California team on a number of 
different host processors, including ARM, MIPS, and the MicroBlaze. A summary of 
their test platforms is shown in table 1.1. The performance increase produced by the 
first two test cases averaged 7.4x and 5.8x respectively. These performance increases 
are partially due to the fact that the clock speed of the RC area is considerably higher 
than that of the CPU. However, as a result of the overheads associated with the 
reconfigurable nature of the RC area, it is not uncommon for the RC area to be an 
order of magnitude slower than the CPU. One example of this is the Cray X D l [67 
where the maximum clock speed of the FPGA is U x lower than the CPU. In the 
"Warp" test cases where the clock speed of the RC area is either the same as or 
slightly lower than the CPU, the average performance improvement drops to a mere 
2.1x and 1.4x respectively. 
22 
1, Introduction 
Due to the complex nature of the hardware platforms, the performance improve-
ments were calculated by combining the results of VHDL simulations of the RC 
area with behavioural simulations of the CPU cores. Due to the lack of actual hard-
ware, several real world factors, such as the communications overhead between the 
CPU and the RC area, would not have been taken into account. These factors will 
further reduce the performance improvements that were reported. 
1.5 Summary 
The performance of computer systems roughly doubles every 18 months. However, 
since the vast amount of computing power that this provides is still not enough 
for many apphcations there has been a shift to the use of dedicated hardware. In 
the PC arena this is illustrated by the fact that some graphics cards now have 
higher transistor counts than their companion CPUs (GeForce 7800 GTX = 302 
miUion transistors, Dual core AMD Opteron = 233 million transistors). Although 
this translates into extremely high levels of performance, it is limited to 3D graphics 
applications. A new computing paradigm is required that provides the performance 
offered by these dedicated hardware solutions without loosing the flexibility and 
backwards compatibility associated with conventional software. 
Reconfigurable computing has the potential to provide the required performance 
without sacrificing flexibility. However, memory bandwidth, device size and tool 
flow issues have blocked the widespread adoption of this technology. The latest gen-
eration of larger FPGAs combined with new techniques for implementing floating 
point arithmetic, have finally made reconfigurable computing an achievable goal. 
As such, several major HPC vendors have started to incorporate FPGAs into their 
product ranges. The tight couphng and integration provided by these systems has 
dramatically increased the memory bandwidth available when compared to tradi-
23 
1. Introduction 
tional expansion card solutions. With the emergence of runtime JIT based tool 
flows, reconfigurable computing is finally becoming a possibility, however significant 
work is required to fully realize the potential benefits. 
24 
Chapter 2 
Behavioural Simulation 
2.1 B ackgr ound 
Modern systems on a chip (SoC's) are becoming increasingly complex, and often 
include a mixture of dedicated hardware blocks together with real time software, 
running on one or more embedded processors [70]. To help ensure that the next 
generation of these devices hit their market windows, avoiding expensive redesigns 
and/or over engineered solutions, it is essential that the initial design concept is 
correct, and that the system is capable of performing aU the specified use-cases 
without exceeding such device performance limits as memory bandwidth, CPU uti-
lization, and latency. Usually the first stage in the design process is to define the 
system use-cases and calculate the device performance required to implement each 
of these. Commonly, this is determined by adding up the required bandwidth for 
each component of a use-case and adding a safety margin. However, this approach 
can be very prone to error as the bandwidths required by the different components 
in the system are frequently estimates. In addition, this approach wiU only give 
the average bandwidth required by the use-case, and will not take into account any 
peaks that might occur in the required bandwidth. These peaks may cause buffers 
25 
2. Behavioural Simulation 
to overflow and data to be lost, resulting in, for example, dropped video frames or 
other artefacts that are not acceptable to the customer. 
It has been shown [71, 72] that the use of behavioural simulation tools (with a 
high level of abstraction) in the early stages of a project, produces useful data 
which can, in turn, be used to guide architectural decisions. In addition to this, 
behavioural models have many other uses. For example if a complete behavioural 
model of a system is produced, it may be used to provide an early hardware target for 
embedded software engineers [73], enabling software development to run in parallel 
with hardware development throughout the majority of the project. This not only 
decreases time to market, but also reduces the knock on effect of late and/or unstable 
hardware on the software development cycle. Even when a stable hardware platform 
is available, a cycle accurate model can still be extremely useful. For example, it 
may be used when writing optimised DSP software [74] in order to provide detailed 
information on which sections of code need to be optimised, and it will also indicate 
the types of performance bottlenecks that are present. To date, a lot of behavioural 
models have been written using proprietary and often inflexible frameworks. This 
can hinder the re-use of blocks within the model and cause problems when trying to 
integrate multiple models (e.g. DSP core, MPEG decoder) into a model of a SoC. 
Open standards such as SystemC [75] are starting to appear that provide a common 
framework that enhances re-use. Unfortunately these standards often have serious 
flaws in their architectures and do not address many of the other problems, such 
as visualisation, simulator overheads and ease of use, associated with behavioural 
modelhng. 
26 
2. Behavioural Simulation 
2.2 CPUSim 
A behavioural simulation system cafled CPUSim was developed during the course 
of this research to address some of the problems traditionally associated with be-
havioural modelling e.g. ease of use, simulation speed and re-use. This system allows 
the quick and early evaluation of different device concepts and also provides data 
on a wide variety of performance metrics. The language chosen for CPUSim was 
Java [76], which enables the simulation to be run, without modification, on a wide 
range of platforms including Windows, Linux, Mac and Unix. In addition to being 
used extensively throughout the course of this research, CPUSim has also been used 
by Philips Semiconductors to model the effects of changing the arbitration settings 
and the sizes of the first in first out (FIFO) buffers present in the memory hub of 
the PNX8550 hybrid TV processor [7. 
2.2.1 Concurrency 
The inherent parallel nature of hardware requires that any model incorporates some 
mechanism for emulating this. Commonly, this is done at the block level where the 
behaviour of a block is defined using normal sequential code and then each block 
in the system is run pseudo-concurrently. In most cases this pseudo-concurrency is 
implemented by running each block in its own thread [75]. This approach, unfortu-
nately, has several side effects:-
• As most threading systems are random in nature, blocks can be executed in 
any order. This can result in the system being non-deterministic. 
• As there is no concept of time within the system, it is extremely problematic 
or even impossible to get accurate data on any time related metric such as 
bandwidth. 
27 
2. Behavioural Simulation 
• The threading mechanism is exposed to the model writer. Writing multi-
threaded programs can be very error prone and can lead to deadlocks, con-
current access, and priority inversions. These are often very difficult to debug 
owing to the random nature of these types of problems. 
• Since each block in the system exists in its own thread, a complex simula-
tion may easily contain over 50 threads. Context switching between so many 
threads is a very time consuming process, leading to large overheads and poor 
simulation speed. 
To resolve some of these problems, it is common to implement some form of syn-
chronisation within the system; usually in the form of a clock event that triggers the 
actions of the blocks in the simulator [77]. However, this does not solve the problems 
associated with debugging a multi-threaded application, or the poor performance 
associated with this type of approach. 
To avoid the above problems, CPUSim uses a different method to simulate concur-
rent execution. A clock generation block calls the "clockPulse ( ) " function on 
each block in the clock domain advancing the state of the block by one clock cycle. 
Once this is complete the process is repeated. However, there is a serious problem 
with this simple clocking scheme. For instance, take the example of a counter which 
is incremented every clock cycle and a display, which is also updated once every 
cycle. The simulator will give different results depending on the order in which 
the blocks are clocked (see figure 2.1). Clearly this is unacceptable. To resolve this 
problem, a second clock function called "clockSwapState ( ) " may be used. This, 
when combined with the following rules, will ensure an accurate simulation:-
• A block is only allowed to communicate to another block during the execution 
of the "clockPulse 0 " function. 
• The externally visible state of a block is only allowed to change during the 
28 
2. Behavioural Simulation 
Counter clocked before display. Display reads "1" after 
first clock pulse 
Counter clocked after display. Display reads "0" after 
first clock pulse 
0) JO 
Reset 
Counter 
Value 
Display 
Value 
I I 
t -
I 
o 
o 
Time 
I 
±11 
Reset 
Counter 
Value 
Display 
Value 
3 ^ 
I T 
I I 
Time 
Figure 2.1: Single function clocking 
Counter clocked before display. Display reads "0" after 
first clock pulse 
Counter clocked after display. Display reads "0" after 
first clock pulse 
•S o .Si 
*- CO 
Counter 
Value 
Display 
Value 
5 8 
I 
I 
I I 
T 
I I 
Counter 
Value 
Display 
Value 
I I 
l__L 
T k I 
_ i _ 
Time Time 
Figure 2.2: Dual function clocking 
29 
2. Behavioural Simulation 
execution of the "clockSwapState ( ) " function. 
• The clock block must ensure that all the "clockPulse ( ) " functions on the 
blocks have been completed before any of the "clockSwapState ( ) " func-
tions are called and visa versa (see figure 2.2). 
This clocking scheme has the following advantages:-
• As the simulator is deterministic and cycle accurate, it can be used to provide 
accurate performance metrics. 
• As the threading scheme is handled centrally, there is only a very small amount 
of multi-threading code, simplifying testing and debugging. Additionally as 
there is no synchronization code in the models of the blocks, the development 
time is also reduced. 
• Since the number of threads used to clock the system can be based on the 
number of processors present in the host computer and not the number of 
blocks in the simulator (as is often the case), a lot of the simulation overheads 
are eliminated. 
It is also worth noting that this two stage clocking strategy is similar to the use of the 
rising and falling edges of a conventional clock signal. This facilitates the integration 
of real hardware with the simulation environment as is described in section 3.4.4. 
2.2.2 Flexibility 
To reduce the cost of modelling a system, it is important that the blocks in the 
simulator are Teusable. There are, therefore, two factors that should be considered 
before writing a block:-
30 
2. Behavioural Simulation 
Parameterization How generic is the model - how many different situations can 
it be used in? 
Interface How many different types of blocks can the block be connected to? 
The concept of having a parameterizable simulation has been used in the past to 
investigate different architectures [78]. However, parameterization can be taken a 
step further to create generic blocks which can be used in multiple situations. One 
example of this is the case of a simple memory block with parameterizable size, 
latency and number of ports. If a memory were to be instantiated with the follow-
ing features: large size, high latency and one port, it could be used to simulate a 
synchronous dynamic random access memory (SDRAM) device. If the same mem-
ory were to be instantiated with a much smaller size, single clock cycle latency and 
multiple read and write ports, it could then be used to simulate the register file in 
a processor. 
For this "write once, use anywhere" approach to work, there must be common 
interfaces. For example, a data cache must fetch data from the memory using the 
same mechanism that the instruction decoder uses to fetch data from the register 
file. This, in turn, means that the interface must be generic enough to allow its 
use in both of these situations, whilst at the same time allowing for a high degree 
of simulation detail. One example of this is the "Addressable" interface which 
specifies the following functions:-
• boolean wr i te ( i n t address, i n t data ) 
• boolean issueRead ( in t address. Track trackObj ) 
• Integer checkRead ( Track trackObj ) 
This interface allows a connected block to issue a read or a write operation to the 
destination block, and in the case of the "read" obtain the result at a later point in 
31 
2. Behavioural Simulation 
time by using the "checkRead ( ) " function. This approach offers the implementer 
a great deal of flexibility. For example, a burst transaction could be implemented 
by returning the result of the first read operation after ten clock cycles together 
with the result of a subsequent read to an adjacent address with a one cycle latency. 
This type of interface also allows transactions to be rejected and, by changing the 
way in which transactions are rejected, the destination block can effectively alter 
the number of simultaneous pipelined transactions that are allowed. 
Because this interface is highly abstracted, it can be implemented by a wide variety 
of blocks (e.g. register files, frame buffers, caches, and hardware input/output (10) 
links) and as a result all of these blocks become interchangeable. This gives the user 
great flexibility when choosing precisely which blocks to add to a simulation and 
how best to connect them. 
2.2.3 Instantiation and Connection 
The time taken between the conception of a new system architecture and obtaining 
a set of performance metrics from it is one of the limiting factors which determines 
how many different architectures can be evaluated before market pressures force the 
start of the implementation phase. The two major contributing factors to this round 
trip time are: the time taken to implement new architectures in the simulator and 
the speed of the simulation. The majority of the simulation environments that are 
available produce static models [75]. This obliges the user to go through a laborious 
process of changing the source code and then recompiling the model, before the 
simulation can be run and the effects of the changes evaluated. In addition to being 
a time consuming task, this also requires software development skills that many 
hardware engineers and system architects lack. Because CPUSim uses common 
interfaces for both instantiating and connecting the blocks, it is possible to make 
changes to:-
32 
2. Behavioural Simulation 
• The number and type of blocks 
• Their parameters 
• How they are connected 
at runtime from the graphical user interface (GUI) without the need to recompile 
the program. This fact allows the user to make changes quickly and easily to the 
system and to observe instantly the effects of any modifications. 
At the heart of this flexibility is the run-time connection and checking interface. 
This is comprised of the following two functions:-
• connectToDestination ( HardwareBlock dest inat ionBlock , 
in t portNum ) 
• connectionsComplete 0 
By calling the "connectToDestination ( ) " function on a block, the simulator 
can create a connection between that block and the "dest inat ionBlock". As 
some blocks have to be connected to several blocks of diff'erent types (e.g. an in-
struction decoder needs to be connected to the instruction fetch unit and also to 
the register file) a "portNum" is used to specify from which port on the block the 
connection is being made. The top level simulation infrastructure also enumerates 
these port numbers in order to present meaningful names to the user. If an error 
occurs (e.g. the "dest inat ionBlock" is not the correct type of block), an excep-
tion is thrown. When all the connections have been made, the simulator calls the 
"connectionsComplete 0" function allowing the block to throw an exception 
if any of the required connections are missing. Any exceptions that are thrown by 
these two functions are caught by the simulator and turned into a meaningful error 
message which is displayed to the user. 
33 
2. Behavioural Simulation 
2.2.4 Visualisation and Statistics 
Good visualisation of a simulated system is vital, as this can greatly speedup both 
the development and the debugging of the blocks in the model. Good visuahsation 
also aids the user in understanding the behaviour of the block within the context of 
the wider system. 
Often the visualisation of models is, unfortunately, an afterthought and is not an 
integrated part of many behavioural modeUing systems. In some cases, the visuali-
sation framework is not even written in the same language as the model implemen-
tation [79]. This can slow down the development of a visuahsation and it can also 
discourage, altogether, the model writer from producing a visualisation. Not taking 
the time to produce an effective means of visuahsation is a false economy that is far 
outweighed by the additional time taken to debug and understand the behaviour of 
a block within the context of the wider system. 
To address the above, CPUSim has an integrated visuahsation framework that re-
duces the time taken to produce the visualisation for a block, whilst also providing 
several visuahsation modules that can be reused. For example, the visualisation 
of the frame buffer block provides a graphical view of the buffer together with an 
editable, table based view of the raw data within the buffer. This table based view 
reuses the same visuahsation component that is used by the memory block. It is 
worth noting that because the simulated system is built up graphically from indi-
vidual blocks using a hierarchical block diagram, the user also has an automatically 
generated structural tree diagram of the system. 
Although visuahsation is a good way to examine the state of a system at any given 
point in time, it is not well suited to gather information on the behaviour of the 
system over a period of more than a few ten's of clock cycles. In order to overcome 
this limitation a statistics infrastructure is incorporated into CPUSim, allowing 
34 
2. Behavioural Simulation 
System block disgtam i ^ B 
f ^ EKtemal Memoir S 
Q Data Mem 
t [^Memorv 
Q Stack 
Q Prof Data 
Q Data Ram 
• Stack Mapp 
• Prof Data M 
D Data Mem 
D MemofY 
D DumyUrk 
a m . 
• 10 Mappe 
Qio 
D CachedAJncat 
Q Level 2 Cache 
D insl Me 
( 9 CPU CoiB 
Q Data Mem 
Q Insl Mem 
D Insl Cache 
D Insl Fete h 
D Inst Decode 
• Data Cache 
» - ( ^ E « c « Retire 
¥ I S Reg Files 
D instFelcn 
D Guard Reg 
D Data Reg Fl 
D Stack P 
D Data Reg 
Q OuardAccs 
D^Cor r t 
D s P W M a p 
WotdQ.l ! 
_: 1188821761 
168297*73 
"2 1720930 
25969900 
1009:5707 
iOI 468112 
21471966 
toi OnO 
0x2 16x0 
11K3 £ M 
DK4 Ort 
313/' 
B52001 
394364001 
'478446689 
1450399 
J38337 
.3614158/3 
U J529253669 
25252761T 
21136422 
20633j05 
^04758 
2084437? 
?0761 
ax7 (hO 
139899168, QKC IOKO 
Reaet t ing siftuIaDDC 
- C * l M d 10:09!5S 27-05-03 
Copped 11:29:11 21-0S-03, a iau ls ted 91645193 cyc l 
Ls i ted 11:29:21 27-05-03 
Upped 14:05:2S 27-05-03, S l w U a u d 1SD614953 CJQUS 6 19.282629 KHz 
a 19.310673 KHi 
Figure 2.3: CPUSim simulating an experimental processor, showing the contents of 
the instruction cache and a Mandelbrot fractal generated by software running on 
the simulated processor. 
any block in the system to generate statistics on any aspect of its operation. The 
underlying infrastructure can then graphically display these statistics to the user on 
a block by block basis and/or produce a report containing some generic details of 
the simulation run, together with the statistics data from every block in the system. 
Some of the visual elements that make up the CPUSim system are shown is figure 
2.3. 
2.3 Summary 
Behavioural modelling is a vital tool for the exploration of device and system archi-
tectures. However, existing simulation systems lack key features such as an integral 
visualization framework and a statistics gathering mechanism. This not only in-
crease model development time, but also hinder the user's understanding and usage 
35 
2. Behavioural Simulation 
of the system. In addition, the reliance on threading techniques in these simulators 
to provide the inherent parallelism, present in hardware, is not only a source of 
bugs and longer development times, but also significantly decreases the speed of the 
resulting simulation. 
CPUSim was developed to address the shortcomings of existing behavioural simula-
tion systems. Its centralized clocking mechanism significantly improves both model 
development and execution times. This together with its integral visualisation/s-
tatistics frameworks and GUI based parameterization and connection system, make 
CPUSim a ideal tool for quickly creating and evaluating new system concepts and 
architectures. 
36 
Chapter 3 
E P I C Simulation 
3.1 C P U Architecture 
To evaluate potential reconfigurable computing tool flows and architectures, a model 
of a CPU and its associated peripherals was created for the behavioural simulator 
CPUSim which was described in section 2.2. Although based on an experimental 
architecture, the CPU core is fundamentally an explicitly parallel instruction com-
puting (EPIC) processor [80], sharing many design concepts with the Intel Itanium 
81, 82] range of processors. However, the following differences are worth noting:-
• The instruction bundle size is variable, instead of being fixed at three instruc-
tions as is the case with the Itanium. 
• There is no register stack engine [83], so as with most other CPU architectures 
stacking has to be handled explicitly. 
The specification of the simulated CPU is shown in table 3.1. 
37 
3. E P I C Simulation 
Effective core/bus clock ratio 10 
Max instruction fetches per cycle 6 
Max instruction issues per cycle 5 
General purpose register file 256 X 32 bit registers 
Guard register file size 32 registers 
Num exec units per type 3 
L I I cache 16 KBytes, 4 way set associative 
L I D cache 16 KBytes, 4 way set associative 
L2 cache 256 KBytes, 8 way set associative 
L3 cache 9 MBytes, 18 way set associative 
Table 3.1: EPIC CPU specification 
Data Cache Instruction Cactie 
Fetch Code Profiler 
Decode Register File 
Sctiedule 
Fixed Exec 
Units Reconfigurable 
Exec Unit 
Ret i re 
Figure 3.1: EPIC CPU core with additional RC blocks shown in red 
3.1.1 Additional R C Hardware Blocks 
For the purpose of evaluating potential reconfigurable computing architectures and 
tools flows, the direct connection topology described in section 1.2.1.1 was used. 
This topology has clear performance advantages and can easily be implemented in 
the simulation environment. To support reconfigurable computing, two additional 
blocks are added to the CPU instruction pipehne as shown in figure 3.1. 
38 
3. E P I C Simulation 
3.1.1.1 Code Profiler 
The code profiler block is connected to the instruction fetch unit and monitors any 
jumps or branches that occur. This block consists of some control logic together with 
a cache that holds both the target address of the branch and the number of times 
that the branch has occurred. As new branches are encountered, additional cache 
entries are created and old entries are flushed out to memory, as necessary, based on 
a least recently used (LRU) pohcy [84]. This approach has two main advantages:-
• Since it is performed in hardware there is no software overhead associated with 
profiling the execution of a program. 
• Since a cache is incorporated into the block, the memory bandwidth consumed 
by the profile data is minimized. 
It has been demonstrated that performance profihng of this nature can be added 
to a MIPS 4Kp processor with a 2.4% increase in power consumption and a 10.5% 
increase in die area [47]. As MIPS 4Kp is a relatively small CPU core, by modern 
standards, the percentage die area utihzed and the power consumed are both sig-
nificantly higher than would be the case if a larger/modern CPU core were to be 
used. 
3.1.1.2 Reconfigurable Execution Unit 
This block presents the same interface as other execution units in the processor, but 
in addition it has a connection to the data cache which enables DMA operations 
from within the reconfigurable hardware. The exact nature of this block is defined 
by the output of the RC conversion tools, however, there are two distinct operating 
modes:-
39 
3. E P I C Simulation 
Behavioural simulation, using the simulation infrastructure provided by CPUSim. 
A wrapper interface to a real hardware implementation of the RC hardware 
as described in section 3.4.4. 
3.2 R C Conversion Algorithms 
The experimental RC tool flow used (see figure 3.2), is essentially the same as the 
flow described in section 1.2.2.3, differing only in that the hardware generation step 
operates on assembly code and not on the raw program binary and as such there 
are additional disassembly/assembly stages in the flow. 
To facilitate the evaluation of different hardware conversion algorithms, the hard-
ware generation tool contains a plug-in interface that abstracts the conversion pro-
cess from the majority of the tool flow. Instruction combination and loop extrac-
tion hardware conversion plug-ins have also been developed (See sections 3.2.1 and 
3.2.2). The performance improvements produced by these conversion algorithms is 
compared in section 3.4. As the loop conversion algorithm was found to be effective 
an enhanced version was developed, this is described in chapter 4. 
3.2.1 Instruction Combination 
As discussed in section 1.1.1, it is common practice to add instructions to proces-
sors, instructions that are speciflc to certain classes of applications [85, 86]. Although 
these operations can usually be performed using generic instructions, the number of 
instructions required is often greater, thus leading to lower performance. Addition-
ally, it is not possible to predict exactly what specialist instructions wiU be required 
for every application. Furthermore, power consumption and die area limitations 
make it impractical to implement such a large number of additional instructions. 
40 
3. E P I C Simulation 
PFogtam 
ueiierdlof 
New 
program 
executable 
config 
image 
Figure 3.2: EPIC CPU RC tool flow 
Processor manufacturers are, therefore, forced to implement only such commonly 
used instructions as the multiply and accumulate (MAC) [87]. 
By analyzing the instruction interrelationships in computationally intensive sections 
of code and by looking for patterns of instructions that occur repeatedly, it is possi-
ble to dynamically extend the instruction set to include instructions that would be 
beneficial to a program that is currently being executed. For example, code listing 
3.1 contains two occurrences of the "iadd (iadd (a , b) , i m u l ( c , d ) ) " instruction 
pattern. Once these instruction patterns have been identified, they are converted 
into hardware blocks with the number of instances of each hardware block being 
based on the number of occurrences of the instruction pattern in the software, to-
gether with the available hardware recourses in the RC area. The final stage of the 
optimization process is to remove aU occurrences of the instruction pattern from 
the software and replace them with instruction(s) that refer to the newly created 
hardware, as shown in listing 3.2. It is worth noting, that because the instruction 
pattern in this example uses four different parameters, it requires two instructions 
to feed the data into the new execution unit. 
11 
3. E P I C Simulation 
and rlO , rlO Oxff 
imul r58 , r2 r l l 3 # Inst group 1 
and r34 , r34 Oxff 
and r l l , r l l Oxff 
imul r59 , r3 r l l 3 # Inst group 2 
and r35 , r35 Oxff 
imul r42 , r34 r i l l 
imul r50 , rlO r l l 2 
imul r43 , r35 r i l l 
imul r51 , r l l r l l 2 
iadd r97 , r42 r50 # Inst group 1 
iadd r98 , r43 r51 # Inst group 2 
iadd r97 , r97 r58 # Inst group 1 
iadd r98 , r98 r59 # Inst group 2 
ssr r97 , r97 Oxf 
ssr r98 , r98 Oxf 
Listing 3.1: Sample code before instruction combination 
and rlO , rlO Oxff 
and r34 , r34 Oxff 
and r l l , r l l Oxff 
and r35 , r35 Oxff 
imul r42 , r34 r i l l 
imul r50 , rlO r l l 2 
imul r43 , r35 r i l l 
imul r51 , r l l r l l 2 
hwO r97 , r42 r50 RC i n s t s for group 1 
hwl r97 , r2 r l l 3 # RC i ns t s for group 1 
hwO r98 , r43 r51 # RC insts for group 2 
hwl r98 , r3 r l l 3 # RC insts for group 2 
ssr r97 , r97 Oxf 
ssr r98 , r98 Oxf 
Listing 3.2: Sample code after instruction combination code 
42 
3. E P I C Simulation 
3.2.2 Loop Conversion 
The loop conversion algorithm uses the execution profile to locate the most compu-
tationally intensive loop that is viable and then translates this in its entirety to a 
hardware data flow pipeline. The following three factors are considered to determine 
whether a loop is viable and if it can be converted into hardware:-
Hardware implementable instructions All the instructions in the candidate 
loop must be able to be converted into hardware. Whilst this is true of the 
vast majority of instructions, some classes of instructions cannot easily be im-
plemented in a data flow pipeline (e.g. Traps, breaks, and cache operations). 
None branching code The loop body must not contain any branching code as 
a hardware pipeline cannot implement non-linear execution flow. Since the 
host EPIC CPU implements guarding [88, 89], the majority of non-hnear code 
flow is resolved to data flow by the compiler. In most cases this results in the 
remaining branching being due to function calls and/or nested loops. 
Hardware resource usage During the early stages of the conversion process, the 
tools estimate the hardware resource usage and the candidate loop is rejected 
if this estimate is greater than the available hardware resources. 
The actual mapping from software to hardware is performed by instantiating a 
hardware primitive for each instruction in the loop, and these are then placed onto 
a hardware pipeline similar to the one shown in figure 3.3. This results in paralleUsm 
in two orthogonal directions:-
• Conventional instruction level parallelism (ILP) is exploited by packing mul-
tiple operations onto the same pipeline stage. 
• Loop level parallelism is exploited by having multiple data sets iterating through 
43 
3. E P I C Simulation 
the pipeline at the same time. This is equivalent to processing multiple itera-
tions of the loop simultaneously. 
Loop Const 
Registers 
Source and Result Registers 
Inter-stage Registers 
1 
erations 
• 
Inter-stage Registers 
Figure 3.3: Example RC data flow pipehne 
3.2.2.1 Data Flow Pipeline 
As shown in figure 3.3 there are three distinct classes of registers in the pipeline: 
source/result, inter-stage and loop constant registers:-
Source/result registers The source/result registers are on the first stage of the 
pipeline. They hold all initial values and can only be read by the operations 
on this first stage, however they may be written to from any stage of the 
pipeline. Once all the required registers of this type have been written to, the 
next iteration can start regardless of whether the previous one has finished. 
Inter-stage registers The inter-stage registers hold temporary/intermediate val-
ues used throughout the loop, and are only accessible by operations on adjacent 
44 
3. E P I C Simulation 
stages. As a result, if a value is generated on one stage but only used on a 
stage much further down the pipeline, data forward operations must be placed 
into the intermediate stages in order to carry the value down the pipehne to 
the stage where it is required. 
Loop constant registers The loop constant registers are used to hold values that 
are read, but not written to, during the execution of the pipeline and these 
are directly accessible from every pipeline stage. Although their functional-
ity could be implemented by a combination of the data forward operations, 
source/result and inter-stage registers, the use of a dedicated set of registers 
for this function significantly reduces the hardware resources that are required. 
Because this hardware architecture uses a distributed register file instead of the 
centralized one found in most processors, there are no limitations imposed by the 
number of registers or read/write ports. It is also worth noting, that this approach is 
ideally suited to modern FPGAs which have flip flops distributed evenly throughout 
their logic fabric. 
Execution in the data flow pipeline is spht into three distinct phases: initiahzation, 
execution and retirement. In the initiahzation phase, the initial values are written 
into the loop constant and source/result registers. Once this operation is complete, 
the pipeline is triggered and the data flows down the pipeline and is processed as it 
passes through the operations on each stage. When the source/result registers have 
been fiUed with the all data required for the next iteration and the loop branch op-
eration indicates another iteration is required, the next data set starts flowing down 
the pipeline. Once the loop branch instruction indicates no further iterations are 
required, and all remaining data has propagated through the pipeline, the final re-
tirement phase is started. In this phase all the required values from the source/result 
registers are copied back to the appropriate registers in the host processor. 
45 
3. E P I C Simulation 
3.3 Performance evaluation 
3.3.1 Test Algorithms 
To evaluate the performance of the hardware conversion system described in section 
3.2, three algorithms with different characteristics were chosen. These are shown in 
table 3.2. For each algorithm the performance of the software was compared to that 
of the hardware optimized version over a range of values for two critical parameters:-
Technology Scale Factor (TSF) This is a measure of how much slower the re-
configurable area of the processor is when compared to a standard fixed exe-
cution unit (i.e. the ratio of the time taken for execution in the reconfigurable 
logic compared to the time taken in a hardwired application specific integrated 
circuit (ASIC) logic). Thus the TSF is a measure of the overhead present in 
reconfigurable hardware when compared to standard ASIC logic. It is worth 
noting that the TSF does not take into account other factors that will affect 
performance, such as the number of execution units present in the processor. 
Loop unroll factor This is the number of times that the body of the critical loop 
has been duplicated and is consequently equal to the reduction factor in the 
number of loop iterations. This increases the ILP in the code, and as this is a 
common software optimization technique, it becomes important to know how 
this efltects the hardware conversion. 
Algorithm Bandwidth to 
instruction ratio 
Loop branch 
point position 
Instruction depen-
dencies (no unroll) 
Copy High Beginning High 
Half brightness Medium Beginning Medium 
Mandelbrot Low End High 
Table 3.2: Test algorithms and characteristics 
46 
3. E P I C Simulation 
3.3.1.1 Copy algorithm 
This test was a simple 80 KByte copy from one area of memory to another. The 
main purpose of this test was to see if the hardware conversion would have a detri-
mental effect under the worst case scenario (i.e. a low number of highly dependant 
instructions which are extremely bandwidth limited). 
3.3.1.2 Half Brightness Algorithm 
This algorithm converts an image from the RGB colour space to the YUV colour 
space, halves the luminance value of each pixel and reconverts it back to the RGB 
colour space. 
3.3.1.3 Mandelbrot Algorithm 
The Mandelbrot fractal was chosen as a test algorithm because of its computationally 
intensive nature and minimal bandwidth requirements. In this case, performing a 
traditional loop unroll (i.e. rephcating the body of the loop) is of little benefit, due 
to the high number of data dependencies. To address this, the loop is un-roUed by 
processing multiple pixels in a single iteration. Since there are no data dependencies 
between adjacent pixels, a significant performance improvement in the software case 
is produced and this operation is typical of the type of optimization that a software 
developer might perform. 
3.3.2 Loop Dependencies 
Software loops commonly contain variables whose values are used to determine if 
another iteration of the loop is required. Listing 3.3 shows a simple code loop that 
47 
3. E P I C Simulation 
iadd r2 , 0 0 
loop : 
aload r5 , r4 r2 
isub r5 , r5 —1 
astore r 5 , r4 r2 
iadd r2 , r2 1 
neq g2, r2 10 
g2 ajump loop 
loop 
iadd r 2 , 0 0 
iadd r6 , r2 0 
aload r5 , r4 r6 
isub r5 , r5 - 1 
astore r5 , r4 r6 
iadd r2 , r2 1 
neq g2, r2 10 
ajump loop 
Listing 3.3: Example code loop with Listing 3.4: Example code loop with-
dependency out dependency 
decrements the values in a ten element array. In this example the load and store 
operations use the value of the loop counter "r2" as an offset into the array and as 
a result, the loop counter cannot be incremented until the store operation has been 
issued. When the loop is converted into hardware, an inter-iteration dependency 
is created that reduces the number of simultaneous loop iterations that can be 
executed, impairing performance. By rephcating the value of the loop counter as 
shown in listing 3.4, this dependency can be removed together with the associated 
performance limitation. Where applicable, this optimization was performed by hand 
on the test algorithms and the performance was then compared to the original non-
optimized test case. 
3.4 Results 
3.4.1 Copy algorithm 
Due to the small number of instructions present in this algorithm, the instruction 
combination conversion was not able to identify any candidate groups of instructions 
and as a result no hardware acceleration was performed. 
48 
3. E P I C Simulation 
The results for the copy algorithm with the loop conversion applied are shown in 
figure 3.4. These graphs have been scaled so that the efi^ ects of the hardware conver-
sion can be clearly observed and as a result the minor fluctuations in performance 
(0.6%) that are produced by the pseudo-random nature of the cache, appear to be 
quite large. 
As shown in figure 3.4(a) the hardware conversion produces a marginal increase in 
performance even in the worst test case (Unroll factor = 1, TSF = 16). However, 
as the unroll factor increases, the performance of aU hardware accelerated test cases 
quickly reaches the bandwidth hmit. 
By removing the loop iteration dependency, as described in section 3.3.2, the perfor-
mance of all hardware accelerated test cases reaches the bandwidth limit, as shown 
in figure 3.4(b). Removing this dependency had no effect on the performance of the 
software only case. 
Overall the performance improvement ranged from 0.02% to 0.8%. Although these 
improvements are minimal and would not normally warrant the hardware resources 
used, this demonstrates that the loop conversion algorithm performs well even under 
the extreme condition of a low number of highly dependent and bandwidth intensive 
instructions. 
3.4.2 Half Brightness Algorithm 
Figure 3.5(a) shows the increase in performance generated by the instruction com-
bination when compared to the normal software case. At best, the conversion pro-
duced a 6.8% speed improvement with a TSF value of one. Additionally, as the TSF 
value is increased this marginal gain quickly turns into a reduction in performance. 
UnroUing the loop increases the amount of ILP. This reduces the performance im-
provement of the instruction combination with low TSF values and also reduces the 
49 
3. E P I C Simulation 
Unroll factor 
(a) Dependency present 
4 5 
Unroll factor 
—•—sw 
HW ( T S F . 1) 
HW (TSF . 2) 
HW ( T S F . 4) 
HW ( T S F . 8) 
HW ( T S F . 16) 
— • — S W 
HW (TSF = 1) 
HWfTSF •2) 
HW (TSF .4 ) 
HW(TSF •8) 
HW(TSF . 16) 
(b) Dependency removed 
Figure 3.4: Performance improvement of copy algorithm with loop extraction 
50 
3. E P I C Simulation 
performance degradation produced at higher TSF values. 
As shown in figure 3.5(b), removing the loop iteration dependency increased the 
performance of both software and hardware accelerated test cases by around 9%. 
However, this improvement in performance quickly diminishes as the unroll factor 
is increased. Again, this is due to the increased ILP associated with unrolling the 
loop. 
The performance improvement provided by the loop conversion algorithm is shown in 
figure 3.6(a). A performance improvement of 3.1x is demonstrated in the ideal case 
of a TSF value of one. This decreases to 67% as the unroll factor is increased to eight. 
More reahstic values of the TSF produce a substantial reduction in performance. At 
high unroU factors, the performance of the software is hmited by the maximum issue 
rate of the instruction pipehne. Since this does not affect the hardware pipeline, 
the performance continues to increase as the unroll factor increases. As a result, 
the performance degradation produced by high values of the TSF is reduced as the 
unroU factor increases. With the loop iteration dependency removed, all hardware 
accelerated test cases produced a significant performance improvement ranging from 
6.6x to 1.75x as shown in figure 3.6(b). As the unroll factor is increased beyond the 
value of three, there is httle additional improvement in the performance of the 
hardware due to the performance of the hardware pipehne being limited by the 
available memory bandwidth. 
3.4.3 Mandelbrot Algorithm 
As with the half brightness test, when the instruction combination algorithm is 
applied to the Mandelbrot test, only a marginal improvement in performance is 
^observed with, low values of the TSF (28% performance* improvement, TSF = -l-, 
unroll factor =15). This quickly becomes a reduction in performance as the TSF 
51 
3. E P I C Simulation 
5 10.000 I 
•g 8,000 
4 5 
Unroll factor 
(a) Dependency present 
—•—sw 
HW(T8F 1) 
HW (TSF •2) 
HW(TSF -*) 
HW (TSF •8) 
HW(TSF = 16) 
i 
! 
—»—sw 
H W ( T S F . 1) 
HW ( T S F . 2) 
H W ( T S F . 4) 
H W f T S F . 81 
H W ( T S F . 161 
Unroll (actor 
(b) Dependency removed 
Figure 3.5: Performance improvement of half brightness algorithm with instruction 
combination 
52 
3. EPIC Simulation 
8 16,000 
4 5 
Unroll factor 
• sw 
HW(TSF 1) 
HWfTSF •2) 
HW (TSF .4) 
HW (TSF •8) 
HWfTSF 161 
12,000 
4,000 
(a) Dependency present 
— ^ S W 
HW ( T S F . 1) 
HW ( T S F . 2) 
HW ( T S F . 41 
- H W ( T S F . 81 
H W ( T S F . 161 
Unroll factor 
(b) Dependency removed 
Figure 3.6: Performance improvement of half brightness algorithm with loop extrac-
l i o i i 
53 
3. E P I C Simulation 
increases, as shown in figure 3.7. It was not possible to perform loop dependency 
removal on the Mandelbrot algorithm, because the decision as whether or not to 
perform another iteration, is based on the resultant values from the body of the 
loop and not on values that are available earlier in the iteration. 
Figure 3.8 shows the effects of the loop conversion algorithm on the Mandelbrot test. 
As the unroU factor increases, the performance of the software only case increases 
until the issue rate of the instruction pipehne becomes the limiting factor. Since 
the data flow pipeline is not affected by either the instruction issue rate or by the 
available bandwidth, its performance continues to scale almost linearly throughout 
all the test cases. This results in a maximum speed improvement of 6.4 times with 
a TSF of one and an unroll factor of sixteen. As shown previously, the performance 
improvement degrades as the TSF increases. With a TSF value equal to eight the 
performance becomes worse than that in the software only case. 
3.4.4 H W Pipeline Implementation 
To validate the performance improvements obtained by the loop conversion algo-
rithm, the model of the data flow pipeline in the simulator was replaced with a 
wrapper block linking the simulator with a real pipeline, which was implemented 
using FPGAs as shown in figure 3.9. To keep the simulator synchronized with the 
hardware, the master simulator clock was used to clock the FPGAs. The paral-
lel port was chosen for the interface between the FPGAs and the PC due to its 
low latency characteristics. This was necessary in order to achieve an acceptable 
simulation speed, given the clock and DMA connections between the FPGAs and 
the simulator. Additionally, the hardware conversion tool was modified to produce 
VHDL for the FPGAs instead of a simulator model of the pipehne. 
The following factors determined which test cases could be implemented in the 
54 
3. E P I C Simulation 
200.000 
— • — S W 
HWfTSF 1) 
HWfTSF 21 
HW (TSF -4) 
HW(TSF .8) 
HW (TSF 16) 
1 2 3 4 5 7 8 9 10 11 12 13 14 16 16 
Unroll factor 
Figure 3.7: Performance improvement of Mandelbrot algorithm with instruction 
combination 
400,000 
200.000 
— • — S W 
HW(TSF •1) 
HW (TSF 2) 
HW (TSF -4) 
HW (TSF 8) 
HWfTSF- 161 
1 2 3 4 6 6 7 8 e 10 11 12 13 14 16 16 
Unroll factor 
Figure 3.8: Performance improvement of Mandelbrot algorithm with loop extraction 
55 
3, E P I C Simulation 
1 PC interface FPGA 4 PC Parallel port connection 
2 RC pipeline FPGA 5 Logic analyzer connections 
3 Monitoring tile 
Figure 3.9: FPGA tiles used to implement real hardware data flow pipeline 
FPGAs:-
Hardware resources The FPGA used to implement the data flow pipeline was 
an Altera APEX 20K1500. As this device contains 51,840 logic cells, it is not 
possible to implement a pipeline that requires a higher logic cell count. 
Floating point maths Due to time constraints, floating point primitives were not 
created for the FPGAs. Therefore, it was not possible to implement any 
pipehne that contains floating point instructions. 
T S F To simplify the generation of the VHDL, only pipehnes with a TSF value of 
one were implemented in hardware. 
Table 3.3 shows which tests were re-implemented using the FPGA. In each case, the 
number of clock cycles required to run the test was exactly the same as the simulation 
model. Even though only a hmited number of test cases could be implemented in the 
56 
3. E P I C Simulation 
FPGA, the fact that the clock cycle counts precisely matched those in the previous 
simulations provides a reasonable level of confidence in the accuracy of the remaining 
test cases. 
Algorithm 
FPGA implemented test cases 
Interaction dependency 
removed 
TSF value Unroll factors 
Copy No Yes 
1 
1 
1 - 8 
1 - 8 
Half brightness No Yes 
1 
1 
1 - 4 
1 - 6 
Mandelbrot None 
Table 3.3: Test cases implemented in FPGA hardware 
3.4.5 Summary 
The instruction combination algorithm produced only marginal improvements in 
performance, due to a hmited exploitation of the available paralleUsm and also 
because of the bottlenecks imposed by the instruction pipeline. The improvements 
were only observed when extremely low and unrealistic values were chosen for the 
TSF. 
The Loop conversion algorithm significantly improved performance over a wide range 
of TSF values. This is especially true when it was combined with the loop itera-
tion dependency removal optimization. This conversion technique even managed to 
produce a marginal performance improvement under the extreme conditions that 
were present in the copy test cases (i.e. low unroU factors, high TSFs and high 
bandwidths). As such, this hardware generation algorithm is a viable candidate 
for further investigation and optimization. To this end an enhanced version of the 
algorithm was developed and is described in chapter 4. 
57 
Chapter 4 
Loop Conversion 
During the evaluation of different hardware conversion algorithms in section 3.4, it 
was concluded that the loop conversion algorithm had the greatest potential to form 
the basis of a viable reconfigurable computing platform. Accordingly, this conversion 
algorithm was re-implemented with additional features and optimizations that would 
both increase performance and also allow it to be evaluated on a wider range of test 
systems. In order to achieve the goal of a completely abstracted reconfigurable 
computing system this entire tool flow and set of optimizations would be performed 
at runtime. A block diagram of this loop conversion process is shown in figure 4.1. 
This system contains two abstraction layers which are shown in green:-
Platform abstraction layer This isolates the conversion algorithms from the host 
platforms, aUowing the tool to be easily and quickly ported to different com-
puter systems. 
Output target abstraction layer This layer separates the pipeline generation 
stage from the physical implementation details of the RC area, thus enabling 
the same conversion system to be used with different hardware platforms, as 
well as the behavioural simulation system described in section 2.2. 
58 
4 . Loop Conversion 
Execution 
profile 
Program 
executable 
Platform Abstraction Plug-ins 
Identification 
Instruction 
Linearization 
Optimisation 
Plug-in Ubrary isation 
gar Instruction 
Insertion 
Pipeline 
Generation 
Output Target 
Converters 
New 
Program 
executable 
Hardware 
config 
images 
Figure 4.1: Simphfied Loop conversion block diagram 
59 
4. Loop Conversion 
In addition to having the capabiUty of working with a variety of host platforms 
and target RC areas, the new loop conversion tool flow is also able to convert to 
hardware, several computationally intensive loops from the same program. It is 
worth noting that, the tool flow operates on small sections of the program in situ 
and therefore avoids the need to disassemble the entire program. This significantly 
speeds up the conversion process when it is applied to large applications. 
4.1 Abstract Instruction Model 
As part of the platform abstraction layer the instructions in the source program are 
converted to an internal abstract instruction model (AIM). The abstract instruc-
tions are generic in nature and can be used with any input parameter type (i.e. 
static value, register, constant register) unlike conventional instruction where the 
type of the parameters are usually fixed when the instruction set is defined. This 
flexibihty reduces the number of abstract instructions that must be supported as 
several instructions from the same CPU ISA can be mapped onto the same abstract 
instruction (e.g. the MIPS XOR and XORI map to the same abstract instruction). 
The generic nature of the abstract instructions also means that instructions for dif-
ferent ISAs will map onto the same abstract instruction (e.g. both the MIPS and 
IA64 ISAs have 32 bit addition operations). Most of the abstract instructions cor-
respond to hardware primitives that can be instantiated in the RC area (e.g. an 32 
adder), any instructions that do not have hardware primitives must be removed by 
the optimisation steps in the conversion process or the hardware generation will fail. 
To utilise the essentially unhmited number of registers (due to the 1:1 flipflop to 
LUT ratio) that can be instantiated in a hardware pipeline, the A I M must also 
have an unlimited number of registers. During the mapping from CPU ISA to A I M 
each CPU register is mapped onto a diflferent A I M register, this therefore leads to 
60 
4. Loop Conversion 
an initial register utilisation equal to the number of registers present in the host 
ISA. The various optimisation steps performed (described in sections 4.5 and 4.5.1) 
during the hardware conversion remap the register numbers so that a much larger 
number of registers are used than were present in the original host ISA, this increases 
ILP and therefore performance. 
The A I M is a representation of what operations are performed, together with infor-
mation about what data is passed between operations (and not how it is passed). 
Therefore the A I M can represent a wide variety of instruction sets including EPIC, 
RISC, and CISC. To concept of using an A I M to represent a wide variety of IS As 
(e.g. MIPS [90], IA32 [86]) is a proven concept and is extencivly used by Transitive 
91 • 
4.1.1 Supporting Additional IS As 
The following steps must be performed to add support for a new instruction set:-
Write instruction decoder An instruction decoder must be written to take the 
binary representation of the instructions and translate them into the op-codes 
and the register numbers/parameter values. The instruction decode translates 
from the CPU specific ISA to a format suitable for the A I M . 
Create mapping tables The mapping table contains all the op-codes in the ISA 
together with references to the abstract instructions they are equivalent to. 
NULL references are placed in the tables for any any op-code that can not be 
supported in the A I M (e.g. cache instructions). 
Implement additional abstract instructions The majority of instructions are 
common to nearly all CPU ISAs. As a result the same abstract instructions 
can be re-used for many host CPUs. However some rare instructions will be 
61 
4. Loop Conversion 
specific to certain CPUs (e.g. vector instructions like MMX) . Although such 
instructions could be left out of the A I M this would prevent the conversion 
system from processing any code that contained such instructions. The best 
approach is to add the CPU specific instructions to the A I M . This is a simple 
matter of writing a small VHDL fragment (called a primitive) that performs 
the same function at the instruction, and added it to the A I M table along with 
various estimates of its latency and hardware resource usage. 
General information In addition to detailed information about the instructions, 
the hardware conversion system requires some general information about the 
host platform. This includes the number of branch delay slots, the register the 
stack pointer is stored in, and which instructions will trigger execution in the 
RC area. 
Physically the A I M is separated from the CPU ISA by Java interfaces [76]. This 
enforce the strict rules defined by the abstraction layer, and provides a starting point 
when implementing front ends for new host CPUs. 
4.2 Targeting The Hardware Pipeline 
The final hardware is a pure data flow pipehne it does not process instructions in the 
same way as a conventional CPU. Instead each instruction in the original software 
becomes a separate piece of hardware in the pipeline. The following characteris-
tics enable the creation of an efficient hardware pipehne that produce significant 
increases in performance:-
Linear non-branching code A hardware pipeline can not handle non-linear in-
structions streams, however software typically contains branching code. There-
fore all branching must be removed (see section 4.4). 
62 
4. Loop Conversion 
High levels of I L P The higher the level of ILP the more parallehsm will be ex-
ploited be the final hardware pipeline. Several optimizations are performed 
during the conversion process to increase the levels or ILP (see sections 4.5.5, 
4.5.1 and 4.6.1). 
Low memory bandwidth requirements In many cases the performance of the 
hardware pipeline is hmited by the available memory bandwidth. Although 
memory bandwidth is primarily a characteristic of the original algorithm and 
the way the software was written some optimizations can be performed that 
will reduce the bandwidth required (see section 4.5.2). 
No inter-iteration dependencies As the hardware pipehne can simultaneously 
process several iterations the presence and position of inter-iteration depen-
dencies (see section 3.3.2) will significantly reduce performance. 
Instruction complexity The latency of each instruction is related to its complex-
ity (e.g. a multiply has a much higher latency than a bitwise AND). The lower 
the latencies the more instructions can be packed onto a single pipeline stage, 
therefore leading to higher levels of ILP and performance. 
Sections 4.3 to 4.6.1 describe the processes and optimizations that are performed to 
make the software suitable for implementation in a hardware data flow pipeline. 
4.3 Loop Identification 
The first stage in the loop conversion process, is to inspect those sections of the 
program surrounding the branch target addresses in the execution profile. This 
process allows the algorithm to determine which branches are due to loops and which 
branches are due to other control flows, such as " i f , else , s w i t c h " statements. 
Candidate loops are then discarded if they fail to meet the following criteria:-
63 
4. Loop Conversion 
Hardware implementable instructions A l l the instructions in the candidate 
loop must be capable of being converted into hardware (whilst this is true 
of the vast majority of instructions, some classes of instructions cannot easily 
be implemented e.g. traps, breaks, cache operations). 
Flinction calls or nested loops The loop must not contain any nested loops or 
function calls (although this appears to be a major limitation, small functions 
will usually be in-Uned by the compiler, and if a nested loop is present that 
loop itself will stiU be considered as a candidate for conversion). 
Once the candidate loops have been identified, a measure of their computational 
intensity is calculated by the hardware conversion tools (equation 4.1a shows the 
simplest way of doing this). However, as many loops contain " i f , else , s w i t c h " 
statements, the total number of instructions can be very different from the number 
of instructions that are executed per iteration. To allow for this, a worst case value 
is calculated using equation 4.1b. This rough measure of execution time is used, 
together with an estimate of the required hardware resources, when deciding which 
loops should be implemented in the RC area. 
Loop exec time w Num iterations x Total num instructions (4.1a) 
Maximum exec time w Num iterations x Num instructions in longest path (4.1b) 
4.4 Instruction Linearization 
The target RC area is used to implement a data flow pipeline similar to the one 
described in section 3.2.2.1. Since this type of pipeline cannot directly implement 
the branches in the execution flow that are common in the software domain, the loop 
conversion system removes all branches by using a combination of two techniques: 
multiplexer (MUX) insertion and instruction guarding. This process is performed 
64 
4. Loop Conversion 
blez r2 elseBody 
r3 add r3 1 
b end 
elseBody 
r3 sub r3 1 
r l mov r2 
end 
gO blez r2 
r5 add r3 1 
r6 sub r3 1 
r4 mov r2 
r l mux gO r l r4 
r3 mux gO r5 r6 
Listing 4.1: If-else statement imple- Listing 4.2: If-else statement imple-
mented with branches mented with MUXs 
recursively, in order that complex program flows containing sets of nested branch 
instructions can be converted to hardware. The end product of the instruction 
linearization stage, is a serial stream of non-branching instructions that perform the 
same function as the original loop body. 
4.4.1 MUX Insertion 
Listing 4.1 shows the way in which an " i f / else" statement is coded in Assembly. 
First, the branching is removed by executing both branch paths, and then one 
or more MUXs, as shown is listing 4.2, are appended (note that the MUXs are 
controlled by a boolean value generated by the original branch instruction). If a 
register is changed in both branch paths (e.g. " r3") , then the corresponding MUX 
will select between the two newly generated values. I f however, the register value 
was only changed on one of the branch paths (e.g. " r l " ) , then the MUX will 
select between the original value and the value generated by the branch path. MUX 
insertion may lead to redundant instructions being present, an example of this is the 
"mov" instruction in hsting 4.2, however this will be removed by the optimization 
process described in section 4.5.4. 
65 
4. Loop Conversion 
blez r2 end 
sw r3 r4 
end 
gO blez r2 
sw gO r3 r4 
Listing 4.3: Conditional store imple- Listing 4.4: Conditional store imple-
mented with branching mented with guarding 
4.4.2 Instruction Guarding 
MUX insertion cannot be used on instructions that have effects external to the CPU 
core/RC area, since both paths are executed regardless of the branch condition (e.g. 
store instructions). A guard [88] value is generated by the original branch instruction 
to control whether or not the store instruction performs the memory operation when 
it is executed (see hstings 4.3 and 4.4). Although not necessary for correct pipehne 
function, guarding of the load instructions is also carried out to reduce memory 
bandwidth and consequently increase performance. 
4.5 Optimization 
A plug-in interface is provided to accelerate the development of additional optimiza-
tions, allowing any number of optimizations to be apphed to the abstract represen-
tation of the serial instruction stream. I t is interesting to note that, generally, as 
the number of optimizations is increased, the overall conversion time decreases. A l -
though these optimizations slightly increase the time taken to generate the pipehne, 
they also simplify its structure thus reducing the time required to perform the PAR 
(by far the most time consuming stage of the entire process). The following opti-
mization plug-ins have been developed:-
• Hardware dependency removal 
• Stack removal 
66 
4, Loop Conversion 
• Iteration dependency removal 
• Instruction removal 
• Tree re-balancing 
4.5.1 Hardware Dependency Removal 
The amount of ILP present is often hmited by hardware dependencies, where instruc-
tions share a common hardware resource that prevents them from being executed in 
parallel. This can be seen in listing 4.5 where the temporary register "r2" is used 
by two sets of otherwise independent instructions. This dependency can be removed 
by remapping some of the instructions so that they use a new register "r5" as 
shown in listing 4.6. This is similar to the register renaming [92] technique used in 
superscalar processors. However, the increase in ILP produced by register renaming 
may be limited by the number of physical registers present in the CPU. Since, it is 
possible to create as many registers as may be required in an RC area, hardware 
dependency removal can yield higher levels of ILP in a reconfigurable computing 
environment than it would in a conventional superscalar processor. 
r2 s r l r l 1 r5 s r l r l 1 
r3 or r2 1 r3 or r5 1 
r2 add r l 1 r2 add r l 1 
r4 and r2 255 r4 and r2 255 
Listing 4.5: Hardware dependency Listing 4.6: Hardware dependency re-
present moved 
4.5.2 Stack Removal 
Since CPUs have fixed numbers of registers, if a section of code requires more regis-
ters than are present, temporary values must be "pushed" and "popped" to and from 
67 
4. Loop Conversion 
loop : 
sw r l 0 ( sp ) 
r l s r l r2 1 
r2 or r2 31 
r2 or r2 r l 
r l Iw 0 ( sp) 
bne r3 r l loop 
Listing 4.7: Stacking ("push" first) 
loop : 
r5 s r l r2 1 
r2 or r2 31 
r2 or r2 r5 
gO bne r3 r l loop 
sw gO r l 0 (sp) 
Listing 4.8: Stack operations removed 
("push" first) 
the stack, as shown in hsting 4.7. Stacking, increases the number of instructions 
that need to be executed, decreases ILP, and increases the amount of bandwidth 
required. Stack operations may be removed on a reconfigurable computing platform 
by creating additional registers, as shown in hsting 4.8. This increases the amount 
of paraUelism and also reduces the bandwidth required by the RC area. To maintain 
consistency with the original code, a single "push" to the stack is performed on the 
last iteration of the loop, so as to leave the stack in exactly the same state as the 
software which the hardware pipeline replaced. 
In some situations the stack "pop" occurs before the "push", as shown in listing 4.9. 
The majority of the stack operations may still be removed by creating an additional 
register. However, the "pop" operation remains in the loop, guarded by a flag that 
is true for the first iteration (see listing 4.10). This allows the initial value to be 
retrieved from the stack. The "push" operation is again guarded so it only occurs 
on the final iteration of the loop. 
4.5.3 Iteration Dependency Removal 
As shown in section 3.4, removing the loop iteration dependency (see section 3.3.2) 
can have a significant effect on the overall performance of the system. As a result, 
the optimization is automatically performed by this plug-in. 
68 
4. Loop Conversion 
loop : 
r l Iw 0(sp) 
r l add r2 r l 
sw r l 0(sp) 
r l s r l r2 1 
r2 or r2 r l 
bnez r3 loop 
Listing 4.9: Stacking ("pop" first) 
loop : 
r6 Iw i s F i r s t 0(sp) 
r l mux i s F i r s t r l r6 
r l add r2 r l 
r5 s r l r2 1 
r2 or r2 r5 
gO bnez r3 loop 
sw gO r l 0 ( sp ) 
Listing 4.10: Stack operations re-
moved ("pop" first) 
4.5.4 Instruction Removal 
Because of the limited capabilities of CPU instruction sets, it is quite common for 
code to contain instructions that can be removed without affecting the functionality. 
These instructions fall into two categories:-
Immediate values Many instructions can only take values directly from the reg-
ister file. In addition, instructions that are able to take immediate values can, 
usually, only accept small numbers (16 bits in the case of the MIPS instruction 
set [90]). As a result, it is common to find groups of instructions that set the 
value of a register, followed by further instructions that use this value (shown 
in listing 4.11). 
Copy instruction Due to the centralized register file architecture present in most 
processors, copying values from one register to another is a common process. 
An additional source of these copy instructions, is the linearization process 
described in section 4.4.1. 
Due to the flexible nature of the RC area compared to that of a fixed instruction 
set, i t is possible to remove both of these types of redundant instructions as shown 
in listing 4.12. 
69 
4. Loop Conversion 
r l l u i 0x802 
r l or r l 0xa578 
r l Iw r l 
r 4 mov r 1 
r4 Iw 0x802a578 
Listing 4.11: Before instruction re- Listing 4.12: After instruction re-
moval moval 
r l or r l r2 
r l or r l r3 
r l or r l r4 
r5 or r l r2 
r6 or r3 r4 
r l or r5 r6 
Listing 4.13: Sequential value combi- Listing 4.14: Balanced value combi-
nation nation 
4.5.5 Tree Re-balancing 
If a series of values need to be combined, it is common for compilers to generate a 
sequential stream of instructions, as shown in fisting 4.13. Altfiougfi this approach 
minimizes the number of temporary registers required and consequently the need 
for stacking, the resultant code has very low levels of ILP. By using additional 
temporary registers and rearranging the way in which the values are combined, it is 
possible to increase the ILP as shown in listing 4.14. This is similar to the concept 
of balancing binary trees [93 . 
If all the source values for the tree are produced on the same stage of the data 
flow pipefine, then the tree could be balanced so that each data path is of the same 
length. However, this is not always be the case. For example, if one of the source 
values is not available until much further down the pipehne, the result would be a 
reduction in performance. To compensate for this, the balancing algorithm estimates 
at which point in the pipeline each source value becomes available and skews the 
tree accordingly. 
It is important to note that the number of stages required for the pre optimization 
70 
4. Loop Conversion 
sequential case, scales linearly with the number of input values (see equation 4.2a), 
whilst in the post optimization balanced tree case this scaUng becomes logarithmic 
(see equation 4.2b). Therefore, the performance improvement produced by this 
optimization increases dramatically as the number of input values increases. 
Num stages (Sequential) = Num values — 1 (4.2a) 
Num stages (Balanced) = logs (Num values) (4-2b) 
4.6 Pipeline Generation 
After the code hnearization and optimization steps have been performed, the pipehne 
is generated. A simplified overview of this procedure is shown in figure 4.2. This 
pipehne generation process can be broken down into the following distinct stages:-
• Operation scheduling 
• Data forwarder addition 
• Register remapping 
4.6.1 Operation Scheduling 
The position of the operations in the pipeline may be calculated by analyzing the 
data dependencies, the results of this calculation are shown in figure 4.2(b). RC 
areas are based on a regular array of fundamental logic elements. In the case of FP-
GAs each element usually consists of a single flipflop and a 4 input Look Up Table 
(LUT) [94], whereas most software instructions have only 2 input parameters. Con-
sequently placing a single operation between registers, can result in half the inputs 
to the logic elements remaining unused. This leads to high latencies and inefficient 
71 
4. Loop Conversion 
r lO s r 1 r l 2 29 
rS s r l r l 2 26 
rS and rS 0x1 
r lO and r lO r l 3 
r lO xor r lO r8 
r9 s r l r l 2 30 
r9 and r9 r lO 
r l xor rS r lO 
(a) Sequential operations 
Stage 0 r lO s r l r l 2 29 r8 s r l r l 2 26 r9 s r l r l 2 30 
Stage 1 rS and r8 0x1 r lO and r lO r l 3 
Stage 2 r lO xor r lO r8 
Stage 3 r9 and r9 r lO r l xor r8 r lO 
(b) Scheduled operations 
Stage 0 r lO s r l r l 2 29 r8 s r l r l 2 26 r9 s r l r l 2 30 
Stage 1 r8 and r8 0x1 r lO and r lO c r l 3 r9 ^ r9 
Stage 2 r lO xor r lO r8 r8 ^ r8 r9 ^ r9 
Stage 3 r9 and r9 r lO r l xor r8 r lO 
(c) Schedulec operations with forwarders 
Stage 0 rO s r l rO 29 r l s r l rO 26 r2 s r l rO 30 
Stage 1 rO and r l 0x1 r l and rO crO r2 ^ r2 
Stage 2 rO xor r l rO r l ^ rO r2 ^ r2 
Stage 3 b r l and r2 rO r l xor r l rO 
(d) Pipeline with remapped registers 
Figure 4.2: Example operation scheduhng onto data flow pipeline 
72 
4. Loop Conversion 
usage of the available hardware resources. To reduce both latency and hardware 
usage, the estimated propagation delay of each operation is calculated based on the 
types of its input parameters (i.e. register or static value). If the combined delay of 
two or more dependent operations is less than the target propagation delay between 
registers, then the operations are packed into the same pipeline stage. Since this 
optimization is performed when the pipehne is generated, a considerable reduction 
in the overall length of the pipehne is produced. This is similar to the technique of 
register retiming [95, 96] where registers are moved forwards or backwards through 
the logic to even out propagation delays. However, with this technique, the to-
tal number of registers and therefore the number of pipeline stages must remain 
constant. 
4.6.1.1 Pointer Aliasing 
In addition to conventional data dependencies, pointer ahasing can also cause op-
erations to be dependent upon each other. For example, when the addresses of two 
memory operations are the same, the original order of these memory operations 
must be maintained. Unfortunately, it is very difficult to identify which pointers are 
prone to ahasing [97, 98]. Simply assuming that all pointers have the potential to 
alias, is the safest solution. However, this approach significantly hmits both ILP and 
loop level parallelism. One potential solution, is to perform aggressive optimizations 
and to implement hardware ahas detection. If pointer aliasing is detected, the state 
of the hardware is roUed back to the point before the occurrence of the alias and 
the code is then re-executed with less aggressive optimizations. This technique has 
been successfully implemented in the Crusoe range of processors from Transmeta 
99, 50]. However, this procedure could not be implemented in the current hardware 
generation system because the error detection stage would have to be based on ex-
ceptions and/or interrupts, which are either not implemented or not accessible on 
73 
4. Loop Conversion 
the current host target platforms. To overcome this hmitation, the user is able to 
specify whether or not pointer aliasing is present. This is the only user intervention 
that is required in the entire conversion process and this can be eliminated with the 
next generation of target platforms. 
4.6.2 Data Forwarder Addition 
As discussed in section 3.2.2.1, the lack of a centralised register file means that values 
generated on one stage need to be piped forward to the stage where they are to be 
used. This can be seen in figure 4.2(c), where the value of register " r9" is generated 
on stage 0 but is not used until stage 3. As a result of this, data forwarders have 
been added to stages 1 and 2. 
If a value is read but never written to during the execution phase of the pipeUne, 
then additional data forwarders are not required. Instead, the input to the operation 
is directly connected to the loop constant centralized register file, which can only be 
read by the pipeline. This can be seen in figure 4.2(c) where the second operation 
on stage 1 has the input parameters " r l 3 " flagged as a constant. 
4.6.3 Register Remapping 
Initially, the instructions all accessed a global register file. Once the operations 
have been placed into the data flow pipeline, they are only able to access the loop 
constant registers and the registers on adjacent pipeline stages. The final stage of 
the pipehne generation process is to remap the register numbers so that each set of 
registers is numbered sequentially, as shown in figure 4.2(d). During this process, 
the number of registers required on each stage is also calculated. 
Operations that produce either: the final result values or the data required for the 
74 
4. Loop Conversion 
— Reg Nums Reg Nums • 
0 N - 1 0 N - 1 
Loop Const Stage 0 Src Stage 0 Src & Stage 0 
Regs Regs Result Regs Result Regs 
Figure 4.3: Loop constant and stage 0 shift register arrangement 
next iteration of the pipeline, are remapped so that the data is written to stage 0 
instead of to the next pipeline stage, as would normally be the case. An example is 
operation 1 on stage 3, shown in figure 4.2(d). 
To eliminate the need for complex clock enables and large MUXs, the loop constant 
and stage 0 registers are connected together to form a shift register. Additionally, 
stage 0 parameters are assigned register numbers according to their usage, as shown 
in figure 4.3. By grouping registers that are used as source values close to the start 
of the shift register and by also grouping result registers close to the end of the shift 
register, the number of values that must be shifted into and out of the RC area is 
minimized. 
4.7 Target Implementation 
The final stage of the loop conversion process, is to generate a target specific pipeline 
configuration from the abstract representation produced by the pipeline scheduling 
phase. In addition to this, a new version of the program executable is created that 
utihzes the newly created hardware. 
4.7.1 Pipeline configuration generation 
The configuration generation is abstracted from the rest of the tool flow by a plug-in 
interface. This aUows multiple configuration formats to be generated for different 
75 
4. L o o p Convers ion 
target platforms. These include:-
M I P S hardware This plug-in generates a V H D L representation of the pipeline 
that can be directly synthesized using the Altera Quartus [100] software, thus 
producing the F P G A configuration bitstream. 
M I P S s imulat ion A simulation model of a MIPS CPU and its peripherals was 
created using the behavioural simulator described in section 2.2. The RC area 
in the simulation is configured f rom a file that is generated by this plug-in. 
C r a y X D l hardware I n order to target the FPGAs present i n the Cray X D l [67 
this output plug-in creates a V H D L representation of the pipeline, which in 
t u rn is converted to a F P G A configuration bitstream by the X i l i n x ISE [101 
software. 
In addition to producing the configuration for the RC area, these plug-ins also 
provide the following information to the rest of the tool flow:-
• The RC area size 
• The maximum number of pipelines that the RC area can contain 
• The latencies of the operations for each input configuration 
• The sizes of the operations for each input configuration 
4.7.2 Program modification 
The register remapping information produced during the pipeline generation phase, 
identifies which register values need to be sent to the pipeline before execution, 
and which values should be copied back to the CPU register file. This information 
is also used to determine the order of the parameters in the trigger instructions. 
76 
4. L o o p Convers ion 
l o o p : 
r l sub r l 1 
r2 m u l r2 r l 
bne r l 0 l o o p 
< R e s t of p r o g r a m > 
l o o p : 
b t r i g l n s t s 
r2 m u l r 2 r l 
bne r l 0 l o o p 
e n d O f L o o p : 
<R,est o f p r o g r a m > 
t r i g l n s t s : 
r2 r c p t O r l r 2 7 
b e n d O f L o o p 
List ing 4.15: Program before trigger List ing 4.16: Program after trigger in-
instruction insertion struction insertion 
These trigger instructions are generated and, together w i t h a j u m p instruction, are 
appended to the program (as shown in hsting 4.15 and 4.16). Once the pipeline 
has finished executing, this j u m p instruction returns the program execution back to 
the main body of the software. Placing the trigger instructions at the end of the 
program provides two main advantages:-
• The hardware acceleration can be enabled or disabled by simply swapping the 
first instruction in the loop w i t h a j u m p to the start of the trigger instructions 
at the end of the program. 
• I n some cases, the number of trigger instructions required w i l l be greater than 
the number of instructions in the original loop. Since, only a single instruction 
needs to be replaced in the original loop, this removes the need to move the 
rest of the program up in the address space and consequently removes the need 
to relink. 
Header information is also added to the trigger instructions at the end of the pro-
gram. This can be read by the program at runtime and is used to give the end user 
the abihty to enable or disable the hardware acceleration. 
77 
Chapter 5 
MIPS Test Platform 
To evaluate the performance of the loop conversion system outhned in chapter 4, a 
test p la t form was created using the MIPS architecture [102]. This was chosen due 
to the widespread availabihty of compilers, open source CPU cores, and reference 
manuals. 
5.1 Platform Details 
The MIPS test pla t form consists of three separate targets:-
H a r d w a r e A standalone hardware implementation of the MIPS pla t form and RC 
area was created using FPGAs. A l l the performance evaluations were per-
formed on this target. 
Simulat ion A behavioural simulation of the MIPS CPU, RC area and peripherals 
was created using the simulator described in section 2.2. The high level of 
visibi l i ty provided by this environment makes i t an ideal debug tool for both 
the MIPS and the hardware conversion software. Because the hardware soft-
78 
5. M I P S Test P l a t f o r m 
ware interface (HSI) exactly matches the standalone hardware target, the same 
software binaries can be run on both wi thout either modification or porting. 
M i x e d s imulat ion and hardware pipeline This target was used as both a de-
bug system and as an intermediate stage to the f u l l hardware solution. I t 
combines the behavioural simulation of the CPU core w i t h its associated pe-
ripherals w i t h an RC area implemented in an Altera A P E X 20K1500, and 
is similar in concept to the mixed hardware/simulation system described in 
section 3.4.4. 
5.1.1 R C Area Integration 
The MIPS test platform uses the directly connected topology outlined in section 
1.2.1.1. As a result, the RC area is connected to the instruction pipeline of the CPU 
as an additional execution unit . To control the RC area, four extra trigger instruc-
tions have been added to the MIPS instruction set. A l l four of these instructions 
are identical except for the fact that each one refers to a different data flow pipehne 
wi th in the RC area. The trigger instructions have four parameters:-
Dest inat ion register This is set to the register number where value shifted out 
of the pipeline wiU be stored. I f the instruction does not return a value, then 
this parameter should be set to "$0", which corresponds to the hard wired 
zero register in the MIPS ISA. 
Source register 1 & 2 The next 2 parameters are the source registers that contain 
the values to be loaded into the pipehne. I f a source value is not required the 
parameter should again, be set to "$0". 
Flags The final parameter is the bitwise OR of three flags that contrprthe execution 
of the instruction. The first b i t is set to trigger the start of the pipeline 
79 
5. M I P S Test P l a t f o r m 
execution; the second bi t is set i f the source parameters of the instruction 
need to be shifted into the pipehne; the th i rd bi t is set i f a result parameter 
needs to be shifted out of the pipeline. A n instruction which triggers execution 
in the data flow pipeline w i l l stall un t i l the RC area has completed execution. 
A n example trigger instruction sequence, for data flow pipehne zero, is shown in 
listing 5.1. The first instruction loads two parameters into the pipeline; the sec-
ond instruction loads two additional source parameters and then triggers execution. 
Once execution in the RC area has completed, the first result parameter is stored 
in the register " a l " and the final instruction stores another result parameter in the 
register "aO". 
r c p t O $0 , $ v l SvO 0x2 
r c p t O Sa l , Sa l SaO 0x7 
r c p t O $aO , $0 $0 0x4 
List ing 5.1: Example RC trigger instruction sequence 
5.1.2 Hardware 
Figure 5.1 shows the hardware used to implement the standalone target. By recom-
pihng the code for the RC area w i t h different interface libraries, the pipeline can 
also be connected to the behavioural simulator for use in a mixed simulation/hard-
ware target. This connection is made via the parallel port due to the low latency 
requirements of the connection. 
The specification of this system is outlined in table 5.1. As shown in figure 5.2, one 
F P G A contains the open source Plasma MIPS CPU core [103], together w i t h its 
peripherals, and the second F P G A contains the reconfigufable data flow pipeline(s). 
A n RS232 port is used to provide a command line interface to the software running 
80 
5. M I P S Test P l a t f o r m 
on the target. Due to the low data rates provided by this interface a USB port is 
also present to aid the transfer of large sets of test data to and f rom the target. 
A 64 bi t clock cycle counter is present in the system, the value of which can be 
read by the CPU and is used to time, accurately, the execution of the various test 
algorithms. 
The Plasma CPU core used, is open source and as such the V H D L code for the 
processor is freely available. Modi fy ing the instruction pipeline to incorporate the 
RC area was, therefore, a relatively simple task. The processor, based on a three 
stage pipehne design, is capable of issuing one instruction per clock cycle and the 
major i ty of instructions w i l l execute wi thout staUing the pipeline. However, the 
multiplier, divider, and load/store units all take multiple cycles and can, therefore, 
stall the instruction pipeline. Since the memory interface takes multiple cycles to 
perform an operation, and is not pipelined, the memory bandwidth available is 
l imited. This hmited bandwidth prevents the high computational rate of the RC 
area f rom being fu l ly reahzed. However, due to the relatively low computational 
rate of the CPU, this l imi ta t ion does not significantly impact on the performance 
of this processor. As the memories connected to the CPU are low latency SRAMs, 
the lack of caches in the core has only a marginal effect on performance. 
Due to the l imited number of block R A M s in the A P E X 20K200 used for the CPU, 
i t was not possible to implement the code profiler block described in section 3.1.1.1. 
Consequently, the code profi l ing was performed on the simulation target. This, 
together w i t h the fact that i t is not possible to run the Quartus [100] synthesis tool 
on the platform, prevents the system being used in a runtime self adaptive mode. 
Instead the conversion process is performed pre-runtime using only the program 
binary and profile report, no access to the original source code is required. 
The clock speed of both the processor core and the RC area is T6MHz, w i t h the 
l imi t ing factor being the path f rom the load/store unit in the processor to the 
81 
5. M I P S Tes t P l a t f o r m 
1 Plasma MIPS CPU core and peripherals 6 
2 RC area F P G A 7 
3 2 MByte S R A M & power dis tr ibut ion tile 8 
4 PC Parallel port connection to simulator 9 
5 Misc (Monitor , CPUSim interfaces, etc) 
Logic analyzer connections 
Monitor ing ti le 
RS232 console interface 
USB data transfer interface 
Figure 5.1: F P G A tiles used to implement MIPS CPU, RC pipeline, and peripherals 
CPU ((.iv Plasma MIPS C P U @ 16 MHz in Al tera A P E X 20K200 F P G A 
RC area Data flow pipeline @ 16 MHz in Altera A P E X 20K1500 F P G A 
10 
RS232 for console command line interface 
USB for data transfer 
Memory 
8 KBy te R O M (read = 1 cycle) 
256 KByte R A M (read = 1 cycle, write = 2 cycles) 
2 M B y t e R A M (read = 1 cycle, write = 2 cycles) 
64 Bi t cycle counter for t iming program execution 
Table 5.1: MIPS pla t form summary 
82 
5. M I P S Test P l a t f o r m 
E 
CO 
w c 
3 
I 
2 
O * lorn
 
nt
er
 
(U 
(0 t •p lor
n 
nt
er
 
"o o ni 
nt
e 
on
t a: 
<D 
1 C
ou
 
c 
o te
rf 
C\J 
CO 
o 1 
Co
u 
O c 
m 
\^ 
o m 
CVJ < >. < w W 00 O cc 
• 
5 
m 
w 
o 
Q-
I 
(0 
w 133 a. 
Ad
dr
 
M
ap
 
Q. 
Co
i 
n l 
sm
 
PU
 
« O 
Q. 
c 
B 
c 
03 
(U O 
< rfa
 
o B 
ir 
I 
a 
0) 
CO 
>> 
CO 
CO 
id 
a 
o CVl CO 
a <D <B 
c c c c 
"5 d) 0) •55 
Q. Q. CL a. 
b. b. b. 0. 
o O o 0 
cc CE oc CC 
83 
5. M I P S Test P l a t f o r m 
output rss 
shift en 0 r 1 L _ 
start exec 0 r r n 0 
re running 1 ^ 1 
mem complete 0 1 1 0 r -1 0 1 1 1 0 1 1 1 
mem Is load stors load Store 1 load 1 Store 
E - d m a tran 1 X Word Idle / Word \ Idle X Word \ 
(ihpipeline ID 0 
d h b u s a 00 X DB X EA \ 00 X D8 X D4 \ 00 X 08 \ 00 X 68 X OH 
ID-bus b 00 6C y B6 \ 00 X DA 00 / ' >: ,<[ DA V, 00 / ' [ lA 00 • DC 80 X . 
Figure 5.3: Example logic analyzer trace showing RC pipeline execution 
external SRAMs. The t iming report for the CPU core indicate that, i t should be 
capable of running at 30 MHz. This would be considerably higher i f the CPU core 
were to be implemented in ASIC logic rather than in an FPGA. I n comparison, the 
t iming report for the RC area, suggests that clock speeds in excess of 100 MHz are 
achievable, however, this is very dependent on the particular algori thm that is being 
converted to hardware (note that, routing delays are minimized as most operations 
in the data flow pipehne only access registers on adjacent stages, thus minimizing 
the length of the signal paths). I n practice, the potentially higher clock rate of the 
RC area is unhkely to increase performance, as most algorithms w i l l be l imited by 
the available bandwidth, which remains constant. Consequently, i t was decided to 
minimize the complexity of the system by using a single clock domain, which results 
in a TSF value of one. 
5.1.2.1 C P U / R C A r e a Interface 
A logic analyzer trace of the interface between the Plasma MIPS CPU core and the 
RC area is shown in figure 5.3. Execution in the pipeline is split into four stages as 
detailed below:-
Setup During the setup stage, in i t i a l values are loaded into the constant and stage 
0 registers as described in section 4.6.3. Because the MIPS instruction set 
architecture (ISA) allows instructions to have two source parameters, two par-
allel shift registers are used to set up the data flow pipehne. On every clock 
84 
5. M I P S Test P l a t f o r m 
cycle where the "shif t en" signal is high, two new values f rom "bus a" and 
"bus b" are shifted into the pipehne as indicated by the "pipeline I D " signal. 
Trigger Once all the required parameters have been loaded into the pipehne, ex-
ecution can be triggered. This is accomplished by setting the "pipeUne I D " 
signal to the number of the pipeline that is to be triggered and by holding the 
"start exec" signal high for one clock cycle. On the following clock cycle, the 
pipeline w i l l drive the "rc running" signal high to indicate that the pipeline is 
busy. 
E x e c u t e During the execution stage, the RC area can issue D M A operations to the 
main memory. A D M A operation is started by driving the " D M A tran" signal 
w i t h one of three possible values (i.e. "Word" , "Hal f Word" or "Byte") which 
indicate the size of the operation that is to be performed. A t the same time 
as the " D M A tran" signal is asserted, the "bus a" signal is used to indicate 
the address of the operation and "is mem load" then indicates whether the 
operation is a load or a store. The signal "bus b" is used to transfer the data 
associated w i t h the D M A operation and is driven by either the CPU or the 
RC area, depending on whether the operation is a load or a store. Once the 
D M A operation is complete, the "mem complete" signal is driven high by the 
CPU. The RC area is then free to post the next D M A operation or return to 
the " I D L E " state. Once the pipeline has finished executing, the "rc running" 
signal w i l l go low. 
R e t i r e The retirement stage is optional, and is only entered i f results f rom the RC 
area need to be moved back into the central register file in the CPU. This 
is done by the CPU asserting the "output res" signal. A t this point, both 
"bus a" and "bus b" are driven by the RC area w i t h the last values f rom the 
two parallel shift registers that make up the constant and stage 0 registers. I f 
additional results are required, the CPU w i l l drive the "shif t en" signal high 
85 
5. M I P S Test P l a t f o r m 
to shift out the next two values. 
The hmited number of connections between the MIPS CPU board and the board 
containing the RC area, restricts the "pipeline I D " signal to a w i d t h of 2 bits. As 
a result, the maximum number of simultaneous data flow pipelines that can be 
implemented is four. This is more than adequate for most apphcations. 
5.2 Software 
5.2.1 Console Software 
The main application that runs on the MIPS pla t form is a console program, wr i t ten 
in C. This presents a simple command line interface that allows the user to run 
a variety of test algorithms (e.g. low pass filters (LPF) , FFTs, image processing, 
sorting, etc). In addition, the console apphcation contains: low level drivers for 
the USB interface, a cycle counter and functions to test/diagnose any hardware 
problems that are found during the pla t form bring up phase. 
A cut down version of the console application, without the test algorithms, is placed 
in the 8 KBy te R O M inside the FPGA. This boots the CPU and is then used to 
download the much larger version of the console application, complete w i t h the test 
algorithms, to the external 256 KByte S R A M (this memory is also used for the stack 
and all non-constant variables). This approach removes the need to resynthesize the 
F P G A bi t stream every t ime that the program is updated. The remaining 2 M B y t e 
external S R A M is reserved for test data (e.g. images, audio, raw data, etc). 
86 
5. M I P S Test P l a t f o r m 
Mem Image 1 | hnaae 2 | Audio 
Figure 5.4: USB data transfer application displaying the contents of the data buffer 
as an image 
5.2.2 Data Transfer Software 
The data transfer application is used to transfer test data to and f rom the hardware 
target via the USB interface. A t the heart of the application is a 2 M B y t e buffer, 
which can be uploaded/downloaded t o / f r o m the 2 MByte S R A M on the target. 
Mul t ip le view windows, display the buffer contents in the following three formats: 
raw, image and audio waveform (see figure 5.4). By default, there are two image 
views each displaying a different area of the buffer. Test data can be loaded into the 
buffer f r o m a variety of different image and audio file formats (e.g. JPEG, PNG, 
WAV, etc). In addition, the data can also be loaded or saved in a raw format. 
5.3 Test Algorithms 
The loop conversion algorithm, described in chapter 4, was applied to a series of 
test algorithms w i t h different characteristics such as: the number of instructions, 
the bandwidth requirements, the levels of ILP, etc. The source code for all the test 
87 
5. M I P S Test P l a t f o r m 
algorithms (kernels of the algorithms are included in appendix A ) was combined w i t h 
that of the console software (see section 5.2.1) and compiled into a single application. 
The types of benchmark algorithm used was ristricted due to the various pla t form 
limitations outlined below:-
None stat ic arrays None static arrays that are local to a funct ion are allocated to 
the stack, to reduce the amount of stack space consumed the GCC C compiler 
uses miss-aligned loads and stores to access the array elements. However for 
patent reasons these instructions were not implemented by the author of the 
open source CPU core used [103]. As a result software that contains none 
static arrays can not be used. 
Standard C functions The console software does not include a f u l l C runtime 
library. Therefore most of the standard C functions can not be used (e.g. 
malloc, sin, fopen, sqrt, etc). 
Float ing point maths The CPU core does not contain any floating point maths 
hardware. Normally floating point support would be emulated by a maths 
library, however the console software does not contain such a library. As such 
software that requires floating point arithmetic could not be executed. 
For the purposes of the evaluation, all the algorithms were run on the fuU hardware 
target. A summary of the speedups produced, together w i t h the hardware resource 
uti l ization for the various test algorithms, is shown in table 5.2. 
5.3.1 PRBS Generator 
Linear feedback shift registers (LFSRs) are often used to create pseudo-random bi-
nary sequences (PRBS) [104]. The high number of bitwise operations present in this 
algorithm often leads to highly efficient hardware implementations. I n comparison, 
88 
5. M I P S Test P l a t f o r m 
•X) o o lO lO CM o CO CO o o oo o CO CM 
S
pe
e 
:a
c1
 CO lO lO d 
r—1 
CO 1—5 CM 
r—1 
1—1 id CM CO CM 1—5 
S
pe
e 
im
e 
yc
le
 
CJ 
CD o CM CO in I—1 CM lO 
27
3.
4 
12
6.
1 CO O o CO CO CO lO 
Ex
i o o 
o" 
CO CM T—1 CO CO o 27
3.
4 
12
6.
1 
CO OO I — { o CM 
o 
r—f 
1—5 CM 1—1 1—< id CM 
o 
1—H 
- i l 
^ 
CO 
im
e 
yc
le
 
o 
o Cvl lO lO CO CO CO CM C35 o O CJ5 
E
xe
 
o CO lO CO 
00 
d 
00 
CD CO 
CO lO 
00 
lO CO 05 
d 
CM lO 
o CM CM 
O 
lO CO CO 
05 CO O CO 
o 
CO 
o 
of
 
FP
 
CM 00 I-- CO (35 lO 1—1 CM lO CM 00 CO o 
us
ag
 
of
 
FP
 
o 1—i lO 
t - H 
CO 05 CO 
i - H 
id 1—1 CM d 1—5 1—5 CO 
cu 
i-l ai 
CO 
c 
ce
ll 
H
ar
 
c 
ce
ll 
00 
I — 1 CM 
00 CO CO 
o 
CM 
00 
I — 1 
CO 
05 
05 
CO LO 
t—H 
lO 
o o 
lO o oo 
05 
1—1 CO 
O lO 
i - H 
CO lO CM 
CO o 
CO 
05 CO 05 co 
1 - H lO lO 
bO 00 >—1 CM lO CM I—1 CM 1—1 O 
ps
 
o 
o to
 
o ed
 
I — 1 J—I CO I—1 CM I—1 1—1 1—( 1—1 t—( 1—< 1—( 1—H 1—1 CM 
<v + j 
m
b 
ve
r 
: ^ C
O
 
S
ub
 
al
go
ri
th
m
 
S
ta
nd
ar
d 
U
nr
ol
le
d 
F
or
w
ar
d 
In
ve
rs
e 
P
ac
ke
d 
P
la
na
r 
B
ub
bl
e 
H
ea
p 
Q
ui
ck
 
o CO 
ne
ra
t 
fil
te
i 
at
io
n 
k 
se
ar
ch
 
4 ^ O ht
ne
s 
a 
ri
th
r 
IS
 g
e 
pa
ss
 
na
liz
 
k 
se
ar
ch
 
de
lb
] 
br
ig
 
or
ia
l 
sr
ie
s' 
su
i 
o 
A
lg
o m FT
 
ow
 
i-< 
o 
B
lo
c 
[a
n t+-i o 
03 sr
ie
s' 
su
i 
+i 
i-i o 
A
lg
o 
B
lo
c 
fa CO CO 
^1 
CO a; 
+ j 
CO 
fa 
o 
CO 
O hO 
CO 
CM 
id 
89 
5. M I P S Test P l a t f o r m 
the fixed instruction pipefine present in CPUs, can often severely l imi t the perfor-
mance of a software solution. Al though there are several classes of application (e.g. 
Monte Carlo simulations) that use random numbers directly [105, 106], there are 
many other classes of apphcation (like encryption [107]) that share w i t h the PRBS 
the characteristic of having large numbers of bitwise operations. 
The random number generation test, used a 32 bit LFSR to generate 2 MBytes of 
random numbers (see appendix A . l ) and took 156.2 mil l ion clock cycles to perform in 
software. On application of the automatic hardware conversion, this was reduced to 
26.2 miUion cycles. Al though this represents a significant factor of 6 speedup, there 
is stiU scope for improvement, as the hardware occupied only 0.4% of the F P G A and 
the algorithm used was not l imited by the available bandwidth. To increase the level 
of ILP, the inner loop was removed by completely unrolhng i t (see appendix A.2) 
and as a result each iteration of the remaining outer loop produced a completely 
new 32 bi t value. Performing this common optimization technique, improved the 
software only performance by 55.4%. Although this is a major improvement in 
performance, i t is dwarfed by that obtained by the hardware case, which is over 99 
times faster than the original non-optimized algorithm and is 54.9 times faster than 
the optimized software version. 
The performance of the hardware is l imited by the poor memory bandwidth of the 
MIPS host. I f this hmita t ion were to be removed, the hardware would be capable of 
generating a new 32 bi t value every clock cycle, which would tr iple the performance. 
Because the software only test case is highly instruction hmited, its performance is 
unlikely to increase significantly. As a result, the hardware would be approximately 
160 times faster than the software. Since this hardware occupies only 1.2% of the 
F P G A and the algorithm can be linearly scaled w i t h increased hardware resources, 
the potential increase in performance, given unlimited bandwidth, is extremely large. 
90 
5. M I P S Test P l a t f o r m 
shm en 
start exec 
rc running 
mem complete 
mem is load 
SHdma tran 
Figure 5.5: Logic analyzer trace showing RC pipeline execution of F F T algorithm 
5.3.2 F F T 
This test performs an integer fixed point F F T on 2 MBytes of audio data (16 bits 
per channel @ 44.1 KHz) . Since the F F T algorithm used (see appendix A.3) can 
only process data sets up to a maximum size of 1024 samples, the audio data is spht 
into blocks of this size which are then processed independently. The conversion to 
hardware resulted in a 5.5x speed improvement for both the F F T and the inverse 
FFT . To achieve this, the tools automatically converted the four innermost loops 
present i n the F F T algorithm to hardware. 
During the course of the F F T , multiple passes of the data set are performed. Due to 
the but terf ly nature of the F F T algori thm [108, 109], the later passes of the data set 
result in the RC area being triggered repeatedly w i t h only a few data values being 
processed on each occasion. As is shown in figure 5.5, this results in a significant 
amount of idle t ime between RC executions. This and the bandwidth intensive 
nature of the algorithm, are the two main performance l imi t ing factors. 
5.3.3 Low Pass Filter 
A low pass audio filter was implemented using a finite impulse response (FIR) [110 
design, the coefficients of which are shown in equation 5.1. As the audio data was 
arranged so that a single 32 bi t word contains both the left and right 16 bit samples, 
the algorithm was implemented to filter both channels simultaneously (see appendix 
A.4) . This reduces the number of passes of the data set and therefore increases the 
91 
5. M I P S Test Platform 
efficiency of the memory sub-system. 
yn = ^ (5.1) 
The conversion to hardware yielded a 7.2 times increase in performance and although 
this is a significant improvement, i t is by no means the maximum that is achievable. 
Computing the new sample values for both channels requires eight clock cycles, five 
of which are spent idle while the pipeline is stalled whilst awaiting data from/to 
the memory. The overall performance of the system could be further increased by 
improving the memory controller and also by implementing a data cache in the 
MIPS CPU core. 
5.3.4 Normalization 
Normalization of audio data is commonly performed to maximize the dynamic range, 
thereby improving the signal to noise ratio during playback. This requires two passes 
of the data set. The first pass determines the maximum sample value and the second 
scales all the data so that the maximum value is set to the maximum possible value 
(see appendix A.5). The loop conversion tool was used not only to extract these 
loops but also to convert them into hardware. An order of magnitude speedup was 
obtained, using only 5.3% of the FPGA, despite the fact that the performance of 
the hardware is bandwidth limited. 
5.3.5 Block Search 
Motion estimation is a key component in such video processing systems as MPEG 
111] and picture improvement algorithms [112]. The basic approach, is to divide 
up the image into macro blocks (16x16 pixels in the case of MPEG) and a sum of 
92 
5. M I P S Test Platform 
absolute differences (SAD) algorithm is then used to determine the closest match to 
each block in the next frame. A motion vector for each block is then derived from 
the difference in position of the blocks in the two frames. Typically, this operation 
is only performed on the luminance component of the pixels in the YUV colour 
space, as this considerably reduces the processing workload with little effect on the 
accuracy of the resultant motion vectors. 
Image data is commonly arranged in one of two pixel formats in computer graphics: 
packed and planar :-
Packed Al l the values for a single pixel are grouped together in the packed format, 
with the data for adjacent pixels on a line being stored in adjacent memory 
locations (see figure 5.6(a)). 
Planar In the planar format, the colour components are grouped together to form 
planes, with each plane able to exist in a completely different section of memory 
(see figure 5.6(b)). 
A i Y i U i V i A2Y2U2V2 A3Y3U3V3 A4Y4U4V4 
(a) Packed 
A1A2A3A4 Y1Y2Y3Y4 U1U2U3U4 V1V2V3V4 
(b) Planer 
Figure 5.6: Pixel graphics formats 
The block search test searches an image for the closest match to a particular 16x16 
pixel macro block. This test is performed on both planar and packed pixel images 
(see appendices A.6 and A.7 respectively for the source code). The packed pixel 
implementation executed fewer instructions but necessitates a much greater number 
of memory operations. This is reflected in the performance results, where the planar 
format is 8.4% slower than the packed pixel format when both are performed in 
software. This situation is reversed when the algorithms are converted to hardware, 
as the higher number of operations present in the planar version, translates into 
93 
5. M I P S Test Platform 
a higher FPGA resource utilization rather than into an increased execution time. 
Additionally, the significantly lower number of memory operations also contributes 
to a clear performance advantage for this data format. Overall, the packed pixel 
search was 3.1 times faster, and the planer search 7.4 times faster in hardware when 
compared to the software implementation. In both cases, the hardware performance 
was severely limited by the available memory bandwidth. 
5.3.6 Mandelbrot 
Rendering the Mandelbrot fractal is a highly intensive computational task, with 
minimal bandwidth requirements. Due to the fact that the MIPS CPU core and 
therefore the RC conversion tools do not support the use of floating point instruc-
tions, the algorithm was modified to use fixed point arithmetic (see appendix A.8). 
By converting the core loop to hardware, a speedup factor of 21.6 was obtained 
using only 5.4% of the FPGA. 
The use of fixed point arithmetic can cause values to saturate and become corrupted, 
if the outer loop is unrolled to process multiple pixels simultaneously (as it was 
in section 3.3.1.3). Although this problem can be resolved by adding additional 
code, the overhead and register stacking that this introduces results in no significant 
increase in performance. 
5.3.7 Half Brightness 
The half brightness algorithm converts an image to the YUV colour space where the 
luminance value is halved, the image is then converted back to the RGB colour space 
see appendix A.9, to produce a correctly colour balanced reduction in brightness. 
The performance improvement produced by the hardware conversion is due to a 
94 
5. M I P S Test Platform 
' . Shift en 
start ewe 
' rerunning 
mem complete •blFlofli'TDjilCLfFloJ^  
Figure 5.7: Logic analyzer trace showing RC pipeline execution of half brightness 
algorithm 
combination of two factors: performing an average of 7.2 operations in parallel 
on each pipeline stage and processing up to 10 loop iterations simultaneously due 
to the automatic removal of the loop iteration dependency. However, in practice 
the hardware only performs 44.3 times faster than the software solution. A logic 
analyzer trace shows that, the "DMA Tran" signal never returns to the idle state 
after the pipeline has started (see figure 5.7), indicating that the hardware is severely 
bandwidth hmited. If this limitation were to be removed, the resulting performance 
would be in the order of 220 times faster than the original software. 
Since only 14% of the FPGA was used to create the pipeline, the complexity of 
the algorithm could be increased without reducing the performance of the hardware 
solution, provided that the independent nature of the loop iterations was maintained. 
This type of scalabihty is un-matched in the software domain, where every additional 
instruction increases the execution time. 
5.3,8 Factorial and Series Sum 
The factorial calculation and series sum tests are both very tight loops with low 
levels of ILP (see appendix A. 10). In order to increase the execution time and 
therefore obtain an accurate measure of the speedup factor, both tests were run 
with a starting value of 10,000,000 and, as a result, the multiplication result register 
overflowed. Although this means that the value calculated for the factorial is invalid, 
this does not aff^ect the time taken to calculate the value. 
The speed increases obtained for the series sum and the factorial were 5x and l l x 
95 
5. M I P S Test Platform 
respectively. The significantly higher performance increase of the factorial, was 
mainly due to the poor performance of the multiplier used by the software in the 
MIPS CPU core. As expected, the hardware utilization for both algorithms is 
quite low, with that of the factorial hardware loop being slightly larger due to the 
multiplier. 
5.3.9 Copy 
The low number of highly data dependent, bandwidth intensive, instructions present 
in the copy test (see appendix A. 11) represents one of the worst test cases that a 
reconfigurable computing platform would have to handle. I t is important to know, 
therefore, what performance improvements, if any, are produced under these extreme 
conditions. The hardware conversion tools successfully exploit the minimal amount 
of ILP and loop level parallelism that is present to produce a speedup factor of 2.8. 
5.3.10 Sort 
Data sorting is a common task in modern computing, and consequently many differ-
ent sorting algorithms have been developed [113]. Three such algorithms (bubble, 
heap, and the quick sort, see appendix A. 12) have been tested with the hardware 
conversion tools. In each case a data set generated by a PRBS was used, aUowing the 
sorting algorithms to be consistently tested with the same set of random numbers. 
The heap and quick sorts were performed on a 262144 element data set, whereas 
the bubble sort was performed on a 8192 element data set. The reduced number of 
elements for the bubble sort results from the very slow nature of this algorithm and 
the need to keep the test times manageable. 
The speedup produced by the conversion to hardware for the bubble, heap and quick 
96 
5. M I P S Test Platform 
outpulres 
shtft en 
start exec 
re running 
mem complete 
rriBmisload 
S-dmatran 
Figure 5.8: Logic analyzer trace showing RC pipeline execution of quick sort algo-
rithm 
sorts are: 3x, 2.3x and 1.2x respectively; in all cases under 5% of the FPGA was used. 
As expected, the quick sort is the faster algorithm in software. However, when the 
algorithms are translated into hardware the heap sort is 32% faster than the quick 
sort, due to the fact that the hardware conversion can only operate on the innermost 
loops. In the case of the quick sort algorithm, these loops are very small and are 
executed many times, resulting in the hardware pipeline being triggered repeatedly. 
This can be deduced from the fact that the "start exec" signal frequently goes high 
as shown in figure 5.8. The overhead associated with this has a significant impact on 
performance, as the "output res" and "shift en" signals are active for a considerable 
amount of time. 
5.4 Performance Scalability 
The conversion of kernel loops to hardware, increased performance by up to a factor 
of 54.9. Although this is a significant increase, since the average hardware utilization 
was only 6.3%, there is considerable room for further improvement. Furthermore, 
the current generation of FPGAs [114, 115] on the market, are up to four times 
larger than the APEX 20K1500, used for this evaluation. 
5.4.1 Bandwidth 
Many of the algorithms tested here are highly bandwidth limited and this is partly 
due to the non-pipehned nature of the memory interface. As a result, load operations 
97 
5. M I P S Test Platform 
take two clock cycles and store operations take three. By pipelining the memory 
interface, a potential and additional 2-3x increase in performance might be obtained. 
The current hardware conversion system extracts parallelism in two orthogonal di-
rections and, in the majority of test cases, the resulting parallelism is enough to 
saturate the available bandwidth. The parallel nature of many algorithms when 
combined with the additional optimizations outlined in section 7.1.3, enables the 
utilization of all available hardware resources. However, considerable bandwidth 
may be required. A good example of this, is the random number generation test 
which is capable of producing 64 MBytes/s of data at a clock speed of 16 MHz 
whilst occupying a mere 1.2% of the FPGA. If the hardware were to be scaled to 
use all the available logic cells in the FPGA, it would generate over 5.3 GBytes/s 
of bandwidth. Memory systems that can handle this level of bandwidth are readily 
available (e.g. dual channel DDR-400 = 6.4 GBytes/s), however, the hmiting factor 
soon becomes the bandwidth available out of the FPGA itself. The FPGA used in 
the MIPS test platform has 488 user 10 pins. At 16 MHz, this results in a maximum 
bandwidth of 967 MBytes/s, which is 5.5x lower than the data rate that the system 
requires. This scales to a required bandwidth of 83.3 GBytes/s compared to an 
available bandwidth of 15.3 GBytes/s when the clock speed is raised to its maxi-
mum of 250 MHz (the fastest that the 10 pins will allow). Although it is possible 
to create memory systems with this level of performance, this is impractical in a 
mainstream computing environment. 
5.4.2 Parallelism 
Increasing the amount of parallelism in code will produce dramatically different 
results, depending on whether the algorithm is to be performed in software or in 
hardware. This is because processors and FPGAs are based on completely different 
computing paradigms:-
98 
5. M I P S Test Platform 
Processors Even the most modern superscalar and EPIC CPUs only operate on 
small sections of code at a time, executing instructions pseudo-sequentially. 
In general, since a single processor can only exploit ILP, the number of in-
structions that can be simultaneously executed is hmited by both the data 
dependencies and the maximum issue rate of the instruction pipeline. As a 
result, increasing the number of instructions to be executed leads directly to 
an increase in execution time. Equation 5.2 demonstrates this by the fact that 
the number of instructions executed, is directly proportional to the execution 
time. 
SW exec time 
Num loop iterations x Num instructions in loop 
Average issue rate x Clock speed 
(5.2) 
Number of instructions executed 
Average issue rate x Clock speed 
F P G A s Once converted to hardware each instruction is represented by its own 
block. The clock rate is Umited by the speed of the slowest block and not by the 
number of blocks. Because hardware is inherently parallel, several orthogonal 
types of parallehsm can be exploited (e.g. ILP, loop iteration and inter-loop 
parallelism). Therefore, the number of operations that can be performed in 
any one clock cycle, is limited only by the data dependencies and not by the 
issue rate of an instruction pipeline. The parallelism is demonstrated by the 
fact that the number of operations that the pipeUne performs has no effect on 
performance and this is, therefore, not a factor in equation 5.3, which calculates 
the execution time. Consequently, as the number of instructions increases the 
hardware utilization also increases, whilst the clock speed remains roughly 
99 
5. M I P S Test Platform 
constant. 
{Cxl) + L 
HW exec time !^ ——; 
Clock speed 
C = Cycles between iteration starts 
(5.3) 
L = Pipeline length 
I = Number of loop iterations per run 
Several common optimization techniques, hke loop unrolling and loop pipehning, 
are designed to increase the ILP in the code. The effects of increasing the level of 
parallelism, for both hardware and software, are shown in figure 5.9. The speed 
of the software only case increases until the maximum issue rate of the pipeline 
is reached, at which point the performance remains constant. If the parallehsm is 
increased further by unroUing, the performance will eventually start to decrease as 
instruction cache misses and register stacking reduce the efficiency of the processor. 
Because the hardware is not limited to only exploiting ILP, its performance will 
increase more rapidly than the performance of the software. At some point, the 
performance of the hardware will level off as the bandwidth limit is reached. Any 
further increase in parallehsm due to unrolling, will only result in additional hard-
ware utihzation without any further increase in performance, and should therefore 
be avoided. Ultimately, the hardware conversion will fail, as the resources required 
will become greater than those available. 
5.4.3 Algorithm Complexity 
An initial analysis of the results, suggests that reconfigurable computing platforms 
will always be bandwidth limited. However, by increasing the size and/or the com-
plexity of the section of code that is to be converted into hardware, i t is possible to 
increase further the performance of bandwidth limited algorithms. 
100 
5. M I P S Test P l a t f o r m 
Further increases in parallelism 
cause hardware conversion to fail 
due to limited FPGA resources 
— Software 
— Hardware 
Figure 5.9: Generahzed effects of parallelism on performance (Software vs Hardware) 
5.4.3.1 Increased section size 
Generally, the larger a section of code is (that is to be converted into hardware), the 
greater the parallelism and therefore the higher the performance (see section 5.4.2), 
even if the system is already bandwidth limited. If the additional code being added 
to the RC area operates on a data set that is already processed or produced by the 
RC area, then the bandwidth requirements may not change significantly. In some 
cases, the off chip bandwidth requirements will actually be reduced. One example 
of this is picture processing and compression, a simple example of which is shown 
in figure 5.10. If the RC area contains a histogram correction algorithm, adding run 
length encoding (RLE) compression will reduce the off chip bandwidth. Similarly, 
adding a low pass filter to the RC area will not change the bandwidth requirements 
but would increase performance. The process of adding additional sections of code 
to the RC area can be automated using the procedure described in section 7.1.1. 
101 
5. M I P S Test Platform 
Raw 
Picture 
Data 
Low P a s s Histogram R L E 
Filter Correction Compression Picture Data 
Figure 5.10: Picture processing and compression 
5.4.3.2 Increased complexity 
A cursory analysis suggests, that there is Uttle or no benefit to be obtained from 
reconfigurable computing on a real time system such as MPEG decode and play-
back, since once the real time constrains have been met, there is little advantage in 
increasing performance. However, in many environments the complexity of an algo-
rithm or data processing system is limited by the available computational power. A 
good example of this is 3D games, where, with the advent of ever faster CPUs and 
3D graphics cards (that offload the rendering of the display to dedicated hardware), 
the amount of processing power available has increased dramatically. However, the 
amount of idle time has remained roughly constant due to the ever increasing com-
plexity of the physical models on which the games are based. The same trend can 
be seen in HPC environments, where the time required to perform intensive tasks, 
like simulation, has also remained approximately constant. This is due to the fact 
that the complexity and therefore the accuracy of these systems has increased inline 
with processing power. 
The bandwidth constraints outlined in section 5.4.1, can severely limit the perfor-
mance improvement obtained when reconfigurable computing is applied to existing 
algorithms. However, in many cases, bandwidth does not limit the size of the al-
gorithm. As a result, in addition to producing significant increases in performance, 
reconfigurable computing could yield dramatic increases in the complexity and there-
fore the accuracy of many computing tasks/models. 
102 
5. M I P S Test Platform 
5.5 Platform Evaluation 
Section 1.3 outhned the features required to produce a viable reconfigurable com-
puting platform, in general these can be summarized by the following requirement:-
• Abstraction 
• Automatic conversion 
• Low conversion time 
• Large performance increase 
5.5.1 Abstraction 
Because the conversion process used (see chapter 4) is designed to be performed at 
runtime instead of compile time, the original software executable contains no infor-
mation specific to the target RC area. In addition to providing complete abstraction 
this also has the benefit that existing legacy software will also benefit from hardware 
acceleration. 
5.5.2 Automatic conversion 
With the expecting of the user setting for pointer afiasing (see section 4.6.1.1) the 
tool flow is completely automatic and autonomous. Since section 4.6.1.1 outlines 
a solution that eliminate the need for any user intervention i t is expected that the 
requirement for an automatic conversion process can be fulfilled by the tool flow 
presented in this thesis. 
103 
5. M I P S Test Platform 
5.5.3 Low conversion time 
The total time required to perform the conversion process is largely dominated by 
the time required to perform the FPGA place and route, and is therefore outside the 
control of the tools presented here. However research has already been conducted on 
possible methods to significantly reduce this overhead [44]. A considerable amount 
of effort has gone into ensuring the conversion tools do not require large amounts 
of CPU time (currently a few seconds, even for the most complex conversions), so 
that, in the future the overall conversion time should be minimal. 
5.5.4 Large performance increase 
The use of reconfigurable computing offers the opportunity to greatly accelerate the 
performance of a system. With speed up factors of 55 times achieved, the new tools 
flow presented in this thesis is no exception. In the future much greater perfor-
mance improvement should be achievable once the bandwidth restrictions outlined 
in section 5.4.1 have been resolved. 
5.6 Summary 
In most cases the performance enhancement is constrained by the memory band-
width available. Although this is a general problem with all computing paradigms, it 
is especially true of reconfigurable computing platforms because of the higher com-
putational throughput, and therefore the requirement for faster data transmission. 
Unfortunately, the memory interface/controller that was present on the host MIPS 
CPU used, was only designed to support the minimal bandwidth requirements of 
a single issue CPU. As a result, it requires 2 cycles to perform a load and 3 cycles 
to perform a store operation and this is further compounded by the fact that the 
104 
5. M I P S Test Platform 
memory interface is not pipelined. Although this severely limits the performance 
produced, the conversion to hardware still achieved speedups in the range of 1.2x to 
54.9x. 
In the software domain, the execution time is governed by the number of instruc-
tions that are executed. However, in the hardware domain, this translates into the 
amount of hardware used in implementing the loop, and as such does not directly af-
fect the execution time. Instead the hardware execution time is determined by both 
the amount of parallelism that can be exploited and also by the available bandwidth. 
The average FPGA unitization was only 6.3% (3200 logic cells) showing that there 
is a considerable amount of unused hardware resource which could be used to in-
crease the algorithm complexity/parallelism. This would directly lead to additional 
increases in the speedup produced. 
During the execution of some of the test algorithms, i t was noted that the hardware 
pipeline was triggered repeatedly, often with large amounts of idle time between ex-
ecutions. This is due to the fact that the hardware conversion can only process the 
innermost loops of an algorithm. Substantial, further improvements could be ob-
tained by optimizing the hardware triggering method and by changing the hardware 
generation process, so as to be able to handle nested loops 
105 
Chapter 6 
Cray X D l Platform 
In order to evaluate the performance potential of reconfigurable computing when uti-
hzed in a high performance computing environment, the hardware conversion tools 
outlined in chapter 4 were used in conjunction with the Cray X D l supercomputer. 
6.1 Cray X D l Overview 
The Cray X D l [67] is the first of a new generation of computers that contain FPGAs 
as an integral part of the system. In these, the FPGAs are tightly coupled to the 
processor, have high bandwidth and low latency access to memory. The X D l is a 
modular and scalable system, consisting of one or more 3 VU rack mount chassis. 
Each chassis contains 6 compute blades, with each blade consisting of an FPGA and 
two Opteron CPUs running Linux (as shown in figure 6.1). Although XDls, based 
on the new generation of dual core Opteron processors and Virtex4 FPGAs are now 
available, the system used during the course of this research contained single core 
CPUs clocked at 2.2 GHz together with Virtex I I Pro 50 FPGAs. 
In addition to providing a high bandwidth link between the FPGA and the rest 
106 
6. Cray X D l Platform 
of the system, the chipset also provides a set of software accessible registers to 
control the supporting hardware for the FPGA. Included in this, is a programming 
interface which allows the user to reconfigure the FPGA. Since the user defined logic 
in the FPGA is in a different clock domain from the interface to the CPU and main 
memory, the user logic is able to run at any clock speed between 63 and 199 MHz, 
in 1 MHz increments. This can be changed whilst the system is running and is 
controlled by the chipset. 
To interface the FPGA to the rest of the system, Cray provides several interface 
cores. These are combined with the user defined logic to form the FPGA configu-
ration, as shown in figure 6.2. Together, these cores occupy approximately 8% of 
the Virtex I I Pro 50 FPGA leaving a substantial number of logic cells and other 
hardware resources available for the implementation of user logic. 
Clock & reset core This block generates all the clock signals required by the in-
terface hardware and user logic, derived from the reference clock which is 
generated by the chipset. The block also provides additional reset signal gen-
eration and distribution. 
R T core The data interface from the chipset is a simplified version of the Hyper-
Transport protocol [116]. This interface is very complex and runs at a high 
clock rate. The RT core translates this into a wide, low clock rate interface 
that simphfies the design of the user logic section in the FPGA. 
Q D R interface There are four instances of the QDR interface - one for each ex-
ternal memory. Like the RT core, they provide a simple, easy to use interface 
to these otherwise complex devices. 
107 
6. Cray X D l Platform 
DDR-400 
SDRAM J 
DDR-400 
SDRAM Jl-I 
1 1 
2.2 GHz 2.2 GHz 
AMD Opteron AMD Opteron 
1 1 
Cray Chipset Cray Chipset 
1 1 
6.4 G B y t e s / s 
^ — 3.2 G B y t e s / s 
2.0 G B y t e s / s 
Xlllnx Virtex II 
Pro 50 FPGA 
QDR-SRAM 
QDR-SRAM 
QDR-SRAM 
QDR-SRAM 
Links to rest of system 
Figure 6.1: Cray X D l blade architecture 
g. 6 >. 
O 
Xilinx F P G A 
0) 
5 
Clock & Reset Logic 
User Defined Logic 
Q D R Interface 
Q D R Interface k 
Q D R Interface k 
Q D R Interface 
Q D R - S R A M 
Q D R - S R A M 
H Q D R - S R A M 
A Q D R - S R A M 
Figure 6.2: Cray FPGA interface cores 
108 
6. Cray X D l Platform 
6.2 Platform Details 
6.2.1 Execution 
The user defined logic section of tlie FPGA is used to implement a data flow pipeline 
similar to the one described in section 3.3. Unhke the MIPS test platform described 
in chapter 5, the FPGA is not integrated into the instruction pipeline of the proces-
sor. As a result, execution in the FPGA is triggered by accessing memory mapped 
control registers within the FPGA and not by executing specific RC trigger instruc-
tions, as is the case with the MIPS test platform. 
To measure the performance increase of both the software only and hardware accel-
erated test cases, the clock cycle counter present in the Opteron processor was used 
117]. This was the most accurate way available to measure the execution time. 
The hardware in the XDl contains a link between the FPGA and the interrupt con-
troller. This would normally allow the CPU to trigger execution in the FPGA and 
then continue conventional software execution. When the FPGA finishes its task or 
requires software intervention, it should be able to trigger an interrupt that would 
cause the CPU to jump to the FPGA control code. However, the software to sup-
port the hardware interrupt connection is unfortunately not yet available, although 
it is scheduled for a future release of the Cray FPGA software application program-
ming interface (API). This, together with other platform limitations, reduces the 
functionality and performance of the hardware conversion tools described in chap-
ter 4. In particular, the lack of interrupt capability results in the CPU having to 
continuously poll the FPGA registers to determine when the execution has finished. 
109 
6. Cray X D l Platform 
6.2.2 Tool flow 
Like many other modern computing platforms, the XDl implements a virtual mem-
ory system. Since the FPGA is connected directly to the main system bus instead of 
to the MMU, it exists in the physical address space, whereas the software is executed 
in one or more virtual address spaces. This makes it extremely difficult to produce 
an x86 front end for the hardware conversion tools. As a result, the C source code 
is compiled to a MIPS executable and is processed using the existing MIPS front 
end for the hardware generator (as shown in figure 6.3). As outhned in section 
7.2.2, once the next release of the Cray support software is made available, this 
problem can be resolved and an x86 front end produced. Since a MIPS executable 
is the starting point for the hardware conversion process, it must be accompanied 
by a MIPS execution profile. This is because the location of the branches will be 
very different in the x86 executable, due to the differences in the compiler and the 
complex instruction set computer (CISC) nature of this processor. Therefore, the 
execution profile is generated by the behavioural simulation of the MIPS platform, 
described in section 5.1. The original source code is compiled, without modification, 
using the x86 GCC compiler; the execution of this code provides a reference which is 
then used to determine the speedup factor produced by the hardware acceleration. 
The use of the MIPS front end instead of an x86 front end will have an effect on the 
results, however it is expected that any differences will be minimal, and would be 
dwarfed by other factors, such as available bandwidth and the amount of parallelism 
present in the code. This is due to the fact that both of the C compilers used (MIPS 
and x86), are variants of GCC, and therefore the same types of optimisations will 
be performed in both cases. 
Due to the complex and time consuming nature of handwriting the RC trigger 
instructions, the conversion process is limited to accelerating only a single software 
loop. 
110 
6. Cray X D l Platform 
Program 
source 
code in C 
Hand written 
trigger code 
insertion 
MIPS 
Execution 
profile 
Original 
XDl 
program 
K86 GCC 
compiler 
Accelerated 
program 
MIPS GCC 
compiler 
generator 
I 
Hardware 
conllg 
image 
Figure 6.3: Hardware conversion tool flow for Cray XDl 
6.2.3 Memory Access 
Due to the issues associated with virtual memory that were outlined in section 6.2.2, 
aU the DMA operations in the data flow pipehne access the QDR-SRAM memories, 
instead of the main SDRAM memory that is connected to the processor. Even if the 
virtual memory issues were to be resolved so that DMA operations could access the 
main SDRAM, the performance of the FPGA would stiU be severely limited owing 
to the high levels of bandwidth consumed by the CPU pohing the FPGA, described 
in section 6.2.1. A connection between the RT core and the QDR-SRAM cross bar 
switches is present, which allows the host CPU to read and write test data to these 
memories. 
Each Cray QDR-SRAM interface core, provides independent 72 bit read and write 
ports capable of running at up to 200 MHz. Consequently, the combined bandwidth 
to each SRAM is 3.2 GBytes/s. AU four SRAMs are interleaved to form one 16 
MByte address space. This is accomplished by the two cross bars that are used to 
interface the memory operations in the data flow pipeline to the SRAMs, as shown 
in figure 6.4. The cross bars analyse the addresses of the memory transactions 
111 
6. Cray X D l Platform 
Load Op 1 
Load 
Operation 
Cross Bar 
1 
« 
Load Op N 
Store Op 1 
Store 
Operation 
Cross Bar • 
Store Op M 
QDR-SRAM 
QDR-SRAM 
QDR-SRAM 
QDR-SRAM 
Figure 6.4: Cross bar architecture for QDR memory interface 
and allocate them to the corresponding memories. Each cross bar can issue up to 
four operations per cycle, provided that the addresses of pending operations do not 
reside in the same physical memory device. As a result, the maximum theoretical 
bandwidth is 12.8 GBytes/s, but in practice this is reduced to 6.4 GBytes/s, as the 
maximum width of a value in the pipehne is 32 bits compared to the 64 bit width of 
the memories. However, this reduced bandwidth exactly matches the bandwidth to 
the main DDR-SDRAM memory, available to the Opteron processors, which confers 
more validity to the performance comparison between the software only and the 
hardware accelerated test cases. It's worth noting that the latency of the SDRAM 
is dependent on many factors, which include previous access patterns and which 
rows, columns, and pages are active. In contrast, the latency of the QDR-SRAM, 
including the controllers and cross bars, is always 10 clock cycles. 
Although in theory, the minimum operating frequency for the RT interface and 
other support cores is 63 MHz, due to a bug in the implementation of the QDR core 
provided by Cray, clock speeds lower than 130 MHz resulted in memory access errors. 
Algorithms that contain DMA operations are, therefore, hmited to a frequency range 
of 130 - 199 MHz instead of 63 - 199 MHz, as both the QDR memories and the data 
flow pipeline are in the same clock domain. 
.To minimize the performance impact due to the latency of the memory interface, 
all the DMA operations in the data flow pipeline are fully pipehned. For example. 
112 
6. Cray X D l Platform 
a load operation spans ten pipeline stages and can have active data in every stage. 
As the latency of the DMA operations in the pipehne matches the latency of the 
memories, the pipehne will only stall if the addresses of multiple operations issued 
on the same cycle, reside on the same SRAM device. In the ideal case, the pipehne 
is able to issue four load and four store operations on every clock cycle without 
stalling. This is a considerable improvement over the MIPS test platform where 
every load operation stalled for 2 clock cycles and each store operation stalled for 3 
cycles. 
6.3 Performance Evaluation 
All the test algorithms are based on the same source code and run under the same 
test conditions as those used on the MIPS platform (See section 5.3 and appendix 
A). Each test was executed both in software and also with the hardware acceleration 
enabled; the number of clock cycles required was recorded and then used to calculate 
the improvement in performance. For algorithms that contain memory operations, 
the software case accessed a 16 MByte buffer allocated in the main DDR-SDRAM 
of the system, while the hardware accelerated version of the algorithm accessed the 
local QDR-SRAMs. The effect of the different memory latencies between the FPGA 
and CPU is minimized, due to the intelligent cache prefetching performed by the 
CPU, together with the streaming nature of many of the test algorithms. This com-
bined with the identical memory bandwidths (as described in section 6.2.3), make 
the results of the software only and hardware accelerated test cases approximately 
comparable. Both memories were initialized with the same test data, and a bitwise 
comparison was performed on the results, once the test had completed. In every 
case, the contents exactly matched, indicating that the hardware present in the 
FPGA functioned correctly. 
113 
6. Cray X D l Platform 
For each test case, the average number of parallel operations executed in the pipeline 
was recorded. This measure of parallehsm can be used to give a rough indication of 
the performance increase that is to be expected. It is worth noting, that the number 
of pipehne operations will be different from the number of software instructions, due 
to the nature of the optimization steps performed during the hardware conversion. 
Because the frequency of the FPGA is set at the maximum for each algorithm, the 
TSF value is different for each test case. The TSF values, hardware utilizations, 
and performance improvements for each algorithm are shown in table 6.1. 
Due to the bugs in the Cray XDl outlined in section 6.2 and summarised in section 
6.5.1 the following restrictions were placed on the test algorithms:-
• Only a single software loop can be converted to hardware. 
• Complex algorithms that require repeated triggering of the hardware can not 
be evaluated. 
• The hardware resulting from the conversion process must be able to run at 
130MHz or greater. 
The predicted results with these limitations resolved are shown in table 6.2. 
6.3.1 P R B S Generator 
The PRBS test generates 2 MBytes of random numbers using an LFSR. The pre-
dicted number of operations that can be performed per clock cycle is 161, this is 
due to the high levels of ILP present in the algorithm. The simple bitwise nature 
of this algorithm, results in a high operating frequency of 196 MHz which leads to 
a relatively low TSF value of 11.2. As the amount of parallelism is coflsiderably 
higher than the TSF value, a significant increase in performance was expected when 
114 
6. Cray X D l Platform 
a, 
KM 
CO 
w 
0) 
I 
o o o 
X 
S ^ 
X o 
m o ^ o 
CO 
CO 
EH 
O I 
CO 
.1—1 
i 
I 
o 
OH 
O 
1 ^ 
o 
o 
'o o 
I I 
a 
O 
00 
CM 
CO 
o 
!-l 
CO 
CD 
CO 
r — I 
C C 
c« 
^ . 
I 
CO 
CM 
CM 
o 
O 
CO 
00 
P 
X 
o 
o 
<D 
CO 
J3 
CO 
0) 
I 
CO 
CO 
O 
o 
o 
g . a; 
CO 
o 
o o o 
CO 
PH 
I 
CP 
8 
o o 
a 
o 
CM 
o 
l-H 
d 
CD 
bO 
CO 
CO 
CO 
CM 
CO 
i 
PQ 
CO 
l-H 
Q; 
CO 
CO 
05 
o 
o 
CM 
CO 
00 
CO 
CO 
CO 
00 
CM 
tr-io 
CO 
00 
CD 
55 is 
s ^ 
CM 
CM 
05 
OO 
CM 
CO 
00 
CO 
CM 
l O 
CM 
o 
CM 
CM 
od 
CO 
CO CO 
CM 
CO 
00 
CO 
CM 
00 
CO 
CO 
o CD 
o 
"3 
,1 
CO 
o 
u 
M 
O 
bC 
o 
CO <u 
f-i 
CD 
> o 
0) 
a 
(A 
a 
PL, 
CM 
CO 
115 
6. Cray X D l Platform 
the algorithm was converted into hardware. This was found to be the case as the 
hardware accelerated version of the algorithm was in excess of 18x faster than the 
pure software case. 
As the software case uses a mere 41 MBytes/s out of the 6.4GBytes/s of available 
bandwidth, it is clear that the CPU is instruction limited, rather than bandwidth 
limited. In comparison, the hardware accelerated case only issues one memory write 
per cycle out of a possible four. This combined with the low hardware utilization of 
8.4% and the parallel nature of the algorithm, suggests that the hardware could be 
further improved to give a 72x overall increase in performance, when compared to 
the software only case. 
6.3.2 Half Brightness 
The half brightness test algorithm produces a reduction in brightness that is cor-
rectly colour balanced; this is accomplished by performing the operation in the YUV 
colour space instead of in the RGB colour space. The complex operations present 
in this algorithm, lead to a reduced clock speed of 161 MHz and therefore to a TSF 
value of 13.7; a value that is higher than that in the PRBS generator test described 
above. This, combined with the subsequent lower amount of parallelism, results in 
a 2.6x increase in performance when compared to the software only case. 
As the hardware implementation of this algorithm merely issues one load and one 
store operation per clock cycle, it is only utihzing a quarter of the available band-
width. The body of the loop was unrolled by a factor of two to increase the amount 
of parallehsm in the section of code. This, consequently, improved the performance 
by using more of the available bandwidth. At first sight, it was expected that this 
would double the performance of the system, but in fact the performance only in-
creased by an additional 1.7x, equating to a 4.4x increase over the pure software 
116 
6. Cray X D l Platform 
case. This is due to the increased complexity of the hardware reducing the max-
imum clock speed from 161 MHz to 146 MHz. In theory, the unroU factor could 
be increased to four, which would result in an overall performance increase of ap-
proximately 6.4x. However, this configuration could not be tested as the clock speed 
would fall below 130 MHz and the bug described in section 6.2.3 would then produce 
memory corruption. 
6.3.3 Low Pass Filter 
Converting the audio low pass filter algorithm to hardware, produced a 7x reduc-
tion in performance due to the combination of a low clock speed (145 MHz) and 
the relatively low number (4.2) of operations per clock cycle. Because of the data 
dependencies present between iterations of the loop, the next iteration cannot be 
started until the majority of the calculations for the previous iteration have been 
completed. This, combined with the high latency of the load operation, leads to 
eight clock cycles of idle time per iteration, as is shown by the lack of any data pro-
cessing operations between stages 1 and 8 in figure 6.5. A potential way to increase 
the efficiency, is to unroll the loop, so that the algorithm would occupy more than 
the 13% of the FPGA utihzed in this test. Again, it was impossible to evaluate the 
effects of this optimization due to the clock speed bug in the QDR interface core. 
Alternatively, it may be possible to further increase the performance by changing 
the way, in which, the hardware is generated so that the data is pre-fetched from 
memory, thus greatly reducing the effects of the high memory latency. 
6.3.4 Normalization 
This algorithm is'siriiilar to that of the audio normahzation test, performed on the 
MIPS platform (see section 5.3.4 and appendix A.5). However because the XDl 
117 
6. Cray X D l Platform 
| X ) CO 
IS 
00 
CM 
T3 
T 3 
43 
CM 
CO 
CM 
T3 
CO 
o 
cn 
CO 
O 
bX) 
CO 
i-H 
CO 
T3 
CO 
o 
CO 
a 
I 
C 
.1—( 
'a 
o 
03 
-(-^ 
- d 
a 
X ) 
i-i 
ci 
<v 
o 
tn 
CH 
. 2 
+^ 
03 
i-i 
<u 
a. 
O 
to 
l-H 
bp 
^ ^ ^ 
X ) O 
:S 
CD CM 
_ HH 
^ 
CO X5 CO rO x ! 
CO 
CO 
03 
h3 ^ 
CO ~ 
1 
CM 
• CD 
,00 ^ 
XJ 73 
CS 
CO 
r-H 
CM 
XJ 
CO 
o3 
CO 
CO 
CO t-H 
T3 T 3 
O T3 
S 3 
^ ^ ^ 
CM 0 0 00 
X ) ^ X ) 
CO , 0 CO 
X ) 
1 
X ) 
lO 
X ) 
05 
X ) 
X3 
X ) 
c« p 
:=! 1^ 
CO X3 
i-H CO 
X3 X ) 
I—I 
o 
X3 
CO 
X3 
CO 
X ) 
CO 
1l 
CO X i 
CM 
X ) 
CO 
CM 
CO 
X ! 
CO 
X ) 
o 
X3 
o 
XJ 
1 
- , 
"to O 
X ) 
CM 
X3 
X ) 
CM 
XJ 
CO 
^ o 
1 ^ ^ 
CO x3 
03 ^ 
f-i r j . 
1^ % 
CO 
UO 
o ^ 
CO ^ 
N CO 
o _, 
CO X3 
O 
X ! 
O 
o 
X ) 
CO 
O 
X ) 
CO 
CO 
X ) 
CO 
X3 
CO 
X ) 
CO 
lO 
X3 
o 
>—I 
XJ 
CO 
XJ 
a.-3 
XJ CO 
X T3 
1 
XJ 
o 
CM 
i-H 
-O 
-l-i 
I—I 
CO 
o 1 
X ! 
O 
XJ 
X5 
CO 
X3 
CM 
XJ 
X3 
XJ 
I—I 
CO 
XJ 
CO 
XJ 
CO 
>—t 
o 
XJ 
X ! 
o 
CM 
X3 
XJ 
X ) 
CM 
73 
X ! 
LO 
XJ 
CO 
O 
XJ 
o 
X ! 
o 
1 
X ) 
CO 
XJ 
XJ 
CO 
o3 CO 
CM 
XJ 
CO 
' d 
CO 
o 
CO PL| 
. . XJ 
CO 
cu 
CO 
CM 
CO 
CO 
CO 
bO 
03 
CO 
PH 
CL>" 
bO 
o3 
CO 
PL, 
CO 
'cu' 
hO 
o3 
CO 
XJ 
XJ 
I 
XJ 
CO 
XJ 
o 
X i 
XJ 
o3 
O 
X ) 
CO 
03 
CO 
kO 
X ) 
CO 
CP 
CO 
00 
-I-i 
CO 
05 
CO 
<v 
bO 
03 
-I-i 
CO 
•3,0 
CO XJ 
-I-i 
CO 
XJ 
o 
XJ 
CO 
CM 
o3 
-i-i 
CO 
118 
6. Cray X D l Platform 
platform is hmited to implementing one loop in hardware, only the value scaling 
section of the algorithm is tested. 
As the normalization algorithm has very few inter-iteration dependencies, a new 
iteration of the loop may be started on every clock cycle. This not only eliminates 
the effects of the memory latency but also increases the parallelism to 17 operations 
per clock cycle. This conversion to hardware actually produced a 32% reduction in 
performance. This is because the parallelism that was extracted was not sufficient 
to overcome the fact that the CPU used can execute multiple instructions per cycle 
and has a clock speed approximately 15x higher than the FPGA. 
The above test algorithm issues two memory operations per cycle out of a possible 
eight. Given that hardware utilization is a mere 9.8%, the main loop is a prime 
candidate for unrolhng. Once again, the effects of this could not be evaluated due to 
the bug in the QDR interface core. However, it is expected that the hardware would 
be 2.1x faster than the pure software case, if the loop were to be unrolled by a factor 
of four. Because this case would use the ma:ximum available bandwidth, further 
increases in performance could only be achieved by either optimizing the hardware 
to increase the clock speed, or by increasing the complexity of the algorithm (see 
section 5.4.3). 
6.3.5 Copy 
The copy algorithm is extremely bandwidth intensive and only contains a small 
number of instructions. When the original version of the source code is converted 
into hardware, only two memory operations per cycle out of a possible eight are 
issued. This confers a significant performance advantage to the CPU. However, 
in practice, the hardware solution is only 41% slower than the software test case. 
When the loop is unrolled by a factor of two, the performance of the software 
119 
6. Cray X D l Platform 
remains constant, whereas the performance of the hardware doubles. This results 
in a speedup of 20% when compared to the software case. Although it couldn't be 
tested due to the bug in the QDR interface, the performance of the hardware is 
expected to increase to a factor of 1.9x if the loop were to be unrolled by a factor 
of four. 
The most hkely cause of the poor performance of the CPU, is the "read before 
write" architecture used in the caches. This architecture, reduces the complexity of 
the cache by using only one data valid bit per cache fine. However, this can result 
in up to a 100% increase in the bandwidth required for some classes of algorithms, 
because a cache line fill must be performed before any cache writes are allowed to 
take place. 
6.3.6 Series Sum 
The series sum test consists of a very tight loop with an extremely low level of ILP, 
consequently the hardware accelerated test case only performs 3 operations per clock 
cycle. The simple nature of this algorithm, leads to the relatively high clock speed 
of 190 MHz (TSF = 11.6). Overall, the algorithm produced a 5.9x reduction in 
performance when it was converted into hardware. On account of the high numbers 
of data dependencies together with the tight nature of the loop, it is unhkely that 
unrolling the loop would produce a significant increase in performance. 
6.4 MIPS R C Platform Comparison 
Although the MIPS and Cray X D l platforms share the same hardware conversion 
tools, there are significant differences in the hardware architectures of these two 
systems, which are detailed below. The most obvious difference, is the way in which 
120 
6. Cray X D l Platform 
their FPGAs are connected to their processors. In the MIPS platform, the FPGA 
is directly connected to the instruction pipeline of the CPU, which enables the RC 
area to be triggered with a minimal overhead. In the XDl platform, the FPGA is 
connected to the system bus via a Cray proprietary chipset, as such, execution in the 
RC area is triggered by accessing a series of memory mapped registers. Although the 
Cray approach enables reconfigurable computing to be added to an existing system, 
it considerably increases the number of clock cycles required to trigger execution in 
the RC area. Consequently, a much higher overhead is experienced by applications 
that repeatedly trigger the RC area. In the MIPS platform, because the FPGA is 
directly connected to the processor, it utilizes the same load/store interface as the 
CPU. This interface severely constrains the performance of the data flow pipehne, 
and this is due to the memory interface being designed to support a single issue 
CPU core and not a highly parallel RC area. At peak performance the MIPS FPGA 
can issue one DMA operation every 2-3 clock cycles. By comparison, the memory 
subsystem present in the XDl platform, is fully pipelined and parallehzed, and is 
therefore capable of processing up to 8 DMA operations per clock cycle. To fully 
exploit the available bandwidth in the XDl , the data flow pipeline instantiated 
within the FPGA must be highly parallel in nature. 
The FPGA used to implement the RC area in the MIPS platform was an Altera 
Apex 20K1500. This is a relatively old device compared to the Virtex I I Pro 50 used 
in the Cray XDl . Consequently, the Apex device does not contain any hardware 
multiplier blocks, so multiple operations have to be implemented using standard 
logic cells. This reduces the clock speed of the system and leads to significantly 
higher hardware resource utihzation, than that found in the Cray XDl . The Virtex 
device used in the XDl is also capable of running at much higher clock frequencies 
than the Apex device, further increasing the performance of the Cray platform. 
Since the FPGA in the XDl exists in a separate clock domain from the rest of the 
system, its clock speed can be set to the optimal value for the specific hardware 
121 
6. Cray X D l Platform 
that is to be implemented. However, in the MIPS platform the FPGA runs from 
the same, fixed frequency clock source as the CPU core. 
6.5 Summary 
6.5.1 Cray X D l Platform Limitations 
Due to a series of bugs, unimplemented features, and architectural problems asso-
ciated with the XDl host platform, the scope of the performance evaluation was 
limited in the following ways:-
Manual integration The FPGA is connected to the system after the MMU. As 
a result, the FPGA exists in the physical address space, whereas the software 
exists in a virtual address space. This complicates the integration of the auto-
matically generated hardware with the software. As a result, this integration 
is currently done manually, limiting the system to converting a single loop to 
hardware. The need for manual integration also prevents the XDl system run-
ning in a runtime self adaptive mode. Like the MIPS platform (see chapter 5) 
the conversion process in performed pre-runtime using only the program binary 
and a execution profile. No access to the original source code is required. 
No concurrent execution The current release of the Cray software does not sup-
port the use of the hardware interrupt line between the FPGA and the CPU. 
Consequently, the processor must continually poll the FPGA to determine if 
the execution has finished, or if there are any outstanding requests that need 
servicing. This prevents the CPU working concurrently with the FPGA. 
No DMA to rnaih memory DMA operations must be performed to the QDR-
SRAM memories that are connected to the FPGA instead of to the main 
122 
6. Cray X D l Platform 
SDRAM memory, as the majority of the available bandwidth is consumed by 
the CPU polhng the FPGA. 
Restricted clock speed range Due to a bug in the QDR-SRAM interface core 
provided by Cray, clock speeds less than 130 MHz, resulted in data corruption. 
Therefore, some algorithms could not be tested with higher loop unroll factors, 
despite the abundance of hardware resources. 
Although these factors do influence the results, in many cases the effect is to arti-
ficially reduce the performance produced by the hardware acceleration. Therefore 
these limitations do not change the conclusion, that using automatic hardware con-
version tools on HFC platforms like the XDl , can produce significant increases in 
performance. 
6.5.2 Clock Speeds 
During the course of the evaluation, it was noted that algorithms that contained 
multiply operations had lower clock speeds. To increase the clock speed, multiply 
operations were pipelined. This was achieved by placing additional sets of registers 
after each multiply operation and ensuring that the register re-timing feature was 
enabled. Although this was in accordance with the Xilinx guidelines, it did not 
result in a significant increase in clock speed, despite the fact that the timing re-
ports indicated that multiply operations were stiU the limiting factor. This could be 
resolved by hand coding the multiply operations to explicitly specify the placement 
of the additional pipeline registers. In addition, since a new range of FPGAs, con-
taining dedicated ceiscade logic between the hardware multiplier blocks has become 
available (e.g. Virtex4 [114]), it is now possible to create 32 bit wide multiphers 
that are capable of running at up to 500 MHz. 
It was noted that algorithms containing large numbers of DMA operations also had 
123 
6. Cray X D l Platform 
lower clock speeds. This is because, as the size of the crossbar switches increases, so 
does the MUXs that they contain, causing their propagation delays to rise. In order 
to maintain high clock rates for algorithms with large numbers of DMA operations, 
it would be possible to modify the data flow pipeline generation system to also 
generate the crossbar switches, so that additional registers are automatically added 
as required. 
6.5.3 Performance Improvements 
The Opteron processor present in the Cray XDl , can issue multiple instructions 
per cycle, and has a clock speed of 2.2 GHz. Since the FPGA that is present 
has a maximum clock speed of 200 MHz, a high level of parallehsm is required 
for this FPGA to produce a significant increase in performance. In general, if the 
conversion to hardware produces more than 30 parallel operations, a significant 
increase in performance will result. Some algorithms, such as the series sum test, 
are not well suited to hardware conversion due to the low levels of parallehsm present. 
However, since the number of parallel operations is calculated at an early stage in 
the conversion process, this calculation can be used to both identify and screen 
out non-ideal sections of code from the hardware generation process. Although the 
current hardware conversion system produced performance improvements of over 
18x, it is clear that there is scope for further enhancement. 
124 
Chapter 7 
Optimisations Of The 
Reconfigurable Computing System 
The results from both the MIPS embedded platform in chapter 5 and the XDl 
HPC platform described in chapter 6, demonstrate that a considerable increase in 
performance can be achieved by converting computationally intensive sections of 
software into hardware. However, it is also clear that further improvements can be 
made by changing both the conversion software and the hardware platform. 
7.1 Hardware Conversion Tools 
7.1.1 Loop Extraction 
As the hardware conversion tools are only capable of processing the innermost loops, 
a large proportion of the execution time may still be spent outside of these loops, 
resulting in some cornputationally intensive code not being converj^ ed to hardware. 
This is one of the major causes of the poor performance of the hardware in the 
125 
7. Optimisations Of The Reconfigurable Computing System 
for ( y = 0; y < 600; y++ ) 
{ 
/ / Inner loop A 
for ( X = 0; X < 800; x++ ) 
{ 
i f ( pictureA [y ] [ x] > max ) max = pictureB [y ] [ x ] ; 
} 
/ / Inner loop B 
for ( X = 0; x < 800; x++ ) 
{ 
i f ( pictureB [y ] [ x] < min ) min = pictureB [y ] [ x ] ; 
} 
/ / ca lculate sum of absolute d i f fe rences 
d i f f — pictureA [y] [0] — pictureB [y ] [ 0 ] ; 
SAD ( d i f f >= 0) ? d i f f : - d i f f ; 
} 
Listing 7.1: Example code with multiple nested loops 
quick sort algorithm on the MIPS platform (see section 5.3.10). Another side effect, 
is the high level of overheads for algorithms where the inner loop is triggered re-
peatedly, one example of this being the FFT test (see section 5.3.2). It is possible, 
however, to alter the hardware conversion system so that these outer loops might 
also be converted to hardware. In addition to removing the overhead associated 
with triggering the inner loop multiple times, this would also enable the conversion 
system to exploit parallelism in an additional, orthogonal direction. This can be 
seen in listing 7.1. When using conventional software execution, the two inner loops 
and the calculation of the sum of absolute differences is done sequentially. However, 
as there are no dependencies between them, they may be executed in parallel in 
the hardware domain. This can produce a substantial increase in the amount of 
parallelism, and consequently will fully exploit the performance increase enabled by 
hardware acceleration. 
The current hardware generation tools are not capable of converting loops that 
contain function calls. The loop extraction algorithm could easily be modified to 
automatically inline suitable functions. This would increase the performance of the 
126 
7. Optimisations Of The Reconflgurable Computing System 
system by increasing the number of loops that could be converted to hardware. 
7.1.2 Floating Point Operations 
The use of floating point operations is currently not supported. The implementation 
of this feature is a simple matter of adding the VHDL primitives, for the various 
operations, to the operation library in the hardware generation software. There are 
several commercially available floating point cores on the market that are specifically 
optimized for use in FPGAs [58, 57]. Consequently, adding support for floating point 
arithmetic, is a relatively simple task. However, as discussed in section 1.2.4, the 
use of higher radix notations [60] and logarithm formats [61] can significantly reduce 
the hardware utihzation in some circumstances. The decision about which floating 
point notation/format provides the optimal implementation for a specific data flow 
pipehne, can be made at run time, by the hardware generation tools. This is possible 
because the conversion tools possess detailed information about the number, type 
and connectivity of all the operations in the pipehne. 
7.1.3 Optimization 
As demonstrated in sections 6.3 and 3.4, loop unroUing can substantially increase 
the speedup factor produced when the code is converted to hardware. Although loop 
unrolling is a common software optimization technique, the optimum unroll factor is 
different depending on whether a section of code is being executed in software or in 
hardware. In addition, the unroll factor can have a dramatic effect on performance; 
too smaU a value can result in low levels of parallelism, thereby limiting the perfor-
mance of the system; too large a value can result in inefficient usage of the available 
hardware resources, which in extreme cases can prevent the loop from fitting into 
the RC area. Since the optimal unroU factor is dependent on several parameters 
127 
7. Optimisations Of The Reconfigurable Computing System 
that may differ from one system to another (e.g. bandwidth, memory latency and 
RC area size), it is not possible to determine the most appropriate value at compile 
time. A potential solution to this problem is for the hardware generation tools to 
examine the code and to calculate the most suitable unroll factor. The conversion 
tools could then either unroll or re-roll the loop as appropriate, producing code 
which is specifically optimized for the particular system, without reducing the level 
of abstraction. 
When a host platform becomes available that supports exceptions/interrupts, an 
error detection and roll back system could be implemented. This would resolve 
the pointer aliasing issue outlined in section 4.6.1.1, and also allow more aggressive 
optimizations to be performed. In this case, if the optimization were not suitable for 
a specific section of code, producing errors and therefore invahd data, the hardware 
execution would be terminated and the state of the system rolled back to a point 
before the error occurred. The hardware conversion system would then re-implement 
the hardware without the optimization that caused the error. 
7.1.4 Scheduling 
7.1.4.1 Operation Variants 
Complex operations like multipliers, dividers and barrel shifters, can take a consid-
erable number of logic cells to implement. Operations can be implemented using 
different methods, depending on the latency, throughput requirements and device 
usage constraints. The hardware generation software described in chapter 4, cur-
rently implements all operations with the minimum possible latency. Since in some 
cases the result of an operation is not required until further down the pipeline, data 
forwaMei-s are added to the pipeline, as described in section 4.6.2. This provides 
the potential for the hardware generation tools to decrease the hardware utihzation 
128 
7. Optimisations Of The Reconfigurable Computing System 
Stage 0 
Stage 1 •< 
Stage 2 
• 
o 
: Src/Result Register 
: Inler-stage Register 
• Stage Operation 
Stage 0 
Stage 1 
Stage 2 
Sre/Resull Register 
Inler-slage Regislec 
Stage Operation 
(a) Feedback to stage 0 (b) Local feedback 
Figure 7.1: Example data flow pipehnes with and without the local feedback opti-
mization 
of an operation, albeit at the expense of increasing its latency without, however, 
affecting the overall performance of the system. 
7.1.4.2 Local Feedback 
In several of the test algorithms evaluated (e.g. Low pass filter), the number of clock 
cycles between iterations is artificially high, since the resultant values can only be 
passed backwards to the very first stage of the pipeline; this limits the performance. 
One such example is shown in figure 7.1(a), where the result of the addition operation 
on stage 2 is forwarded back to stage 0, delaying the next iteration until after 
the previous iteration finishes. This performance restriction can be removed by 
modifying the pipeline to include local feedback, so that the value of register " r l " 
on stage 0 is not required to start the next iteration (as shown in figure 7.1(b)). 
The number of clock cycles between starting iterations in the pipeline has a signifi-
cant effect on performance and is, as shown in equation 5.3, roughly proportional to 
the overaU execution time. Performance will, therefore, be significantly enhanced by 
adding the capability to perform local feedback optimizations, such as in the simple 
example shown in figure 7.1, where this optimization has tripled the performance. 
129 
7. Optimisations Of The Reconfigurable Computing System 
7.1.4.3 DMA Operation Scheduling 
The current scheduhng algorithm places operations onto the pipeline based solely on 
the data dependencies that are present. However, if the scheduhng algorithm places 
more DMA operations onto a stage than the system can support, the pipeline is 
forced to stall. By altering the placement of DMA operations within the pipehne, 
the amount of idle time could be reduced, and thus performance increased. The 
exact placement of these operations would be influenced by the following factors:-
• The DMA issue constraints of the RC area. 
• The number and type of DMA operations in the section of code being converted 
to hardware. 
• The stages in the pipeline where data from load operations is required. 
7.1.4.4 F P G A Tool Integration 
Estimates of both the latency and also the hardware utilization for the operations 
that make up the data flow pipeline, are used extensively during the hardware 
generation process. In particular, the estimated latency is used during the scheduling 
phase to determine which operations can be packed onto the same pipeline stage, 
without exceeding the target propagation delay, as this would reduce the clock speed. 
Although the estimation process takes account of the effects of the four possible input 
configurations, its accuracy is severely hmited due to the following factors:-
Logic cell packing The majority of the operations that will be scheduled onto the 
pipeline have two inputs, whereas the logic cells present in most FPGAs have 
four inputs. The tools provided by FPGA vendors automatically combine 
operations and pack them into the available logic cells. Exactly how this 
130 
7. Optimisations Of The Reconfigurable Computing System 
is accomplished, has a considerable impact on both latency and hardware 
utilization. 
Hardw^are resource type used Some structures, like memories or multipliers, 
can be implemented using either general purpose logic cells or dedicated hard-
ware blocks present in the FPGA. The type of resource to be used is deter-
mined by the FPGA tools, and can have a significant effect on both latency 
and resource usage. 
Routing delays The position of hardware blocks relative to that of connected 
blocks within the FPGA, is the major factor that influences routing delays, 
and as a consequence affects the overall delay between register stages. 
In the current system, the hardware conversion tools are completely separate from 
the FPGA vendor's synthesis and PAR tools. By closely integrating these two tool 
chains, the limiting factors, hsted above, may be eliminated, producing an optimized 
pipeline layout. In addition, the PAR tools could be made to relay information both 
about the routing congestion and also about any unmet timing constraints. This 
would enable the hardware generation tools to insert additional registers and thus, 
reorder the pipeline in an iterative process to produce the optimal configuration. 
7.1.5 Hardware Software Integration 
The instructions that trigger execution in the RC area are currently placed at the 
end of the program. As described in section 4.7.2, this approach avoids the need to 
relink the program in the event that the trigger instructions are larger than the loop 
that they replace. However, if the RC pipeline is triggered repeatedly, the additional 
jump instructions that are required can introduce a significant overhead. In cases 
where the number of trigger instructions is less than the number of instructions that 
the RC area replaces, it is possible to embed the trigger instructions into the body 
131 
7. Optimisations Of The Reconfigurable Computing System 
of the program without relinking. This proposed optimization reduces the overhead 
associated with starting execution in the RC area, however this is at the expense of 
increasing the time required to enable or disable the hardware acceleration. Since 
the number of times that the RC area will be triggered is considerably higher than 
the number of times that the hardware acceleration will be enabled/disabled, this 
trade off is of overall benefit. 
As mentioned in section 7.1.1, it is possible to automatically inline and thus incor-
porate function calls into the hardware data flow pipeline. Although this allows the 
conversion of sections of code that would otherwise be ineligible, it can lead to ineffi-
cient usage of hardware resources, in cases where the function is called conditionally 
on a small number of loop iterations. Such cases can easily be recognized by further 
analysis of the profile data that is used to identify candidate loops. These ineffi-
ciencies can be avoided by replacing the function call with a interrupt trigger. This 
enables the data flow pipeline within the RC area to be paused and execution of the 
function to be carried out in the software domain. Once the function call completes, 
the results would be passed to the data flow pipeline and execution resumed. This 
would not only improve the efficiency of the hardware utilization, but would also 
enable a loop to be converted to hardware that would otherwise be too large to fit 
into the RC area. 
7.2 Platforms 
7.2.1 MIPS 
The results in section 5.3, indicate that substantial increases in performance can be 
obtained by performing the conversion to hardware. However, significant additional 
speed improvements can be obtained if the memory bandwidth to the RC area is 
132 
7. Optimisations Of The Reconfigurable Computing System 
increased. By simply pipelining the memory interface, it is possible to perform one 
DMA operation per clock cycle which, in some test cases, results in a further tripling 
of performance. Overall, this enables performance improvements in the excess of two 
orders of magnitude, whilst stiU consuming a level of memory bandwidth which is 
inline with the capabilities of most modern embedded platforms. 
Currently the FPGA exists in the same clock domain as the CPU core, forcing the 
two components to operate at the same clock speed. The maximum clock speed 
of the CPU is fixed at design time, however the maximum clock speed of the RC 
area is determined at runtime when its configuration is generated. As a result, in 
the majority of cases, the RC area is not operating at the optimal clock speed. 
Moving the FPGA to a separate clock domain would enable the RC area to run at 
its optimum clock speed, as determined by the hardware generation and synthesis 
tools that are described in section 7.1.4.4. 
7.2.2 X D l 
As described in section 6.5.2, pipelines with high numbers of DMA operations 
have lower maximum clock speeds due to the increased complexity of the cross 
bar switches that connect the load/store operations to the memory interface. Since 
the cross bars are inside the RC area, they can be easily modified to suit the specific 
data flow pipeline to which they are connected. Extra registers could be added to 
the cross bars for pipelines with high numbers of DMA operations, increasing the 
maximum clock speed at the expense of increasing the memory latency Clearly, this 
is advantageous for highly parallel algorithms, as the additional latency would not 
decrease the overall throughput. However, for some applications which contain high 
numbers of data dependencies, the additional memory latency might result in an 
overall reduction in performance. Since, the level of parallelism is calculated during 
the early stages of the hardware conversion process, the conversion tools would be 
133 
7. Optimisations Of The Reconfigurable Computing System 
able to determine whether adding additional registers to the cross bars is hkely to 
increase performance. This information may then be used to decide whether or not 
the optimization is performed. 
Once software support for the FPGA interrupt line is made available by Cray, many 
additional enhancements can be made to the hardware conversion process and ac-
companying execution environment. One key improvement would be the introduc-
tion of virtual memory. This could be effected by implementing an MMU inside 
the FPGA to perform the virtual to physical address translation. Any DMA trans-
actions to memory pages that are not present in physical memory would pause 
execution in the RC area and trigger an interrupt that would cause the main CPU 
to re-load the page from disk. Once this had been accomphshed, execution in the 
RC area would resume. This would allow DMA operations to be performed on the 
main SDRAM memory instead of on the local QDR-SRAMs. To reduce the effec-
tive latency of the main SDRAM, the QDR-SRAMs would be used to implement 
a cache. Since the RC area and the CPU would exist in the same virtual address 
space, the software/hardware integration could be performed automatically, as de-
scribed in section 4.7.2. With the integration process fully automated, it would also 
be feasible to implement multiple loops inside the RC area, as in the case of the 
MIPS platform (see section 5.1). 
7.2.3 Benchmark Algorithms 
Once the various platform limitations outlined in sections 5.3 and 6.5.1 have been 
resolved, it would be possible to run a range of standard benchmarks and real appli-
cations. The relevant sections of these pieces of software could then be automatically 
converted to hardware. This would provide a much more detailed understanding of 
the performance improvements that can be obtained in real world applications. In 
addition to aiding the understanding and future development of the runtime tech-
134 
7. Optimisations Of The Reconfigurable Computing System 
niques outhned in this thesis, using standard benchmarks would also enable the 
performance to be compared with other reconfigurable computing platforms and 
tool sets. 
7.3 An Ideal Reconfigurable Computing Platform 
The majority of current reconfigurable computing platforms (such as those avail-
able from Cray [67] and SGI [31]), are based on existing, commonly used hardware 
architectures, with FPGAs grafted onto the system. Although these systems are 
capable of producing considerable performance improvements in excess of an order 
of magnitude, their performance wiU always be limited by the system architecture. 
To fully exploit the potential of the reconfigurable computing concept, the hardware 
conversion tools must be combined with a hardware platform that has been specifi-
cally designed with the reconfigurable computing environment in mind. The block 
diagram of the ideal reconfigurable processor is shown in figure 7.2. 
7.3.1 Processor Integration 
In many of the test algorithms evaluated, the RC area was triggered repeatedly (e.g. 
quick sort, FFT, Mandelbrot). As a result, the time taken to trigger execution can 
have a considerable impact on the overall performance of the system. To minimize 
the trigger time, the RC area can be integrated into the same die as the CPU core. 
This eliminates the high latency associated with slow, off chip busses. Additionally, 
if the CPU core were to be placed on the same die as the RC area, the RC area 
may be interfaced directly to the instruction pipeline via an APU port [22], further 
decreasing the time taken to transfer execution from the CPU core to the RC area-
It is worth noting that, the proposed system has only one CPU core, whereas many 
135 
7. Optimisations Of The Reconfigurable Computing System 
nJ o 
S E 
ra Q o -4—• . 
3a
 
tc
h 
tn S 
o CO 
6 
•4—• 
01 
o 
ys
t • c 
CO 
<D 
> 
o 
ed
 
C
a 
c 
3 
o 
O 
EI O 
rS 
g i 
"c 
o 
o 
T 
01 a 
5 » 
2 < P 
o " 
l i 
= Si 
« o 
t E 
nJ o 
Q- I 
< • 
O 
CT) 
Q 
io
n 
it
or
 
c 
o 
CO 2 
tl- o 
s 
o 
Q-
< 
2 
to 
i3 
ra 
Q 
o 
. i - H 
o 
5 
O 
o 
a 
1^  
o 
s 
oi 
136 
7. Optimisations Of The Reconfigurable Computing System 
modern processors are dual or multi core [19, 20, 21, 118]. Moving to a dual core 
architecture, more than doubles the die area giving a maximum theoretical perfor-
mance improvement of 2x. The results from both the MIPS and XDl platforms (see 
sections 5.3 and 6.3 respectively), indicate that the die area used to implement the 
second CPU core would produce a significantly larger increase in performance, if it 
were to be used to implement an RC area. In addition to increasing performance, 
replacing a CPU core in a dual core system with an FPGA will also dramaticaUy 
reduce the power consumption and thus the heat generated. This is due to the lower 
operating frequencies of FPGAs. This is highly beneficial as the thermal envelope 
and thus the operating temperature, is a major design consideration, with modern 
systems generating up to 130 watts of heat [119 . 
Execution in the RC area is started by executing one or more trigger instructions, as 
outlined in section 5.1.1. Once these trigger instructions have been issued, the CPU 
core will stall until execution in the RC area has completed. Having such a large area 
of silicon idle for even short periods of time, wiU significantly reduce the efficiency of 
the system. In addition, the CPU must be active in order to perform auxihary tasks 
like: virtual memory paging and the execution of sub-functions (described in section 
7.1.5). In order to improve efficiency and also to accommodate the execution of vital 
tasks, the CPU core in the ideal system implements SMT [18]. This enables the CPU 
to simultaneously handle and issue instructions from multiple thread contexts. Since 
the RC trigger instructions only staU the execution of a single thread, the CPU can 
continue to execute instructions from other threads, thus enabling the RC area and 
the CPU core to run in parallel. 
7.3.2 Homogeneous R C Area 
The following is a list of ideas and concepts that should be incorporated into the 
RC area of an ideal reconfigurable computing platform:-
137 
7. Optimisations Of The Reconflgurable Computing System 
• Partial reconfigurability 
• Homogeneous structure 
• Configuration controller 
• Speciahzed hardware 
• Design for PAR 
• Clock domains 
7.3.2.1 Partial Reconfigurability 
The FPGA resources used to implement most software loops in hardware, are typ-
ically, significantly lower than the resources available (in all the results thus far 
obtained, the FPGA utihzation did not rise above 20% for any of the test algo-
rithms). To obtain the highest possible improvement in performance, the RC area 
can be used to implement several sections of code at the same time. Due to the dy-
namic nature of computing environments, the specific hardware blocks that the RC 
area will be required to implement, changes over time. The reconfiguration required 
by this, involves the RC area being idle. To minimize the associated performance 
impact, the RC area can be partially reconfigured so that one section of the hard-
ware is actively processing data, whilst another area is being configured for the next 
task. 
7.3.2.2 Homogeneous Structure 
As a direct result of the runtime scheduhng and reconfiguration (described in section 
7.3.2.1), it is not possible to determine exactly where, in the RC area, a hardware 
block will be placed during the synthesis and PAR stages. To avoid the need to 
re-run the PAR process every time the RC area is reconfigured, the RC area must 
138 
7. Optimisations Of The Reconfigurable Computing System 
have a homogenous structure that allows location independent hardware blocks to 
be synthesized. 
7.3.2.3 Configuration Controller 
To decrease the time taken to reconfigure the RC area, the configuration controller 
is connected to a cache instead of directly to the MMU. As such, pre-fetching and 
caching of configuration data wiU greatly speedup the reconfiguration process. Since 
the configuration data for even small hardware blocks can be larger that the level 1 
instruction cache, "cache trashing" is likely to occur (large amounts of useful cache 
data, flushed out to make room for a large, infrequently used data sets). To prevent 
this "cache trashing" the configuration controller is connected directly to the much 
larger, level 2 cache. 
7.3.2.4 Specialized Hardware 
Early FPGAs contained only logic cells together with their associated routing matri-
ces. However, modern FPGAs also contain block RAMs and hardware multipUers 
[114, 115]. These additional, specialized hardware resources can significantly in-
crease the performance of hardware implemented on these devices. Due to the 
complex nature of commonly used floating point operations, FPGAs designed for 
reconfigurable computing would benefit from additional specialized hardware primi-
tives. As mentioned in section 7.3.2.2, the RC area must have a regular, homogenous 
structure to allow for the relocation of hardware blocks. Although the addition of 
speciahzed hardware resources, like multipliers, disrupts the homogenous structure 
locally, if they are evenly distributed throughout the RC area a regular structure 
can be maintained at the global level. Accordingly, it is still possible to create relo-
catable hardware blocks whilst maintaining the advantages of specialized hardware. 
139 
7. Optimisations Of The Reconfigurable Computing System 
In order to maintain the appearance of homogeneity, FPGA resources could only be 
allocated at the granularity of the smallest repeating hardware unit. The hardware 
utilization within a block will decrease slightly owing to the increased size of the 
repeating units, caused by the addition of memory blocks, multipliers etc. This mi-
nor decrease in efficiency, is more than outweighed by the increase in performance 
produced by the presence of specialized hardware primitives. 
7.3.2.5 Design For Place And Route 
Modern FPGAs are optimized to provide the most efficient usage of die area at 
the expense of ever increasing synthesis and PAR times. Although this trade off 
is ideal for situations where the hardware is compiled once, and used to configure 
many devices, it is not well suited to environments, such as runtime reconfigurable 
computing, where the hardware compilation process occurs repeatedly. By changing 
the structure of the routing matrix in the FPGA, it is possible to decrease the amount 
of memory used during the PAR process by a factor of 18, whilst at the same time 
reducing the time taken, by a factor of 3 [44]. This kind of optimization will have a 
significant impact on the system, given that the PAR dominates all the other stages 
of the conversion process, in terms of both execution time and memory usage. 
7.3.2.6 Clock Domains 
Although not shown in figure 7.2, the RC area exists in a separate clock domain 
from the CPU core, and all other blocks in the processor. This is primarily due to 
the fact that the reconfigurable nature of the RC area, prevents it from running at 
the higher clock speeds used by the CPU. However, the presence of a clock domain 
boundary between the RC area and the rest of the system, allows its clock speed 
to be set at the optimal value for the specific hardware being implemented. As the 
140 
7. Optimisations Of The Reconfigurable Computing System 
RC area may be used to simultaneously implement multiple hardware blocks, the 
RC area also contains multiple clock distribution trees. This enables each block to 
be clocked at its optimum frequency, independently of the surrounding blocks. The 
presence of multiple, independent clock trees is a common feature of many modern 
FPGAs [114, 115 . 
7.3.3 Memory Sub-System 
As shown in figure 7.2, the DMA channel from the RC area connects to the memory 
hierarchy at the same point as the CPU core (i.e. the level 1 data cache). Not only 
does this mean that the RC area will benefit from both the level 1 and the level 
2 caches, but it also results in the system being naturally cache coherent. There 
is therefore, no overhead associated with the snoop traffic that would otherwise be 
required to keep the data in parallel caches synchronized. The configuration data for 
the RC area is read directly from the unified level 2 cache, instead of from either of 
the level 1 caches. As described in section 7.3.2.3, this eliminates the possibility of 
"cache trashing". However, as a result, newly generated configuration data will not 
be visible to the configuration controller, unless it has been first flushed back to main 
memory from the level 1 data cache (this is similar to the cache coherency problem 
faced by applications using self modifying code). One possible solution, could be 
to implement additional snoop hardware to ensure cache coherency, however, the 
additional hardware might result in a lower overall clock speed for the system. 
Since only a small section of software, that runs infrequently, handles the hardware 
generation and configuration, the cache coherency can be easily and efficiently be 
managed in the software domain. 
Since the RC area is connected to the memory hierarchy before the MMU, any hard-
ware insidie the RC area exists in a virtual address space instead of in the physical 
address space. This eliminates many of the integration and hardware conversion is-
141 
7. Optimisations Of The Reconfigurable Computing System 
sues that are faced by existing reconfigurable computing platforms such as the Cray 
XDl [67 . 
The computational rate of the RC area is considerably higher than that of the CPU 
core, resulting in memory bandwidth requirements that are significantly higher than 
those of CPUs [17]. This is demonstrated by the fact that, many of the algorithms 
tested on both the MIPS and XDl platforms (see sections 5.3 and 6.3 respectively) 
became bandwidth limited after they were converted to hardware. To provide the RC 
area with the maximum possible bandwidth, the RC area is connected, through the 
caches, to the latest dual channel DDR2 memory architecture [120]. Additionally, to 
decrease memory latency and to further increase performance, the memory controller 
is integrated directly into the processor. This approach has already been successfully 
implemented in the AMD Athlon and Opteron lines of processors and Intel plan to 
incorporate it in future products. 
7.3.4 Code Profiling 
A hardware code profiler, similar to the one described in section 3.1.1.1, monitors 
the software execution. Because the profiling operations are performed by dedicated 
hardware, there is little or no performance impact. Since the profiler, contains its 
own small cache, the bandwidth consumed by the profile data is minimal. As the 
profile data will never be read back by the profiler block, connecting the block to 
one of the data caches would not result in an increase in performance. In fact, due 
to the "read before write" policy of these caches, the overall performance would 
be reduced. In order to prevent this, the code profiler is connected directly to the 
MMU. The metrics gathered by the profiler are used by the hardware conversion 
software to determine which sections of code consume the most CPU time, and 
therefore to determine which sections should be converted to hardware. 
142 
7. Optimisations Of The Reconfigurable Computing System 
The code profiler monitors the execution of aU the code running on the system, and 
as a resuh the hardware conversion software, itself, will be profiled. This, in turn, 
will result in the hardware generation software converting computationally intensive 
sections of itself, to hardware. This will further speedup the generation of hardware 
for all other applications. 
7.3.5 Hardware Scheduling 
As mentioned previously in section 7.3.2.1, the RC area is capable of simultaneously 
implementing multiple sections of code in hardware. This, combined with the fact 
that the hardware required will change as the execution of the application progresses, 
means that a runtime scheduling algorithm is required to load the hardware blocks 
into the RC area, at the correct time. This process is described in section 1.2.3 
52, 53 . 
Due to the limited size of the RC area, it is not possible to load all the hardware 
blocks generated by the conversion software. As it is not always possible to predict 
which sections of code will be executed, it may be that the corresponding hardware 
block is not present in the RC area, when it is required. If this occurs, the section will 
be executed in the software domain by the SMT CPU core. A dedicated hardware 
block called the, "section monitor" (see figure 7.2), monitors the code being executed 
by the CPU core. If the code being executed matches one of the sections that the 
monitor is configured to detect, an interrupt wiU be raised. This interrupt then 
triggers the hardware scheduling software which will then load the corresponding 
section of hardware into the RC area. Once this has been completed, the scheduler 
will modify the apphcation software to use the newly instantiated hardware. The 
section monitor may also be configured to only raise an interrupt, if the number of 
times a section of code is used in a set time period, exceeds a threshold value. This 
allows the system to detect changes in the amount of execution resource used by 
143 
7. Optimisations Of The Reconflgurable Computing System 
different sections of code, without the generation of large numbers of time consuming 
interrupts. 
The scheduling software supported by the two hardware monitoring blocks would 
gather information about code usage and RC area congestion. This information, may 
then be used by the hardware generation tools to inform the generation of additional 
variants of the hardware blocks. For example, if the RC area can only accommodate 
2 out of 3 blocks that are simultaneously required, the system could automatically 
adapt and generate new versions of the blocks with lower unroll factors, thus allowing 
all 3 blocks to be concurrently implemented in the RC area. 
7.4 Heterogeneous Computing 
Many different types of computational resource have been developed since the birth 
of the "Von Neumann architecture" in 1945 [1]. A few existing architectures, to-
gether with their advantages and disadvantages, are listed below:-
C I S C / R I S C C P U Conventional CISC and RISC CPUs are by far the most com-
mon form of computational resource. This is due to their flexibihty, scalability 
and backwards compatibility. Their basic architecture is based on an instruc-
tion pipeline that executes a series of sequential instructions, the generic nature 
of which, enables this type of processor to perform an almost hmitless number 
of different tasks. However, this flexibly comes at the cost of efficiency, for 
this type of processor has a very low computational rate, compared to the die 
area used. Common examples of this architecture include: the AMD Athlon 
121], the Intel Pentium 4 [122] and the MIPS [102] range of processors. 
V L I W Very Long Instruction Word (VLIW) processors (e.g. the TriMedia [85' 
and Efficeon [123]) have a lot in common with CISC and RISC CPUs, as they 
144 
7. Optimisations Of The Reconfigurable Computing System 
are also based on an instruction pipeline. However, they differ in that, the 
instructions are grouped together to form bundles, where the instructions in 
each bundle are independent and are executed in parallel with each other. 
This explicit parallelism, leads to greater performance per unit die area, due 
to the absence of complex dependency checking and scheduling hardware. The 
use of VLIW cores has been limited to embedded applications where the code 
is compiled for a specific processor, due to the inherent lack of backwards 
compatibility and scalability provided by this architecture. 
F P G A In contrast to CISC, RISC and VLIW processors, FPGAs do not contain an 
instruction pipehne, instead they contain an array of logic elements that can 
be connected together to form a great variety of different circuits. Although 
they are capable of extreme computational workloads, a very high level of data 
parallelism is required to enable this. This limitation, is further compounded 
by the fact that FPGAs inherently have considerably lower clock speeds than 
those exhibited by conventional processors. In addition, the simple nature 
of the logic ceUs used in FPGAs, can lead to large inefficiencies when imple-
menting such complex functions as floating point arithmetic. The low level 
of abstraction between the configuration data and the device itself, leads to a 
total lack of backwards compatibility. 
Array processor Array processors (e.g. ClearSpeed CSX600 [124]) are similar to 
FPGAs, as they also consist of regular arrays of elements, connected together 
by a routing matrix. Array processors differ from FPGAs, in that each element 
is capable of performing complex mathematical functions, instead of just the 
simple 4 input logic functions of the FPGA. Although the architecture of an 
Array processor leads to highly efficient floating point implementations, it is 
not as efficient in performing bitwise and integer operations as the FPGA. 
Like FPGAs, array based processors can also suffer from a lack of backwards 
compatibility. 
145 
7. Optimisations Of The Reconfigurable Computing System 
All the architectures currently available have disadvantages, and many exist due to 
their superior performance in certain, specialized areas. It is, however, possible to 
create a system that will provide the optimal performance for every type of appli-
cation by combining, CISC, VLIW, and Array processors, together with FPGAs, in 
a single platform. Since CISC processors have high levels of abstraction, and can 
thus maintain backwards compatibility, they are the best choice for the primary pro-
cessor in the system, and therefore the architecture that all applications would be 
written for. Runtime monitoring software would analyze computationally intensive 
sections of code, to determine which type of computational resource is best suited 
to performing each task. The corresponding dynamic translation layer could then 
convert the code to the appropriate assembly language or hardware configuration. 
This approach would provide the performance advantages of a specialized system, 
whilst still maintaining compatibility with existing software appHcations. 
For a heterogeneous computing system to provide significant increases in perfor-
mance, a viable hardware platform is required. Some multi-core, multi-architecture 
devices have recently been introduced (e.g. the Cell processor [118]), however, be-
cause in these devices, the number and type of the processing elements is fixed, this 
category of system will not be optimal for every class of application. One possible 
solution, is to base the hardware platform on a blade architecture with a very high 
bandwidth between the compute blades (e.g. Cray XDl [67]). Blades with different 
types of computational resources could then be created. This approach would allow 
system administrators to optimize systems for the specific needs of their applica-
tions, by altering the proportions of the various computational resources present in 
the system. Blade architectures also provide an open, flexible platform to ease the 
introduction of new types of processor and/or new reconfigurable hardware. 
Software to software translation is commonplace (e.g. Java [49] and Transmeta [50]), 
and hardware to software translation is addressed throughout this thesis. This com-
146 
7. Optimisations Of The Reconfigurable Computing System 
bined with the fact that the hardware platforms required, either already exist or are 
relatively easy to develop, indicate that the concept of a heterogeneous computing 
environment is achievable. 
7.5 Future Research 
In addition to implementing the additional optimizations and concepts outlined in 
this chapter, the following should be investigated during any future work:-
Cache configuration A lot of work has be conducted to determine the opti-
mal cache size and organization for CPUs with various type of memory sub-
systems. Although a reconfigurable computing system will be used to execute 
the same algorithms as conventional CPUs, the memory bandwidth RC sys-
tems require is significantly higher due to the increase in computation rate. 
In addition any bandwidth required for instruction fetches is eliminated. A 
detailed analysis of the effects of different cache architectures and hierarchies 
should be conducted. 
Clock speed to bandwidth ratio Because RC systems consume high amounts 
of memory bandwidth in investigation into the relationship between RC area 
clock speed and required bandwidth should be performed. Although some 
work in this field has already been performed [17] a good understanding of the 
levels of bandwidth required is lacking, particularly in the context of a large 
scale RC system with multiple high performance CPUs (e.g. the Cray XDl) 
High level optimizations The possible use of high level optimization to restruc-
ture algorithms should be investigated. Typically code is optimized for soft-
ware execution unfortunately this does not provide the best solution for recon-
figurable computing platforms. In the future it might be possible for compilers 
147 
7. Optimisations Of The Reconfigurable Computing System 
to include some form of meta data in the final program binary that could be 
processed by the hardware conversion tools at runtime. This could provide 
the additional information required to make many high level optimizations 
possible. 
Virtual memory Since the RC area is usually connected directly to the memory 
sub-system it exists in the physical address space. However the instructions 
that the RC hardware will be based on operate in a virtual address space. One 
solution to this problem is to implement a full MMU between the RC area 
and the memory sub-system. Since the core code loops that are converted to 
hardware usually only access a small sub-set of the data present within a large 
application, it may be possible to create a more effective memory mapping 
system. 
7.6 Summary 
By extracting parallelism in an additional orthogonal direction, and combining this 
with the new optimisation techniques described in section 7.1, it is possible to con-
siderably increase the performance achieved by the hardware generation tools de-
scribed in chapter 4. Several of these optimisations will also reduce the hardware 
resources required, and consequently allow a greater proportion of an application to 
be converted to hardware, thus further increasing performance. 
A detailed design of an idealised computing platform is presented, which allows the 
hardware conversion tools to fully exploit the potential offered by the reconfigurable 
computing concept. In addition, a heterogeneous computing system is proposed, 
which combines reconfigurable hardware with a variety of other computational re-
sources. This would enable runtime translation of code to the most appropriate 
computational resource, producing dramatic increases in performance over a wide 
148 
7. Optimisations Of The Reconfigurable Computing System 
range of applications, whilst minimizing power consumption and maintaining back-
wards compatibility. 
149 
Chapter 8 
Conclusion 
Over the past forty years, the performance of conventional computing platforms has 
increased exponentially, in line with "Moore's Law" [2]. Although this has resulted 
in enormous levels of computing power being available, some applications such as 
computer games, simulations, multimedia programs and others, require even greater 
levels of performance. In addition to this need for increased performance, power 
consumption is becoming a critical factor in the design of modern microprocessors. 
At up to ISOWatts per processor [119], not only are the power requirements for 
super computers and large clusters a major cost issue, but the heat generated, 
causes thermal design problems, for all modern computing systems. 
The use of dedicated hardware (e.g. PC graphics cards [5, 6]) can dramatically 
increase performance and reduce power consumption, compared to a conventional 
software solution. However, dedicated hardware is only capable of performing the 
specific task for which it was designed, leaving it idle for significant amounts of time 
when not required. This has hmited its usage to frequently performed computational 
tasks. The use of reconfigurable hardware (e.g. FPGAs), has long been seen as a 
potential solution to this problem, as it can be used to provide the increased perfor-
mance and also the reduced power consumption that is associated with dedicated 
150 
8. Conclusion 
hardware. In addition, the reconfigurable nature provides the flexibility associated 
with software solutions. Historically there have been two major barriers to the 
widespread adoption of this technology:-
Hardware platforms The high performance of hardware based solutions, leads 
to high bandwidth requirements, which necessitates the reconfigurable hard-
ware to be closely integrated with both the CPU and the rest of the system. 
However in the past, reconfigurable hardware, has usually been added to ex-
isting platforms as an afterthought. Consequently the RC area was usually 
connected to one of the peripheral buses, which resulted in low bandwidth and 
poor performance. 
Abstraction and backwards compatibility Due to its low level nature, recon-
figurable hardware (unlike CPUs) has a very low level of abstraction between 
the configuration data and the underlying hardware. Consequently it is not 
possible, at the present time, to produce scalable systems that allow backwards 
compatibility. 
In this thesis it has been shown that by using JIT techniques, it is possible to 
convert computationally intensive sections of software to hardware at runtime, and 
thereby fully abstract the RC area from the application. Not only does this allow 
the performance of the system to be scaled, as new, faster FPGAs become available, 
but it also enables applications that were not written for reconfigurable computing, 
to be accelerated by this technology. 
To evaluate the performance improvements that could, potentially, be provided by 
runtime JIT techniques, a complete tool flow has been developed and described, 
which together with various test algorithms, was applied to two hardware platforms:-
MIPS platform The MIPS platform contains a small, embedded class CPU and 
has an FPGA connected directly to the instruction pipeline. As the CPU 
151 
8. Conclusion 
core and the FPGA are in the same clock domain, they run at the same clock 
speed of 16 MHz. This single clock domain approach, significantly reduces the 
complexity of the hardware. 
Cray X D l platform The Cray XDl is one of a new generation of supercomputers 
that integrate FPGAs into the core of the system, giving the FPGA high 
bandwidth and low latency access to the main memory. The CPU in the test 
system was a 2.2 GHz AMD Opteron, which was paired with a Virtex I I Pro 
50 FPGA, having a maximum clock speed of 200 MHz. 
The automatic hardware acceleration of programs on the MIPS test platform, pro-
duced performance increases ranging from 20% to a factor of 54.9, with many of the 
test algorithms being over lOx faster than in the original software case. The results 
obtained from the Cray XDl platform, ranged from a 7x decrease in performance 
to an 18x increase in performance. The reduction in performance, in some of the 
test cases, was due to the low levels of parallehsm in the original software, com-
bined with the fact that the FPGA ran at clock speeds up to 15 times slower than 
the CPU. It is worth noting, that in the majority of cases, the performance of the 
FPGA was limited by the available bandwidth, and not by the computational speed 
of the FPGA itself. Although significant increases in performance, in excess of an 
order of magnitude, have been obtained, it is clear from the results presented that 
reconfigurable computing cannot accelerate every test case. However, it is possible 
for the JIT hardware generation software to easily identify and screen out non-ideal 
kernel loops, and thus prevent reductions in performance. 
Although significant increases in performance are produced by the JIT tool flow, 
higher levels of performance can usually be obtained by handwriting the hardware 
in an HDL (e.g. VHDL, SystemC [40]). However, this requires a time consum-
ing and expensive process, that needs to be performed for every target platform. 
In comparison, the JIT based automatic conversion system outlined in this thesis 
152 
8. Conclusion 
provides the advantages of scalability and total abstraction. As a result, the die 
area being utilized in the latest generation of CPUs to instantiate multiple cores, 
119, 118] would produce much higher performance gains, if it were used instead, to 
instantiate an RC area. This solution would also help address the power consump-
tion and thermal issues that are currently being faced by the industry. The proposed 
JIT based tool flow does present some additional problems and drawbacks:-
Verification A runtime tool flow that dynamically changes the nature of an ex-
ecutable presents some verification problems. This is because it is almost 
impossible for the developer to verify the correctness of the system as a whole, 
however this level of verification is only usually required for life support, avia-
tion, and other critical applications. As such the vast majority of applications 
can still benefit from the increased performance provided by the use of JIT 
techniques. In these cases the application developer must verify that there ap-
plication functions correctly on the platform without hardware acceleration, 
and the JIT tool flow vendor must verify the conversion process itself. It can 
therefore be assumed that in the majority of cases any application will still 
function correctly once it has been hardware accelerated. The concept of re-
lying on abstraction in the verification model is common place (e.g. software 
vendors do not verify there applications against every possible combination of 
PC hardware components). It is worth noting that this approach is also used 
in the verification of many other JIT based tool fiows (e.g. Java [49], FX32 
48], Crusoe [50]). 
Time consuming tool flow Although the time required to perform the place and 
route operations, and therefore the overall conversion time is quite high, this 
wiU be reduced by the introduction designed for PAR (see section 7.3.2.5). In 
most cases the high conversion time does not pose a significant problem as the 
majority of programs fall into one of the following two categories; applications 
153 
8. Conclusion 
that are run repeatedly (e.g. word processing) where the conversion can be 
perform during system idle time and the results stored for future runs, or 
programs that have very long run-times (e.g. complex simulations) where the 
time required to perform the conversion becomes insignificant. 
Debugging Many threading related bugs are sensitive to the timing on the exe-
cution process. As such these bugs could be exposed and cause crashes on 
platforms that implement dynamic hardware acceleration. As this issue is 
common to any increase in system performance (e.g. increasing CPU clock 
speeds) care should be taken when writing multi-threaded apphcations for 
any platforms. In addition software development tools exist that can help 
identify these problems before the software is released. 
Application specific performance increases The use of reconfigurable hard-
ware can produce significant increases in performance. However this increase 
is very dependant on the application being accelerated, in some cases the per-
formance might not increase at all. This drawback is common to virtually 
every method of increases performance (e.g. increasing cache sizes, number of 
execution units, memory bandwidth, etc). 
Many different types of computational resource currently exist (e.g. RC areas [115, 
114], floating point array processors [124], conventional superscalar CPUs [122, 121]), 
each with their own inherent strengths and weaknesses. In the future, it should be 
possible to create a system that harnesses the combined power of all of these types of 
computation resource, whilst still providing scalability, abstraction and backwards 
compatibility. This could be achieved by combining existing software to software 
conversion techniques [50, 48, 49], with the software to hardware translation outlined 
in this thesis. The conversion software would locate intensive sections of code, 
identify the most suitable computational resource, and automatically migrate the 
code accordingly. This ability to dynamically translate and move sections of code 
154 
8. Conclusion 
from one resource to another, would produce a dramatic increase in performance 
and would also significantly reduce power consumption and operating costs. 
155 
Bibliography 
1] J. von Neumann, "First draft of a report on the EDVAC," IEEE Annals of 
the History of Computing, vol. 15, no. 4, pp. 27-75, 1993. 
[2] G. E. Moore, "Cramming more components onto integrated cir-
cuits," Electronics, vol. 38, no. 8, Apr 1965. [Online]. Available: 
ftp://download.intel.com/research/silicon/moorespaper.pdf 
3] A. Peleg and U. Weiser, "MMX technology extension to the Intel 
architecture," IEEE Micro, vol. 16, no. 4, Aug 1996. [Online]. Available: 
http://www.eecs.lehigh.edu/~mschulte/ece401/papers/mmx.ps 
4] L. Gwennap, "AltiVec vectorizes PowerPC," Microprocessor Re-
port, vol. 12, no. 6, May 1998. [Online]. Available: 
http://docencia.ac.upc.edu/ETSETB/SEGPAR/microprocessors/altivec 
%20(mpr).pdf 
[5] (2005, May) GeForce 6 series. nVidia Corporation. [Online]. Available: 
http: //www.nvidia.com/page/geforce6.html 
6] (2005, May) Radeon X850 graphics technology. ATI Technologies Inc. 
Online]. Available: http://www.ati.com/products/radeonx850/index.html 
7] (2005, May) Nexperia PNX8550. Philips Semiconductors. [Online. 
Available: http://www.semiconductors.philips.com/acrobat/hterature/9397/ 
75012469.pdf 
156 
B I B L I O G R A P H Y 
8] (2005, May) OMAP 2 architecture: OMAP2420 processor. Texas Instruments. 
Online]. Available: http://focus.ti.com/pdfs/wtbu/TLomap2420.pdf 
9] Z. Guo, W. Najjar, F. Vahid, and K. Vissers, "A quantitative analysis 
of the speedup factors of FPGAs over processors," in FPGA '04: 
Proceedings of the 2004 ACM/SIGDA 12th international symposium on 
Field programmable gate arrays, 2004, pp. 162-170. [OnUne]. Available: 
http://www.cs.ucr.edu/~vahid/pubs/fpga04_anal.pdf 
10] G. Lu, H. Singh, M.-H. Lee, N. Bagherzadeh, F. J. Kurdahi, and E. M . C. 
Filho, "The MorphoSys parallel reconfigurable system," in Euro-Par '99: Pro-
ceedings of the 5th International Euro-Par Conference on Parallel Processing, 
1999, pp. 727-734. 
11] M . Sima, S. Cotofana, S. Vassiliadis, J. T. van Eijndhoven, and K. Vis-
sers, "MPEG-compliant entropy decoding on FPGA-augmented TriMedi-
a/CPU64," in 10th Annual IEEE Symposium on Field-Programmable Custom 
Computing Machines, Apr 2002, pp. 261-270. 
12] J. R. Hauser and J. Wawrzynek, "Garp: a MIPS processor with a reconfig-
urable coprocessor," in IEEE Symposium on FPGAs for Custom Computing 
Machines, Apr 1997, pp. 12-21. 
13] M . J. Wirthlin, "A dynamic instruction set computer," in IEEE Symposium 
on FPGA's for Custom Computing Machines. IEEE Computer Society, 1995, 
pp. 99-107. 
14] C. Ebehng, C. Fisher, G. Xing, M . Shen, and H. Liu, "Implementing an OFDM 
receiver on the RaPiD reconfigurable architecture," IEEE Transactions On 
Computers, vol. 53, no. 11, pp. 1436-1448, 2004. 
15] J. Babb, M. Frank, V. Lee, E. Waingold, R. Barua, M. Taylor, J. Kim, 
S. Devabhaktuni, and A. Agarwal, "The raw benchmark suite: computation 
157 
B I B L I O G R A P H Y 
structures for general purpose computing," in FCCM '97: Proceedings of 
the 5th IEEE Symposium on FPGA-Based Custom Computing Machines. 
Washington, DC, USA: IEEE Computer Society, 1997, p. 134. [Online]. 
Available: http://www.crhc.uiuc.edu/ mfrank/pubs/Babb-1997-FCCM.pdf 
16] S. Bono, M. Green, A. Stubblefield, A. Juels, A. Rubin, and 
M . Szydlo, "Security analysis of a cryptographically-enabled," in l^th 
USENIX Security Symposium, 2006, pp. 1-16. [Online]. Available: 
https: / / www.usenix.org/events/sec05/tech/bono.html 
[17] S. Derrien and S. Rajopadhye, "FCCMs and the memory wall," in IEEE 
Symposium on Field-Programmable Custom Computing Machines, Apr 2000, 
pp. 329-330. 
18] J. L. Lo, J. S. Emer, H. M . Levy, R. L. Stamm, D. M. Tullsen, and S. J. 
Eggers, "Converting thread-level parallehsm to instruction-level parallelism 
via simultaneous multithreading," ACM Transactions on Computer Systems, 
vol. 15, no. 3, pp. 322-354, 1997. 
19] (2005, May) Intel dual-core processors. Intel Corporation. [Online]. Available: 
http://www.intel.com/technology/computing/dual-core/?iid=search&: 
20] (2005, May) Introducing multi-core technology. Advanced Micro Devices, Inc. 
[Onhne]. Available: http://multicore.amd.com/Technology/ 
21] T. Takayanagi, J. L. Shin, B. Petrick, J. Y. Su, H. Levy, J. H. P. Son, N. Moon, 
D. B. D, U. Nair, M. Singh, V. Mathur, and A. S. Leon, "A dual-core 64-bit 
UltraSPARC microprocessor for dense server applications," IEEE Journal of 
Solid-State Circuits, vol. 40, no. 1, Jan 2005. 
22] A. Ansari, P. Ryser, and D. Isaacs, "Accelerated system performance 
with APU-enhanced processing," Xcell, no. 52, pp. 36-39, 2005. [Onhne. 
158 
B I B L I O G R A P H Y 
Available: http: / / www.xilinx.com / publications / xcellonline/xcell_52 / xc_pdf/ 
xc_xcell52.pdf 
[23] (2005, May) Custom instructions. Altera Corporation. [Online]. Avail-
able: http: / / www.altera.com / products/ip/processors / nios2 / features / ni2-
custJnstructions.html#customJnstructions 
[24] R. Laufer, R. R. Taylor, and H. Schmit, "PCI-PipeRench and the SWOR-
DAPI: A system for stream-based reconfigurable computing," in IEEE Sym-
posium on Field-Programmable Custom Computing Machines, Apr 1999, pp. 
200-208. 
25] S. D. Haynes, R Y. K. Cheung, W. Luk, and J. Stone, "SONIC-a 
plug-in architecture for video processing," in IEEE Symposium on Field-
Programmable Custom Computing Machines, Apr 1999, pp. 280-281. [Onhne . 
Available: http://www.ee.ic.ac.uk/pcheung/publications/fpl99_sonic.pdf 
26] (2005, May) Nallatech FPGA computing solutions. Nallatech Inc. [Online . 
Available: http://www.nallatech.com/?node_id=l.2.l&;id=l 
27] C. Plessl and M. Platzner, "TKDM A reconfigurable co-processor 
in a PCs memory slot," in IEEE International Conference on Field-
Programmable Technology, Dec 2003, pp. 252-259. [Onhne]. Available: 
http://www.tik.ee.ethz.ch/~plessl/publications/fpt03/fpt03.pdf 
28] P. Leong, M. Leong, O. Cheung, T. Tung, C. Kwok, M. Wong, and K. Lee, 
"Pilchard A reconfigurable computing platform with memory slot interface," 
in IEEE Symposium on Field-Programmable Custom Computing Machines, 
Apr 2001, pp. 170-179. 
29] (2005, Jun) SRC Computers Inc. - Hardware elements. SRC Computers Inc. 
Online]. Available: http://www.srccomp.com/HardwareElements.htm 
159 
B I B L I O G R A P H Y 
30] (2005, Jun) Apphcation acceleration with FPGA-based re-
configurable computing. Cray Inc. [Online]. Available: 
http: //www.cray.com/products/xdl/acceleration.html 
[31] (2004, Nov) Extraordinary acceleration of workflows with reconfigurable 
application-specific computing from SGI. Silicon Graphics Inc. [Online. 
Available: http://www.sgi.com/pdfs/3721 .pdf 
[32] (2005, May) The hypercomputer product line. Star Bridge Systems Inc. [On-
line]. Available: http://www.starbridgesystems.com/products/hardware.html 
[33] M. Verderber, A. Zemva, and D. Lampret, "HW/SW partitioned optimization 
and VLSI-FPGA' implementation of the MPEG-2 video decoder," in Proceed-
ings of the conference on Design, Automation and Test in Europe. IEEE 
Computer Society, 2003, pp. 238-243. 
34] (2005, May) Case study: Oil & gas 
seismic exploration. Nallatech Ltd. [Online]. Available: 
http://www.nallatech.com/mediaLibrary/images/english/970.pdf 
35] P. Waldeck and N. Bergmann, "Dynamic hardware-software partitioning on 
reconfigurable system-on-chip," in IEEE International Workshop on System-
on-Chip for Real-Time Applications, Jun 2003, pp. 102-105. 
36] T. Maruyama and T. Hoshino, "A C to HDL compiler for pipeUne processing 
on FPGAs," in IEEE Symposium on Field-Programmable Custom Computing 
Machines, 2000, pp. 101-110. 
37] D. C. Cronquist, P. Franklin, S. G. Berg, and C. Ebehng, "Specifying and com-
piling applications for RaPiD," in IEEE Symposium on FPGAs for Custom 
Computing Machines, 1998, pp. 116-125. 
160 
B I B L I O G R A P H Y 
38] (2005, May) Complete design environment for C-based algorithmic 
design entry, simulation and synthesis. Celoxica Ltd. [Online]. Available: 
http://www.celoxica.com/products/dk/default.asp 
39] (2004, Oct) A true software approach to 
FPGA programming. Mitrionics AB. [Onhne]. Available: 
http://www.mitrion.com/press/Mitrion_whitepaper_041030.pdf 
40] G. Arnout, "SystemC standard," in Asia South Pacific Design Automation 
Conference, 2000, pp. 573-578. 
41] J. Hopf, G. S. Itzstein, and D. Kearney, "Hardware join Java: A high level 
language for reconfigurable hardware development," in IEEE International 
Conference On Field Programmable Technology, Dec 2002, pp. 311-347. 
42] B. W. Kernighan and D. M . Ritchie, The C Programming Language, 2nd ed. 
Prentice Hall, 1988. 
43] R. Lysecky, F. Vahid, and S. Tan, "Dynamic FPGA routing for just-in-
time FPGA compilation," in DAC '04-' Proceedings of the 41st annual 
conference on Design automation, 2004, pp. 954-959. [Online]. Available: 
http: / / www.cs.ucr.edu/~vahid/pubs/dac04_jitfpgaroute.pdf 
[44] , "A study of the scalability of on-chip routing for Just-
in-Time FPGA compilation," in IEEE Symposium on Field-
Programmable Custom Computing Machines, 2005. [Onhne]. Available: 
http://www.cs.ucr.edu/~vahid/pubs/fccm05_jitroute.pdf 
[45] J. M. P. Cardoso and H. C. Neto, "Macro-based hardware compilation of 
Java(tm) bytecodes into a dynamic reconfigurable computing system," in 
IEEE Symposium on Field-Programmable Custom Computing Machines, 1999, 
pp. 2-11. 
161 
B I B L I O G R A P H Y 
46] J. L. Schilling, "The simplest heuristics may be the best in Java JIT compil-
ers," ACM SIGPLAN Notices, vol. 38, no. 2, pp. 36-46, Feb 2003. 
47] A. Gordon-Ross and F. Vahid, "Frequent loop detection us-
ing efficient non-intrusive on-chip hardware," in Proceedings of the 
2003 international conference on Compilers, architecture and synthe-
sis for embedded systems, 2003, pp. 117-124. [Onhne]. Available: 
http://www.cs.ucr.edu/~vahid/pubs/cases03_profile.pdf 
[48] A. Chernoff, M . Herdeg, R. Hookway, C. Reeve, N. Rubin, T. Tye, S. B. 
Yadavalfi, and J. Yates, "FX!32: a profile-directed binary translator," IEEE 
Micro, vol. 18, no. 2, pp. 56-64, Mar 1998. 
[49] C.-H. A. Hsieh, J. C. Gyllenhaal, and W. mei W Hwu, "Java bytecode to native 
code translation: the caffeine prototype and prehminary results," in MICRO 
29: Proceedings of the 29th annual ACM/IEEE international symposium on 
Microarchitecture, 1996, pp. 90-99. 
50] J. C. Dehnert, B. K. Grant, J. P. Banning, R. Johnson, T. Kistler, 
A. Klaiber, and J. Mattson, "The transmeta code morphing™ software: 
using speculation, recovery, and adaptive retranslation to address real-life 
challenges," in CGO '03: Proceedings of the international symposium on 
Code generation and optimization, 2003, pp. 15-24. [Onhne]. Available: 
http://www.cs.aau.dk/~fleury/bug_cms/CMS_Reverse/Papers/DGB03.pdf 
[51] G. M. Amdahl, "Validity of the single-processor approach to achieving large 
scale computing capabihties," AFIPS Conference Proceedings, vol. 30, pp. 
483-485, Apr 1967. 
[52] H. Walder and M. Platzner, "Online scheduling for block-
partitioned reconfigurable devices," Design Automation and 
162 
B I B L I O G R A P H Y 
Test m Europe, pp. 290-295, 2003. [Onhne]. Available: 
http://www.tik.ee.ethz.ch/~walder/HomePage/XFORCES/DATE03.pdf 
53] A. Ahmadinia, C. Bobda, and J. Teich, "A dynamic scheduhng and 
placement algorithm for reconfigurable hardware," Lecture Notes In 
Computer Science, vol. 2981, pp. 125-139, Mar 2004. [Online]. Available: 
http://wwwl2.informatik.uni-erlangen.de/publications/pub2004/ABT04.pdf 
[54] Z. L i , K. Compton, and S. Hauck, "Configuration caching management 
techniques for reconfigurable computing," in IEEE Symposium on Field-
Programmable Custom Computing Machines, 2000, pp. 22-36. [Online]. 
Available: http://www.ece.wisc.edu/~kati/Publications/Li_FCCMOO.pdf 
55] M. Sanchez-Elez, M. Fernandez, R. Maestre, F. Kurdahi, R. Hermida, and 
N. Bagherzadeh, "A complete data scheduler for multi-context reconfigurable 
architectures," in Design Automation And Test In Europe, 2002, pp. 547-552. 
Online]. Available: http://www.eng.uci.edu/comp.arch/new_pubs/c84.pdf 
56] (2005, June) Floating point cores. Nallatech. [Online]. Available: 
http://www.nallatech.com/mediaLibrary/images/english/2432.pdf 
57] (2005, June) Double precision floating point cores. Nallatech. [Online]. Avail-
able: ht tp: / / www. nallatech. com / mediaLibr ary / images/english/3269.pdf 
[58] (2005, June) Quixilica(R) floating point cores. QinetiQ. [Online]. Avail-
able: http://www.qinetiq.com/home-rtes/quixilica_products/firmware_cores/ 
quixillica Jp. SupportingPar.0001. File, pdf 
[59] J. Liang, R. Tessier, and O. Mencer, "Floating point unit generation 
and evaluation for FPGAs," in IEEE Symposium on Field-Programmable 
Custom Computing Machines, 2003, pp. 185-194. [Onhne]. Available: 
http://www.doc.ic.ac.uk/~oskar/pubs/fccm03.pdf 
163 
B I B L I O G R A P H Y 
60] B. Catanzaro and B. Nelson, "Higher radix floating-point representations for 
FPGA-based arithmetic," in IEEE Symposium on Field-Programmable Cus-
tom Computing Machines, 2005. 
61] M. Haselman, M. Beauchamp, A. Wood, S. Hauck, K. S. Hemmert, and K. Un-
derwood, "A comparison of floating point and logarithmic number systems 
for FPGAs," in IEEE Symposium on Field-Programmable Custom Computing 
Machines, 2005. 
62] F. Vahid. (2005, Jul) Warp processors. [Online]. Available: 
ht tp: / / www .cs.ucr.edu/ ~ vahid / warp / 
63] G. Stitt, R. Lysecky, and F. Vahid, "Dynamic hardware/software 
partitioning: a first approach," in DAC '03: Proceedings of the 40th 
conference on Design automation, 2003, pp. 250-255. [Onhne]. Available: 
http://www.cs.ucr.edu/~vahid/pubs/dac03_dhs.pdf 
64] R. Lysecky and F. Vahid, "A study of the speedups and competi-
tiveness of FPGA soft processor cores using dynamic hardware/software 
partitioning," in DATE '05: Proceedings of the conference on Design, 
Automation and Test in Europe, Mar 2005, pp. 18-23. [Online]. Available: 
http: / / www.cs.ucr.edu/~vahid/pubs/date05_warp_microblaze.pdf 
65] G. Stitt, Z. Guo, W. Najjar, and F. Vahid, "Techniques for synthesizing 
binaries to an advanced register/memory structure," in FPGA '05: 
Proceedings of the 2005 ACM/SIGDA 13th international symposium on 
Field-programmable gate arrays, 2005, pp. 118-124. [Online]. Available: 
http://www.cs.ucr.edu/~vahid/pubs/fpga05-binsyn.pdf 
66] Z. Guo, B. Buyukkurt, and W. Najjar, "Input data reuse in compihng window 
operations onto reconfigurable hardware," in Proceedings of the 2004 ACM 
164 
B I B L I O G R A P H Y 
SIGPLAN/SIGBED conference on Languages, compilers, and tools for em-
bedded systems, 2004, pp. 249-256. 
67] (2005, Jun) Cray X D l supercomputer. Cray Inc. [Onhne]. Available: 
http://www.cray.com/products/xdl 
68] R. Lysecky and F. Vahid, "A configurable logic architecture for dynamic 
hardware/software partitioning," in DATE '04: Proceedings of the conference 
on Design, Automation and Test in Europe, vol. 1, no. 1, Feb 2004, pp. 480-
485. [Onhne]. Available: http://www.cs.ucr.edu/~vahid/pubs/date04_clf.pdf 
69] G. Stitt and F. Vahid, "Hardware/software partitioning of software binaries," 
in Proceedings of the 2002 IEEE/ACM international conference on Computer-
aided design, 2002, pp. 164-170. 
70] (2005, May) Highly integrated, programmable system-
on-chip (SoC). Philips Electronics. [Online]. Available: 
http://www.semiconductors.philips.com/products/nexperia/ 
71] P. Lieverse, P. V. D. Wolf, K. Vissers, and E. Deprettere, "A methodology for 
architecture exploration of heterogeneous signal processing systems," Journal 
of VLSI Signal Processing Systems, vol. 29, no. 3, pp. 197-207, Nov 2001. 
72] D. D. Gajski, S. Narayan, L. Ramachandran, F. Vahid, and P. Fung, "System 
design methodologies: aiming at the 100 h design cycle," IEEE Transactions 
on Very Large Scale Integration (VLSI) Systems, vol. 4, no. 1, pp. 70-82, Mar 
1996. 
[73] (2005, May) Virtio - Virtual platforms for embedded system design. Virtio. 
Online]. Available: http://www.virtio.com 
[74] (2005, May) TriMedia SDK. Phihps Semiconductors. [Online]. Available: 
http://www.alacron.com/downloads/vncl98076xz/sde_2_75006255.pdf 
165 
B I B L I O G R A P H Y 
75] P. R. Panda, "SystemC: a modeling platform supporting multiple design ab-
stractions," in Proceedings of the 14th international symposium on Systems 
synthesis, 2001, pp. 75-80. 
76] J. Goshng, B. Joy, G. L. S. Jr, and G. Bracha, The Java™ Language 
Specification, 3rd ed. Addison-Wesley Professional, Jun 2005. [Online . 
Available: http://java.sun.eom/docs/books/jls/download/langspec-3.0.pdf 
77] R. K. Gupta and S. Y. Liao, "Using a programming language for digital system 
design," IEEE Design & Test, vol. 14, no. 2, pp. 72-80, Apr 1997. 
78] J. Gong, D. D. Gajski, and A. Nicolau, "Performance evaluation for 
application-specific architectures," IEEE Transactions on Very Large Scale 
Integration (VLSI) Systems, vol. 3, no. 4, pp. 483-490, Dec 1995. 
79] P. Paulin, C. Pilkington, and E. Bensoudane, "StepNP: a system-level explo-
ration platform for network processors," IEEE Design & Test, vol. 19, no. 6, 
pp. 17-26, Nov 2002. 
80] M. S. Schlansker and B. R. Rau, "EPIC: Explicitly Parallel Instruction Com-
puting," Computer, vol. 33, no. 2, pp. 37-45, Feb 2000. 
[81] H. Sharangpani and K. Arora, "Itanium processor microarchitecture," 
IEEE Micro, vol. 20, no. 5, pp. 24-43, Sep 2000. [Online]. Available: 
http://courses.ece.uiuc.edu/ece512/Papers/itaniumarchitecture.pdf 
82] C. McNairy and D. Soltis, "Itanium 2 processor microarchitecture," IEEE 
Micro, vol. 23, no. 2, pp. 44-55, Mar 2003. 
83] A. Settle, D. A. Connors, G. Hoflehner, and D. Lavery, "Optimization for the 
Intel® I tan ium® architecture register stack," in CGO '03: Proceedings of 
the international symposium on Code generation and optimization, 2003, pp. 
115-124. [Onhne]. Available: http://rogue.colorado.edu/draco/papers/cgo-
03-register.pdf 
166 
B I B L I O G R A P H Y 
84] M. Chrobak and J. Noga, "LRU is better than FIFO," in 
SODA '98: Proceedings of the ninth annual ACM-SIAM sympo-
sium on Discrete algorithms, 1998, pp. 78-81. [Online]. Available: 
http://www.cs.ucr.edu/~marek/pubs/lru_vsJifo.ps 
85] G. Slavenburg and M. Janssens, DataBook: PNX1300 Se-
ries Media Processors. Philips Electronics North America 
Corporation, Feb 2002, ch. Appendix A. [Online]. Available: 
http://www.semiconductors.philips.com/acrobat_download/hterature/9397/ 
75010145.pdf 
86] IA-32 Intel@ Architecture Software Developer's Man-
ual, Intel Corporation, Apr 2005. [Online]. Avail-
able: ftp://download.intel.com/design/Pentium4/manuals/25366615.pdf, 
ftp: / / download.intel.com/design/Pentium4/manuals/25366715.pdf 
87] R. B. Lee, "Multimedia extensions for general-purpose processors," in IEEE 
Workshop on Signal Processing Systems, Nov 1997, pp. 9-23. [Online . 
Available: http://www.ee.princeton.edu/~rblee/HPpapers/sipsl5go.ps 
88] M. L. Anido, A. Paar, and N. Bagherzadeh, "Improving the operation auton-
omy of SIMD processing elements by using guarded instructions and pseudo 
branches," in DSD '02: Proceedings of the Euromicro Symposium on Digital 
Systems Design, 2002, pp. 148-156. 
89] D. N. Pnevmatikatos and G. S. Sohi, "Guarded execution and branch predic-
tion in dynamic ILP processors," in ISCA '94: Proceedings of the 21ST annual 
international symposium on Computer architecture, 1994, pp. 120-129. [On-
hne]. Available: http://www.mhl.tuc.gr/research/publications/ISCAl994-
GuardedExecution.pdf 
90] MIPS32'^^ Architecture For Programmers, MIPS Technologies Inc, Jun 2003. 
167 
B I B L I O G R A P H Y 
91] (2006, Apr) Transitive corporation: Technology overview. Transitive 
Corporation. [Online]. Available: http://www.transitive.com/technology.htm 
92] D. Sima, "The design space of register renaming techniques," IEEE 
Micro, vol. 20, no. 5, pp. 70-83, 2000. [Online]. Available: 
http://www.dc.uba.ar/people/materias/ap/Articulos/The Design Space of 
Register Renaming Techniques.pdf 
93] G. H. Gonnet, "Balancing binary trees by internal path reduction," Commu-
nications of the ACM, vol. 26, no. 12, pp. 1074-1081, 1983. 
94] V. Betz and J. Rose, "How much logic should go in an FPGA logic block?" 
IEEE Design & Test, vol. 15, no. 1, pp. 10-15, 1998. [Online]. Available: 
http://www.eecg.utoronto.ca/~jayar/pubs/betz/design98.pdf 
95] K. Eckl and C. Legl, "Retiming sequential circuits with multiple register 
classes," in DATE '99: Proceedings of the conference on Design, automation 
and test in Europe, Mar 1999, pp. 650-656. 
[96] C. Leiserson and J. Saxe, "Optimizing synchronous systems," in Journal of 
VLSI and Computer Systems, vol. 1, no. 1, 1983, pp. 41-67. 
97] W. Landi and B. G. Ryder, "A safe approximate algorithm for interprocedural 
pointer aliasing," ACM SIGPLAN Notices, vol. 39, no. 4, pp. 473-489, 2004. 
Online]. Available: http://athos.rutgers.edu/pub/sigplan92-landi-ryder.ps 
98] G. Ramalingam, "The undecidabihty of aliasing," ACM Transactions on Pro-
gramming Languages and Systems, vol. 16, no. 5, pp. 1467-1471, 1994. 
99] D. R. Ditzel, "Transmeta's crusoe: Cool chips for mobile computing," Hot 
Chips Symposium, Aug 2000. 
168 
B I B L I O G R A P H Y 
100] (2005, May) Quartus I I software. Altera. [Onhne]. Avail-
able: http://www.altera.com/products/software/products/quartus2/qts-
index.html 
101] L. Hansen, "Design performance leaps forward with ISE 
7.1i software," Xcell, no. 53, pp. 64-66, 2005. [Online. 
Available: http: / / www.xihnx.com / pubhcations / xcellonline/xcell_53/xc_pdf/ 
xcjx;cen53.pdf 
102] G. Kane and J. Heinrich, MIPS RISC architecture, 2nd ed. Prentice-Hall, 
Inc., 1992. 
103] S. Rhoads. (2005, June) Plasma - most 
MIPS I ^ " ^ opcodes. Opencores.org. [Online]. Available: 
http: / / www.opencores.org/projects.cgi / web/mips / overview 
104] T. Balph, "LFSR counters implement binary polyno-
mial generators," EDN, May 1998. [Onhne]. Available: 
http: / / www.pldworld.com/html/technote/1 ldf_06.pdf 
105] H. Niederreiter, Random number generation and quasi-Monte Carlo methods. 
Philadelphia, PA, USA: Society for Industrial and Applied Mathematics, 1992. 
106] P. J. M. Laarhoven and E. H. L. Aarts, Eds., Simulated annealing: theory and 
applications. Norwell, MA, USA: Kluwer Academic Publishers, 1987. 
[107] B. Preneel, A. Biryukov, E. Oswald, B. V. Rompay, L. Granboulan, E. Dottax, 
S. Murphy, A. Dent, J. White, M. Dichtl, S. Pyka, M. Schafheutle, P. Serf, 
E. Biham, E. Barkan, O. Dunkelman, J. J. Quisquater, M. Ciet, F. Sica, 
L. Knudsen, M. Parker, and H. Raddum, NESSIE security report. NESSIE, 
Feb 2003, ch. 3. 
108] E. getin, R. C. S. Morling, and L. Kale, "An integrated 256-
point complex FFT processor for real-time spectrum analysis and 
169 
B I B L I O G R A P H Y 
measurement," IEEE Proceedings of Instrumentation and Measurement-
Technology Conference, vol. 1, pp. 96-101, May 1997. [Onhne]. Available: 
http: //dolphin.wmin.ac.uk/~cetine/IMTC97.pdf 
109] L. C. Ludeman, Fundamentals of digital signal processing. Wiley, 1996, ch. 6, 
pp. 272-286. 
110] , Fundamentals of digital signal processing. Wiley, 1996, ch. 5, pp. 246-
252. 
I l l ] D. L. Gall, "MPEG: a video compression standard for multimedia applica-
tions," Communications of the ACM, vol. 34, no. 4, pp. 46-58, 1991. 
112] G. de Haan, "IC for motion-compensated de-interlacing, noise re-
duction, and picture-rate conversion," IEEE Transactions Consumer 
Electronics, vol. 45, no. 3, pp. 617-624, 1999. [Onhne]. Avail-
able: http: / / www.semiconductors.philips.com/acrobat_download/other/cms/ 
99_225.pdf 
113] W. A. Martin, "Sorting," ACM Computing Surveys, vol. 3, no. 4, pp. 147-174, 
Dec 1971. 
114] (2005, June) Virtex-4 family overview. Xilinx. [Online]. Available: 
http: //www.xihnx. com/bvdocs / publications/ds 112 .pdf 
115] (2005, June) Stratix I I device family data sheet. Altera. [Online]. Available: 
http: //www.altera.com/hterature/hb/stx2 / stx2_sii5vl _01 .pdf 
116] (2005, Jul) HyperTransport Consortium. [Online]. Available: 
http://www.hypertransport.org/ 
117] (1997) Using the RDTSC instruction for perfor-
mance monitoring. Intel Corporation. [OnUne]. Available: 
http://www.math.uwaterloo.ca/~jamuir/rdtscpml.pdf 
170 
B I B L I O G R A P H Y 
118] (2005, Aug) The CELL project at IBM research. IBM Corporation. [Online . 
Available: http://www.research.ibm.com/cell/ 
119] Intel® Pentium® D Processor 84O, 830 and 820 
Datasheet, Intel Corporation, May 2005. [Onhne], Available: 
http: / / download. Intel, com / design/Pentiumd/datashts/30750601. pdf 
120] B. Davis, T. Mudge, B. Jacob, and V. Cuppu, "DDR2 and low latency vari-
ants," in The 27th Annual International Symposium on Computer Architec-
ture, May 2000. 
121] (2005, Aug) AMD Athlon"^^ 64. Advanced Micro De-
vices, Inc. [Online]. Available: http://www.amd.com/us-
en/Processors/ProductInformation/0„30_118_9484,00.html 
122] (2005, Aug) Intel® Pent ium® 4 processor. Intel Corporation. [Online. 
Available: http://www.intel.com/products/processor/pentium4/index.htm 
123] (2005, Aug) Transmeta'T^' Efficeon'^^ TM880 pro-
cessor. Transmeta Corporation. [Online]. Available: 
http://www.transmeta.com/pdfs/brochures/tmta_efficeon_tm8800.pdf 
124] (2005, Aug) CSX600 apphcation accelerator. ClearSpeed Technology pic. 
Online]. Available: http://www.clearspeed.com/products/si.php 
171 
Appendix A 
MIPS Test Algorithms Source 
Code 
This appendix contains only the kernels of the test algorithms used to evaluate the 
MIPS platform. The source code for the console application (described in section 
5.2.1) is not included. 
A . l P R B S Generator (Standard) 
v o i d r a n d o m F i l l ( unsigned long data [] , unsigned long l eng th 
unsigned long seed ) 
{ 
i n t i , c u r l n d e x ; 
unsigned long b i t S l , b i t 2 8 ; 
unsigned long s h i f t R e g ; 
s h i f t R e g = (seed = 0) ? DEFAULTJ'RBS-SEED : seed; 
/ / f i l l the a r ray 
f o r ( cu r lndex = 0; c u r l n d e x < l e n g t h ; cur lndex-H- ) 
{ 
/ / generate 32 b i t s of random data 
f o r ( i = 0; i < 32; i + + ) 
{ 
172 
A. MIPS Test Algorithms Source Code 
/ / get the l a s t 2 b i t s of the r e g i s t e r 
b i t 3 1 = ( s h i f t R e g » 31) & 0 x 1 ; 
b i t 2 8 = ( s h i f t R e g » 28) & 0 x 1 ; 
/ / s h i f t the reg up and or in the new 
s h i f t R e g « = 1; 
s h i f t R e g 1 = b i t a i ' bit28 ; 
} 
/ / s to re the random number 
da ta [ cu r lndex ] = s h i f t R e g ; 
b i t 
A.2 P R B S Generator (Unrolled) 
void r a n d o m F i l l O p t ( unsigned long d a t a [ ] , 
unsigned long leng t h , 
unsi gned long seed ) 
i n t c u r l n d e x ; 
unsigned i n t b i tS l , bitao , bit29 , bit28 , bit27 , bi t26 
unsigned i n t b i t25 , b i t 2 4 , b i t23 , b i t 22 , b i t 2 1 , b i t 2 0 
unsigned i n t b i t l 9 , b i t l 8 , b i t l 7 , b i t i e , b i t l 5 , b i t l 4 
unsigned i n t b i t l S , b i t l 2 , b i t l l , b i t lG , bit9 , b i t s ; 
unsigned i n t bi t? , bi te , bi t5 , bi t4 , b i ts , b i t2 ; 
unsigned i n t b i t l , b i tO ; 
unsigned i n t tempO , t empi , temp2 , s h i f t R e g , new S h i f t R 
s h i f t R e g = (seed 
/ / 
f o r 
{ 
i l l the a r ray 
( cu r lndex = 0; c u r l n d e x < len 
/ / Get the b i t s of the s h i f t r 
b i t 3 1 = ( s h i f t R e g » 31) & 0x1 
b i t 3 0 = ( s h i f t R e g » 30) & 0x1 
b i t 2 9 = ( s h i f t R e g » 29) & 0x1 
b i t 2 8 = ( s h r f t R e g » 28) & 0x1 
b i t 2 7 = ( s h i f t R e g » 27) & 0x1 
b i t 2 6 = ( s h i f t R e g » 26) & 0x1 
b i t 2 5 = ( s h i f t R e g » 25) & 0x1 
b i t 2 4 = ( s h i f t R e g » 24) & 0x1 
b i t 2 3 = ( s h i f t R e g » 23) & 0x1 
b i t 22 = ( s h i f t R e g » 22) k 0x1 
b i t 2 1 = ( s h i f t R e g » 21) & 0x1 
b i t 2 0 = ( s h i f t R e g » 20) & 0x1 
173 
0) ? DEFAULT_PRBS^EED : seed; 
1 ; cur lndex-H- ) 
i n t o the r e g i s t e r s 
A. MIPS Test Algorithms Source Code 
b i t l 9 = 
b i t l 8 = 
b i t l 7 = 
b i t l 6 = 
b i t l 5 = 
b i t l 4 = 
bi t 13 = 
b i t l 2 = 
b i t l l = 
b i t lO = 
b i t9 = 
b i t s = 
b i t? = 
bi te = 
b i t s = 
b i t4 = 
b i t s = 
b i t 2 = 
b i t l = 
bitO = 
/ / Calc 
temp2 = 
tempi = 
tempO = 
newShif tReg 
newShif tReg 
newShif tReg 
newShif tReg 
newShif tReg 
newShif tReg 
newShif tReg 
newShif tReg 
newShif tReg 
newShif tReg 
newShif tReg 
newShif tReg 
newShif tReg 
newShif tReg 
newShif tReg 
newShif tReg 
newShif tReg 
newShif tReg 
newShif tReg 
aewShif tReg 
newShif tReg 
newShif tReg 
newShif tReg 
s h i f t R e g 
s h i f t R e g 
s h i f t R e g 
s h i f t R e g 
s h i f t R e g 
s h i f t R e g 
s h i f t R e g 
s h i f t R e g 
s h i f t R e g 
s h i f t R e g 
s h i f t R e g 
s h i f t R e g 
s h i f t R e g 
s h i f t R e g 
s h i f t R e g 
s h i f t R e g 
s h i f t R e g 
s h i f t R e g 
s h i f t R e g 
s h i f t R e g 
» 
» 
» 
» 
» 
» 
» 
» 
» 
» 
» 
» 
» 
» 
» 
» 
» 
» 
» 
» 
19 
18 
17 
16 
15 
14 
13 
12 
11 
10 
9) 
8) 
7) 
6) 
5) 
4) 
3) 
2) 
1) 
0) 
k 0x1 
& 0x1 
& 0x1 
& 0x1 
& 0x1 
& 0x1 
& 0x1 
& 0x1 
& 0x1 
& 0x1 
& 0x1 
& 0x1 
& 0x1 
& 0x1 
& 0x1 
& 0x1 
& 0x1 
& 0x1 
& 0x1 
& 0x1 
the new s h i f t r e g i s t e r value 
b i t S l 
bitSO 
b i t 2 9 
bit28 
bi t27 
bit26 
bitO 
b i t l 
b i t2 
b i t s 
b i t4 
b i t s 
b i te 
b i t7 
b i t s 
b i t9 
b i t lO 
b i t l l 
b i t l 2 
b i t l S 
b i t l 4 
b i t l 5 
b i t i e 
b i t l 7 
b i t l 8 
b i t l a 
bit20 
b i t21 
bit22 
tempO) 
t e m p i ) 
temp2) 
bitO 
b i t l 
b i t2 
b i t s 
b i t4 
b i t s 
b i te 
b i t7 
b i t s 
b i t9 
b i t lO 
b i t l l 
b i t l 2 
b i t l S 
b i t l 4 
b i t l S 
b i t i e 
b i t l 7 
b i t l S 
b i t l 9 
« 
« 
« 
« 
« 
« 
« 
« 
« 
« 
« 
« 
« 
« 
« 
« 
« 
« 
« 
« 
1 
2 
3 
4 
5 
6 
7 
8 
9 
.10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
« 21 
« 22: 
174 
A. MIPS Test Algorithms Source Code 
newShi f tReg 
newShi f tReg 
newShif tReg 
newShif tReg 
newShif tReg 
newShif tReg 
newShif tReg 
newShif tReg 
newShif tReg 
/ / r o t a t e vars 
s h i f t R e g = new 
d a t a [ cu r lndex 
( b i t 2 3 " b i t 2 0 ) « 23 
( b i t 2 4 " b i t 2 1 ) « 24 
( b i t 2 5 ^ b i t 2 2 ) « 25 
( b i t 2 6 " b i t 2 3 ) « 26 
( b i t 2 7 " b i t 2 4 ) « 27 
( b i t 2 8 " b i t 2 5 ) « 28 
tempO « 29; 
t empi « 30; 
temp2 « 3 1 ; 
and s to re the random number 
Sh i f tReg ; 
= s h i f t R e g ; 
A.3 F F T 
/ * W r i t t e n by: Tom Roberts 11/8/89 
* Made p o r t a b l e : Malcolm Slaney 12/15/94 
* ma lco lm@in te rva l . com 
* Embedded MIPS p o r t : Thomas Grocu t t 10/1/2005 
*/ 
i n t f i x e d P o i n t F f t ( v o l a t i l e unsigned long d a t a [ ] , 
unsigned long m, bool inve r se ) 
{ 
i n t mr , nn , i , j , 1 , k , n , scale , temp ; 
bool s h i f t ; 
shor t qr , qi , t r , t i , wr , w i ; 
mr = 0; 
scale = 0; 
n = 1 « m; 
nn = n — 1; 
/ / check t ha t the f f t i s n ' t to b i t to process 
i f ( n > N_WAVE ) r e t u r n - 1 ; 
/ / dec ima t ion i n t ime — re —order data 
f o r ( m = 1; m < = nn; mH- ) 
{ 
1 = n ; 
do 
{ 
1 » = 1; 
} w h i l e ( (mr + 1) > nn ) ; 
175 
A. MIPS Test Algorithms Source Code 
mr = ( mr & (1 - 1) ) + 1 ; 
/ / make sure elements are i n the r i g h t order 
i f ( mr > m ) 
{ 
temp = data [ m ] ; 
da t a [ m ] = data [ mr ] ; 
data [ mr ] = temp; 
} 
} 
k = L0G2J^.WAVE - 1; 
f o r ( 1 = 1; 1 < n ; 1 « = 1 ) 
{ 
i f ( i nve r se ) 
{ 
/ / v a r i a b l e s c a l i n g , depending upon data 
s h i f t = f a l s e ; 
f o r ( i = 0; i < n ; i - f + ) 
{ 
temp = data [ i ] ; 
t r = temp & OxFFFF; 
t i = temp » 16; 
i f ( t r < 0 ) t r = - t r ; 
i f ( t i < 0 ) t i = - t i ; 
i f ( ( t r > 16383) | | ( t i > 16383) ) 
{ 
s h i f t = t r ue ; 
i = n ; 
} 
} 
i f ( s h i f t ) s c a l e + + ; 
} 
else 
{ 
/ * f i x e d s ca l ing , f o r proper n o r m a l i z a t i o n . 
* There w i l l be l o g 2 ( n ) passes, so t h i s r e s u l t s 
* i n an o v e r a l l f a c t o r of 1/n, d i s t r i b u t e d to 
* maximize a r i t h m e t i c accuracy . 
* / 
s h i f t = t r ue ; } 
/ * are we on f i n a l b u t t e f f l y where there is only 1 
* pass of inner looi^ 
* / " 
i f ( (1 « 1) < n) 
{ 
176 
A. MIPS Test Algorithms Source Code 
/ * i t may not be obvious , but the s h i f t w i l l be 
* per formed on each data p o i n t e x a c t l y once , 
* d u r i n g t h i s pass . 
*/ 
f o r ( m = 0; m < 1; mH- ) 
{ 
j = m « k ; 
wr = s i n L u t [ j + (N_WAVE / 4) ] ; 
wi = —sinLut [ j ] ; 
i f ( i nve r se ) wi = —wi ; 
i f ( s h i f t ) 
{ 
wr » = 1; 
wi » = 1; 
} 
else 
f o r ( i = m; i < n ; i + = 1 « 1 ) 
{ 
j = i + 1; 
temp = data [ j ] ; 
qr = temp & OxFFFF; 
q i = temp » 16; 
t r = F I X ^ U L ( wr , qr ) 
- FIX_MUL( w i , q i ) ; 
t i = F I X ^ U L ( wr , q i ) 
+ FIX^/[UL( wi , qr ) ; 
temp = data [ i ] ; 
qr = temp & OxFFFF; 
q i = temp » 16; 
i f ( s h i f t ) 
{ 
qr » = 1; 
q i » = 1; 
} 
da t a [ j ] = ( ( q i - t i ) « 16) 
I ( ( q r - t r ) & OxFFFF): 
da t a [ i ] = (( q i + t i ) « 16) 
I ( ( q r + t r ) & OxFFFF) 
1 
/ * t h i s is e x a c t l y the same as above, except i t s 
* the s p e c i a l case on the l a s t b u t t e r f l y when 
* the inner loop only does 1 i t e r a t i o n . 
* Removing t h i s loop r educ ing the overhead and 
177 
A. MIPS Test Algorithms Source Code 
* speeds t h i n g s up. 
* / 
f o r ( m = 0 ; m < 1; mH- ) 
{ 
j = m « k ; 
wr = s i n L u t [ j + (N_WAVE / 4) ] ; 
wi = —sinLut [ j ] ; 
i f ( i nve r se ) w i = —wi ; 
i f ( s h i f t ) 
{ 
wr » = 1; 
wi » = 1; 
} 
j = m + 1 ; 
temp = data [ j ] ; 
qr = temp & OxFFFF; 
q i = temp » 16; 
t r = FIXJ4UL( wr , qr ) 
- FIX_MUL( w i , q i ) ; 
t i = FIXJ4UL( wr , q i ) 
+ FIX_MUL( wi , qr ) ; 
temp = data [ m ] ; 
qr = temp & OxFFFF; 
q i = temp » 16; 
i f ( s h i f t ) 
{ 
qr » = 1; 
q i » = 1; 
} 
da ta [ j ] = ( ( q i - t i ) « 16) 
I ( ( q r - t r ) & OxFFFF); 
da t a [ m ] = ( ( q i + t i ) « 16) 
I ( ( q r + t r ) & OxFFFF); 
} 
} 
k — ; 
} 
r e t u r n ( scale ) ; 
A.4 Low Pass Filter 
void l p f A u d i o ( v o l a t i l e unsigned i n t b u f f e r 
178 
A. MIPS Test Algorithms Source Code 
) 
unsigned i n t l e n g t h 
i n t 
unsigned i n t 
shor t 
shor t 
shor t 
shor t 
i n t 
i n t 
i n t 
shor t 
shor t 
shor t 
shor t 
i n t 
i n t 
i n t 
temp; 
cO ; 
cOCur ; 
cOPrevl 
cOPrev2 
cOPrevS. 
cOPrev4. 
cOPrevS. 
c l ; 
c l C u r ; 
c l P r e v l 
c lP rev2 
c lPrevS . 
c l P r e v 4 . 
c lP revS . 
mul2 
mul2 
mul2 
mul2 
mul2 
mul2 
/ / pre calc some temp vars 
cOCur 
c lCur 
cOPrevl 
cOPrev2 
cOPrev3-mul2 
cOPrev4_mul2 
cOPrev5-mul2 
c l P r e v l 
c lP rev2 
c lPrevS-mul2 
c lPrev4_mul2 
c lPrev5_mul2 
b u f f e r [ 0 ] & OxFFFF; 
b u f f e r [ 0 ] » 16; 
cOCur; 
cOCur; 
cOCur * 2 
cOCur * 2 
cOCur * 2 
c l C u r ; 
c l C u r ; 
c l C u r * 2 
c l C u r * 2 
c l C u r * 2 
/ / go t h rough the b u f f e r low pass f i l t e r i n g each channel 
f o r ( i = 0; i < l e n g t h ; i -H- ) 
{ 
/ / get the value of the 2 channels 
temp = b u f f e r [ i ] ; 
cOCur = temp & OxFFFF; 
c l C u r = temp » 16; 
/ / calc the I p f ou tpu t 
cO = ( (cOCur * 4) + 
(cOPrev2 * 3) + 
cOPrev4_mul2 + 
c l = 
(cOPrevl * 3) + 
cOPrev3-mul2 + 
cpPrev5_mul2 ) / 16; 
( c i C u r * 4) + ( c l P r e v l * 3) + 
( c l P r e v 2 * 3) + c lPrevS_mul2 + 
c lPrev4_mul2 + c lPrevS-mul2 ) / 16; 
179 
A. MIPS Test Algorithms Source Code 
/ / r o t a t e valus 
cOPrev5_mul2 = cOPrev4_mul2 ; 
cOPrev4_mul2 
cOPrev3_mul2 
cOPrev2 
cOPrevl 
c lPrev5_mul2 
c lPrev4_mul2 
c lPrev3_mul2 
c lP rev2 
c l P r e v l 
/ / s to re 1p f 
= cOPrev3_mul2 ; 
= cOPrev2 * 2; 
= cOPrevl ; 
= cOCur; 
= c lPrev4_mul2 ; 
= c lPrev3_mul2 ; 
= c lP rev2 * 2; 
= C l P r e v l ; 
= c l C u r ; 
r e s u l t 
b u f f e r [ i ] = ( c l « 16) | (cO & OxFFFF) 
A. 5 Normalization 
v o i d no rma l i seAud io ( v o l a t i l e unsigned i n t b u f f e r 
unsigned i n t l e n g t h ) 
i n t 
shor t 
shor t 
shor t 
shor t 
shor t 
shor t 
shor t 
i n t 
1 ; 
cOVal 
cOMin 
cOMax; 
c l V a l 
c l M i n : 
c lMax: 
absMax; 
upScale ; 
unsigned i n t temp; 
/ * go th rough the b u f f e r g e t t i n g the min and max 
* values of each channel 
* / 
cOMax = b u f f e r [ 0 ] & OxFFFF; 
cOMin = cOMax; 
clMax = b u f f e r [ 0 ] » 16; 
c l M i n = c lMax; 
f o r ( i = 0; i < l e n g t h ; i-H- ) 
{ . 
/ / get the value of the 2 channels 
temp = b u f f e r [ i ] ; 
cOVal = temp & OxFFFF; 
180 
A. MIPS Test Algorithms Source Code 
} 
c l V a l = temp » 16; 
/ / update min and max vals 
i f ( cOVal > cOMax ) cOMax = cOVal 
i f ( cOVal < cOMin ) cOMin = cOVal 
i f ( c l V a l > clMax ) c lMax = c l V a l 
i f ( c l V a l < c l M i n ) c l M i n = c l V a l 
/ / f i n d the max abso lu te value and calc the scale f a c t o r 
absMax = (cOMin < 0) ? ( - 1 * cOMin) : cOMin; 
= (cOMax < 0) ? ( - 1 * cOMax) : cOMax; 
f ( i > absMax ) absMax = i ; 
= ( c l M i n < 0) ? ( - 1 * c l M i n ) : c l M i n ; 
f ( i > absMax ) absMax = i ; 
= (c lMax < 0) ? ( - 1 * c lMax) : c lMax; 
i f ( i > absMax ) absMax = i ; 
upscale = (MAX_VALUE * MAX_VALUE) / absMax; 
/ / now apply the c o r r e c t i o n 
f o r ( i = 0; i < l e n g t h ; i -H- ) 
{ 
temp = b u f f e r [ i ] ; 
cOVal = temp & OxFFFF; 
c l V a l = temp » 16; 
/ / get the value of the 2 channels 
cOVal = ( cOVal * upScale ) » 15; 
c l V a l = ( c l V a l * upScale ) » 15; 
b u f f e r [ i ] = ( c l V a l « 16) | (cOVal & OxFFFF); 
181 
A. MIPS Test Algorithms Source Code 
o 
ffl .2 
O 4 ^ 
S S 
O O 
O O 
Pi 
PH PL, 
I I 
W W o o < < 
• - l O 
CO >—I 
• i - H . i - H 
-a 
0) 
m 
o 
A 
!3 
o 
p—I 
CO 
< 
* -x-
M S-i 
^ .^4-1 
PQ 
a a 
o o 
PL, PLH 
I I 
O O 
<; < 
03 
O 
03 
~ CO 
<+-! <-)-( 
< 0 3 
CO CO 
cn ai 
o3 o3 
> > 
CO 
<: pq 
^ ^ ^ 
03 
> 
CM 
<; pq 
•-PLH 
03 =^  
o 
m 
!-H 
1=) 
o 
0 5 03 
> 
T3 X ) ^ ^ 
<: pq 
o o 
O 00 CO m 
<H-1 ^ . — I • — I 
03 o3 
CM 
o 
s 
o 
<; 
o 
s 
o 
o 
03 
+ ^ -1-= - t - i 
c! C C 
(-1 
+^ -»-> o3 
a 
-1—1 • I—t o o 
-o 
a; CD cu 
C3 a P5 bfl bO bC bO 
•f-H • . — I • 1—t • 1—1 
m M w CO CO 
a n ° a ' 
:=i 
o 
o s 
+^ 
CO 
Q; 
CO 
O 
o 
•)«• 
03 
n d 
o 
a 
bO 
tn 
cl 
;=! 
l-H 0) 
o3 CO 
<0 o3 
o 
o 
s i 
s -S 
CD 
^ CD 
+^ a 
O CO 
CO =3 
CO ^ ' 
o3 
o 
o3 O 
o 
s 
o 
CJ 
03 
< 
CJ 
ffl 
o 
l-l 
o 
o3 
a 
T 3 
m 
o 
! - ( 
O 
03 
a 
bX) 
03 
c! 1 P) 1 
CD 
bO 
-
CD 
1—t 
CO +^ 
i — 1 CB 
<V Oi 
X 
a , 
V 
o3 
CO 
a C3 
CJ 
— < 
o 
- C J 
II 
-a <u 
bC c! no Li
 
i-< 
+ J 
O 
bC 
o 
182 
A. MIPS Test Algorithms Source Code 
CP 
V 
a, 
o 
V — ^ 
CO 
X5 0. ^ 
• rt CO 
J <P 
o 
^ + 
< 
0) 
CO 
CD 
O 
CD 
a 
CO 
o 
O 
i-i 
O 
s 
o 
O 
CO 
a 
CO 
CD 
+^ 
C J 
0 3 
O 
CO 
o 
CO 
CD 
i - H 
II * + (D <D 
X ) -a Pi 
• >—( S-H 
+^ 
CO CO 
O A A 
O 1 1 s +j +^ CO CO 
CD CD 
o 
* 
CD 
a 
o o 
o O 
\ ^ 1—H 1—1 
o S-H 
=3 3^ 
q O O 
M O o 
^ —' 
o II ffi 
o 
^ O 
—< o 
^ CO 
( M ^ (30 
^ ^ ^ 
A A A 
( M . - H C X ) 
A A A 
A A A 
o o O O O o 
o o O TT. O O o o o l-M 
/-^ 
O O o 
o 
o FF
O
 
OO
F:
 
c ^ 
O 
o 
o 
O 
O 
O FF
O
 
OO
F:
 
O 
O 
O 
o o o o o o 
o o o o o o 
X X X X X X X X 
o o o o O o o o 
<^ 
m pq m 
o o o o 1—( 1—1 I—1 
CO CO CO CO CO CO CO ( O 
> > > > > > > > 
I I I I I I 
< CP <: m <; PQ < ; 
O O I — I 1—I ( M CC| C O 
P3 
C O 
03 CO CO CO CO CO CO CO CO 
B 03 03 o3 03 03 o3 
= ^ > > > > > > > > 
CM CM 
a n 00 oo 
+ + + + + + CO -CD 0) 0) 1—( CM I — ( CM CM t—1 CM 
n a CD QJ CD <V CD Q J 
bX) 
-1—I 
bO 
• 1—1 
d Pi a a fl a Pi 
CO CO > 
o 
o 
o 
o 
o 
o 
X 
o 
o 
o 
o 
o 
o 
o 
X 
o 
o 
o 
o 
o 
o 
o 
X 
o 
o 
o 
o 
o 
o 
o 
X 
o 
o 
o 
o 
o 
o 
o 
fa 
fa 
X 
o 
o 
o 
o 
o 
fa 
fa 
o 
o 
X 
o 
o 
o 
fa 
fa 
o 
o 
o 
o 
X 
o 
fa 
fa 
o 
o 
o 
o 
o 
o 
X 
o 
<=»<^  c*^ c j ^ < ^ 
< < < < 
PI 
P! 
CD CD CO 
.5 .2 
Q Q Q < < < o o o H H H 
CO CO CO 
CO CO CO 
Q Q Q <;<;<; o o o H H H 
CO CO CO 
CO CO CO 
o 
H 
CO 
c« 
< ^ < 
rt:^  o o o o 
CO CD CO CO 
o3 o3 03 o3 
> > > > 
CO CO CO CO 
o3 
> > > > 
O - — i C M c o ^ i o t o i r -
^ ^ s s s s s 
>H 
. o3 O 
. to ^ 
183 
A. MIPS Test Algorithms Source Code 
+ + + + + + 
CO l O 
CO L O 
«^ _( t4-H t ^ - ^ 
I I I I I I I I 
^ C O . -
CM >—I 0 0 
A A A 
A A A 
^ CO 
CM . - I 0 0 
A A A 
A A A 
o 
o 
o 
o 
o 
o 
o 
o 
o 
o 
o 
o 
o 
X 
o 
o 
o 
fa 
o 
o 
o 
o 
X 
o 
o 
o 
o 
o 
o 
o 
X 
o 
o 
o 
o 
o 
o 
o 
X 
o 
o 
o 
o 
o 
o 
o 
X 
o 
fa 
o 
o 
o 
o 
o 
o 
X 
o 
o < j cl<J Ci<^ Ci<^ 
m CQ m 
CM CM CM 
w CO CO 
d c3 d 
> > > 
m m m PQ m 
CM f O C O C O C O 
CO CO CO CO CO 
I I I I I I I 
o 
o 
o 
o 
o 
o 
fa 
fa 
X 
o 
o 
o 
o 
o 
fa 
fa 
o 
o 
X 
o 
o 
o 
fa 
fa 
o 
o 
o 
o 
X 
o 
o 
o 
o 
o 
o 
o 
fa 
fa 
X 
o 
o 
o 
o 
o 
fa 
fa 
o 
o 
X 
o 
o 
o 
fa 
fa 
o 
o 
o 
o 
X 
o 
fa 
fa 
o 
o 
o 
o 
o 
o 
X 
o 
T—I C O l o 
C^. c^- c^- c^- c^- c^-
o o o o o o o o 
II II II II II II II II 
A A A A A A A A 
t C O l O 
1 — l c o L o ^ ^ a i I — I . — 1 ^ 
<4—( U—t < - H ^ - H M—( 
+ -l- + + -t- + + + 
O CM 
O C M - * i O 0 0 ^ > - l > - l 
<u I I I I I I I I 
o 
i-H 
O CM ^ 
O C M ^ C O O O . - * - — I T — I 
C^- ( N - c^- c^- c - c^- c^- c^-
1 3 0 0 0 0 0 0 0 0 
c i ^ < ^ c»<^ d ^ cl<^ cl<^ 
C M C M C M C M C O C O C O C O 
c n c o c o c o c o c o c o c o 
c3 
> 
o 
CO A A A A A A 
o 
A A 
CM 
O CM 
<-l-l <-l-l 
^ CO 0 0 .-H .—I .—I 
^ ^ ^ ^ ^ ^ ^ ^ s _ 
^ ^ ^ ^ ^ ^ ^ ^ s ^ 
II II II II II II 11 II ^ I 
O CM C O l O O 
O O C T I T — l > — I . — I i — I t — I d 
^H-H t4-H ^ - H M — I ^—( «-t—I e4_( 
o 
CO 
CO 
:=! 
hJ 
o 
- l—i 
> 
CP 
; - i 11 
D H li 
o 
O 
1—< 
m 
+^ 
CO 
OJ 
CO 
^ O o 1—1 
o 
a 
o 
!-( o 
<l> <—1 
-t^ , V 
CO o 
+^ 
CO II II 
cu 
o O 
o • — 1 O 
' — ' 
II 
V 
ce 
C O +^ 
CO + j CO 
xs CO <a 
ce CO 
CO O 
^ - — ' O • — 1 
CO o 
.1—t — 
CO 
CO 
A 
CO CO 
CD 
0) 
t-i 
X3 
CO 
^ 
s 1 
O 
(-1 
o O 
o . — 1 
m 
CO 
D 
CO 
O 
1—1 
t 4 - l O 
O 
a 
o 
"-^ —^ 
03 
— ^ 
O 
'—' II 
CD 
o 
' — 1 X 
a 
fa 
i-i 
\ , :=! o 
184 
A. MIPS Test Algorithms Source Code 
-^1 1-1 
cn cfi 
cc cfi 
DH CL, -73 
CO 
II II ^ 
M -
o o 
cj o 
o o (1) 
CD 
o 
O 
s 
o 
CJ> 
cd 
o 
cd 
o o 
'o 'o 
O O 
O O 
Pi pci 
o o < < 
CO 7 3 
?3 I D 
oj cu 
cd cd 
I I 
o o 
Pi Pi 
O O 
<; <: 
X5 
a> 
o 
cd 
PH 
o o s 
o 
• - — I 
cd 
+^ 
CO 
T 3 X J 
CO 
t + H < - l - l 
C O 
M-l < - t - l 
" CM 
^ I 
• I-H . - — I 
- O T O 
•~ >—I 
C O . - I 
" o 
(N . - I 
<*-( < - t - l 
<U 
P i 
' O X ! • - (B o O 
X a o „ „ — m 
- O <X3 J « tl 
T 3 l H_, <H_l ( -1 S-H !-H P 
=^  ^ ^ " 
CO T3 o o o -x-
. - C Q 
oj CU J3 
d n cd 
:^ z; s 
< 
o pq 
cd 
cu 
o 
C O CO 
CO CO 
<U o 
O '3 
4J> 4-D + J + J 
cn G i=) P i a 
i-l cd cd - ( - ^ 
P l P l 
o o .1—1 
7 3 T 3 7 3 7 3 7 3 
CU a; cu cu 
P l P l P l P l 
hO tuO bC 
• 1—1 • 1—1 -I—( -I—1 - ^ 
CO . CO CO CO CO 
a P l P l P l 
p! p i p) P! 
o 
Cd 
o 
o 
(-) 
o 
cd 
0) 
o 
S 
o 
o 
a 
+ 
(U 
7 3 
CO 
CU 
CO 
s 
O 
CU 
cd 
7 3 
7 3 
<; 
cu 
CO 
cd 
A 
cu 
O CO 
CU 
Cd 
<D 
r p l 
7 3 
7 3 
o 
s 
o 
cd 
B 
185 
A. MIPS Test Algorithms Source Code 
i — 1 
d X 
• 1—4 
• 1—1 >< P L , 
!-l 
d d 
o d 
o 
1^ , 
bO 
oi 
tv -d 
-d bO 
CO - t J 
•—' CO 
<V (D 
X 
Ci, 
d 
cc3 
d 
V 
(D 
d 
i-i 
d 
CJ 
Q; O 
5 II 
-d 0^ 
bO d 
d 
CJ 
-a 
CO 
V 
fli 
L< 
d 
PL, 
d 
o 
CO 
U 
(—' 
CD 
+ d 
-73 
O 
CO 
ci3 
CO O 
^ s 
-^ ^ d 
CO CO 
cu 
CP 
-d 
•X- -t^ 
OJ bfl 
i=i d 
• < ^ 
d ^ 
o d 
.—. ^ 
\~, CJ 
T 3 
1 3 c/2 < ^ 
CO O 
03 ^ 
X I rX 
CO O) 
1 3 - i - j 
=y 'SJ 
-d 
CJ -1-^  o 
—< O 
ffl bO 
S-H 
d 
CJ 
0 ) 
d 
o o 
S 
d 
o 
T3 i - ^ 
. 03 O 
. CO ^ 
<0 
CO 
CO 
CO 
CO 
T 3 
o o 
•X- * d 
CD 
d 
CJ 
o 
m 
d 
CJ 
A ! 
CJ o s 
S-I 
d 
o 
o3 
-d 
o 
13 
CD 
d 
bp 
CO 
d 
d 
CP — . 
fH _ _ . „ _ . „ . „ . „ . „ . „ 
CP — . . . — . . — . 
-d 
^ ^ , , ^ , s ^ v , ^ ^, X 
O O ' - H C M C O ^ k O C D t ^ t X D C l O - — l C M C O ' = * < l O 
^ d ^ ^ ^ ^ ^ ^ 
P3 M r * * * * * * * * * * * * * * * 
!-i CO 
d •'nt^ "^t^ ^c^* ""vj^ 
CJ <P ^ ' - — ' ^ — — — ' 
— • ^ + + + + + + + + + + + + + + + + 
T3 ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 
CP <; ^ 
^ + ^ C M C M C M C N C N C M C M C < l C M C < l C C t C M C M C N C M C M 
O C P C P c P C D c P < P c P C P C D ( P C D c p c p c p c P C P 
^ ^ d d d d d d d d d d d d d d d d 
m ^ -zz 
§ ^ I I I I I I I I I I I I I I I I 
1^ 
c< j CO 
^ ' - n O ' — i C M C O - ^ i O C O b - C X ^ c O l O i — t C M C O - ^ i O 
— , I >-H •-^ ( I—( I I— I 
•X- -a * * * * * * * * * * * * * * * * 
L< CO 
a 3 - Q T } < - s } < - v J < - ^ - ^ - ^ - ^ - ^ - ^ - r } < - ^ - * - ^ T : } < ^ - ^ 
^ 03 ^ ^ ^ ^ ^ ^ - — ^ ^ - - ^ ^ ^ ^ ^ ^ ^ ^ 
<1J + + + + + + + + + + + + + + + + 
T 3 ^ 
d 
.2? o r r : T ^ r r r : " " : r r : : : : : : : :rrrr:" 
2 c p c p c p c D < P C D Q ; i c p c p c p i P ' P C D < P c p < p 
^ a d d d d d d d d d d d d d d d d 
d ^ .„ .-H .-, .-H .-H .-H .-H .-H .-H ^ ^ 
C N 
<P 
d 
^ o - - H CM C O lo 
" ^ O - — I C M C O ^ l O C O I ^ C X D C ^ , - H , - H > - H T - H r - H , - H 
T 3 T 3 X 5 ' T 3 T 3 T 3 ' T 3 ' T 3 ' T 3 ' T 3 T 3 ' T 3 ' T 3 T 3 ' T 3 
o • 
LH 
O 
186 
A. MIPS Test Algorithms Source Code 
+ + + + + + + 
T-H CO lO 
T—I CO i-H t-H 
<-l-l •H-I •-1-1 <-!-( 4-1 4-1 4H 4-1 
7 3 
1 
7 3 
1 
7 3 
1 
7 3 
1 
7 3 
1 
7 3 
1 
7 3 
1 
7 3 
1 
CO lO 
I-H CO lO cn I—1 r-H I-H 4-1 4-1 4-1 4-( 4-1 4-1 
• 1—< • 1—1 • •—1 • 1—1 • >—< 'i—t • f - i • 1—1 
7 3 7 3 7 3 7 3 7 3 7 3 7 3 7 3 
(N- ( > - I N - (>-• (>-• <>-
O o O O o O o O 
II II II II II II II II o 
A A A A A A A A cd 
I-H CO lO a 1 CO I-H T-H I-H <-i-i 4-( 4-1 4-1 4-1 4-1 
• I—c • •—) • r-l 'i—t • r—< CO 
7 3 7 3 7 3 7 3 7 3 7 3 7 3 7 3 cu 
+ + + + + + + + 
CM 
A 
A 
7 3 
7 3 
< 
cu 
CO 
cd 
A 
O C N 
O C M ' ^ C O O O . - H i - H i - H 
7 3 7 3 7 3 T 3 7 3 7 3 7 3 7 3 
I I I I I I I I 
CO 
CO 7 3 
CO 
cd 
O CM 
O C M ' ^ C D O O . - H ^ i - H 
7 3 7 3 7 3 7 3 7 3 7 3 7 3 7 3 
O. c^ . o . (>-. c^ . c ^ . o . c^ . 
II II II II II II II II 
A A A A A A A A 
O CM ^ 
O C M ^ C D O O i - H i - H i - H 
4-1 4-t 4-1 4-( 4-1 4-1 4-1 4-1 
7 3 7 3 7 3 7 3 7 3 T 3 7 3 T 3 
+ 
7 3 
cd 
CO 
cu 
o 
cd pi 
CO o 
O 
7 3 
cd pq 
CO + J 
+ j CO 
CO <u 
<V CO 
^ O 
2 ^ - -
4-1 o <u 
O 7 3 7 3 ^ ^ ._ 
— • CO CO 
^ S A A 
CO I I 
<U O +^ -i-^ 
^ ^ CO CO 
+j O 0 ) cu 
7 3 7 3 
4-1 r V 
p) 
P l .^^ _X ^ 
.2 (X (X 7 3 
4 J ^ !-, (_ 
CJ 4 J 
CO 
II II I 
P l P l ^ 
- X .2 .2 s 
cd 4 J ^ , 
U PH Cd Cd pi 
IH o O - |J 
| J o o cu 
^ C J ^ —I M 
cd 
o 
o 
(V 
pi 
187 
A. MIPS Test Algorithms Source Code 
A.8 Mandelbrot 
vo id mande lb ro t ( IMAGE_PROC_FrameBuffer * f b ) 
{ 
n t Z r , Z i ; 
n t Cr, C i ; 
.nt X , y ; 
i n t modulous ; 
i n t i ; 
i n t tempR, t e m p i ; 
/ / i t e r a t e over the l i n e s and the p i x e l s on the l i n e s 
f o r ( y = 0; y < 240; y - ^ ) 
{ 
f o r ( X = 0; X < 320; x ^ ) 
{ 
/ / scale p i x e l s so they are i n the complex plane 
Cr = ( ( x - X_OFFSET) « FIX_POINT_SCALE) / XJSCALE; 
Ci = ( ( y - Y_OFFSET) « FIX_POINT_SCALE) / Y ^ C A L E ; 
/ / i t e r a t e over the z value 
Zr = 0; 
Z i = 0; 
modulous — 0; 
f o r ( i = 0; ( i < MAKJTERATIONS) &fe 
(modulous < MOD_THRESHOLD); i+-|- ) 
{ 
/ / calc the mod and f o r e a r l y e x i t 
modulous = Zr * Z r ; 
R_DIV( modulous, FIX_POINT_SCALE ) ; 
modulous + = Zi + Zi ; 
/ / calc the square of Z 
tempR = (Zr * Zr ) - ( Z i * Z i ) ; 
R_DIV( tempR, FIX_POINT_SCALE ) ; 
t empi = 2 * Zr * Zi ; 
R_DIV( t e m p i , FIX_POINT_SCALE ) ; 
/ / add C to get new Z value 
Zr = tempR + Cr; 
Z i = t empi + C i ; 
} 
/ / calc the modulous of z | ( a+ib ) | = ( a2+b2 ) 1/2 
i = (modulous < MOD_THRESHOLD) ? 
OxFFFFFF : ( i * 15) ; 
fb ->baseAddr [ (y * f b - > s t r i d e ) + x ] = i ; 
} 
} 
188 
A. MIPS Test Algorithms Source Code 
A.9 Half Brightness 
void h a l f B r i g h t n e s s ( IMAGE_PROC_FrameBuffer * f b ) 
{ 
i n t r , g , b , y , u , v ; 
i n t cu rL ine , c u r P i x e l , w i d t h ; 
unsigned i n t v a l u e , * l i n e ; 
w i d t h = f b - > w i d t h ; 
/ / g o th rough the l i n e s i n the p i c t u r e 
f o r ( c u r L i n e = 0; c u r L i n e < f b - > h e i g h t ; curLine-H- ) 
{ 
l i n e — fb—>baseAddr + ( c u r L i n e * f b — > s t r i d e ) ; 
/ / go th rough the p i x e l s on the l i n e 
f o r ( c u r P i x e l = 0; c u r P i x e l < w i d t h ; c u r P i x e l - H - ) 
{ 
/ / unpack the RGB value and convent to YUV 
value = l i n e [ c u r P i x e l ] ; 
b = value & OxFF; 
g = (va lue » 8) & OxFF; 
r = (va lue » 16) & OxFF; 
y = ( (Y_R_CONST * r ) + (Y_G_CONST * g) + 
(Y_B_CONST * b) ) » 15; 
u = ( U.CONST * (b - y ) ) » 15; 
V = ( V_CONST * ( r - y ) ) » 15; 
/ / h a l f the b r i g h t n e s s 
y » = 1; 
/ / conver t back to RGB, c l i p 
r = ( (R.CONST * v ) » 
g = y - ( ( (G_V_CONST 
(G_U_CONST 
(B_CONST * u) » 
to range and repack 
b = 
r > 
r < 
g > 
g < 
b > 
b < 
value = 
value = 
value 1 = 
255 
0 
255 
0 
255 
0 
r « 
g « 
b ; 
value 
value 
value 
value 
value 
value 
16; 
8; 
15 ) + y ; 
* v ) + 
* u) ) 
15 ) + y ; 
255; 
0; 
255; 
0; 
255; 
0; 
» 15 ) ; 
l i n e [ c u r P i x e l ] = v a l u e ; 
189 
A. MIPS Test Algorithms Source Code 
A. 10 Factorial and Series Sum 
s t a t i c vo id f a c t ( const CONSOLE_GeneralCmdType *cmd, 
vo id *params [] ) 
{ 
} 
i n t c u r V a l ; 
i n t f a c t ; 
c u r V a l = ( i n t ) p a r a m s [ 0 ] ; 
f a c t = ( c u r V a l = 0) ? 0 : 1; 
/ / calc the f a c t o r i a l i t s e l f 
f o r ( ; c u r V a l > 0; c u r V a l — ) f a c t *= c u r V a l ; 
/ / p r i n t out the r e s u l t s 
p r i n t N u m ( f a c t ) ; 
p u t S t r i n g ( " \ n " ) ; 
s t a t i c vo id seriesSum( const CONSOLE_GeneralCmdType *cmd, 
v o i d *params [] ) 
{ 
i n t c u r V a l ; 
i n t sum = 0; 
/ / calc the f a c t o r i a l i t s e l f 
f o r ( c u r V a l = ( i n t ) p a r a m s [ 0 ] ; c u r V a l > 0; c u r V a l — ) 
{ 
sum + = c u r V a l ; 
} 
/ / p r i n t out the r e s u l t s 
p r i n t N u m ( sum ) ; 
p u t S t r i n g ( " \ n " ) ; 
A. 11 Copy 
s t a t i c vo id copyFunc( const CONSOLE_GeneralCmdType *cmd, 
v o i d *params [] ) 
{ 
unsigned long *s rcAddr ; 
unsigned long *des tAddr ; 
unsigned long len ; 
unsigned long i ; 
190 
A. MIPS Test Algorithms Source Code 
/ * e x t r a c t the data f rom the params a r ray 
* and mask the addresses 
*/ 
srcAddr = (uns igned long *) (( (uns igned l o n g ) 
params [0] ) & WORD ADDRESSJVlASK); 
destAddr = (uns igned long *) (( (uns igned l o n g ) 
params [ 1 ] ) & WORDADDRESSJVIASK); 
len = ( (uns igned l ong ) params [2] ) » 2; 
/ / do the memory copy 
f o r ( i = 0; i < len ; i -H- ) 
destAddr [ i ] = srcAddr [ i ] ; 
A.12 Sort 
/* 
* @(#) BubbleSor t A l g o r i t h m . Java 1.6 95 /01 /31 James Gos l ing 
* 
* Copyr igh t (c ) 1994 Sun Microsys tems , I n c . A l l Righ ts 
* Reserved 
* 
* Permiss ion to use, copy, m o d i f y , and d i s t r i b u t e t h i s 
* s o f t w a r e and i t s documenta t ion f o r N0N-(X)1\'1A'IERCIAL 
* purposes and w i t h o u t fee is hereby gran ted p rov ided t h a t 
* t h i s c o p y r i g h t n o t i c e appears i n a l l c o p i e s . Please r e f e r 
* to the f i l e " c o p y r i g h t . h t m l " f o r f u r t h e r i m p o r t a n t 
* c o p y r i g h t and l i c e n s i n g i n f o r m a t i o n . 
* 
* SUN MAKES NO REPRESENTATIONS OR WARRANTIES 7\B0UT THE 
* SUITABILITY OF THE SOFTWARE, EITHER EXPRESS OR IMPLIED, 
* INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF 
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR 
* NOI^INFRINGEMENT. SUN SHALL NOT BE LIABLE FOR ANY DM/IAGES 
* SUFFERED BY LICENSEE AS A RESULT OF USING, MODIFYING OR 
* DISTRIBUTING THIS SOFFVVARE OR ITS DERIVATIVES. 
* / 
/** 
* A bubble sor t demons t r a t i on a l g o r i t h m 
* S o r t A l g o r i t h m . j a v a , Thu Oct 27 10:32:35 1994 
* 
* © a u t h o r James Gos l ing 
* © v e r s i o n 1.6, 31 Jan 1995 
191 
A. MIPS Test Algorithms Source Code 
* M o d i f i e d 23 Jun 1995 by Jason Harrison@cs . ubc. ca: 
* A l g o r i t h m completes e a r l y when no i tems have been 
* swapped in the l a s t pass. 
* M o d i f i e d 9-9-2004 by Thomas Grocu t t 
* Minor o p t i m i z a t i o n s and por ted to C 
*/ 
s t a t i c vo id bubbleSor t ( unsigned long data [] , 
unsigned long l e n g t h 
) 
{ 
* 
i n t i ; 
i n t j ; 
unsigned long dataTemp; 
bool complete = f a l s e ; 
/ / keep sweep u n t i l reach end or we completed e a r l y 
f o r ( i = l e n g t h - 1; ( i > = 0) && ! comple te ; i — ) 
{ 
complete = t r ue ; 
dataTemp = data [ 0 ] ; 
/ / sweep the a r ray swapping out of order pa i r s 
f o r ( j = 0; j < i ; ) 
{ 
/ / should the 2 elements be swapped 
i f ( dataTemp < = data [ j + 1 ] ) 
{ 
dataTemp = data [ j + 1 ] ; 
} 
else 
{ 
data [ j ] = data [ j + 1 ] ; 
da t a [ j + 1 ] = dataTemp; 
complete = f a l s e ; 
} 
* @(#)HeapSor tA lgo r i t hm . j a v a 1.0 95/06/23 Jason H a r r i s o n 
* 
* Copyr igh t ( c ) 1995 U n i v e r s i t y of B r i t i s h Columbia 
* 
* Permiss ion to use, copy, m o d i f y , and d i s t r i b u t e t h i s 
* s o f t w a r e and i t s documenta t ion f o r NON-<X)^'IMERCIAL 
* purposes and w i t h o u t fee is hereby gran ted p rov ided t h a t 
192 
A. MIPS Test Algorithms Source Code 
* t h i s c o p y r i g h t n o t i c e appears i n a l l cop i e s . Please r e f e r 
* to the f i l e " c o p y r i g h t . h t m l " f o r f u r t h e r i m p o r t a n t 
* c o p y r i g h t and l i c e n s i n g i n f o r m a t i o n . 
* 
* UBC A^<ES NO REPRESENTATIONS OR WARRANTIES ABOUT THE 
* SUITABILITY OF THE SOFTWARE, EITHER EXPRESS OR IMPLIED, 
* INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF 
* AffiRCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR 
* NO^INFR[NGE]\/IENT. UBC SHALL NOT BE LIABLE FOR ANY DM/IAGES 
* SUFFERED BY LICENSEE AS A RESULT OF USING, MODIFYING OR 
* DISTRIBUTING THIS SOFTWARE OR ITS DERIVATIVES. 
*/ 
/* * 
* A heap so r t demons t r a t i on a l g o r i t h m 
* S o r t A l g o r i t h m . Java , Thu Oct 27 10:32:35 1994 
* 
* © a u t h o r Jason Harrison@cs . ubc . ca 
* © v e r s i o n 1.0, 23 Jun 1995 
* 
* M o d i f i e d 9-9-2004 by Thomas Grocu t t 
* Minor o p t i m i z a t i o n s and por ted to C 
*/ 
s t a t i c v o i d heapSort ( unsigned long data [] , 
unsigned long l e n g t h 
) 
{ 
} 
unsigned long i ; 
unsigned long temp; 
unsigned long curElement ; 
f o r ( i = l e n g t h / 2; i > 0; i — ) 
{ 
heapSortDownHeap ( d a t a , i , l e n g t h ) ; 
} 
f o r ( curElement = l e n g t h - 1; curElement > 0: 
curElement — 
) 
{ 
temp = data [ 0 ] ; 
data [ 0 ] = da t a [ curElement ] ; 
data [ curElement ] = temp; 
heapSortDownHeap ( d a t a , 1 , curElement ) ; 
} 
s t a t i c vo id heapSortDownHeap ( unsigned long d a t a [ ] , 
193 
A. MIPS Test Algorithms Source Code 
unsigned long k , 
unsigned long curElement 
) 
{ 
unsigned long temp; 
unsigned long j ; 
bool done = f a l s e ; 
temp = data [ k — 1 ] ; 
w h i l e ( (k < = c u r E l e m e n t / 2 ) ! done ) 
{ 
j = k + k ; 
i f ( ( j < curElement ) ( d a t a [ j - l ] < d a t a [ j ] ) ) 
{ 
j + + ; 
} 
i f ( temp > = d a t a [ j — 1 ] ) done = t r u e ; 
else 
{ 
d a t a [ k —1] = d a t a [ j — 1 ] ; 
k = j ; 
} 
} 
data [ k — 1 ] = temp; 
} 
/* 
* @ ( # ) Q S o r t A l g o r i t h m . j a v a 1.6 f 95 /01 /31 James Gos l ing 
* 
* C o p y r i g h t ( c ) 1994-1995 Sun Mic rosys tems , I n c . A l l Rights 
* Reserved . 
* 
* Permiss ion to use, copy, m o d i f y , and d i s t r i b u t e t h i s 
* s o f t w a r e and i t s documenta t ion f o r NON-COADS'IERCIAL or 
* COA'DV'IERCIAL purposes and w i t h o u t fee is hereby g r a n t e d . 
* Please r e f e r to the f i l e 
* h t t p : / / j a v a . s u n . c o m / c o p y _ t r a d e m a r k s . h t m l f o r f u r t h e r 
* i m p o r t a n t c o p y r i g h t and t rademark i n f o r m a t i o n and to 
* h t t p : / / j a v a . s u n . c o m / l i c e n s i n g . h t m l f o r f u r t h e r i m p o r t a n t 
* l i c e n s i n g i n f o r m a t i o n f o r the Java ( tm) Technology. 
* 
* SUN MAKES NO REPRESENTATIONS OR WARRANTIES ABOUT THE 
* SUITABILITY OF THE SOFTWARE, EITHER EXPRESS OR IMPLIED, 
* INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF 
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR 
* NON-INFRINGEAiENT. SUN SHALL NOT BE LIABLE FOR ANY DAMAGES 
* SUFFERED BY LICENSEE AS A RESULT OF USING, MODIFYING OR 
A. MIPS Test Algorithms Source Code 
* DISTRIBUTING THIS SOFTWARE OR ITS DERIVATIVES. 
* 
* THIS SOFTWARE IS NOT DESIGNED OR INTENDED FOR USE OR 
* RESALE AS ON-LINE CONTROL EQUPMENT IN HAZARDOUS 
* ENVIRONA/IENTS REQUIRING FAIL-SAFE PERF0RA4ANCE, SUCH AS IN 
* THE OPERATION OF NUCLEAR FACILITIES , AIRCRAFT NAVIGATION 
* OR COMMUNICATION SYSTEMS, AIR TRAFFIC CONTROL, DIRECT 
* LIFE SUPPORT MACHINES, OR WEAPONS SYSTEMS, IN WHIQI THE 
* FAILURE OF THE SOFTWARE COULD LEAD DIRECTLY TO DEATH, 
* PERSONAL INJURY, OR SEVERE PHYSICAL OR ENVIRON'A'IENrAL 
* DAMAGE ("HIGH RISK A C T I V I T I E S " ) . SUN SPECIFICALLY 
* DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY OF FITNESS FOR 
* HIGH RISK ACTIVITIES. 
*/ 
/** 
* A ciuick so r t demons t r a t i on a l g o r i t h m 
* S o r t A l g o r i t h m . j ava , Thu Oct 27 10:32:35 1994 
* 
* © a u t h o r James Gos l ing 
* © v e r s i o n 1.6 f , 31 Jan 1995 
* 
* 19 Feb 1996: Fixed to avoid i n f i n i t e loop discoved by 
* Paul H a e b e r l i . Misbehaviour expressed when 
* the p i v o t element was not uniciue. 
* —Jason H a r r i s o n 
* 
* 21 Jun 1996: M o d i f i e d code based on comments f rom Paul 
* Haeber l i , and Peter Schweizer 
* ( P e t e r . S c h w e i z e r @ m n i . f h - g i e s s e n . d e ) . Used 
* Daeron Meyer ' s (daeron@geom.imm.edu) code 
* f o r the new p i v o t i n g code. 
* — Jason H a r r i s o n 
* 
* 09 Jan 1998: Another set of bug f i x e s by Thomas Ever th 
* (everth@wave . C O . nz ) and John Brzus towsk i 
* ( jb rzus to@gpu . srv . u a l b e r t a . ca ) . 
* 9-9-2004 M o d i f i e d by Thomas Grocu t t 
* Minor o p t i m i z a t i o n s and por ted to C 
*/ 
s t a t i c v o i d q u i c k S o r t ( unsigned long d a t a [ ] , 
unsigned long loO , 
unsigned long hiO ) 
{ 
unsigned long lo ; 
unsigned long h i ; 
unsigned long temp; 
unsigned long p i v o t ; 
195 
A. MIPS Test Algorithms Source Code 
/ / set i n i t i a l values 
lo = loO ; 
h i = hiO ; 
i f ( 10 > = h i ) r e t u r n ; 
else i f ( lo = h i — 1 ) 
{ 
/ / so r t a two element l i s t by swapping i f necessary 
i f ( d a t a [ l o ] > d a t a [ h i ] ) 
{ 
temp = data [ lo ] ; 
data [ lo ] = data [ h i ] ; 
data [ h i ] = temp; 
} 
r e t u r n ; 
} 
/ / Pick a p i v o t and move i t out of the way 
p i v o t = da ta [ ( l o + h i ) / 2 ] ; 
d a t a [ ( l o + h i ) / 2 ] = data [ h i ] ; 
da t a [ h i ] = p i v o t ; 
w h i l e ( lo < h i ) 
{ 
/* 
* Search f o r w a r d f rom a [ l o ] u n t i l an element is 
* found t h a t is g r ea t e r than the p i v o t or lo > = h i 
*/ 
whi le ( ( d a t a [ lo ] < = p i v o t ) ( l o < h i ) ) l o + + ; 
/* 
* Search backward f rom a [ h i ] u n t i l element is 
* found t ha t is less than the p i v o t , or lo > = h i 
*/ 
whi l e ( ( p i v o t < = data [ h i ] ) && ( l o < h i ) ) h i — ; 
/ / Swap elements a [ l o ] and a [ h i 
i f ( lo < h i ) 
{ 
temp = data [ lo ] ; 
d a t a [ lo ] = da t a [ h i ] ; 
da ta [ h i ] = temp; 
} 
} 
/ / Put the median i n the " c e n t e r " of the l i s t 
d a t a [ hiO ] = data [ h i ] ; 
da t a [ h i ] = p i v o t ; 
196 
A. MIPS Test Algorithms Source Code 
/* 
* Recursive ca l l s , elements a [ l o O ] to a [ lo—1] are 
* less than or eciual to p i v o t , elements a [ h i + l 
* to a [ h i O ] are g rea t e r than p i v o t . 
*/ 
q u i c k s o r t ( d a t a , loO , lo—1 ) ; 
q u i c k s o r t ( d a t a , h i + 1, hiO ) ; 
197 
