
















Carl Frans van Schaik 
A dissertation submitted to the Department of Electrical Engineering, 

University of Cape Town 

in partial fulfilment of the requirements, for the degree of Master of 

Science in Engineering 

Cape Town, January 2002 











The copyright of this thesis vests in the author. No 
quotation from it or information derived from it is to be 
published without full acknowledgement of the source. 
The thesis is to be used for private study or non-
commercial research purposes only. 
 
Published by the University of Cape Town (UCT) in terms 













I, Carl Frans van Schaik declare that all infonnation contained within this dissertation 
report is original and my own work to the best of my knowledge, except were indi­
cated in the text. This dissertation report has not been submitted, in whole or part to 
any organisation other than those directly related to the University of Cape Town. 
signed 




I would like to thank all the people for assisting in the completion of this project as 
well as: 
My supervisor Professor M.R. Inggs for his time, dedication and supporting of this 
project. 
Alan Langman for being my 'unofficial' supervisor and who provided much input 





This dissertation concerns the design and implementation of a node for a hardware 
reconfigurable parallel processor. The hardware that was developed allows for the 
further development of a parallel processor with configurable hardware acceleration. 
Each node in the system has a standard microprocessor and reconfigurable logic device 
and has high speed communications channels for inter-node communication. 
The design of the node provided high-speed serial communications channels al­
lowing the implementation of various network topographies. The node also provided 
a PCI master interface to provide an external interface and communicate with local 
nodes on the bus. A high speed RlSC processor provided communication and system 
control functions and the reconfigurable logic device provided communication inter­
faces and data processing functions. 
The node was designed and implemented as a PCI card that interfaced a stan­
dard PCI bus. VHDL designs for logic devices that provided system support were 
developed, VHDL designs for the reconfigurable logic FPGA and software including 
drivers and system software were written for the node. The 64-bit version Linux oper­
ating system was then ported to the processor providing a UNIX environment for the 
system. 
The node functioned as specified and parallel and hardware accelerated process­
ing was demonstrated. The hardware acceleration was shown to provide substantial 








1.1 Project Objectives . 
1.2 Outline of Dissertation 
2 Theoretical Background 
2.1 Parallel Computing . . . . . . . . . . . . . 
2.1.1 Introduction to Parallel Processing . 
2.1.2 Parallel Software . . . . . . . . . . 
2.1.3 Hardware Configurable Parallel Processors 
2.1.4 Communication Architectures 
2.2 Example Parallel Processors . . . . 
2.3 Example Reconfigurable Processors 
2.4 Reconfigurable Logic . . . . . . . 
3 User Requirements and Specification 
3.1 User Requirements .. 
3.2 Project Deliverables . . . . . 
3.3 Requirements Analysis ... 
3.3.1 Silicon requirements 
3.3.2 Mechanical requirements . 
3.3.3 Interface requirements . . 
3.3.4 VHDL / Macro-function requirements. 
3.3.5 Software requirements 
3.4 System Specifications ..... 
3.4.1 Processing Algorithms 
3.4.2 Communications 




























































4 Concept Study 	 19 
4.1 	 node 19 
V'-'\J,""V< requirements 20 
4.2.1 	 System Requirements 
Selection .. 21 
4.3 	 Reconfigurable logic requirements 22 
system design . . . 23 
Communication infrastructure ,",U\JL",,,;;) 24 
Software requirements .... 
5 Hardware Design and Implementation 	 27 
5.1 	 Components 27 
Design Choices . 28 
5.2.1 	 Choice A 
Choice B 
Choice C 
High Level System Design 32 
Functional Unit 35 
5.4.1 	 Microprocessor.. 35 
Memory Controller . 37 
Bus and Pet1PIlen:tl Bus 
FPGA ....... . 39 
5.4.5 	 Clocking .. . . . . 40 
5.4.6 	 Power and Peripheral Devices 
5.5 	 System Configuration Options 
5.6 	 Printed Circuit Board 
Conclusion .... 45 
6 Hardware Verification 	 49 
6.1 	 PCB Inspection 49 
6.2 	 Power Supply ... 
6.3 	 Configurable 50 
6.4 	 Memory Controller 51 
Processor ..... 53 
7 Firmware Implementation 
7.1 	 design implications 
Configuration CPLD ... 56 
7.2.1 	 Requirements... 57 
7.2.2 	 Specification and Design 58 
7.2.3 	 Implementation. 59 
7.3 	 Bus + Control CPLD 
7.3.1 	 Requirements .. 59 
CONTENTS 
7.3.2 Specification and Design . 
7.3.3 Implementation . 
7.4 FPGA Designs ...... 
7.4.1 Basic LED Test . . 
7.4.2 LVDS VHDL Test 
7.4.3 LVDS Schematic Test 
7.4.4 Local Bus Emulation Test 
7.4.5 16550 Compatible UART 
7.4.6 Remote Bus Access ... 
7.4.7 Local-Bus Test ... . . 
7.4.8 Sigma-Delta Modulator 
7.4.9 Propane UART . 
7.4.10 Propane Design . 
7.4.11 DSP Experiment 
7.5 Conclusions . . . . 
8 Software Development 
8.1 Linux PC Software . . . . . . . . . . . . . . 
8.1.1 PCI Driver . . . . . . . . . . . . . . 
8.1.2 Bus Access and FLASH programmer 
8.1.3 Debugging and Utilities 
8.2 MIPS Software . . . . . . . . 
8.2.1 Verification Programs. 
8.2.2 Diesel Boot-Loader. 
8.2.3 Linux Kernel .. 
8.2.4 Linux Programs. 
9 Testing and Verification 
9.1 Finnware Verification . 
9.1.1 Config CPLD . 
9.1.2 Bus + Control CPLD 
9.1.3 FPGA Testing. 
9.2 Software Testing 
9.3 Benchmarks...... 
10 Conclusions and Future Work 
10.1 Conclusions . 
10.2 Future Work . . . . . . . . 
Bibliography 
A Schematics and PCB 














































































C Propane Interface 121 
D Source Code and Datasheets 125 
List of Figures 
3.1 Basic Component Requirements of a Node. 15 
3.2 Node interface requirements 17 
5.1 Design architecture choice A 29 
5.2 Design architecture choice B 31 
5.3 Design architecture choice C 33 
5.4 RC64574 Functional Block Diagram . 36 
5.5 RC64574 Interfaces . . . . . . ... . 37 
5.6 GT64115 Interfaces ......... . 38 

5.7 Local and Peripheral Bus Configuration 39 
5.8 Clock Distribution Architecture .... 41 
5.9 Top side, unpopulated project hardware 45 
5.10 Bottom side, unpopulated project hardware 46 
5.11 Top side view, populated project hardware 47 
7.1 Data-Strobe Encoding. . . . . . . . . . . 63 
7.2 LVDS Transmitter Logic ........ . 64 

7.3 High level overview of Remote Bus Access system 67 
9.1 Normalised FFfW performance . . . 92 
B.l 4X Clock Generation . . . . . . . . . 118 
B.2 4X Clock - dV/dt (lGV/s per division) 118 
B.3 LVDS Clock - FFT (266MHz) .... 118 
B.4 LVDS Cable Delay and Signal Quality . 119 
B.5 LVDS Difference Voltage ....... . 119 

X111 
LIST OF FIGURES 

List of Tables 
5.1 Memory device assignments ..... 38 

5.2 Worst case power supply calculations 42 

5.3 Galileo GT64115 Configuration ... 43 

5.4 Interrupt and other Configuration ... 43 

7.1 System control interface, on Chip Select 3 (CS3) 62 

8.1 Virtual Interrupt Allocation . 82 





LIST OF TABLES 

Glossary 
ASIC Application Specific Integrated Circuit: An integrated circuit that has been spe­
cially designed for a custom application, typically proprietary. 
CPLD Complex Programmable Logic Device: A programmable logic device similar 
to, but providing more features, such as flip-flops and feedback, than PALs and 
GALs. 
DCT 	Discrete Cosine Transform: A mathematical algorithm to calculate the Cosine 
function components of a signal represented by a discrete set of points. 
FFT 	Fast Fourier Transform: A mathematical algorithm for converting time domain 
data into frequency domain data. 
FPGA Field Programmable Gate Array: A device providing a large array of config­
urable logic building blocks and configurable routing interconnects for imple­
menting logic designs. 
PVM Parallel Virtual Machine: A software library and application for running paral­





The goal of this MSc project was the design, implementation and testing of a node for 
a run-time hardware reconfigurable parallel computing processor. Three major tasks 
were undertaken in the completion of this project: Analysing a selection of successful 
parallel computing architectures and evaluating their characteristics. Specifying, de­
signing and implementing the node hardware according to an architecture derived from 
the first task with the added advantage of configurable logic. And lastly, implementing 
a software environment to provide a base for further parallel processing research and 
demonstrating basic proof-of-concept examples. 
Configurable logic allows the logical circuitry of a specialised silicon chip, in par­
ticular FPGAs, to be configured and changed without modifying the physical devices 
in the system. FPGA devices allow for highly specialised digital designs to be im­
plemented in general-purpose silicon without the cost of developing custom silicon. 
Reconfigurable logic allows the logic design configured in the device to be changed at 
any time, especially while the device and system are in operation. 
Parallel processing is the use of multiple processing units connected together to 
handle an intensive computational task. Parallel processing is most commonly em­
ployed to reduce the processing time for certain very large or complicated processing 
tasks. This works because certain processing tasks can be achieved by allowing each 
processing node to work on a subset of the entire problem. Some example applications 
in which parallel processing is used are: Image processing, finite element simulations 
and computer generated animation. 
In many processing tasks, a single or small number of algorithmic functions are 
used extensively on a large amount of data. In certain cases, the algorithm used can 
be implemented directly in a digital logic design in hardware. When this is possible, 
the hardware implementations are usually orders of magnitude faster than the same 
algorithm running on a digital microprocessor. The possibility exists to implement 
these hardware implementations in reconfigurable logic devices rather than custom 
silicon. The use of these devices with their ability to implement arbitrary configura­
tions allows for a more general-purpose hardware accelerated processing unit to be 
designed. Further, by using these hardware accelerated processing functions in com­
1 

Radar Remote Sensing Group at the University 
'-'UJl.lllil"UJUU Off The Shelf 
radar projects. 
for processing on the GOLACH prClce~;sOI 
is been investigated long term is 
developing a more O'PI4,pr~ 
1.1. OBJECTNES 

a general purpose 
principles of u ...... ,"'-"v 
accelerated parallel processor can designed which can 
platform can 
processing, a hardware 
improve on the 
of standard parallel processors. This is 
employ hardware implementable algorithms. 
machines for parallel processing. The 
Pentium II machines running the 
Town has been 
Beowulf 
is used for the processing of data 
projects run to develop 
of an em.D~lae~a 
parallel processor that will more flexible use of parallel processing as 
This thesis to investigate the of a node such an 
emOe(lOe:Q processor and to investigate the use of software configurable to 
application at very high 
1.1 Project Objectives 

dissertation project is the development of a hardware """"'1".11 to create a 
that will enable reconfigurable hardware processing at UCT. 
specific objectives project were to: 
L 
various 
logic in orclces;sm 
processing 
Additionally, a review 
of the 
current use reconfigurable 
"HVUU..l be undertaken. parallel 
Devise a plan 
processing node. This involves 
selecting core components and where 
Design the prototype hardware of a node for an isotropic parallel 
system. 
4. Develop software and to enable the to run application 
software. 
Demonstrate simple examples hardware's capabilities and verify its 
ation. 
6. 	 Keep cost node low enough the system to be compared to other 
systems. 
Page: 2 of 125 Radar Remote Sensing Group, Engineering, 
insight into the subject matter 
computing is given. It 
common methods 
Chapter 2 provides a 
various parallel 
reconfigurable 
overview of some 
V,",'_'''''JI architectures in existence. 
DrC'Ce~~SIrU! nodes is 
for implementing 
described. Various available COElliTIlerCl off the shelf (COTS) proices.sOI 
investigated. A processor was selected that was the most 
practical choice for the of the requirements of 
urable logic described choices and 
suitable. A high-level is formulated 
modules and their which the design would 
choice of networking topology is and the various methods 
1.2 Outline of Dissertation 

The following "''"'.. ,'''''. this dissertation are structured following manner. 
general 
parallel the physical topologies protocols used are 
discussed. on reconfigurable logic benefits and 
problems can have. A discussion of the practical limitations 
in reconfigurable logic is provided. 
Chapter 3 provides and specifications of the 
quirements describe what the final system should be capable of 
deliverables are a and interpretation of the user 
out definite goals for The requirements analysis 
ments and deliverables how each 
Lastly, an acceptance test JIJ...,"'u."' .... tests 
which the project must pass to in 
ment of the 
Chapter 4 describes performed before the commencement the 
project and 
of a node for a parallel 
a processing node to 
It begins with a study of the 
looks at the core elements 
nodes. Following this is a review for the processor to is 
were analysed to u...,,,,'.,,,_ 





5 describes the hardware implementation processes. This 
"'".'.......,,, high-level conceptual low-level system design diagrams 
choices, component research and as well as the implementation 
cess. implementation of every Lll\J" of hardware is and their .....'\J 
with each other is described. 
Radar """">lJ'''' Sensing Group, Electrical 
1 OUTLINE DISSERTATION 
Chapter 6 shows the taken to verify functionality of system. indi­
vidual component in is according to 
ification and requirements. A description the results of 
acceptance test 
in pre-codes in VHDL and schematic entry as as software 

dominantly and C as the acceptance test procedures is provided. 

Chapter 7 describes the of development of the for the various con­
figurable devices employed in the nrr\,,,,,roT design is 
sented with high-level detailed U"""'j:,l1 specifications and discusses 
implementation details. Details and their solutions are 
provided. 
Chapter 8 the development and implementation software 
node microprocessor and systems. chapter firstly the 
of software for the host that interfaces with the project hardware. The 
various development tools that were are and the implementation 
various utility software programs is described. Secondly, the and im­
plementation of for use on hardware itself is described. 
shows development software the project, with utility 
programs needed to initialise, control and test the system. is followed by 
the implementation of an system to the hardware environment 
allows the development and use of application software to execute without 
modification on hardware. 
Chapter 9 the testing procedures, verification and from the final 
plemented system. The algorithms and software used to benchmark the perfor­
mance of the system are laid out. The benchmarks are compared with other 
processors to evaluate the performance of the Ver­
ification procedures testing the validity of the results are also provided to 
prove the correct operation of the hardware. results are compared to 
retical predictions of the hardware performance and a discussion the analysis 
IS gIven. 
the conclusions as to the success of the Recommendations 
for future hardware designs are provided. 
Appendix A and PCB. are the _~,,.,..,.. u for the developed 
dissertation. 
from oscilloscope.VariousAppendix B Oscilloscope 
design specification. systemAppendix C TO""-"''''O 
Radar Remote Sensing Group, Electrical Engineering, 
Chapter 2 
heoretical Background 
2.1 Parallel Computing 
section gives a brief V....""""'A introduction to parallel the core 
components required in any It will also highlight techniques 
especially in embedded 1Ju.Au.A""" o:rocessmg which is the focus of this 
Introduction to Parallel Processing 
are two main types of multiple processor configurations used today 
Instruction Multiple Data (SIMD) and Multiple Instruction Multiple 
processors are usually highly and difficult to design 
They perlorm operations by perlorming the same 
on a number of parallel data inputs producing a set of parallel outputs. 
cessors are more common. typically consist of multiple orClceliSlIH! 
separate instructions on processors are 
two areas, Shared Memory Memory systems. 
memory processors when processors are linked on 
and all have access to same memory space taking care not to 
access regions of memory a way that errors would occur. 
me:m()ry parallel processors have individual processors each with their own 
processor has its own program to execute and a communications chan-
For one processor to access memory of another, lil"'''''''f;;''''' 
must between the two processors. 
Distributed Memory Parallel processing is "'''~''''U'UQ.lq use of more than one 
linked via a communications T'lprn""clr processing tasks 
not be possible with a single orCICe!;SlIlg The advantage of this 
over a memory system is that tens, or even thousands of these pro-
connected to communicate Each processor in 
is a node and is capable independently of the 
5 
2.1. PARALLEL COMPUTING 

others. These nodes may also run their own operating system with multi-threading 
allowing multiple programs to run on each node. 
On a distributed Memory Parallel processor, the communication between nodes 
is normally done by a set of libraries that work together to create a parallel virtual 
machine. The processing power of the system is dependent on the communications 
interconnects on the system as well as the parallel algorithm used. A good algorithm 
will keep message passing to a minimum as the processor can be kept waiting unnec­
essarily for data to be moved. Also, for a system with a relatively large amount of 
nodes, a 100% parallelizable algorithm needs to be used according to Amdahl's law 
[1,sec1.4.1J. Amdahl's law states that the maximum speedup is limited to the serial 
fraction of the program. 
2.1.2 Parallel Software 
There are two major types of software for parallel processing, Shared memory models 
for shared memory computers and message passing libraries for distributed memory 
system which are however also used commonly on large shared memory machines 
like the Cray T3D. The message-passing paradigm is the dominant form of software 
currently in use and matches the architectural design of a distributed computer. 
The parallel computing libraries make the task of writing software for a parallel 
processor more focused on the algorithm so the user does not need to worry and the 
system topology or architecture. There are three industry standard libraries for paral­
lel computing, MPI (Message Passing Interface), PVM (Parallel Virtual Machine) and 
SHMEM which is run on Cray supercomputers. Software such as PVM can be modi­
fied for a particular parallel processor simply by implementing the necessary low-level 
message passing mechanisms. The application software will be unaware of these im­
plementation details. 
When building a parallel computer, the systems programmers need to provide an 
operating system and extensions to a parallel processing library for the system. The 
most important part of the libraries is the communications subsystem, which need to 
use the hardware to its maximum. 
2.1.3 Hardware Configurable Parallel Processors 
The concept behind a hardware configurable parallel processor is that the electronic 
circuitry in a node can be reprogrammed to perform the currently required operation 
in the most efficient means. With the advent of advanced FPGA technology, these 
logic devices can be reprogrammed while running to allow the processing logic to be 
changed dynamically. 
Presently most projects using configurable hardware such as FPGAs use them to 
create a configurable network or reconfigurable mesh. This allows the system to im­
plement various interconnection topologies to optimise various algorithms and exper­
iment with with arbitrary connection patterns. 
Page: 6 of 125 Radar Remote Sensing Group, Electrical Engineering, UCT 
or even in a system 
There 
simple 
CHAPTER 2. THEORETICAL BACKGROUND 

IJvvUA", cores for type of 
data processing required by system. These can performed times faster 
FPGAs can be used to algorithmic 
than with general-purpose processors it allows programmers to 
Vl.I'Ll1J.JU;:'", the circuitry and run multiple sub of an algorithm in parallel. 
2.1.4 Communication Architectures 
Various communication exist for distributed parallel 
its own strengths 
Depending 
of size and I.Uj;,vu.'I. ... " run on each one ..U 
and weaknesses. 
Important in parallel 
Connectivity: The ideal situation is when the if fully cormectea, thus every 
has a direct to every other node. This is impractical even reasonably 
sized clusters. 
Degree of connectivity: This is the number 
The this value, 
hardware more complicated. 
of from node, or 
the of neighbours a node has. quicker the 
communications, but it the software 
Static: A parallel is static when all the links are prt:-u~~nrleu and fixed. 
Switch: A or set of nodes that only perform communications and no processing. 
A switch can be used to the network topology dynamic or even fully con­
......'v ......,.... It can connect communicate with any other node on the same switch 
directly. 
the case where a fully connected is not 
designed to a network topology for connecting the nodes must 
are many topologies around and they all aim to either 
and cost effective or as to fully connected as possible. 
type networks allow a type of fully connected system 
nications are to the bandwidth the bus. 
Direct one-to-one connected lines allow for the best and lowest latencies and 
allow the communications to be simplified. 
Some typical topologies are: 
Bus: When a is used, it is a commercial 
OU!PVf'r all the commu-
may not be However mem­
ory busses allow for very performance but are limited 
to a few nodes. A cluster machines on a network 
running a Beowulf system is a typical example. 
* 
Star: Each node is connected only to one central 
number of nodes that the central node can support is 
_,.---'---,---'----,_-'--_ bus like Ethernet 
Radar AU""...,." Electrical Engineering, UCT 7 of 125 
EXAMPLE PARALLEL PROCESSORS 

Array: n-way array of nodes each 'with 2n lines to other 
Typically a or 3-d are paths 
from any given node to any other This is the struc­
ture by the Intel supercomputers including ASCI Option Red 
(1.8 TFLOPS). 
Tree: Nodes are arranged in a tree configuration. node 
n children and one parent node except the top node, which 
only has children. 
Ring: Ring type structures can be formed by connecting oppo­
edges of an array topology together. picture shows a 
fTom a 1-dimensional 
2.2 Example Parallel Processors 
Although many hundreds of parallel computers have 
been to specifically exploit potential 
mise the systems peIformance from a processing point 
designed a build, very few 
configurable logic to opti-
Firstly, in an attempt to discover what traits a successful parallel processor 

some of most powerful successful parallel computers not using configurable 

are 




ofparallel that was reasonably cheap 
compared to the commercial processors of time. processing nodes are 
The Paragon is a 
Cray T3E Multiprocessor 
ht : I .cray. sl I 3el 
a 2D nodes connected to a backplane for communications. 
that performs the message passing and 
Specialised on edge of array access. 
applications processor that provides user inteIface 
two one to run applications and the other is a 
dedicated communications processor. 
Radar Remote Sensing Group, Electrical Engineering, 
-----------------




can connected via 
cessors and a PCI bus to 
are connected. Various 
pending on their function. 
to allow system to ,",VAAUU 
used no V,",\.<.:J.:J.'U,,", 
This is an older 
2. THEORETICAL BACKGROUND 
...........................----------­
is a scalable shared-memory processor with nodes containing 
connected together in a 30 torus. Each node interconnect to 
480MB/s of data, which supports the shared memory system. I/O access is via 
channel provides 267MB/s of bandwidth for every four 
incorporate advanced cache techniques to hide the 
To take advantage of these caches, 
incorporated into the compilers to take advantage of the 
T90 other vector processors 
opera­
These systems are true node 
connected to the main system memory with an 
Each is synchronised by a central clock 
memory system is considered a 
GigaRing system. 
The ASCI Option Red Supercomputer 
son/OVERVIEW.htmlhttp://www.cs.sandia.gov/ISUG97 
Timothy G. Mattson and Greg Henry, 
This super computer was developed as computers for the US 
Department of Energy (DOE) that had than the then current fastest 
supercomputers. Intel was challenged to build world's first and currently only> 
I1FLOPS (Trillion Floating point = 1,000,000,000,000). It 
utilises a 2D mesh interconnection structure controlled by custom mesh routing chips 
providing four simultaneous 200MB Is to every other mesh routing chip. A 
network interface chip on each node connects to a mesh routing chip and sets up a route 
between two NICs though the Each node in the system has two pro-
such as RAID, ATM and FOOl cards 
also run specialised operating systems de­
is designed with redundant components 
of hardware failures. This system 
was done in COTS processor chips. 
Kendal Square KSRl 
ht :/ .cs.colorado. / ksr.html 
hardware supported "distributed-shared" 
they call ALLCACHE memory that allows 
Radar Reuwte VCT 
PARALLEL PROCESSORS 

each to reference any memory location in the The local memory then 
becomes a cache of another memory location. cache-miss prompt the 
to search first locally and then distributed for the location of addressed meimolrv 
search hardware is called search engine and it runs on a custom set of rings 
and directories for and memory 
Huinalu Linux SuperCluster 
http://www.mhpcc. doc/huinalu/huinalu-intro.html 
machine is in a stream of new highly 1J"'"""'""v' machines on COTS 
components. This machine has 260, 933MHz using standard IBM rach 
mount computers. theoretical maximum processing is 478 GFLOPS. 
general-purpose components this machine cost 1/10th of an 
custom supercomputer. 
ht ://csepl. .ornl.gov .html 
CM-5 a large set processors divided into groups. partition has 
own processor, a manager. processor is a SPARC based processor with 
four vector units in The whole arrangement is a memory 
system. 
There are two communications systems in the CM-5. A data network and a roAr,trA 
network to which nodes are connected. network is to 
and global operations. The data network is used inter-node data 
at 20MB/s. 
machine supports a maximum 16,384 nodes a theoretical maximum 
speed 1000GFLOPS 
vpp Architecture 
://www.fujitsu.co. rtext/Products/ s/hpc 
/index.html 
Fujitsu high perfonnance machines are Vector Processors (VPP) 
on custom CMOS Each the VPP5000 example 
over 8700 MFLOPS for the UNPACK benchmark. Each processing element 
contains to 16GB of with the system up to 
The elements communicate on a crossbar switch supporting 1 
second in two directions simultaneously. The runs a operating 
provides parallel and high speed networking interfaces. 
maximum speed is rated at 1228 on 128 ele­
ments. 
Radar Remote Sensing Group, Electrical Engineering, 
2. THEORETICAL BACKGROUND 

2.3 Example Reconfigurable Processors 
Annstrong III 
://www.lerns.brown.edu 
Annstrong is a 20-node parallel computer build to research merging 
a processor and reconfigurable logic device in parallel node. Each node has 
high-speed links, can allow the to be in numerous configura­
tions. contains a communications board and processor board. The comu­
nications board contains a dedicated communications processor communications 
The processor board contains a RISC and as well as memory 
and serial The system showed that the devices greatly accelerated 
performance of the system. 
ArMen 
http://ubolib.univ-brest. ~armen/armen -eng.html 
The processor is a parallel processor with each node containing a FPGA 
cessor. Each node has up to 4Mb of ram and has a processor with four 20Mb/s 
The interconnect architecture is various applications 
requirements. 
2..4 Reconfigurable Logic 
many new speed computation and communication systems, the U\"'>lU....Y re­
quirements the are more demand. flexibility in the 
has been mostly concerned with software because was the only component 
the that was modifiable. With the of a new-generation FPGA devices 
that support run-time reconfiguration, the of reconfiguration a system can now 
include logic firmware flexibility. 
Firmware re-configuration of the in an device allows system design­
ers to implement optimised high-performance as well flexibility. 
Cost savings can also be achieved by using a single device for multiple functions, 
an appropriate configuration when required. 
current major problem with reconfigurable is that is a major lack 
design to enable Most static 
configurations that are loaded into device, as the development software evolves 
to catch up with the hardware, that reconfigure partially without 
other of the same device will be 0~UJ.fJ~"'A to implement. 
Radar Remote Sensing Group, Electrical Engineering, 
2.4. LOGIC 

Page: Radar Remote Sensing Group, Electrical Engineering, VCT 
Chapter 3 

User Requirements and Specification 

chapter develops the requirements and deliverables the as wen as a 
specification for project hardware and software design. No past research into hard­
ware reconfigurable processors has been conducted at University Cape 
Town, and thus the and specifications are largely drawn from the ba­
and on other reconfigurable and processing 
systems. 
a user specification is provided which specifies high level objectives 
the These no to specific or software 
usage. project deliverables are then developed from the user requirements and 
are followed by an analysis of the user requirements. system specification is 
developed by a set test specifications to the system 
must comply. 
3.1 User Requirements 
development a software hardware configurable parallel processor node has 
the following functional requirements: 
1. 	 system must demonstrate principle using a configurable co­
Df()CeSSc)r to common functions as the for use in 
radar processing, or the Discrete Transform for use in processing. 
The system must also demonstrate parallel execution advantages of config­
urable logic. 
2. 	 A shall be a low cost and power efficient processing that can run 
as a stand-alone processor. All the nodes in the parallel system will be ..........u .."'cu 
from a hardware perspective (isotropic). 
3. 	 node must contain a communications unit, central processor, config­
urable logic unit capable of interfacing with processor. 
3.2. PROJECT DELIVERABLES 

4. 	 Each node must provide support for high-speed inter-node communications. 
5. Each node will run its own copy of an open-source operating system and parallel 
processing libraries. The nodes will form a MIMD processor. 
6. 	 A minimum of two nodes must be built to demonstrate the nodes perfonning 
parallel processing functions. 
7. Speed comparisons 	of various networking topologies must be performed for 
common parallel algorithms. 
3.2 Project Deliverables 
The fundamental goal of this project is to produce a working node for a Parallel Pro­
cessing Unit (PPU) and to demonstrate a simple, hardware accelerated parallel algo­
rithm running on the node. The nodes are to be comprised of a configurable logic 
unit and a general-purpose microprocessor with supporting hardware. Each node is 
required to run an open-source operating system and demonstrate a parallel processing 
example. 
The system is to be configurable in software and hardware. Software configura­
bility means the ability to change the algorithms and software running on the system. 
Hardware configurability is the ability of the system to change the behaviour of certain 
logic components to optimise the system for specific processing needs. 
As secondary goals, the following deliverables should be attained: The implemen­
tation of networking protocols across the configurable communications infrastructure. 
The use of configurable logic for optimising various computations. 
As tertiary goals, dynamic re-configuration to constantly optimise the system per­
formance and dynamic network configuration should be investigated. 
The functional hardware platform is the major deliverable. Software for the pro­
cessors, configurable logic, communications interfaces, communications libraries and 
parallel processing libraries are also part of the deliverables. 
3.3 Requirements Analysis 
After considering the requirements, it is clear that no specific hardware design con­
straints are implied other than providing the required functionality. Therefore after an 
analysis of typical microprocessor systems and the requirements of a node in a parallel 
processor, the following more focused requirements have been developed. 
3.3.1 Silicon requirements 
The basic component requirements of the reconfigurable parallel processing node are 
illustrated in figure 3.1. 
Page: 14 of 125 Radar Remote Sensing Group, Electrical Engineering, VCT 
CHAPTER 3. USER REQUIREMENTS AND SPECIFICATION 

High Density RAM 









Figure 3.1: Basic Component Requirements of a Node 
1. 	 Processor, a high performance, low power embedded processor with the capa­
bility to interface with a: Communications controller, FPGA and Memory In­
frastructure. 
2. 	 A high gate count FPGA with an interface to the processors memory bus to 
provide high-speed access. Advanced clock generation would be beneficial for 
implementing high speed designs. 
3. 	 A low gate count communications FPGA with high speed interfacing technolo­
gies. The most commonly provided interfaces being Low Voltage Differential 
Signalling (LVDS) or PECL. 
4. Ethernet like communications infrastructure capable of at least lOMb/s and pos­
sibly l00Mb/s speeds. 
5. 	 Non-volatile boot memory for standalone system operation. 
6. 	 ITAG Test and Access port or similar debugging and testing interface. 
7. 	 Serial communications (EIA-232) for system control and debugging. 
8. 	 High-speed inter-node communications infrastructure support. 
9. 	 Power supplies and system control components. 
10. Processor/system boot-up configuration if necessary. 
11. Status/debugging indicators. 
Radar Remote Sensing Group, Electrical Engineering, VCT Page: 15 of 125 
ANALYSIS 

3.3.2 Mechanical requirements 
Ball Grid high chip packaging products are to be 
to reduce system cost and the of manufacture 
or error. future designs, limitation may be removed. 
physical dimensions and interfacing connectors must comply with the 
backplane or architecture to which the will if any. 
specified the specification and hardware design VU''''UVVA 
3.3.3 Interface requirements 
1. 	 or some other high-speed means provide U.Ul""U1VU and 
ware downloading to each node. 
Serial communications must be used for debugging and console use. 
3. 	 FPGA communications must use high (>100Mb/s) LVITL, or LVDS 
signalling between HV...."'''. 
4. 	 must be memory mapped to the processor and an interface to 
orC)Cel;SHlf! and logic. 
5. 	 Comms should have bus or DMA capabilities memory 
transfer operations. 
6. Parallel processing libraries will provide a technology independent 	 for 
between nodes. 
operating system must provide a TCPIIP over a custom communications 
channel to allow simple communications nodes. 
8. 	 For prototype hardware design developed in this thesis, an interface to a 
standard is required, across the PCI bus. 
3.3.4 VHDL I Macro-function requirements 
1. 	 Communications controller. 
Coprocessor functions. 

Including an or processor 

3. 	 System bus controller. 
Advanced clock generation and deskewing. 
5. buffers for communications and processing. 
Radar Remote Sensing Group, Electrical Engineering, VeT 
CHAPTER 3. USER REQUIREMENTS AND SPECIFICATION 

1- -- - --------1 1- - - ---------1 
• • t • 
: Node: : Node : 
.1....-._--' r;;;:;-r'l........--..N .. ...... -
Processing Node 
...... .... , 

PCl Devices : 
. ' ~"--- --'---"-' 
.- .. .. --_ ..... ~ ~::::::::: : __ a>: Di sk : 
........ .. ... 
~ lVOS"Ghannel r------: 
'.. _... _---" .. ---- : 
Nex t Node 
Figure 3.2: Node interface requirements 
6. FPGA configuration support 
7. System boot and configuration 
3.3.5 Software requirements 
1. Operating system (Linux or eCos) ported to a node. 
2. Ethernet or equivalent networking: Programming, control, messaging. 
3. FPGA run-time reconfiguration if FPGA supports it. 
4. Interfacing with FPGA processor and communications. 
5. Parallel processing libraries ported to as/communications network. 
6. Test suites. 
3.4 System Specifications 
3.4.1 Processing Algorithms 
1. FFf or DCT algorithm demonstration. 
2. Parallel execution of algorithms. 
Radar Remote Sensing Group, Electrical Engineering, UCT Page: 17 of 125 
3.5. SPECIFICATIONS 

1. I Custom -"',",,---EI interlace via channels. 
or equivalent. 
Communications 
Network Mesh. 3. TCP/IP Stack on 
Configurable inter-node topography. 
Test Specifications 
system will initially tested with a in order to 
firmware and port and test the operating to the platform. test pro­
tests will be developed to verify the functioning of the hardware and firmware 
Testing will performed using port and with test software in the 
If a direct to a PC system is implemented, this can be used 
to test system without the of the processor. 
Later testing at a over a communications link and will 
provide access to such as a remote if the Linux 
configuration as programming flash and setting 
node. 
Final testing will test processing running on the and FPGA, 
vv",,,, ..,, ... example. 
the correct of those algorithms and finally demonstrate a parallel 
Page: 18 Radar Remote Sensing Group, Electrical Engineering, VeT 
Chapter 4 
Concept Study 
The idea of a hardware and software reconfigurable parallel processor is not a novel 
idea, however not many such systems have been developed. In most cases, the projects 
have been either commercial or research using proprietary technology. This makes 
open research in this area more difficult due to the lack of information provided and 
the lack of physical hardware. 
A hardware configurable processor with the use of open-source technology hopes 
to further the use of open-source software for parallel computing and to develop an 
environment for further development and research. This chapter describes the concept 
study undertaken at the beginning of the project in order to determine the viability and 
further develop the requirements and specification of the system. 
Firstly, the requirements of the parallel node in general are discussed; this is fol­
lowed by more detailed looks at the processor and reconfigurable logic. A high level 
system design is specified followed by discussion of the various choices for inter-node 
communications. Finally the software requirements for the system are expanded upon. 
The results of this study were that the project goals ofdeveloping a node for config­
urable parallel processing were determined to be within reach using currently available 
technologies. 
4.1 Parallel node requirements 
Essentially, each node needs to be capable of running as a stand-alone entity without 
support from other nodes. This means that and as such, it will require some of the 
basic elements needed for a microprocessor system: 
1. 	 Processor: This can be either a stand-alone, System On Chip (SOC) or ASIC 
implemented processing device. This will provide the general processing and 
control requirements of the system. 
2. 	 Memory: Memory for system and application software as well as data storage is 





(a) a medium density non-volatile memory program 
data storage. 
SRAM, a high memory low-density memory low 
high bandwidth data storage. 
(c) DRAMISDRAM, a high-density 	 type reasonably high 
for program execution and volume data 
3. 	 Communications. In order a node to communicate with other nodes, various 
forms communication may be required. 
(a) 	 networking is a well-defined industry standard communications 
bus, with average bandwidth capabilities. 
(b) Backplane 	 provide very high-speed connections can 
only support a limited number of 
(c) Custom communications can 	 provided though link layer chip-sets or 
implemented using configurable logic They the potential for 
producing high point-to-point or bus type connections. 
<",""""" and 
(d) Standard These low data rate communications be used 
control and configuration as well to aid in debugging. 
4. 	 supplies the system components and microprocessor supervi­
sory, control and configuration devices. 
Hardware interface and physical connectors. 
support the need reconfigurable processing, a configurable device with 
resources to support the algorithms required will be needed. 
4.2 Processor requirements 

scope project is to develop a 
Array 
node for parallel processing and the 
is more on the than absolute processing Where possible, devices such as 
Ball will not used because of added and manufacture 
problems. 
Radar Remote Sensing Group, Electrical Engineering, 
CONCEPT 

4.2.1 	 System Requirements 
for the processor is processing 
4.2.2 Processor Selection 
orolces,sOI architecture. general, most come in a low 
count package configurations and have an interface to a support chip (companion) for 
m""rnA,n, bus and bus access. 
IDT [3] processor is a 64bit MIPS processor capable of 
to 333MHz with a maximum of MIPS (Millions 
structions Per Second) a 128pin This features a 
precision floating-point unit running up to 666 MFLOPS (Millions of Floating 
Operations Per It also includes extensions to 125 million 
and accumulates (MACs) per second, a full-featured virtual memory manager and 
and caches. '574 has a 32bit external data/address 
(SysAD bus) Other variations the 64bit MIPS pro­
cessors are available Toshiba all the same instruction 
sets and similar bus interfaces. The C compiler the full stan-
MIPS processor instruction sets. internal architecture of all MIPS nrc)ce:ssors 
are very similar internally and are backwards compatible which makes 
newer devices 
emloe(lOe;a PowerPC chips IBM and Motorola run up to and 
peripherals. devices typically come in the 300 to 
The major processing tasks of will be communications and 
point mathematics. DSP instructions may be used for particular pro-
but will usually need to routines written the proces­
language and carefully The Multiply aCcumulate 
of some processors can also up algorithms. 
processor (with companion memory if necessary) must able to 
.VA""'''''' to SDRAM, SRAM. FLASH and the preferably 
will mean that hardware design will be greatly "LU_....lAAU'""U 
with interfacing to complicated devices such as SDRAM will be reduced. 
to perform (DMA) transfers on bus will allow 
and processor to use a shared arrangement which will reduce the 
"\lprnp'>f1 of copying data the CPU. 
range 
to use packages very 
are supported by Linux, and the 
and most importantly GNU C Compilers (GeC). A port 
to project hardware will be greatly accele:ratt~ 
600 pin BGA types and are to with. 
are typically aimed at communication processors and support a host of 
Radar Remote ''''E'I'.-u'n. Engineering, 	 Page: 
can allow each node 
from the microprocessor. 
features. Most of the features are not needed for a processing node. 
The and most om~n-~murce operating the PowerPC 
range 
The Hitachi Super-H processor SH-4 runs up to ....V'-'lYH 
Drystone 1.1 MIPS. It comes in a 208 pin QFP or 256Pin 
which could 
There is not a wide range of these 
these processors is 
at 206 MHz and come 
and are based on a 
future 
to complexity ratio 
to use packages 
available from multiple manufacturers. The processors 
units and added peripherals and interfaces are 
chips. and BSD's the 
aid of the chosen 
for vector manipulation mainly aimed at graphics 
accelerating certain 
<U1U'Ul .... and backward compatibility and future support 
Strongarm processors run 
package. 
ARM core. The 
this range of processors is questionable. 
the MIPS processors have the greatest 
processors evaluated. The are available in 
compatible 
to be dedicated 
4.3 Reconfigurable logic requirements 

an FPGA to be used as a viable device for implementing arbitrary C01)rOCeS,sor 
the the resources to support any synthesis able .......,v..... " 
to some limit. In cases, a parallel implementation of the algorithm 
can be created if 
._A"_-J parallel to parallelism 
tern. There is also the to possibly implement a reduced CPU core on 
This could allow more complicated functions to be 
on the FPGA and reduce amount control 
implemented CPU used to feed data into coprocessor units and results 
a shared as an example. This could also be used to perform 
communications provide intelligent switching of the inter-node commu­
nication links. 
A device in the of a 1 million 
ment a very wide 
copies of simpler to create a 
very expensive and a lower gate count device 
However 
probably be used for 
implementation. 
To experiment with high speed interconnects; the FPGA used must support high 
speed 110 such as (Low Voltage Signalling) serial which 
Group, Electrical Engineering, VCTPage: 22 of 125 Radar Remote 
preferably an IDT or 
The IDT '·....Af''''''.. 
perspective. At some 
different tasks. Therefore, the U'"''''E,H 
computing functions in 
node will contain a 
R4000 or R5000 64bit 
processor will be identical 
the system may 





can run up to 840Mb/s some FPGAs. The FPGA-CPU nr"'!"T!lf'''' must be capable of 
handling the full burst rate of the bus. This should allow data throughput 
across this interface. 
The Xilinx [4] support multiple (limited by pin count) 
622Mb/s LVDS channels off a single clock .....u not provide ...,uu,,,. 
(SERDES) circuitry nnUJPVI3T application notes 
All the Xilinx VH"A.u.vU these in low-level 
support up to 16 622Mb/s on prede­
pins. Only devices and KC range with than 400,000 gates 
support LVDS. do however provide hardware units however 
flexibility in their use. 
4.4 High level system design 
reconfigurable 
can double the integer 
provide a high speed floating point that operates in parallel with 
unit and employs dual issue support to create a scalar pipelined 
means that the processor can execute a integer and floating point 
simultainiously. These devices also multiply and accumulate 
which can be utilised. processors also all have advanced memolry 
management, which support advanced like Linux and 
can used to create a totally hardware environment applications to 
execute which means that they can to other hardware 
between MIPS platforms will most cases not even require a recompile. 
MIPS processor an chip to provide memory 
bus support for the chip performs the 
of various memory without processor intervention. 
chip will connect to at 16MB of SDRAM memory possibly 
DIMM format modules or on board There will also be a 16Mbit 
memory for boot loader and system storage. 
FPGA will be connected with an SRAM or similar interface 
urable) to the memory bus. The Xilinx a flexability 
and support LVDS channels in all the 
will have a high-speed SRAM directly connected to it 
storage. This removes the ,,,v,,,v,,,,-, FPGA compete for 
v, ..... .,~....Radar Remote Group, Electrical Page: 23 of 
4.5. COMMUNICATION INFRASTRUCTURE CHOICES 

usage the processor local bus. 
FPGA will provide at least four LVDS channels point-to-point inter-node 
high-speed commmunication. 
The low-speed memory devices (FLASH) may to be located on a peripheral 
to reduce the loading of local bus. 
The prototype hardware platform will interface to a standard PC via the 
PCI This will allow the devices in the system to be accessed transparently from 
the PC which will in development, debugging and provide com­
munications capabilities. node may access devices on the bus in system if it 
provides bus master support. This will allow nodes to access standard PCI 
ethemet At a stage, 
backplane for standalone operation. Because the nodes 
they must to specifications the PCI 
4.5 Communication infrastructure choices 
The various parallel processor systems pre-study use a variety 
of networking topologies. Some were to system hardware capa­
bilities while implemented complex communications processors. By using a 
configurable logic device, we have the freedom to implement a wide variety of com­
munications structures if the basic physical interconnects are well planned. 
various options for a network topology are as follows: 
1. 	 serial chain or loop of nodes. this system, node contain two 
to two opposite neighbours. This system has the advantage that switching 
and communications software can be implemented in a very simple manner. 
A message to node can be broadcast in both directions (or one direction 
in a loop) and passed node to until the addressed node receives 
is however a low level connectivity and total 
communications bandwidth is to the bandwidth of one 
A square matrix of nodes. This is a common structure is most notably 
by the Intel Each node a link to its neighbour on 
four Messages are routed though using routing algorithms. 
total bandwidth of the network is than chain topology as there 
are multiple paths between any two nodes and a only uses one path, 
leaving the paths for other 
3. 	 Cubic hyper-cube structures are also possible, however the hardware and 
software becomes increasingly more complicated. 
A structure. types of networks are useful when implementing an 
algorithm where jobs are dispatched from a central node and all results are re­
turned back to that node. No inter-node communications other than with the 
Page: of 125 Radar Remote Sensing Group, Electrical Engineering, UCT 
network. Each node will 
This structure can be used to using the FPGAs to switch 
through one access between 
reduce message by each node between the two communicating 11 v ...'"". 
basic system functionality and 
4. CONCEPT STUDY 

the mas­
ter to the furthest 
situation, the number of 
for a node with a number of 
master node is performed. 
star network. These networks are normally made of 
with a central communications 
rest the nodes. The aim this 
of nodes. 
llnr'Art for Ethernet 
Thus a bus structure need not be considered 
which 
bus may to make the system act as it was 
having much greater 10 .... aL/aUJ'U 
nodes 
to the 
however is to develop a low cost 
Based on structures, a square 
to implement allows for configurable 
bi-directional configurable ports for interfacing to 
of one or more of LVDS channel ...."',J"'......,...E'> 
The configurability 
a ring or torus by linking the outer nodes of 
can also to a tree structure 
of nodes configuration is simple 
This will give 
nodes. Each port may .... VA"1" ••". 
FPGA resources. 
supporting up to children. 
project however will focus on the hardware design 
these options will left for future work. 
4.6 Software requirements 
"'Hl"'U' for the system is to port an An""'''lh 
must ....."'...v ••" 
processing capabilities of a 
inter-node communications 
system to the 
node. Further work 
example. 
already open-source nrc"...,", 
All the software for this project is required 
selected will be open­
to built on open-source 
projects and tools. tools and operating 
source and most of the parallel computing libraries MPIandPVM 
The operating available for running on are: Linux, 
eCos, NetBSD, L4, QNX 
NetBSD [5] runs on a 
as well as some Silicon 
based machines and on and Toshiba 
based MIPS palm-top The NetBSD 
provides support of platform independent including PCI, 
Radar Remote Sensing Group, Electrical Engineering, VCT Page: of 125 
4.6. 
commercial busses. strong NetBSO is that 
specifically portability and runs on multiple architectures. It can 
applications with the aid of a compatibility mode or The GNU 
tools are primary development bases used the 
Linux [6] has been ported to the MIPS standard 
kernel and Linux. The Standard GNU are used to develop code the 
"''''".....1-'. those with an 
MIPS platform. Linux supports many of the R4000 and R5000 based MIPS 
controller. lOT MIPS processors 
not feature thus should pose problems. 
eCos [7] NEC Vr4300 64bit MIPS eCos is a 
hard-real-time system and thus does not use memory management and is staticly com-
This changing applications or running multiple aplications simul tainiousl y 
much more difficult. 
LA microkernel is a real-time kernel and runs on the R4000 CPUs a 
large percentage of it is hard coded assembler, may difficult to port. (see 
User mode Linux can run under the LA kernel, howeverfor project, a real-time 
system is not and may prove unnecessary. 
QNX is a non-opensource, free to use operating that support the lOT MIPS 










Page: of 125 Radar Remote Sensing Group, Electrical Engineering, 
Chapter 5 
Hardware Design and Implementation 

the and specification was performed, the hardware design 
was developed. This chapter describes the hardware design process and imple­
mentation of that design. The design process is starting the high level 
design showing major functional units and their methods of interaction. LU...... ...., ...,'-'u,'s • .J 
more detailed design of each unit sub-parts thereof are 
provided. entire design would be too large to cover in a single low-level descrip­
tion. Various difficulties and complicated design choices are explained in further detail 
where applicable. 
design the system and sub-section was specified to conform to 
requirements and specifications laid out Chapter The design made are 
indicated with motivations for the various choices given. Implementation details are 
provided each to an understanding the low-level the 
components in system. 
5.1 Processing Node Components 
After the review specific processor, companion devices, memory and FPGA 
devices were chosen for design. major components needed chosen and sourced 
early in the before the had finalised long times. 
The processor for the project was the 79RC64754 64-bit MIPS RISC 
microprocessor [3J. This part was chosen for a number of reasons. It has 32-bit wide 
external bus would hardware design much easier and the 
risk involved in designing with a 64-bit bus. The device provides a double-precision 
HV',U.U"F> point which operates parallel with integer unit through a dual-issue 
architecture. Finally, the device is packaged in an to use 128-pin package. 
The companion selected for the project is the Galileo GT641 [9]. 
This was selected primarily because it interfaces the same 32-bit SysAD bus as 
IDT processor. It also an integrated PCI bus master interface, which would make 
interfacing the PCI bus simple. An on chip SDRAM and memory controller allows 
for a ' glue less' .""....-,,"'" to SDRAM. access SRAM and FLASH devices however, 
a bridge or decoder "''-'''''\ALL< is required. are such 
interlaces in the datasheets. 
The FPGA chosen is the Xilinx XCV200E, 200,000-gate device in a 240-pin pack-
This device up to 64 LVDS pairs, 114k of block RAM and 20 I/O 
standards. The this was cost. Higher count FP-
GAs are available for later use but at a much higher cost. 
Other core components required are SDRAM for processor memory, FLASH 
for program storage, SRAM buffers and supervisory Also essential 
are power supplies for the various components. 
5.2 Design Choices 
section three possible system implementation architectures and lists 
pros and cons associated each. implementation details differ from these 
~~,,. ...... w which are from early in the design phase. do however give an indication 
of the basic 
5.2.1 Choice A 
See figure 	 1. 
is acts the bridge between the local bus and peripheral 
Advantages 
• High-speed operation minimal loading on memory/data bus. 
TPTT'>f'P• Direct 	 to FPGA data transfer rates and allows DMA 
access. 
• 	 FPGA can ...."'~..'r..-...... DMA transfers to SDRAM / bus. This 

CPU usage doubles memory bandwidth. 

• 	 and CPU ROM allows stand-alone system operation (Without host system 
intervention). 
• SRAM cache memory for high speed FPGA reconfiguration. 
• Partial reconfiguration possible direct PCI I/O operations. 
for system configuration and FPGA configuration management. • Single 
• Simpler routing PCB. 
Radar Remote Sensing Group, Electrical Engineering, UCT 
CHAPTER 5. HARDWARE DESIGN AND IMPLEMENTATION 

Re~Q tI JTAG " 
H••"er ~ 
0 
~u eLK DataU :> 
*2.5v ~ 
rOT H[PS Rc64514 CPU ~T 











. ~ . ~ "C ----'-""' ­I " " ~"' "' 




J2bic - G41l.leot. GT-64.11S ~PCI 1: MIPS Suppor t Chip 
Bu. r- ;-­
;; 
~ - U.. 
Memory Int.er lace 
-' " "' u 
~ ~ 
g -
:I~ ~ "' "' "' 
-. ,., .. -c 














LVOS ..... Xiltnx FPGA. xCV200E 
~ 
Vlrt.ex-E 





Wa cc hclog 
11 III 11 11 1 
r ­ - r ­ '-­
a-bit a-bit a-bit 8-bit 
Boot. User rPGA ~PGA )(i11nx r------,Flash FI ... sh Cont Con! XC9512XL 
t t6MBit) (16MtHt., Fl ••h SRAM CPLO ISupe~isor I(2Hbit) 12Mtn t ) 
T 
Figure 5.1: Design architecture choice A 





configuration to allow access to CPLD and 
SRAM for reconfiguration the CPU and bus. 
• 	 FPGA requires a simple 
• 	 Cannot perform direct I/O full configuration of FPGA, must cached SRAM 
• configuration errors will prevent system operation. 
• 	 Peripheral operation dependent on operation. 
5.2.2 Choice B 
The resides on the local bus along a to act as a bridge 
to peripheral bus. 
Advantages 
• High-speed operation with greater loading on memory/data bus. 
rates allows DMA 
access. 
• Direct 32-bit interface to FPGA maximises 
• 	 can DMA direct to SDRAM / bus. 

CPU for data and doubles memory bandwidth. 

operation (Without host system 
intervention). 
• FPGA and CPU ROM allow stand-alone 
• SRAM cache memory for high speed FPGA reconfiguration. 
• Full/partial reconfiguration possible direct/cached I/O 
• 	 Hardwired peripheral bus though CPLD device not dependent on FPGA 
tion. 
Disadvantages 
loading on tTn_."n'~"f1 memory bus. 
• Two CPLDs 
• More complex routing. 
Page: 30 of Radar Remote Sensing Group, Electrical Engineering, 
CHAPTER 5. HARDWARE DESIGN AND IMPLEMENTATION 

:ill go • O~ta. u " u • LXI Powor Supply > '" W 
J . 3V, 2.5V, l.8V 
~ 
lor MrPS llC6451. CPU ~ 
(Pos51b ~ y 5V) 
~5V 
Cl ock. 
Clock( ..15KHz snct.Ka3V Dl.att'l.but.l.Oll 
~ i ~ 
:lOy 
~ u ~ 




32bit t-- ­ • Ga ll.leoc. GT-64115 T. ~ PCl J: MIP S .Support Chl.p Bus t-- ­ iii 
~ 
3."" u.mo.y .....aco 
""c. U< 









IU~ - I2C 
xu~i1)( "PCA KcvlOO1: 
+­
Virt&x- E -. 




""""'" OM• ... 
RS232 
III 11 III 
8-bn 8-bl.t 8-bit. a-bit . :x.111.nx 
Boot User 
'FPGA FPGA ' . . Sma ll :----,F l a.sh Flash Conf Con! CPLD : 
(aMBit ) (16Mb i t ) 
Flash · SAAH CPU Contiq I ';1' .~J12Mbl.t) !2Mblt) 
3upervJ:.sor 
3,. 3" 3.lV 3." :J:3V 
'I" 
Figure 5.2: Design architecture choice B 
Radar Remote Sensing Group, Electrical Engineering, UCT Page: 31 of 125 

5.3. 
5.2.3 Choice C 
figure 5.3. 
FPGA on peripheral bus, with only a CPLD device and SDRAM 
on the local bus. 
Advantages 
• Reduced loading on memory/data bus. 
• 	 FPGA CPU ROM allows stand-alone system operation (Without host system 
intervention). 
• SRAM cache memory for high speed FPGA reconfiguration. 
"'..... ',AVH possible via direct/cached PCI 110• FulUpartial 
• Hardwired peripheral bus though CPLD not dependent on FPGA opera­
tion. 
• Simpler routing. 
Disadvantages 
• Buffered access to 
may necessary. 
• DMA operations may not possible. 
• Full bus width probably not available to FPGA, pertormance implications. 
5.3 High Level System Design 
basic architecture from design option B in the previous section was selected 
the The major feature this setup is that the FPGA is located on the 
local bus of memory controller along with a separate 'Bus that will act as 
to the peripheral bus as well as providing the logic to configure 
The MIPS CPU to be various to provided operation. 
Most importantly, the processor bus (SysAD bus) needs [3, 1] to interface with 
a controller provides a directly compatible inter­
face, which will serve this function. processor requires various interrupt inputs, 
which are provided by the active components of All fu­
ture interrupt sources must be accommodated for in order to make the design more 
speed. 
a 
Page: 	 Radar Remote Sensing Group, Electrical Engineering, UCT 
CHAPTER 5. HARDWARE DESIGN AND IMPLEMENTATION 

ii. Dv C IrT.KI Power Supp ly eset .J, 
3.3V. 2. SV. l.8V 
~ 
I01' HIPS RC64574 CPU ~ 
(Possibly SV) 
rPllrLK xi 1 inx 
Clock 
Clock "TY"LK Small r ­<=75MHz < N" K <, ,n <, CPLD3.3V Distdbution 




CPU Interface Supervi sor 
~ 
tid9 Calileo GT64115 ""--ECLK.-. pcr HIPS Support ChipBUB 
,.JV Memory Interface 
C1 LK 
S bM 
C S B k 
















Irn r ­ <or 
~ 
xi linx FPGA XCV200E = ..Virtex-E 
n , 512""16
I R~ Conf Fast Watchdog 
I I * 
SHAM 




J III 1 III 
RS232 
8 -Di t 8-bi t 
Boot Ueer 
Flash Flash 
(8MBi t) (l 6Kbi t) 















Figure 5.3: Design architecture choice C 
Radar Remote Sensing Group, Electrical Engineering, VCT Page: 33 of 125 
SYSTEM DESIGN 

generic. The configurable logic devices provide further possibilities. processor 
also a which must provided by an 
ROM memory or dynamically generated. This will be one of the functions provided 
by a 'Config CPLD'. 
The memory controller does not any interfaces self, it provides 
functions for the requirements of system. The for high speed high-density 
meJmOJrv is by SDRAM the memory controller provides an 
interface for controlling memory. 
The will a connection to the local bus. By providing as many bus 
as possible to the FPGA device, the configurability and options available for 
implementation is kept high. It introduces the possibility the to emulate 
the controller if it has access to all required signals. The has 
following modes of configuration: Serial bitstream and 8-bi Parallel. JTAG pro­
gramming is very simple and works reliably a simple programmer. Support 
JTAG will provided initial system testing and as a backup in 
the event that other methods do not work. Serial configuration requires a special 
color serial memory device to the FPGA's parallel 
is the available for configuration. The is 8-bits wide 
and can interface a standard memory bus with some additional signals for the con-
of the to use. The CPLD' will control the of these signals 
and allow for parallel programming of via interface connected to the 
bus. FPGA devices have some internal RAM blocks that at very 
For complicated algorithms however, more memory is required 
than available in device (XCV200E has 114kbits). this reason, a high-speed 
4Mbit device will be provided a dedicated interface to FPGA. 
The system a standard serial connection for use as a system console and 
debugging. Devices are available that will provide this interface which interface 
to the peripheral bus. logic to implement links for example on 
other uses relatively resources in an FPGA. By using FPGA to imple­
ment the link, flexibility to change the physical software interfaces when 
required is provided. For these reasons, it was to implement serial port 
functions in FPGA. 
The non-volatile for if a is to 
operate stand-alone. A Linux kernel and simple filesystem, which is enough to get the 
fully functional requires at least I-2MB storage. For this reason, a I6Mbit 
(2MB) device will be provided boot-up. The de­
signs may be required to in non-volatile The XCV200E FPGA that 
was to used requires 1.442Mbits (l80kB) configuration. FPGA device 
however is upward compatible to a 600,000 device which will require 4Mbits of 
configuration. In to for this and to provide for the possibility of 
multiple configurations, an 8Mbit FLASH device was chosen. will allow 
up to five configurations for XCV200E to stored. 
number LVDS provided will depend on the available pin resources 
of 125 Radar Remote Sensing Group, Electrical Engineering, VeT 
HARDWARE ....,"-',"'-<....., AND IMPLEMENTATIONCHAPTER 
after routing of other components to the FPGA the of lines available 
to board connectors. 
watchdog timer and real-time clock device are desirable for system time keeping 
from system errors. These typically to a micropro­
cessor bus or provide a serial such as PC. choice of will 
be decided by component availability and interfacing complexity required. lower 
the complexity more favourable the device. 
cost and PCB allows, a user FLASH device provided for 
laneous data storage. 
Finally, the various components require power supplies that are of supply­
ing required current in to complying with device tolerances in terms of 
noise and accuracy. processor and I/O blocks the reqUlre 
a supply, the CPLD devices, GT64115 and FPGA I/O 3.3V the clock 
oscillator a 5V supply. Switch mode power supplies are most to be 
due to the relatively high current requirements and to reduce power 
5.4 Functional Unit Design 
gives a detailed description of design process for of the 
design. 
Microprocessor 
functional units in 
The chosen microprocessor was the IDT 79RC64574 64-bit MIPS orOices;sm 
internal functional structure to the device is very and 
mance. As can seen in 5.4, the processor has multiple control and execution 
The two execution the 64-bit Integer Floating point unit in par­
allel, each with a five pipeline that is fed by a dual-issue instruction fetch 
This allows both units to operate full speed simultainiously. The instruction fetch unit 
f't"\1.nl,,·rl to the controller, which cached entries performs external 
when misses occur. The processor has a control processor 
which manages interrupts, cache and management functions. 
From external view, the processor has interrupt, configuration, JTAG and 
SysAD bus interfaces, which are shown in 5.5. processor clock is 
from system clock by multiplication an on-chip Locked Loop (PLL). The 
PLL is to noise thus its own power supply to ensure 
nrAn"'r operation. This was provided by filtering PLL power supply though an LC 
effectively as a low-pass filter. processor 7 
inputs: normal and one Non-Maskable Interrupt (NMI). sources of nor­
mal are: interrupt from the memory controller that can later decoded 
into memory errors, DMA and interrupts. An interrupt from of 
the Bus CPLD and Con fig CPLD, which are to be used application specific 
Radar Sensing Group, Electrical Engineering, 













Dual-Issue Instruction Fetch Unit ~f*----~ 
Clk In 












RC644 7 4 Compatible 
System Interface 
(SysAD bus) 
RC64574 Functional Block Diagram 
purposes. The possibility for accepting interrupts from the bus directly is allowed 
by routing a PCI interrupt line though Bus CPLD, which 5V inputs. 
The processor can only accept a maximum 3.3V input voltage. 
The processor configuration interface five signals need monitoring and 
control. There are three reset which are generated the Config CPLD 
two signals for processor configuration. processor configuration 
uses a serial bit-stream. CPLD can this support for it 
is provided. In the event that this is not possible, provision is made for using a serial 
EEPROM to configure the processor. 
The processor supports JTAG test and port and boundary scan 
and instruction execution. The JTAG can prove very useful for debugging 
and processor was added to JTAG 
The 32-bit MIPS SysAD t"'rt<lf"'" on the 
Galileo memory controller. direct connection between along with 
the relevant pull-up and pull-down resistors were as the recommen­
dation provided in the datasheets. 
Other than features, the processor provides no other on chip peripherals, 
which need to be interfaced. 
Radar Remote Sensing Group, Electrical 125 VCT 
CHAPTER 5. HARDWARE DESIGN AND IMPLEMENTATION 





'sysAO ·32-b1t InlQ~ace 
Conlig 
32-bit AD CMD Comrol 
Figure 5.5: RC64574 Interfaces 
5.4.2 Memory Controller 
The Galileo GT64115 MIPS companion, PCI and memory controller device was se­
lected to provide access to the PCI bus and memory devices. It supports three main 
interfaces, a 32-bit MIPS SysAD bus, a 32-bit PCI interface (up to 66MHz) and a 
32-bit SDRAM and Device local bus interface. See figure 5.6. On the local bus, the 
memory controller supports up to four banks of SDRAM and provides five device 
chip-select signals, which includes a chip-select for the boot memory. 
Like the processor, this device uses an internal PLL device for clock generation 
and required a power supply filter for correct operation. 
The SysAD and PCI bus interfaces were connected up according to the specifica­
tions in the datasheets. For certain signals on the PCI bus more research and examples 
were needed to clarify their usage such as the PCCPRSNT1I2 signals. 
The local bus SDRAM interface was connected to two 16-bit SDRAM devices 
forming a single 32-bit 32MB SDRAM bank. The decision was made not to use an 
SDRAM module due to the difficulty in obtaining parts and the possibility of over­
loading and lengthening the high-speed bus. The FPGA and Bus CPLD devices are 
configurable logic devices and as such are very flexible. Because the chip-select sig­
nals are multiplexed on the local bus, all are available to each device. Thus, no specific 
chip-select signals are dedicated to either device. A convention was however specified 
for the usage of the chip-select lines, see Table 5.1. The peripheral bus bridge is the 
logic implemented in the Bus CPLD to provide transparent access to device on the 
peripheral bus from the local bus. The local bus chip-select signal will be decoded into 
address banks to address individual memory device. 
The FPGA local bus interface was provided with the full set of signals required in­
cluding the memory controller's multipurpose pins which can be configured to provide 
Radar Remote Sensing Group, Electrical Engineering, VCT Page: 37 of 125 














SysAD ·3Z-bll .lnlertace 








Figure 5.6: GT64115 Interfaces 
Table 5.1: Memory device assignments 
I Chip-select signal I Assigned Device 
SCS[O.. l] SDRAMBankO 
SCS[2..3] Not assigned 




Boot CS Peripheral Bus Bridge 
DMA support. 
The Galileo memory controller device has a fully software configurable system 
memory map which allows reprogramming of device widths, timing and address ranges. 
For processor boot-up however a default configuration is need. The device provides 
two mechanisms for initial basic configuration. Firstly a series of week pull-up and 
pull-down resistors can be configured on certain local bus signals. On power up, 
the busses are tri-stated and the signal level on each of these lines is sampled. This 
allows various configuration options to be set. The alternative method is via auto­
configuration whereby the memory controller reads a set of configuration values from 
a special location in the Boot memory device. Provision for both these modes was 
made. 
Page: 38 of 125 Radar Remote Sensing Group, Electrical Engineering, VCT 
CHAPTER 5. HARDWARE DESIGN AND IMPLEMENTATION 





SDRAM sdram Control, Controll.r , Address 
~ 
SORAM ' FPGAMemory 
"GaJileo 2. FLASH 
GT6411S 
[1--'-- , , ~--
1 Bi-Oi, Buffa, I' p.r"B~d;!Bus L Lalch .. Dacod... 1 






Figure 5.7: Local and Peripheral Bus Configuration 
5.4.3 Bus CPLD and Peripheral Bus 
The Bus CPLD device operation has been largely discussed in previous sections. The 
device chosen was a 144 macro-cell Xilinx CPLD in a 144 pin package [10]. This 
provided enough I/O pin resources to interface the local bus and peripheral bus along 
with providing various other interfaces. This device generates the address and control 
signals for the peripheral bus (master operation) and is a slave on the local bus. This 
device also generates the configuration signals required to program the FPGA. The 
system JTAG interface is routed to the I/O pins of the CPLD. This allows for the JTAG 
chain to be accessed from the CPLD and thus from the system its self with the correct 
logic design. 
The peripheral bus was chosen to have an 8-bit (byte) wide data bus in order to 
reduce PCB routing complexity. The address bus was sized to address the maximum 
possible memory device on the bus. The processor boot FLASH is specified at 16Mbit 
but provision was made for supporting a 32Mbit device. This meant that the address 
bus needed to be 22 bits wide. 222words * 8bits = 32Mbits. The SRAM device 
chosen for general-purpose storage and FPGA configuration cache was selected on the 
basis of availability. The device chosen was a 16-bit wide, 4Mbit device, which had 
already been purchased for other projects. This would not however directly interface 
to the 8-bit bus. Fortunately, the device selected has two control lines for byte wide 
accesses. During read and write cycles with only a single bank selected, the un selected 
bank is driven to tri-state thus not affecting that side of the bus. Therefore it was 
decided that both banks could be hardwired together making sure that each bank was 
selected individually. The Bus CPLD could then be programmed to use the two banks 
as separate chip-selects, effectively accessing the upper and lower banks of the SRAM 
as separate devices. 
5.4.4 FPGA 
The primary task for the FPGA in the requirements is hardware acceleration of pro­
cessing algorithms. The FPGA however also needs to provide some essential system 
Radar Remote Sensing Group, Electrical Engineering, VCT Page: 39 of 125 
FUNCTIONAL UNIT DESIGN 

support for the system to function. interface to the processor, the FPGA is v ....''"''''''' 
on the memory local bus to provide 32-bit data access and provide 
possibility DMA transfers. FPGA is required to implement a serial con­
troller (UART). This can performed in logic, but a RS-232 level translator device is 
needed external to the to interface other systems. An LT1386 EIAfTIA 
compatible line driver was It provided the capability for two UART channels 
however only a pair of Transmit and was routed. reason 
for this was there would not sufficient physical space on the board for extra 
channel. The serial port not be used as a primary communications in the 
system and only one is required. 
The is connected directly to a 16-bithigh-speed 4Mbit SRAM device which 
will provide dedicated external storage space for the processing algorithms 
mented. The JTAG interface the FPGA is connected as part the system JTAG 
chain. 
To use the LVDS I/O capabilities of FPGA, the I/O blocks the LVDS 
channels to be must at 2.5V. Because all the other interfaces 
operate at there was only a single I/O block available that be 
for LVDS. I/O block provides for eight LVDS pairs (configured four input 
and and an LVDS clock The LVDS input was not 
implemented because the clock can be recovered with the of clock-data recovery 
cabling is suggested for signalling and thus two RJ-45 jacks 
were used providing four channels over two cables. This will allow for point­
to-point communications systems in a ring or square matrix configuration 
depending on the cable configuration. 
Finally, the configuration interface was configured to allow parallel pro-
on the SelectMAP described in datasheets and application 
notes. 
5.4.5 Clocking 
processor, memory controller SDRAM all require a synchronised in-
phase clock to which they need to synchronise. The CPLD and Config 
CPLD also inputs. The memory controller requires a clock greater than 
the frequency of the clock for PC environment) but no more than 
The processor requires a clock of between 33MHz and 125MHz and an internal 
pipeline clock than 100MHz. The SDRAM memory used a maximum 
frequency of l00MHz. The CPLD devices each operate up to l00MHz and the 
is specified at around 200-300MHz. 
bus frequency was chosen to 66MHz with the possibility up to 
75MHz. With a high frequency bus clock and the number of devices running 
from it, reliability problems can occur. all the the same clock, the 
loading on that signal would be too and possibly cause increased and 
nal degradation. other problem is that each device will not guaranteed to 
Page: 40 of 125 Radar Remote Sensing Group, Electrical Engineering, UCT 




Figure 5.8: Clock Distribution Architecture 
the clock in phase, which will cause synchronisation problems. To solve these prob­
lems, a clock distribution system needed to be designed with terminations to prevent 
reflections. A clock driver chip with ten output drivers was selected to manage the 
system clock. It offers a worst-case 350ps skew between output clocks. Because each 
device has its own dedicated clock signal, the nets are single ended. [11] indicates two 
recommended methods for terminating single ended clocks, source and end termina­
tions. End terminations works well but require a resitor-capacitor network. Source 
termination requires only a single resistor at the signal source and was chosen for its 
simplicity. In addition to the low skew of the clock distribution device, the signal prop­
agation delays on each net need to be very close to each other. For this reason, the net 
lengths on the PCB were equalised by routing in zigzag patterns were necessary to get 
the nets to within 2mm of each other. 
One particular issue that was described in an errata document from the memory 
controller was that the clock signal to the particular processor (RC64574) needed to be 
slightly delayed from its own clock input. The simplest method of achieving this was 
to increase that clock net length to create the required signal delay of Ins. 
5.4.6 Power and Peripheral Devices 
The power supplies for the system needed to be designed to be capable of handling the 
worst-case current requirements from the various devices. These worst case calcula­
tions are shown in table 5.2. The input voltage from the PCl is 5V and also 3.3V in 
newer systems. For the 3.3V and 1.8V supply, it was decided to use a buck mode (step­
down) switching regulator from the 5V supply. This was mainly in order to reduce the 
heat dissipation that would result from a linear regulator. For the 2.5V supply, a linear 
regulator was used due to concerns about noise effects on the high-speed processor and 
LVDS channels using this supply. A heat sink was specified for use with this regulator. 
The switch mode supply needed special attention to be paid concerning the choice of 
inductors and smoothing capacitors. High ripple supply capacitors (Low ESR) and 
power inductors were used with values calculated from datasheet information. 
A real-time clock with EEPROM memory, watchdog timer and voltage monitor 
device in an 8-pin SOlC package. This small device has a two wire 12C bus interface 
connected to the FPGA that will implement a controller. This device satisfies the need 
for voltage monitor, watchdog and RTC in a single device. 
Radar Remote Sensing Group, Electrical Engineering, vcr Page: 41 of 125 
5.5. SYSTEM CONFIGURATION OPTIONS 

Table 5.2: Worst case power supply calculations 
I Device I 5V Supply I 3.3V Supply I 2.5V Supply I 1.8V Supply I 
CPU RC64574 250MHz - - 880mA -
Galileo GT64115 - 250mA - -
XiI in x XCV200E - 400mA 20mA 1000mA 
SDRAM (128Mbit) - 2x140mA - -
SRAM (4Mbit) - 12nS - 2x120mA - -
FLASH - 90ns - 2x12mA - -
Xilinx XC9572XL - 60mA - -
Xilinx XC96144XL - 110mA - -
Clock Distribution - 60mA - -
RTCXC1227A - 2mA - -
LTC 1386 - 400uA - -
66MHz clock 20mA - - -
ITotal 20mA 1426mA 900mA 1000mA 
5.5 System Configuration Options 
One of the aims of the hardware design was to make the system flexible and config­
urable in order to provide a wide range of possible configurations and reduce the risk 
of a design problem with an inflexible design. 
The configuration options provided for the Galileo memory controller are shown 
in Table 5.3. The interrupt and miscellaneous configuration is shown in Table 5.4. 
5.6 Printed Circuit Board Design 
The printed circuit board design had the constraint that the hardware needs be com­
patible with PCI in terms of physical size and layout. The board was specified to be 
a six layer design in order to keep manufacture prices lower although for the proto­
type, additional layers could be accepted if necessary. Components were restricted 
to single side with only very low profile passive components allowed on the back of 
the board as per the PCI clearance constraints. Where at all possible, surface mount 
components were used. The design was specified to keep EMC consideration in mind 
as the hardware was required to interface a PC system, which typically produces a lot 
of interference and may be susceptable as well. (See [12] for EMC design) 
The most important aspects of the PCB design were the bus and high-speed track 
layout and the component placement to optimise this. The local bus is a 32-bit wide 
bus with additional control signals running at 66MHz to 75MHz. A circuit board trace 
typically requires termination when the propagation delay on the net exceeds the time 
Page: 42 of 125 Radar Remote Sensing Group, Electrical Engineering, UCT 
CHAPTER 5. HARDWARE DESIGN AND IMPLEMENTATION 

Table 5.3: Galileo GT64115 Configuration 
I Resistor/Jumper I Configuration 
R43, R44 Swapped CS[3] and Boot CS (PCI BAR) 
R45, R46 Swapped SCS[3:2] PCI BAR 
R47,R48 PCI Expansion ROM enable 
R49, R50 CS[3] and BootCS PCI enable 
R51, R52 Internal Registers PCI enable 
R53, R54 Autoload enable 
R55, R56 CS[2:0] PCI enable 
R57, R58 SCS[3:2] PCI enable 
R61-R64 BootCS , CS[3] bus width 
R65, R66 PCI Conditional Retry 
R35-R42 PLL Configuration 
Table 5.4: Interrupt and other Configuration 
I Resistor/Jumper I Configuration 
J5 GT64115 Int Enable 
J6 PCI Int B to CPLD 
17 PCI Power Required (Open 15W, Closed 7.5W) 
J8 PCI Clock to System Clock 
110 FPGA - No Config Pullups 
111 FPGA - Config Pullups 
DIP SW 1..0 CPU PLL: '0' - x2, 'I' - X3, '2' - x4, '3' - x5 
DIPSW2 CPU Timer Int 5 Enable 
DIPSW 4..3 CPU Write Mode: '0' - R4400, 'I' - Res., '2' - Pipeline, '3' - Reissue 
DIP SW 5 ADDR to Data Delay, '0' - Slow, 'I' - Fast 
Radar Remote Sensing Group, Electrical Engineering, VCT Page: 43 of 125 
5.6. PRINTED CIRCUIT BOARD DESIGN 

of 1110th of a wavelength. Therefore: 





max net length 20cm 
It was therefore important to keep the bus net lengths normalised and shorter than 
20cm in order to avoid having to terminate the bus. 
Due to the specification to try stick to a six layer board, a lot of effort was required 
achieve this. The hardware design uses four separate supply voltages. Fortunately, the 
supplies are each only required in localised areas on the board. This allowed the use 
of segmented power plains, reducing the need for additional physical plains. 
All the clock and bus nets were routed first in an effort to minimise the number of 
layer changes and net lengths. Design rules such as no 900 comers and equal track 
spacing and widths were used to help improve signal condition. All bus nets were 
routed in a daisy-balanced configuration and the bus devices were arranged to support 
this configuration. The support and low speed signals were routed around the critical 
nets towards the end of the PCB design phase once all critical paths had been finalised. 
Various test points for the power supplies were placed around the board for testing 
the board. Some of the remaining free lIO lines on the FPGA were routed to a DEBUG 
header for testing purposes. 
The switch mode power supplies and smoothing capacitors were placed together 
and as far as possible from the logic devices. This was an attempt to keep the device 
power supply voltages as noise free as possible. Each device power pin was addi­
tionally decoupled with a lOOnF capacitor for high frequency supply demand changes 
and 22uF tantalum capacitors were distributed around the board to help with lower 
frequency supply demand changes. 
The LVDS and DB9 serial port connectors were placed on the East edge of the 
board so that they would be accessible from the PCI slot interface panel of a stan­
dard PC. The PCI edge connector was hard Gold plated for improved contact conduc­
tance. Small copper planes were placed underneath the power supply devices with Vias 
though the board to help increase the surface area for heat dissipation. A ground plane 
was placed beneath the processor device on the top layer as the processor contains an 
exposed metal heat sink on its underside. 
The entire design was hand-routed due to the complicated power plane design and 
special high-speed routing requirements. This allowed precise control of the board 
design to meet the requirements. The SDRAM devices were routed in a manner that 
optimised the net placement however it required track sizes smaller than the manufac­
turers recommended minimum track widths. After consultation with the manufacturer, 
it was confirmed that localised locations on the top and bottom layers may have track 
Page: 44 of 125 Radar Remote Sensing Group, ELectricaL Engineering, VCT 
CHAPTER 5. HARDWARE DESIGN AND IMPLEMENTATION 

Figure 5.9: Top side, unpopulated project hardware 
widths smaller than the minimum size specified. This greatly simplified the SDRAM 
routing which otherwise may not have worked on the six layer constraint. 
5.7 Conclusion 
This chapter gave a quick summary of the hardware design process and some of the 
decisions made. The result of this work is a hardware design capable of meeting the 
system requirements. Each node has a microprocessor with boot memory and high­
density program memory, an FPGA for system implementing system interfaces and 
processing and high speed communications links for inter-node communications. 
Four PCBs were manufactured at Trax Interconnect and the majority of the com­
ponents were placed by Rhomco electronic assembly services. 
This hardware design is a first revision prototype design and thus was designed 
with a major focus on risk reduction. This meant that the fastest and most optimal 
solutions were not always favoured over more reliable and proven ones. Future projects 
may enhance the design to improve performance and features. 
Radar Remote Sensing Group, Electrical Engineering, UCT Page: 45 of 125 
5.7. CONCLUSION 

Figure 5.10: Bottom side, unpopulated project hardware 
Page: 46 of 125 Radar Remote Sensing Group, Electrical Engineering, UCT 
CHAPTER 5. HARDWARE DESIGN AND IMPLEMENTATION 

Figtlre 5.11: To~side vie?n0l'tllated prneet hardware 





Page: 48 of 125 Radar Remote Sensing Group, Electrical UCT 
Chapter 6 
Hardware Verification 
This chapter describes the process of verifying the design and testing the hardware for 
errors and compliance. This process was on going and ran in parallel with the firmware 
and software design phases describes in the following two chapters. 
This chapter begins with a description of the initial testing performed on the PCBs 
before the placing any of the components. This is followed by a description of the 
testing of the power supplies and the configurable logic devices. The testing of the 
memory controller was performed and testing the PCI interface. Finally when the 
whole system was functioning the processor operation was tested and verified. 
6.1 PCB Inspection 
The PCB inspection was relatively simple. Each individual PCB had already been 
flying probe tested after manufacture to the compliance of the Gerber design informa­
tion provided. However, errors in the design may have existed which would not have 
been detected. The most important feature to test was the power plane connections 
and isolation. If any short-circuits were present between any of the power planes, the 
hardware will not function and could be damaged if powered up. 
A simple digital multimeter was used to test the power plane isolation as well 
as testing for continuity between distant points of the same supply plane. All the 
planes were verified to be isolated from each other as required and the planes were all 
continuous from source to load. 
A quick visual inspection of the design was also performed with a printout of the 
PCB design as a reference. Everything appeared to be correctly manufactured. 
6.2 Power Supply 
The switch mode power supply devices selected for the project had not been previously 
used or tested and thus a full compliance testing was performed to verify their oper­
ation. The power supply devices and all the supporting components were manually 
49 

6.3. CONFIGURABLE LOGIC 

placed on the PCB in the laboratory. For each supply, the power was slowly applied to 
the board by increasing the current limit until constant voltage operation was achieved. 
If any problems existed, the current consumption would greatly exceed the power sup­
plies quiescent current requirements. By using the current limit, damage to the devices 
can be prevented. 
The switch mode power supplies were initially verified to generate the correct out­
put voltage levels at 'no load' conditions even though the device specifications indi­
cated that a load was required. Resistive loads were used to progressively test the 
power supply output current up to the rated 3A. The output voltage was observed to 
drop slightly with the increasing load but never fell below the specifications. An oscil­
loscope was used to measure the ripple voltage produced. Initially, the ripple voltage 
was measured at 80mV(rrns) however after placing the smoothing capacitors this was 
reduced to levels below the noise level of the oscilloscope «2mV). Device power sup­
ply ripple and noise tolerances are not given for most of the devices however, noise 
levels in a standard PC environment are generally orders of magnitude greater than 
this and thus correct device operation should be assumed. During the 3A load test, the 
heat dissipation observed from the switch mode devices was negligible, only becoming 
slightly warm to the touch (±40°C). 
The 2.5V supply is generated from the linear regulator and was also tested. No 
detectable noise was observed on the output. Linear regulator devices are typically 
very stable and with low noise and do not normally cause problems. The only problem 
with them is power dissipation. The heat generated by the device with a lA load from 
5 Volts is 2.5W. With the heatsink rating of 57°C @ 2W and additional PCB designed 
heat removal, no overheating was observed. 
The power supply devices were found to produce the correct output voltages with 
no detectable noise and each was capable of supplying the specified maximum current 
requirements. Finally, the clock and clock distribution components were placed and 
their operation verified. The clocks are critical for the functioning of the logic devices 
in the system. The Ins clock delay designed for the processor was difficult to verify 
due to the sampling frequency of the oscilloscope available. 
6.3 Configurable Logic 
The passive devices (resistors and capacitors), memory, clock and the configurable 
logic devices were then placed by a contract manufacturer. The first tests to be con­
ducted were to again test the power supply isolation. Any design errors in which power 
pins had been routed to ground pins for instance would immediately be apparent. The 
tests showed that at least no power pins and ground pins had been wrongly connected. 
Power was applied to the board by slowly increasing the current limit until the 
power supply switched into voltage limit mode meaning that no short circuits or ex­
cessive current drawing problems were present. If the current usage of the board had 
exceeded a limit specified by totalling the quiescent current requirements of each de-
Page: 50 of 125 Radar Remote Sensing Group, Electrical Engineering, UCT 
CHAPTER 6. HARDWARE 

vice, a fault could be assumed. testing would have been to the 
fault. 
It was that power on the board was not any of the to 
up and current consumption was as expected. Verification of the configurable logic 
devices was started. All the configurable logic devices used have a test and 
access port. The objective was to establish communications with the devices 
using the interface order to configure them. 
A link was used to the processor JTAG 
placed on PCB. Xilinx JTAG downloader to 
was on VeroBoard as per the Xilinx provided design. 
attempts to probe the using JTAG failed. The software 
indicated that no devices were present in the chain. After some investigation, it was 
found that problem was with the download hardware and a jumper for 
configuration mode to be set. The JTAG software finally detected 
correct devices in right order on the chain. verified that the power 
hardware 
to each were correct and each was internally. 
The was the device to tested. A design to flash the was 
written in The compiled code however configuration 
the JTAG interface although indicating correct operation, produced no change 
the state of the A setting in the software configuration options was found to 
cause this problem the FPGA configuration was successfully tested. tests 
were used to check the functionality the Bus and CPLD. 
The CPLD design was the first to tested. While no devices 
reset delays were IJ"'LJ'\.'H"~ on the state of the reset signals were present, 
and on the oscilloscope. were verified to be the same as those 
The push-button 
ate the correct behaviour and ensure the logic was functional. The 
brown-out reset detection the MAX811 device was also verified by reducing 
power supply current limit to the point that the system supply voltage started to 
triggered MAX811 to assert its reset output and the to perform 
the programmed response. 
Work began on the emulation of the local-bus to test Bus design, 
this is in 9.1.2. 
6..4 Memory Controller 
Once correct of the devices peripheral mem­
was established, the processor and controller devices were placed on 
The CPLD was programmed to keep the processor reset during this 
testing phase. Again, a test power planes was conducted to check PCB de­
sign errors. These tests found no problems with the memory controller design. 
Resistors to specify power-up configuration GT641 were then soldered 
Radar Remote Sensing Group, Electrical Engineering, VCT 
ntp'rt<l'" 
the host motherboard. J:SeI'ore 
Any 





unused Pentium 1 was used as the host. 
On with the hardware in the PC, the did not respond 
concern and the power was off. After an 
on the card had heated or seemed unusual . 
a ' PCI Retry 
option of the was identified. The initial 
interpretation of use 
enabled which meant memory controller HJ....,"""-' 
received a signal local processor to indicate that 
...,<11.. "...,.... 
...,"'••"'''''''-'''''-' and during the 
configuration option was .nl'''n"·Pl't 
all 
of 
DrCice~;s01 requests, thus if PCI accesses 
was operational. 
host PCI PC computer 




into position. was for a little-Endian default configuration. 
On providing power to it was observed that the memory controller 
device was warm to about the power usage was 
not provided. The the use of a heatsink was 
reduced. This removed any concerns about the devices 
The first major test was performed was 
a PC by inserting it into a 
on the PCI edge connector was double checked to make sure that 
nets and device pins were to the interface. 
critical failure of the processor card or PC hardware because the 
capable of providing 
Once it was decided that the hardware was ready for testing in the 
ceed. The CPU was reset with no code present it to run. memory controller 
thus kept the system waiting indefinitely. By simply changing the resistor 
setting, the started up correctly with the identified the Galileo 
memory controller device on the PCI bus. The Linux on the host's 
PCI probe the device but no drivers were as expected for the de-
correct layout and design of interface on the Galileo 
testing the memory controller was to access the memory on the 
across the PCI bus. was very important in that it 
entire memory work on getting the 
memory requests in the same 
it could be assumed that 
was written. This was firstly 
used to test access to the SRAM device on Bus. Various 
were overcome and a greater knowledge of controller and PCI 
After finding and fixing various was able to read from 
through the Config CPLD and controller from the PC. 
was the last component of the system to be verified. A 
sequence is required to by the SDRAM devices in order to 
Some support for the special bus commands sent to the SDRAM are 
memory controller SDRAM initialisation sequence was 
Group, Electrical Engineering, UCTRadar Remote 
CHAPTER 6. HARDWARE VERIFICATION 

sent from the PC Linux driver and the SDRAM was tested. The SDRAM was setup 
for 2CLK precharge, 4 bank interleaving, 2CLK CASIRAS latency and 8 data bursts. 
Two prototype boards were being built during this hardware verification stage. On one 
board, the SDRAM was working while on the other, specific bits in the data read-back 
from the devices were incorrect. It was found that some of the SDRAM pins on the 
one card had 'dry joints' or pins that had not soldered correctly. This was caused partly 
because the PCB footprint for the SDRAM devices was too small and solder did not 
make proper contact on all the pins. Each dry-joint needed to be carefully re-soldered 
and tested. This solved the problem with the SDRAM memory. 
6.5 Processor 
The processor was placed at the same time as the memory controller although it was 
initially held in reset. During the PCB testing phase, a very low-impedance was de­
tected between the power and ground of the processor. After closer inspection it was 
discovered that a design error had caused the processors PLL power supply pins to 
be connected the wrong way around. Fortunately the PLL supply was filtered and 
thus moving the components and using link wire corrected the problem. On one of 
the boards, a low-impedance still existed between the processor power supplies in the 
order of 45ft The cause of this low impedance was never found however device oper­
ation was not affected. 
The processor operation was only tested once the memory system was verified to be 
operational. The first step taken was to test the CPU 'ColdReset' serial configuration 
sequence. This involved testing the Config CPLD code and using an oscilloscope on 
the serial channel to capture and verify the sequence. The processor PLL supply was 
also tested with the oscilloscope to check for supply ripple characteristics from the 
PLL. A week 200MHz ripple verified that the on chip PLL was operating in X3 mode 
as specified: 66.7 MHz * 3 = 200MH z. Various changes to the CPLD code were 
implemented in order to generate the correct configuration sequences. 
The processor was then ready to be booted. The memory controller was configured 
to allow the CPU to boot from the SDRAM memory. A test program was loaded into 
the SDRAM and the processor reset released. An oscilloscope was used to monitor the 
processor's SysAD Bus interface to the memory controller to check for bus activity. 
After various processor configuration changes, the system setup was correct and the 
processor operation was verified by running test code and verifying the results across 
the PCI bus. 
Radar Remote Sensing Group, Electrical Engineering, UCT Page: 53 of 125 
6.5. PROCESSOR 

Page: 54 of 125 Radar Remote Sensing Group, Electrical Engineering, UeT 
Chapter 7 
Firmware Implementation 
The node contains three configurable devices in which 
designs are required to be in for the hardware to 
function. language . for the of designs implemented as 
this thesis was VHDL. This provides a method for digital logic 
in a high programming language style. of the important VHDL 
implications are discussed in section of this chapter. 
individual design used during the course of the including test 
verification as well as firmware IS Each design is 
documented a requirements statement, specification, various options and 
,",U\.1""'",,, made as as implementation details. 
7.1 VHDL design implications 
stands VHSIC Hardware Description VHSIC Very 
Speed Integrated It is an standard logic description which 
resembles a computer 
Designing and implementing logic designs very specialised software 
is generally manufacture specific. design which is followed is 
as follows: A is started a set of which are translated into a 
VHDL is written to describe the in a standard programming 
language Very sophisticated and compiler software is 
then used to infer logic-elements and from VHDL descriptions. A 
net-list is as the of operation that a will accept. 
is usually not hardware dependent but may elements not directly trans­
latable to the target technology. net-list various other from 
other sources such as vendor specific macros are then together in 
siser that a conslstmg elements in the logic 
devices Place-and-Route software is then employed to allocate logic 
routing resources in device to implement the In FPGA the rout­
7.2. CONFIGURATION CPLD 

ing delay between logic elements is path and location specific. Thus the place and 
route process is comprised of an iterative placement and net-delay analysis procedure 
that attempts to optimise the performance the design to a set of user specified design 
constraints. The logic designer needs to be very aware of the capabilities of the com­
piler and target technology used. Very complicated logic equations or circuits can 
take a long time to produce a valid result. This time may be longer than the period 
of the clock used for the design. This then requires a redesign of that block to either 
reduce the complexity or use techniques such and pipelining and manual placement 
to achieve the performance goals. The designer also needs to be aware of the various 
VHDL constructs and what logic elements are instantiated or inferred from them. Of­
ten, the incorrect usage of language constructs can create a design that either performs 
very poorly or not at all, even when the VHDL appears to be logically correct. 
At two stages during the design process, the entire design or a particular module 
may be simulated. The logical operation of the system may be simulated directly from 
the compiled net-list. This provides no timing information but can be used to quickly 
test the design logic. This simulation does not require a full compile process and 
contains no routing and timing information and is thus much faster in runtime. The 
final design as implemented on the device may be simulated which includes actual 
silicon path and logic element delay information. This type of simulation can be very 
useful to analyse asynchronous operations and the response to external stimulus or the 
devices own output. 
All the CPLD and FPGA devices used in the ERPCN01 hardware are Xilinx prod­
ucts and use the same compiler and synthesis tools. The compiler used in this project 
was a Xilinx vendor release of the Synopsis FPGA Express compiler, version 3.5.1. 
The Xilinx Foundation 3.1-r8 software package was used for all the logic design and 
simulation. 
7.2 Configuration CPLD 
The configuration CPLD device is responsible for system reset control and processor 
boot-up configuration. It also bas non-dedicated connections to the Bus CPLD and 
FPGA for general-purpose usage and interrupt line to the processor is also provided 
but unassigned. 
System reset control is the primary function of this CPLD. The project hardware 
has three system resets and five reset sources capable of producing different reset con­
ditions and sequences. The Bus CPLD device is responsible for monitoring the reset 
inputs and synchronising the reset outputs. It also needs to generate the timings re­
quired by the various devices it controls. 
Only a single design was implemented on this device due to its specific function. 
The design under went a series of developments that fixed problems and added or 
removed features as required. 
Page: 56 of 125 Radar Remote Sensing Group, Electrical Engineering, UCT 
79RC64574 
are: 




device has three reset inputs and two 
interfaced and controlled in a specific manner 
'VCCOK' - indicates the state of the system supply 
reset. 'ColdReset' - used at power on and during normal 
that requires the same sequence as a power-on reset. 
on and during normal operation to perform a 'Warm 
initialised from a 'Cold Reset'. After a 'Warm Reset' 
code from the boot-vector. 
specifies the timing diagrams for boot-time initiation 
The requirements laid out by the processor manual 
from the timing diagrams are: 
synchronous with the processor/CPU clock input and com­
of the device. 
for a minimum of lOOms after the supply voltage 
operational voltage range. 
at 64k clock cycles after the assertion of 
synchronously with the CPU clock. 
'Reset' must be until at least clock cycles after 'ColdReset' has been 
synchronously with 
"""....."'''' 
tion interface aPt'lpr~tp 	
H'H.JLI'J'-''-'......~... 
of 'VCCOK', the initialisa­
output and samples a 256 






• 	 'ColdReset' must 
'VCCOK' and must 
• 
• 	 At least 
and the device suppling 
the bit-stream must rising edge of this clock. 
• 	 The 'MODECLK' 
The format of the 
X. This bit-stream is required to 
ROM device. 
X in Appendix 
or the Serial EEP-
System Requirements 
The devices other processor that are or by the Config CPLD 
are the FPGA, Bus Control CPLD, Monitor, and 
Galileo MIPS companion chip. Two push button inputs are also available to the Config 
CPLD, which are required to be de-bounced in are: 
Radar Remote Sensing Group, Electrical Engineering, 
CONFIGURATION 

(I Config and require a reset input that is 
VLv"''''''''' from reset. 
C'"""ct""fi after 
the has stabilised and the processor is 
(I The Reset Monitor Device has a reset input for manual reset generation, which 
needs to be asserted to emulate a cold reset. 
(I serial EEPROM has a chip enable and output enable which to 
be generated when that device is used for processor initialisation, 
• The 'mTest' input of Galileo device is routed to the Config and needs 
to be asserted for the device to enter normal operation. 
Specification and Design 
The for the Config will be synchronous with the 'ConfCLK' input 
clock which is synchronous clocks to bus CPLD, companion chip, 
SDRAM FPGA. will allow output signals to processor to meet syn­
chronisation requirements. All reset input sources are synchronised with the 
input is nominally 66.667MHz to a maximum 75MHz. A 23-bit pre-
scaler counter will be used to generate the timing requirements. will a worst 
223/(75 * 106) = O. 1 This counter is u.Lua'~'.H~ count period 
after the de-assertion of all reset inputs. 
, SystemReset' controlling the and CPLD is asserted until 
half the count value is reached after a reset input event: 222/(75 * 106) 
0.5598 ~ 56m8. 
The'VCCOK' is de-asserted the first 
The completion of initialisation signal is asserted at the pre-
scaler count value with 15 and 16 set following the assertion of 'VCCOK'. 
effectively allows for 216 = 98304clocks in which to the initialisation, 
at 256bits * 256clocks = 65536clocks for completion. 
The initialisation sequence is reset by de-assertion of the 'VCCOK' and 
is clocked by the 'MODECLK' input from the processor. Certain in the stream 
are fixed and the rest are sampled the inputs an 8-bit dip-switch, which allows 
various bits in stream to be modified without changing VHDL and re­
programming CPLD device. 
The 'ColdReset' signal is de-asserted at the next overflow of the pre-scaler 
following of 'VCCOK' and completion of the bit-stream LU'.UUL 
(>::::::1 
The CPU's 'Reset' input is controlled by the boot-up sequence but is required 
to be toggled without a full system reset to enable Resets'. This signal is asserted 
by either a system or soft reset input signal and can only be de-asserted 
dReset' is pair delay is triggered following the 
'Col-
Radar Remote Sensing Group, Electrical Engineering, 
CHAPTER 7. FIRMWARE IMPLEMENTATION 

of the signals which CPU 'Reset' signal delayed at least 64 
cycles de-assertion as required. 
two push-button inputs are as 'SystemReset' 'Cold Reset' inputs. 
The push-button is and push-button 
is routed to external MAX811 Voltage Reset Monitor The input 
MAX811 is used to generate the CPLD's 'Cold events. 
Finally, LED devices are connected to the These were to 
indicate the status of the 'ColdReset', CPU 'Reset' and 'SystemReset' signals. 
7.2.3 Implementation 
The design was written according to the and 
designs compliance to the requirements. During the course of the 
process, errors in the implementation and 
the simulation results and 
CPLD was specified in the hardware design to be a macro-

however a 72-macro pin-compatible was used. 

macro cell device not have had resources this design. 

of the final resource utilisation implementation are: 
of macro cells, 30% of product terms, 65% registers and of function 
block inputs used. 
maximum input frequency 100MHz, 'ModeCLK' maximum input fre­
""<>",...,, 71MHz 
7.3 Bus + Control CPLD 
Bus and Control has three main Most importantly, it implements 
a between the local bus and the bus allowing bus 
transactions to be performed across it. Secondly, it has the task of handling run­
reconfiguration. it provides a memory mapped system control interface 
mOlces.soy reset FPGA initialisation. 
Requirements 
Local to Peripheral Bridge 
local to peripheral bus bridge needs to interface the standard 8-bit FLASH and 
devices with address and busses to the memory con-
32-bit AddresslData 
understanding 
the bus operation was required in to the design. The 
memory local bus is multiplexed meaning that control lines are 
timing on the bus are 
Radar Remote Sensing Group, Electrical Engineering, VCT Page: 59 of 125 
7.3. BUS + CONTROL CPLD 

shared with the data bus connections. Various signals are provided to latch and qualify 
these signals which, need to be decoded. The bus also supports up to 8-word burst 
readiwrite transfers to a selected device. This is possible due to the fact that the lower 
three address bits are provided as dedicated pins on the Galileo device. A detailed 
description of the memory controller's memory bus (local bus) along with full timing 
diagrams are located in Galileo GT64 1 15 datasheet [Chapter 5 Section 8]. 
The CPLD is required to bridge single and multi-word readiwrite cycles from the 
local bus to peripheral bus. The CPLD is also required to perform address decoding to 
fragment the local bus chip-select 0 (CSO) and 'BootCS' windows into smaller mem­
ory windows to each device on the peripheral bus. The memory controller's 'BootCS' 
signal must access the processor boot memory device. 
The peripheral bus is has an 8-bit data bus with a 22-bit address bus and operates 
as a standard asynchronous bus. The CPLD is required to generate appropriate chip­
select lines to the individual devices and generate the readiwrite signals. 
The CPLD is finally also provided with an interrupt input from the PCI bus and 
has two interrupt lines to the processor. These lines must be de-asserted if not in use. 
Finally, a connection the the systems JTAG chain is provided. This is for future work 
and it is required to be driven tri-state. 
FPGA Configuration 
The Bus + Control CPLD is required to generate the signals needed to configure the 
Xilinx Virtex-E FPGA device via the SelectMAP [xl interface. This basically involves 
generating the various control signals to initiate and control the configuration and gen­
erating an address and control signals for the memory device on the peripheral bus that 
holds the FPGA design bit-codes. 
During this configuration mode, the peripheral bus is busy and may not be accessed 
from the local bus. All requests to the peripheral bus should be ignored. 
The design must allow for power-up as well as run-time configuration and recon­
figuration. A maximum of configuration speed of 50MHz may be used without the 
need for using a more complicated configuration algorithm. A detailed description 
of the Virtex-E configuration via the SelectMAP interface is provided in the Xilinx 
Application Notes [15] [16] 
System Control Interface 
The system control interface is required to provide an interface to allow the state of 
the system to be controlled without physical intervention. It is required to provide 
an interface that will allow the processor reset to be asserted, de-asserted and pulsed 
(off->on). This will allow the state of the processor to be controlled either by its self 
or from across the PCI bus from the host. It is also required to provide an interface 
to initiate FPGA configuration. No requirement was set for the selection of an FPGA 
configuration memory device or address. Finally, an interface to allow the toggling 
Page: 60 of 125 Radar Remote Sensing Group, Electrical Engineering, VeT 
CHAPTER 7. FIRMWARE IMPLEMENTATION 

of the 'SystemReset' signal and a method to acquire the systems revision number for 
software purposes. 
7.3.2 Specification and Design 
The specification and design process for this component of the system went though 
a series of revisions before reaching its final state. During various hardware test and 
verification processes, slight modifications and changes in functionality were required 
and some features were only implemented when required. 
The design is required to operate in two distinct modes: FPGA configuration and 
bus bridge modes. A global signal will indicate this state and disable the appropriate 
logic and act as the control signal for bus multiplexers. Of the signals that need to 
be multiplexed, the address and data on the peripheral bus have two possible sources, 
one from the FPGA configuration logic and the other decoded from the local bus. The 
peripheral bus control signals need to be generated for both modes. 
The FPGA configuration requires a configuration clock. 8-bit parallel data is 
latched on every rising edge of this clock. The FPGA interface is signalled that a 
new configuration cycle should begin. A read request starting from address zero is 
directed to the memory device containing the configuration data from the Bus CPLD. 
The memory device outputs this data onto the bus which is connected to the FPGA 
interface. During each configuration clock cycle, the memory address is incremented 
thus loading the entire design into the FPGA. On completion, the FPGA notifies the 
Bus CPLD with a 'DONE', which returns the CPLD back to the bridge mode of oper­
ation. 
The local-bus interface upon closer inspection is not very complicated, but requires 
precise timing. The address, chip-selects and control signals need to be latched on the 
falling edge of the' ALE' (Address Latch Enable) signal from the memory controller. 
A chip-select timing signal from the memory controller is used to qualify the latched 
control signals. The peripheral bus read and write signals are generated directly from 
the decoded signals from the local-bus. Bi-directional buffers are controlled by the 
decoded read/write signals to allow the data bus to be bridged to the local-bus during 
data transfers. 
The system control interface is to be implemented as a memory mapped virtual 
address on an unused chip-select signal. The communication lines between the Bus 
CPLD and Config CPLD are to be used to manipulate the Reset signals. Any access to 
the virtual device (chip select) will invoke an action selected by the address selected. 
A table of ' commands' is provided in Table 7.1. 
7.3.3 Implementation 
The design was written as a single VHDL file and targeted to the 144 macro-cell Bus 
CPLD. The Bus CPLD design was tested initially in simulations to verify correct op­
eration. During the simulations, errors in the design and implementation were found 
Radar Remote Sensing Group, Electrical Engineering, VCT Page: 61 of 125 
7.4 
----0.- did not give many 
final design implemented 
macro cells used, 35% 
7.4. FPGA 

Asserts processor Reset signal 
Deasserts the processor Reset signal 
Asserts then the processor Reset 
Changes state of the system reset 
number on the 
and until the simulated operation to the local 
bus interface and other requirements. 
The of the design in was performed hardware 
verification described in chapter FPGA was used initially to emulate the 
while the memory controller device was not present in the A more 
of that process is in the FPGA section 7.4.6. 
although during the hardware verification, 
a dry joint on Flash, which the SRAM from 
This an investigation that looked partly at the CPLD 




terms used, 38% 

of functional block inputs used. 

The clock frequency design was limited to 100MHz. 

Designs 
the development of 10 basic designs developed to test and 
for the designs underwent constant review, additions 
and modifications while simple tests. designs are de-
briefly with more paid to the cornplllc3l£ea 
All the FPGA designs the system local bus are "...o,,..,.,,,,,rt to use 
WishBone™System on Chip (SOC) bus. This is an open '>LauuI'll for linking modules 
together with a common interface. The design and specifica­
implementing Wishbone compatible systems were and followed in 
designs. By strictly to a single interconnect design reuse was 
improved. 
7.4.1 Basic LED 
design served as a example in order to hardware design and 
software tools used to the FPGA designs. Initially, this design 
system clock down to a human detectable ....."'...."."'''',..." 
62 of 125 Remote Sensing Group, Electrical Engineering, 




for display on the LED FPGA. Once this was verified, ex-
began Locked (DLL) and 
This using Xilinx components in the design and thus 
served to test this functionality. It also whether the design allowed 
for high-speed generation in terms of correct decoupling 
routing. Clocks up to 266MHz were successful implementa­
tion problems occurred in the use of the components due to the use of out-of-date 
This was quickly resolved using the latest documentation. 
LVDS Test 
test design was an experiment in use Low 
(LVDS). The FPGA 
hardware serialiser/deserialiser circuitry. This design was an at 
an in synthesisable VHDL. 
connected to it, 8 LVDS channels (4 in , 4 out) but 
Raw LVDS is simply a physical capabilities. During 
design period, clock I were investigated use as 
a link layer. Data Strobe Encoding was investigated for use as a 
it reduces the data interference between the channels. 
are employed to Data and channels. 
Two LVDS 
channel is oprIP,.~,rp" 
data xor'ed with the transmission clock. The clock can be recovered 
at the receiver by xoring the channel with Strobe channel. figure 
L 
Various detection synchronising data were investigated. 
They all negatively affect the data rate of the channel by inserting framing information. 
Finding a reasonable implementable solution that provided a high data rate to baud rate 
was the 
During the implementation of basic data transmission VHDL, 
were encountered. The outputs require Double 
registers, which can be implemented using the components of the Virtex 
FPGA architecture. Unfortunately, FPGA 
high-level description, reducing it which caused design to fail. 
VHDL implementation of LVDS was cancelled and a schematic 
attempted. 
Radar Remote VCT 
net. is also the 
to optimise the logic. 
exampleofa 
was as the reference for 























7.4.3 LVDS Schematic Test 
implementation of an transmitter and receiver --~'r-," sCflenlatlc capture 
is much simpler than the equivalent in VHDL. Timing are extremely 
for LVDS can be clicking and properties of a 
schematic translates 
in the Xilinx Application 
LVDS transmission, the transmission rate far exceeds the 
limits of the device. this reason, the high-speed sections ne(~ae:a 
The transmitter design used Parallel Input to Serial Output (PISO) ,",'V""P"'" 
nents. Each PISO takes a 4-bit parallel input and a 4X-Clock from 
a four-stage output sequentially though 
Two PISO devices, one positive one negative edge 
connected to two a (DDR) Flip Flop which 
two signal together switching state on of the 4X-Clock. 
..or",.. ,,,,,, .. is the transmitter, de-multiplexing 
output using the rpr"pn'Pri Note that the receiver's local 
to the transmitter's and a synchronisation barrier u'-',,.,u ....u 
was implemented with the 
a baud rate of 66M H z * 4 * 2 Framing information 
reduces baud rate. Only basic clock and data ..P{·"''',,, ....., was demonstrated on 
hardware to test physical design. The FPGA LED devices were to 
display data in real-time. 
briefly by 
The results of test "nllUlF'" 
displaying the recieved 
synchronisation was ex­
UtUJ"vU constant even though 
to be very tolerant to loading 
UU'vllI was observed in the test. The re­
rpf',''''''>'' data displayed on the 
The link was also 
was not same as the transmitted .... "t·t"' ...... The problem was analysed 
,-,uu",-,u by sampling the received oscilloscope and posterised to 
Page: 64 of 1 Radar Remote Sensing Group, brt ...,,(>,rri Engineering, UeT 
CHAPTER 7. FIRMWARE IMPLEMENTATION 

waveform with the wrong timing. 
7.4.4 Local Bus Emulation Test 
The need for the emulation of the memory controller arose from the need to test the 
functioning of the Bus CPLD, which implemented the Peripheral Bus Bridge and the 
need to access and program the FLASH memory in-system. The local bus emula­
tion design needed to emulate exactly, the signals from the memory controller. For 
this reason, a good understanding of the local bus protocol was needed. The Galileo 
datasheet [x] and additional timing diagrams [x] provided for the device by the man­
ufacturer proved invaluable. In addition, the timing parameters configurable on the 
memory controller were specified to be implemented in the design to allow various 
configurations to be tested. 
The implementation of the design was incremental with various sections being 
designed and simulated to test compliance individually. The local bus emulation unit 
was specified to have a simple internal interface to which an intelligent module could 
attach to gain access to the local bus. This interface contained: a request signal and a 
direction signal to start a transfer, a completion strobe to indicate the completion of a 
bus cycle, and data input and output ports. 
The design was implemented as a number of synchronous state machines that inter­
operated to manage the bus. A main state machine monitored the request signal and 
initiated a bus transfer. Firstly an address state is used to place the requested address on 
the bus. This is fonowed by either a read or write state and completed with a bus hold­
off cycle. Various parallel state machines generate the cycles requires for read and 
write cycles as well as generating the timing and burst accesses as specified by input 
parameters. Various parallel processes are also implemented that monitor the states of 
the various state machines and generate the required control and output signals. 
A simple state machine was finally written to interface the bus control logic to 
perform test read and write accesses to the SRAM device across the bridge. In to­
tal, ten independent parallel processes were implemented to perform the emulation of 
the memory controller device. The design for the basic bus emulation test was im­
plemented in the FPGA and used the following resources: 5% of device slices, 2% of 
available Flip-Flops, 32% of the I/O pins and was implemented with an equivalent gate 
count at 9075. The place and route software optimised the design to run up to 75MHz. 
7.4.5 16550 Compatible DART 
The 16550 UART is the standard UART used in a PC and many embedded applica­
tions. It was thus desirable to implement a UART design that was software compatible 
to it. The design requirements and interface specification were taken from the Na­
tional Semiconductor NS16550 UART datasheets [x]. Reference was also made to 
a Verilog 16550 UART design from the opencores.org Internet site for free IP cores 
[x]. Various design errors and flaws were identified in the Verilog design however, it 
Radar Remote Sensing Group, Electrical Engineering, UCT Page: 65 of 125 
7.4. FPGA DESIGNS 

still influenced some of the implementation features. The design was specified to be 
implemented using the Wishbone™bus. 
The design was segmented into six functional units: A transmission unit, a receiver, 
FIFO memories, baud rate generator, registers and control unit and a wishbone bus 
interface unit. 
The baud rate generator is implemented as a simple down-counter that resets to 
a value specified at its input when it counts to zero. On reaching zero, a pulse is 
generated which acts as the baud rate enable. This pulse has a frequency of 16 times 
the wire baud rate. This is purely to increase the reliability of the receiver, which 
performs over-sampling and synchronises to the received start bit. 
The transmitter unit was specified as a stand-alone functional unit performing par­
allel to serial conversion. The interface was specified to provide all the functionality 
required by the 16550 transmitter, including parity, bit-count and stop bit generation 
controls. The transmitter was implemented as a state machine with a shift-register used 
to serially send the data. 
The receiver unit is also a functional unit that is based on a shift register to serially 
capture data received. The input signal is first synchronised to the system clock to pre­
vent meta-stability in the flip-flops. A state machine monitors the state of the receive 
line for a start bit and then proceeds to capture the data in the format specified by its 
input settings. Parity, frame and break errors are detected and provided along with the 
received data to the interface. 
The FIFO unit implements the transmit and receive FIFOs as indicated by the 
16550 specification. All error information is inserted along with the received data in 
the receiver FIFO. Various full, empty and control signals are provided for the system 
control logic. 
The registers unit contains the 16550 interface compatible registers and the control 
logic required to integrate all the components of the system. This unit has a simple 
bus interface to read and write the various registers. The control logic has the task of 
sending data to the transmitter unit from the transmit FIFO, generating interrupts and 
sending and receiving data from the FIFOs. 
The last unit required was the WishBone bus interface, which is a simple unit that 
translates WishBone read/write cycles into the simple read/write interface provided by 
the registers unit. 
7.4.6 Remote Bus Access 
The remote bus access design was implemented to test various aspects ofthe Peripheral 
Bus though the FPGA across Local Bus emulation. The requirement for the design was 
to implement a controller on the FPGA that would allow a PC to access and modify 
memory devices on the Peripheral Bus through the FPGA implemented RS-232 port. 
This design was implemented with three major system modules. The 16550 UART 
design was used to provide the serial port access and the Local Bus emulation design 
was used to provide access to the Peripheral Bus through the bridge in the Bus CPLD. 
Page: 66 of 125 Radar Remote Sensing Group, Electrical Engineering, VeT 
CHAPTER 7. FIRMWARE IMPLEMENTATION 

FPGA 
~ . ~ RS·232 ~ ~. Cl • •• 
Figure 7.3 : High level overview of Remote Bus Access system 
The third module that was required was a control logic unit to interface the two other 
units and implement a protocol to communicate to the Pc. A high level diagram of the 
system is shown in Figure 7.3. 
The protocol for the RS-232 communications was specified to be a general purpose 
as possible, not providing the most optimal performance. The protocol specifications 
were: 
Baud Rate 115200bps 
Format 8 bits, Even Parity, 1 stop 
PC - Communications master, FPGA - slave 
Address Format - Segmented: 24 bit (16 bit segment, 8 bit address 
Packet Format: 
Master - 1 Control byte followed by 1 or 2 data bytes 

Slave - 1 Ack/Nack byte followed by 0 or 1 data bytes 

Control byte (master): 

Ox33 -> Read operation + 1 data byte (8-bit address) 

OxCC -> Write operation + 2 data bytes (address, data) 

OxA5 -> Set segment + 2 data bytes (16-bit segment) 

Slave ACK/Nack responses: 

OxFO -> Data acknowledge + 1 data byte 

OxOF -> Negative acknowledge + 0 data 

Ox3C -> Simple Acknowledge 

State machines diagrams were designed before the VHDL implementation in order 
to simplify the implementation though a better understanding of the logic sequences 
required by the implementation. The VHDL implementation of these state machines 
and the control logic driven by the states was implemented and simulated. 
The final design resource utilisation in the FPGA was: 14% of Slices, 7% of Reg­
isters, 10% of LUTs and the total equivalent gate count the design was 12,887 gates. 
The design was optimised to run up to 72MHz. 
Radar Remote Sensing Group, Electrical Engineering, VCT Page: 67 of 125 
7.4. FPGA DESIGNS 

7.4.7 Local-Bus Test 
The Local-Bus test design was required to develop, verify and test the FPGA inter­
face for implementing a slave device on the Local Bus. This is necessary in order to 
communicate with the FPGA from the memory controller device. 
A lot of experience on the working of the Local Bus had been gained from the 
Bus CPLD bridge and emulation work. This design however needed to interface the 
asynchronous Local Bus with the synchronous internal FPGA logic. 
Interfacing the asynchronous bus proved to be relatively simple in an initial work­
ing design. This basic functionality operated with the constraint that very slow timing 
was required from the memory controller due to the discrete sampling properties of 
synchronous system. What is a lot more difficult would be to re-think the design to 
speed up the interface. This was a low priority for this project and was left for future 
work. 
The basic Local Bus test displayed the lower 8-bits of data written to the FPGA on 
the LEDs and on reading from the FPGA provided a hard-coded value on the bus. 
Finally, the 16550 UART was connected to the Local Bus interface module by 
implementing a local-bus to WishBone bridge. This was done to test the operation of 
a peripheral device implemented in the FPGA from across the bus. 
7.4.8 Sigma-Delta Modulator 
A Sigma-Delta modulator [13] is a function that generates a digital stream of pulses 
from an input value. The proportion of Logic' l' outputs to Logic '0' outputs over a 
large sample count is equivalent to the ratio of the current input value to the maximum 
input value. By low pass filtering this digital bit-stream, an analogue output value can 
be obtained. This creates a simple Digital to Analogue Converter (DAC). The purpose 
of designing a Sigma-Delta modulator was to demonstrate the high-speed processing 
and general purpose ability of the FPGA. 
The Sigma-Delta modulator was researched and a design for a simple uncompen­
sated or signal processed converter was developed. The design consisted of two iden­
tical modulators for stereo output channels, a FIFO memory for buffering samples and 
a control unit that generated the timing and external interface. The sample format was 
specified to be 16-bits per channel in sign-magnitude format which is used for digital 
audio representation. The modulator was run at 66MHz with gave around ll-bits of 
accuracy per sample however it could be run at 133MHz to improve the signal quality. 
The FIFO was polled at 44100Hz so that CD quality audio samples could be demon­
strated in the system. 
Various revisions were required to perfect the design as problems in the initial 
implementation of the audio format and FIFOs existed. The design was tested initially 
from the host PC by a program that interfaced the Sigma-Delta converter using the 
Linux driver. 
Page: 68 of 125 Radar Remote Sensing Group, Electrical Engineering, VeT 
CHAPTER 7. FIRMWARE IMPLEMENTATION 

7.4.9 Propane DART 
The 16550 UART design was unnecessarily complicated and used a lot of FPGA re­
sources. What was required for future work where other modules were to be incorpo­
rated in the FPGA was to drop the 16550 design and specify a simpler UART design, 
the 'Propane UART', for eventual use in the propane system interface described in the 
next subsection. 
The specification for the simplified Propane UART design were: 
• Only 7 and 8-bit selectable communication modes 
• Only 1 or 2 stop bits 
• Parity modes, Even, Odd, None 
• Transmit and receive FIFOs 
• Interrupt on receive FIFO content size - programmable - 1,6,12,Full 
• Programmable baud rate - 1100 baud to 4M baud 
• Parity and Frame error detection 
This design although still providing a great deal of flexibility is implementable using 
less logic than the 16550 UART, particularity due to the complexity of implementing 
infrequently used functions like break detection and 5,6 bit transmission modes. 
The design implemented the WishBone interface as required by the Propane inter­
face. 
7.4.10 Propane Design 
The propane system interface is a specification developed for this project to allow 
the processor to automatically detect and use various functional units present in the 
active FPGA configuration. Examples of such functional units are UARTs and the 
Real Time Clock interface. The propane interface aims to create a simple 'Plug and 
Play' environment for the FPGA designs so that operating system and boot-up software 
need not be modified every time a new design is loaded into the FPGA. 
The Propane interface was specified and the major requirement of the system was 
to implement a design that met the interface specification and used the WishBone SOC 
bus for linking modules in the system. The Propane interface is specified to provided 
an interface for up to eight modules or cores to be used on the FPGA, with provision 
for extending this to more if required. The specification is provided in Appendix C: 
The Local Bus interface from previous designs was used for the Propane design. 
Many problems existed with this interface and a lot of development time was required 
to try and obtain reliable operation. Some of the major problems experienced with 
Radar Remote Sensing Group, Electrical Engineering, VeT Page: 69 of 125 
7.4. FPGA DESIGNS 

the Local Bus interface were, metastability from input signals, double read / write 
strobes being issued for single read / write operations and bus-hold contention. Many 
of these problems were only detected when implementing the Propane UART because 
the FIFO read and write operations are not tolerant of any extra read / write strobes 
and problems such as missing received characters and transmitting multiple copies of 
the same character were experienced. After much debugging and simulation of the 
interface, a stable working but slower than optimal bus interface was obtained. It was 
left for future work to improve the interface speed of this unit. 
The various functional units were all implemented with a WishBone SOC bus in­
terface. The Local Bus interface acted as a bridge and WishBone bus master allowing 
external accesses from the Local Bus to reach the addressed functional unit. No sup­
port for burst reads or writes was provided by the Local Bus interface. 
The Real Time Clock (RTC) interface was specified as being part of the same 
functional unit as the Propane interface. The RTC device used in the project is a Xicor 
XI227, which provides a RTC, watchdog timer, alarms and reset and voltage monitor 
in a single 8-pin SOIC device. The interface to the device is the J2C protocol [14]. 
This was the first logic unit required to implement the RTC functional unit. A basic 
J2C master controller was designed for use by the other units. After an analysis of the 
X1227 J2C interface, the master interface was specified. The interface consisted of: 
an address input which translated to the X1227's Clock Control Register address, 8-bit 
input and output data, a read/write direction control signal and Start and Done control 
signals. 
The implementation consisted of a complicated II-state master state machine that 
provided the basic major protocol state logic. Various smaller state-machines and con­
trol logic were used to generate fine grain protocol details and control the sequencing 
of the main state-machine. The unit interfaced the X1227 RTC at 346kHz, which is 
reasonably close to the rated maximum speed of 400kHz. 
The J2C protocol generated by the unit was very carefully simulated for compli­
ance in the functional simulator. All error states and possible problems were specified 
to be handled in the implementation and simulated for compliance. 
The RTC unit interfaced to the Propane system interface and the J2C master con­
troller. This allowed requests from the Local Bus, translated to the Wishbone Bus to 
the performed over the J2C protocol to the X1227 RTC. 
With the Propane FPGA interface design complete, adding and removing custom 
Propane compatible cores requires no changes to any other modules. With properly 
designed software, automatic detection and use of the modules provided in the FPGA 
is possible. 
7.4.11 DSP Experiment 
The Digital Signal Processing experiment was required to demonstrate the processing 
capabilities of the node. Two possible processing algorithms were identified for pos­
sible implementation in the FPGA: The Fast Fourier Transform (FFf) use mainly for 
Page: 70 of 125 Radar Remote Sensing Group, Electrical Engineering, VCT 
7. FIRMWARE IMPLEMENTATION 

and Discrete Transform processing. 
and DCT processing cores are both supplied as modules 
tion into a design. On inspection it was that the core supplied would 
not implementable in the 200,000 gate due to architectural re­
quirements core. It was that reason that it was decided to use DCT core 
processing benchmarks. 
DCT DSP functional unit was implemented at a module for the Propane inter-
which software development to use the core simple. The core was 
by the Xilinx CORE with the following parameters: 16 ­
bit coefficient width, 9 clocks sample latency, data input width, 16-bit data 
output width, mathematics. 
point DCT, 1 
optimise the DCT core usage, two memories were used, one 
put data storage and one for output data Since Local Bus interface is 
bi-directional, during cycles to the input, no output data can 
unit was that converted the Local 
A control 
values, serially output values were multiplexed 
32-bit values reading. conversions optimised Local through-
required synchronisation and a lot simulation to perfect 
interface. 
The performance of CORE was not to be though the 
Local Bus The DCT core used is a pipelined unit that is capable of reading 
an input sample generating an output on every clock Thus the core at 66MHz is 
theoretically capable of processing 66666666.67/16 = 4.167 million DCT 
per second. The Local Bus interface was a limiting factor to this performance. 
7..5 Conclusions 
In conclusion, various VHDL cores were designed, implemented, 
and finally implemented on the various Programmable Logic (PLD) in the 
system. configurations were used to the correct functioning and 
formance of the Final designs interface were 
developed that provide a stable and flexible base for software environment to be 
developed. 
Radar Electrical Engineering, UeT Page: 71 of 1 
7.5. CONCLUSIONS 

Page: 72 of 125 Radar Remote Sensing Group, Electrical Engineering, VCT 
Chapter 8 
Software Development 
Previous chapters described the Hardware and Finnware specification and design pro­
cesses. The final design and development effort was related to the software for both 
the node and the host PC systems. The host PC was running a version of the Linux 2.4 
kernel for which drivers and utility programs were developed to communicate with and 
test the nodes. The node software consisted of a combination of low-level assembler, 
embedded C and Linux application programming. 
The chapter begins with a description of the software developed for the PC host 
system including Linux drivers, test applications and utility programs. The second part 
of the chapter describes the software developed for the MIPS processor on the node 
hardware. Various complicated design problems and their solutions are described in 
more detail in each of the sections. 
All software developed for this project used opensource technologies including 
compilers, debuggers and operating systems as specified by the requirements for the 
project. 
8.1 Linux PC Software 
In order to properly use, debug and test the hardware and software on a node, PC 
software was required to communicate with the nodes on low-level hardware as well 
as high-level software interfaces. The first requirement was to interface the Galileo 
PCI memory controller on the nodes. Programs to communicate with the FPGA and 
memory over the serial port were needed to test the Bus Access finnware designs as 
well as the hardware. Various utility and test programs were required to provide higher 
level interfaces to the hardware for testing, debugging and controlling the nodes. 
8.1.1 PCI Driver 
The Linux kernel only allows hardware drivers access to the hardware resources in the 
system. Since the nodes were connected to the host PC system on the PCI bus, a driver 
73 

8.1. LINUX PC SOFTWARE 

was required to allow user-space programs to communicate to the nodes. 
A test driver was originally developed to identify the Galileo PCI device for an 
evaluation board that was purchased for testing the Galileo memory controller device. 
The main purpose of this driver was to develop an understanding of the Linux PCI 
driver system and provide a skeleton driver for development of the project hardware 
driver. This driver identified the Galileo memory controller devices in the system 
and registered the driver for each instance of the device found. It also performed a 
secondary role, which was to reset the evaluation system's processor, which was useful 
while experimenting on the platform. 
The development of the driver for the ERPCN01 node was based on the initial 
driver developed for the evaluation board. The development of the driver was also 
progressive with features been added as required. The order in which these features 
are described below is a rough equivalent to the order in which they were developed. 
Various features that were used for testing various hardware elements were omitted in 
the later versions. 
The first major goal of the driver was to test the Galileo memory controller. Once 
the hardware was in a state that the Linux host operating system identified the Galileo 
device, development of the driver was started. 
The driver on detection of a Galileo memory controller device remaps the memory 
windows provided by the device into the host system's memory space. This effectively 
allows the physical memory on the node to be accessible directly by the driver. Firstly, 
the driver was modified to perform write operations to the SRAM device on the node. 
A correct read from that device would verify the operation of the Galileo device, Local 
Bus - Peripheral Bus Bridge and SRAM device. During this stage many problems were 
encountered which were gradually solved as a greater understanding of the Galileo 
device and the issues associated with a PC environment. One particular problem that 
took a long time to rectify was caused by the PC's PCI BIOS reprogramming the 
Galileo's PCI registers, adjusting the PCI memory map without reprogramming the 
internal maps. This caused memory access errors and PCI retry operations, which 
effectively locked up the host PC in certain situations. 
Following the testing of devices on the Peripheral Bus, the SDRAM memory 
needed to be initialised and tested. Since the Galileo's internal registers are mem­
ory mapped to the host, full control of the device is possible without the on board 
processor's intervention. The SDRAM memory controller needed to be programmed 
to enable the SDRAM and to setup the device timings. The initialisation sequence and 
timing settings from the SDRAM's datasheets were used as a reference for this. A test 
sequence then proceeded to write and verify the SDRAM to test its functionality. 
Once access to physical memory devices on the project hardware from the driver 
was functional, a device interface for user programs was implemented. A Linux char­
acter device interface was chosen for access to the physical memory on the board. This 
would allow user programs to read and write various memory locations with a file like 
interface. In addition, the Linux kernel provides a memory map file access extension 
which allows a user space program to access the memory identically to the way it 
Page: 74 of 125 Radar Remote Sensing Group, Electrical Engineering, VCT 
emulation layer 
the Linux 
For the Point-To-Point 
link except no 
are sent across the knows 
less processing is npr'f"nrrn as no 
driver used 
8. SOFTWARE DEVELOPMENT 

accesses its local memory. 
A and the four memory windows on the were 
names: 
tlX - 63: (X) 
/dev/parnoderegX 63: (X+8) sters 
/dev/parnodesdrarnx - 63: (X+16) SDRAM memory 
- 63: (X+24) ects O.. 2 
/dev/parnodebootX - 63: (X+32) - Boot CS 3 
and 






numbers identifying the sub-device referenced. 

'0' to and indicates the node number in the 

stands for 'Parallel Node', which is how the 

/ 
The of each of the Galileo 
window is not large enough to address the entire 
vice. this reason, PCI windows can 
these areas. One of the primary reasons 
an interface to modify these offsets. 
IS some cases, the 
de-
The final addition to the Linux 'parnode' driver was a 
to provide a virtual network between the local 
operating system that was implemented on the MIPS orc)ce,SSO<f. 
emulate a Point-To-Point link. 
concepts were the same as for the Ethernet 
information, MAC addresses or ARP messages 
address of the target machine and 
decoding has to be performed. This 
Point-to-Point{TUN) and Ethernet{TAP) device drivers as 
meme:nle:a and after debugging and fixing various 
Two approaches to this networking emulation were 
was to emulate a full Ethernet network including HfnprnPf 
Linux PCI-skeleton 
and buffers 
dress Resolution Protocol support. This 
networking code for NE2000 compatible 
that are provided by a physical network 
tween the host and target board. The basic 
ready for transmission into the shared 
packet has arrived. The remote 
"'..........,."'... by using shared be-
into the networking layer as if 
worked well initially for the low 
however, TCP connections were 
solution was not found it was 
V~~""'L'VU was to place a packet 
and signal the remote side that a 
will then read this buffer and insert the packet 
had just received it. This approach 
emulation and ARP I ICMP protocols 
ignored. After an extensive debugging effort, a 
to try the alternative approach that was to 
Radar Remote VCT 
8.1. LINUX PC SOFTWARE 

Various work was performed to try and improve the performance of the virtual 
networking link. The initial driver used the unused upper bank of the SRAM on the 
node as a shared memory buffer. The host operating system did not use this device in 
its system memory map. For this reason, no other programs or data would be affected 
by using this memory. The problem however was that it was an 8-bit device with slow 
timing. To try and increase access speed, a 32kb block of SDRAM main memory was 
reserved early during the Linux MIPS kernel memory management initialisation. This 
was reserved for the networking driver. Both the host PC and board processor had 
much greater bandwidth to the 32-bit SDRAM memory and the performance increase 
justified the effort. 
8.1.2 Bus Access and FLASH programmer 
The Bus Access design in the FPGA implemented a serial port and protocol to which 
a PC could interface in order to read and write devices on the Peripheral Bus though 
the Local Bus. The first application to be developed was a speed-test program to test 
the system by reading and writing the SRAM device as fast as the serial port allowed. 
During the development ofthis program, the routines for handling the communications 
protocol were developed. 
One of the primary reasons for developing the Bus Access design was to allow 
the FLASH memory on the board to be programmed without the need for a working 
processor or memory controller. A suite of programs was written to perform actions 
using the Bus Access design. 
The next program to be developed was a FLASH identification program. It was 
required to read the FLASH identification code from the FLASH devices on the board 
using the common FLASH interface, which the Flash devices supported. The Flash in­
terface is memory mapped in the FLASH devices and allows the programming, erasing 
and controlling of the device. The protocol was implemented using the specifications 
given in the FLASH datasheets. This involved sending a sequence of commands to the 
device and reading back the result. The successful reading of the FLASH ID would 
also verify the operation of the device as it was the first test to interface the devices. 
A similar program to erase the FLASH memory was written based on the Get_ID 
program. Finally, two programs, 'program_flash' and 'fascprogram' were developed 
to program the FLASH memory. They both implemented different programming algo­
rithms as specified in the datasheets [x]. The 'fascprogram' program was finally used 
to program the FPGA Configuration FLASH memory device with an FPGA configu­
ration. The standard FLASH programming routines in the 'program_flash' program 
involved many more many read/write operations, which over the serial link it was cal­
culated to take more than 3hours to program the device. Finally, routines to read the 
FPGA configuration files in Intel MSC-86 format were written in order to program 
the FPGA FLASH device with configuration data. The FLASH programming opera­
tions were proven successful as the read-back from the device was identical to the file 
programmed into the device. 
Page: 76 of 125 Radar Remote Sensing Group, Electrical Engineering, VCT 
CHAPTER 8. SOFTWARE DEVELOPMENT 

It is possible to implement the FLASH programming routines in the FPGA but the 
time taken to implement the design was not worth the effort. The main method for 
programming the FLASH in the final system would be from the host PC system across 
the PCI bus. This method would be orders of magnitude faster. The Bus Access design 
merely proved a useful development and testing tool. 
8.1.3 Debugging and Utilities 
A full suite of debugging, testing and utility programs were written for the host PC 
Linux system for interfacing the ERPCNO I nodes. The major programs are described 
briefly below. 
Flash-PCI 
The Flash-PCI utilities are a set of programs to test and program the FLASH memory 
devices on the boards. They talk to the board using the 'pamode' Linux driver and use 
the same programming and identification routines as the Bus Access software. 
UART 16550 Test 
In order to test the 16550 VHDL design in the FPGA from across the Local Bus, a 
simple program was written to memory map the FPGA through the Linux 'pamode' 
driver. This allowed the registers of the UART in the FPGA to be modified by simply 
reading and writing to a pointer. This created a very simple and fast utility to directly 
control the UART in the FPGA. 
FPGA Configuration 
One of the major aims of the project was to enable FPGA configuration at run-time 
in the system. Run-time basically means that the process occurs while the rest of the 
system is running. The software to program the FPGA is fairly simple. The FPGA 
configuration file is decoded and programmed into the SRAM configuration buffer. 
The Bus CPLD device is then triggered to program the FPGA by accessing the system 
interface it provides. 
Audio Test 
The Sigma-Delta firmware design in the FPGA was tested by sending raw audio sam­
ples to the FPGA into the FIFO buffer. A simple loop was written that polled the FIFO 
state in the FPGA and inserted data as required. The input data was taken from the sys­
tem 'stdin' (standard input) which allowed audio data to be 'piped' into the program. 
Most of the debugging and fixing was involved with overcoming the read latencies and 
buffering the input to prevent the FIFO from emptying. 
Radar Remote Sensing Group, Electrical Engineering, VeT Page: 77 of 125 
8.2. MIPS SOFTWARE 

Utility Programs 
A number of utility programs were developed to aid various system operations and 
functional requirements. Some of these programs were: 
cpu_reboot Accesses the Bus CPLD system interface to reboot the processor 
cpu_reset Accesses the Bus CPLD system interface to reset the processor 
cpu_run Accesses the Bus CPLD system interface to release the processor from reset 
dev_dump Programs a binary file into a memory location in the CS 0 ..2 PCI window 
dev_read Reads from a memory location in the CS 0 ..2 PCI window 
map_csl_devs Instructs the 'pamode' driver to map the PCI window CSO..2 to CS 1 
reseCmaps Instructs the 'pamode' driver to set the PCI windows back to their origi­
nal state 
sdramJead Reads from the SDRAM memory 
setup_sdram_boot Programs the memory controller to boot the processor from SDRAM 
setup_sram_boot Programs the memory controller to boot the processor from SRAM 
One last Unix program, 'dd' was used extensively to dump binary files into the board 
memory though the device driver. 
8.2 MIPS Software 
The major software development effort focused on the software written for the MIPS 
processor in the project hardware. Four distinct sets of software were developed during 
the course of this project, test programs to verify the operation of the processor and 
system, the diesel boot-loader software to initialise and interact with the processor, the 
Linux operating kernel porting effort and various Linux applications. 
Before any programs for the MIPS processor could be developed, a development 
environment needed to be setup on a development Pc. The development tools that were 
used are all GNU open source software and consisted of: binutils - Binary utilities for 
creating, modifying, extracting and viewing compile object files and archives. Gee 
- The GNU C Compiler. GDB - The GNU Debugger including DDD and Insight. 
glibc - The GNU standard C Library. newlib - An embedded glibc alternative with 
no operating system calls. Three sets of tools were configured for different processor 
configurations and feature usage. These were: MIPSel - little end ian 32-bit MIPS 
compatible, MIPS64el - little endian 64-bit MIPS compatible and MIPSel-Linux ­
little 32-bit MIPS compatible for Linux applications. The MIPS Linux tool-chain had 
many problems and issues at the time of use and pre-compiled, tested versions of the 
compilers were used to ensure correct operation. 
Page: 78 of 125 Radar Remote Sensing Group, Electrical Engineering, VeT 
CHAPTER 8. SOFnWAREDEVELOPMENT 
8.2.1 Verification Programs 
The first programs to be developed were used to test the compiler tool chains and gain 
a greater understanding of the tools and the MIPS instruction set. The code written for 
these tests was in MIPS assembler language and was used to boot the processor and 
initialise the Galileo memory controller. This code was simulated but never run on the 
processor. 
The first program to be executed on the processor was designed to be run from 
SRAM. The program configured processor for basic operation and a main loop incre­
mented a 64-bit counter at a specific memory location starting from value '0'. During 
this test, the processor was running un-cached which meant that the physical memory 
location was constantly updated and could be read from the host PC across the PCI 
bus. This was used to verifiy the operation of the processor. 
The first programs written in 'C' were then developed to test the compiler tool 
chains. An initialisation function in assembler language was used to initialise the pro­
cessor, setup the stack and clear the program heap. This is required to start execution 
of a ' C' program. The GNU term for this file is the' crtO' file, which is used to initialise 
all programs and is required to be custom written for embedded systems. The first 'c' 
program tested implemented the same counter as the previous counter in assembler. 
This was followed by a program that interfaced the 16550 DART in the FPGA to echo 
received characters. 
One last assembler program was written to test the SDRAM memory. A value was 
written to the SDRAM while the program was running from SRAM to verify that the 
processor to SDRAM interface of the GT64115 was working. 
8.2.2 Diesel Boot-Loader 
Before work on the Linux kernel could begin, a boot loader and debugging platform 
was required which would also have the responsibility of setting up the hardware sys­
tem for Linux and executing it. 
The processor initialisation code from the previous programs was used as a base 
for the Diesel initialisation. Various additional checks such as endianness and Galileo 
configuration were added to prevent the program from executing under an incorrect 
hardware environment. Interrupt handling routines were written to provide error mes­
sages when the program execution caused an exception. Tests to determine from what 
location the program was running were performed to allow various features to be en­
abled or disabled. One such feature was that processor caches should only be used 
when running from SDRAM memory. The processor cache controller was reset and 
the various routines to enable a 'C' environment were performed before executing the 
main program. 
The Diesel Boot-Loader was designed to be simple and made easy to add new fea­
tures. A serial port communications interface was designed so that changing the serial 
port functionality (in the FPGA) would not affect the reset of the system. A shell inter-




face was written to provide a user interface to the This implemented 
a style to allow various shell commands to added into 
system to provide functionality. The shell functions provided were: 
boot Execute a program from a user provided address, typically used to start Linux 
cache Enable/disable processor the modes 
cpO Allows access to processors 
as well as '-'U<l,U;:;:" 
o(cpO) for system control 
help Provide information on use of the and various shell commands 
memory Display memory controller's configuration in an easy to understand 
port ....."ry,...,.'." read or operations to a specified (processor address) 
the core modules present in the Propane propane interface 
reset Perform a processor reset 
rtc and modify the date and in the on board Real Time Clock 
diesel boot was modified to a test 
benchmarks in a 64-bit environment without an operating "",'T"''''' 
8.2.3 Linux Kernel 
porting little Linux to ERPCNOI hardware platform took the 
majority of the software development time and During development period, 
various versions of the kernel were experimented with and much the kernel ITIP'ITI/'\r\l 
ag(~mc~nt filesystem architecture specific was under 
The porting effort from the start was a very big task, which needed to be broken 
down into units in to manage work load understand what was 
The task was to the kernel sources and add in support ER­
PCNO 1 platform into the MIPS64 architecture port. The next task was to 
source adding configuration and problems to the to 
compile. Then the task of debugging code on actual hardware was 
Drivers for serial port, networking and audio processor were 
then written. One last task was to implement an initialisation RAM disk from which 
task was to implement architecture code involved developing 
system error handling interfaces the kernel. bus error 
dling code was taken from Silicon 'Indy' port as the internal 
operation was identical. A low-level interrupt handler on the 'Indy' port was 
assembler that differentiated between and system The 
reason handling timer differently was in to reduce latency of this high 
to the Linux system. 
terrupt, 
Radar Remote Sensing Group, Electrical Engineering, 
as shown in 
num­
interrupts were 
memory controller and the Propane 
the two interfaces. 
of the generic 
cleared the Linux 
was configured and setup memory regions as per the 
shared memory was 
YnI"Ynl'\l"'\/ maps, the various 
getting the kernel sources setup for the new 
next major task. It was discovered that the 64-bit 
During the process of 
options in order to compile the required source 
of code were 
would need to 
required. The initial 
was the latest 
.... 
setup the cor­
as 'stub' code 
at a later stage 
version that was 
development 
which had not 
on fixing the 
CHAPTER 8. 
 nTU"L;, DEVELOPMENT 
frequency interrupt. main high-level interrupt 
rupts from the devices in the system. MIPS 
hardware interrupt however each interrupt can cascaded i.e. the 
ory controller has a interrupt line to the 
sources. The interrupt will respond the the Galileo interrupt by requesting the 
cause of the from the device. A table was specified the 
direct and 
Routines were written to allow the ,",..,un'.. l 
ber. This would automatically mask or 
Handlers for 
rupts were to handle specific details 
tines to initialise memory system prior to 
management were needed. These 
how the 
the MIPS interface. This is 
for the networking device. A was implemented to setup the MIPS inter­
nal timer to act as the Linux kernel Much of the timer code was devised from 
the 'Indy' Clock in the 
Propane interface was developed. was used to set Linux clock and 
keep it synchronised. Finally, routines to reset and reboot the were provided 
as required by Linux kernel. 
was not 
to compile. 
during debugging of the 
worked on was the Linux 2.4.9 
started. It was found that various "'U""U,,",'~" 
been in the MIPS port. 
MIPS port to allow the kernel to ""'-""..... u 
was written to handle 
only nrrn,,,'1..., 
but has 21 possible .nt,,,,"rr,nnt 
a fixing the problems the kernel from COlmpUIfI,2; 
fixing and implementing 'stub' routines as started. boot loader 
was used to execute kernel, which was placed into SDRAM memory by 
over the PCI bus. from the kernel entry point, an iterative process 
debugging break-points was initialisation 
function on the Linux was as debugging As 'stub' func­
were encountered, they were with the to implement the 
function. One of the first tasks was to the kernel operational which 
a working serial port When the serial port was working and 
console linked to it, it allowed kernel-debugging to echoed back to 
from the serial port. functions were verified, the debugging check-points 
Remote Sensing Group, Electrical Engineering, Page: 81 of 
8.2. MIPS SOFTWARE 

Table 8.1: Virtual Interrupt Allocation 
I Interrupt I Cascaded IRQ I Description 
0 no Software Interrupt 0 
1 no Software Interrupt 1 
2 no Galileo GT64115 
3 no FPGA 
4 no PCI / Bus CPLD 
5 no Bus CPLD 
6 no Config CPLD 
7 no MIPS Timer Counter 
16 2 CPU Memory Address Bounds Error 
17 2 DMA Memory Address Bounds Error 
18 2 PCI Host Access Error (Memory Error / Parity etc) 
19 2 DMA-O Complete 
20 2 DMA-1 Complete 
21 2 DMA-2 Complete 
22 2 DMA-3 Complete 
23 2 Timer 0 Interrupt 
24 2 Timer 1 Interrupt 
25 2 Timer 2 Interrupt 
26 2 Timer 3 Interrupt 
27 2 PCI Read Parity Error 
28 2 PCI Write Parity Error 
30 2 PCI Abort Termination 
31 2 PCI Retry Expire 
32 2 Power Management Request 
33 2 PCI Interrupt 0 
34 2 PCI Interrupt 1 
35 2 PCI Interrupt 2 
36 2 PCI Interrupt 3 
37 2 PCI Interrupt 4 (remote debug request) 
48-55 3 Propane Interface Interrupt 0-7 
48 3 Real Time Clock 
49 3 Typically Propane UART (should auto-detect) 
63 - Last interrupt 
Page: 82 of 125 Radar Remote Sensing Group, Electrical Engineering, VeT 
CHAPTER 8. SOFTWARE DEVELOPMENT 

and break-points were removed to allow the kernel to boot further. 
Some major problems were encountered during this process as soon as the cache 
and memory management sections of the processor were enabled. Problems in the 
cache caused strange and unpredictable operation and a lot of time was spent reading 
back the SDRAM and comparing it to the original binary and Linux System Map. A 
lot of time was spent on the Linux MIPS IRC channel talking to the developers from 
SGI and other companies in order to solve these problems. 
Various small goals were assigned during this development period such as getting 
the MIPS timer operational, enabling system interrupt handling and fixing memory 
management problems. The task of identifying the cause of the problems that re­
sulted from the memory management issues prompted a trace of program execution 
though the kernel filesystem, drivers and process handling code. The problem was fi­
nally tracked down to a problem with the Translation Look-aside Buffers (TLB) in the 
MIPS memory management low-level routines. A quick fix was implemented to fix 
the problem but this had some perfonnance problems associated with it. The proper 
fix was left up to the Linux MIPS developers who identified that a problem existed 
but could not state when it would be fixed. Development on the getting the rest of the 
system operational was a priority and the memory management was left in this state. 
Once the kernel was operational to such a state that it required the 'init' program to 
be executed to boot the system, work started on implementing an initialisation RAM 
disk. An initialisation RAM disk is a filesystem based in memory that the kernel 
uses for its root filesystem in order to run just enough programs to enable access to a 
larger, possibly remote file system. For the embedded system, this filesystem needed 
to be linked into the kernel object code by the linker, which would provide pointers to 
data location for the kernel. Code to implement this needed to be written and utility 
programs to package the RAM disk as a linkable object file were written. The Linux 
MIPS applications in the RAM disk are fortunately totally unaware of the underlying 
hardware as the kernel provides all the interfaces. For this reason, a standard Linux 
MIPS RAM disk file can be obtained or generated that will work in any Linux MIPS 
kernel as long as the endianness of the files is correct. 
Debugging the kernel although complicated by the caches was reasonably possible 
with the use of debugging messages and monitoring memory from the host. When 
interrupts were enabled however and processes started to be spawned, any piece of 
code in the interrupt handler and system call interface could be called at anytime from 
any process or interrupt routine. Debugging in an environment where you are not sure 
what is running when or whether the currently executed function will complete before 
an interrupt occurs and the system starts executing other code before returning and 
completing the function became very difficult. 
Debugging user-land processes became even more complicated because they exe­
cute in virtual memory which mayor may not have existed in physical memory and 
probably not in contiguous blocks. User space programs were tracked by inserting 
debugging code in the kernel system call entry and exit points to try and deduce what 
operating system calls were being executed. 
Radar Remote Sensing Group, Electrical Engineering, VCT Page: 83 of 125 
8.2. MIPS SOFTWARE 

A major problem in the entire system was that the kernel was running in 64-bit 
mode and user space programs in 32-bit mode. This is primarily due to the fact that 
the Linux compilers were unable to produce 64-bit Linux user space programs at the 
time of development. Every system call had to be evaluated to make sure that it was 
32-bit safe. This was implemented as a 32-bit compatibility layer that was mostly 
provided by the standard MIPS kernel but a few extra handlers had to be implemented. 
The most common problem of running a 32-bit program on a 64-bit kernel is that the 
'c' size of pointers and 'long' variables are different and thus data-structure sizes may 
differ. This can cause memory overruns to occur resulting in unpredictable errors. 
After a very extensive debugging period, the Linux kernel was successfully able 
to execute a shell program, giving console access to the system. This allowed the 
development of further Linux application software for the system to be started. 
Following the successful implementation of Linux on the ERPCNO 1 system, drivers 
were written to implement the sigma-delta audio interface and virtual networking inter­
faces. The sound card driver that was needed to provide an interface to the sigma-delta 
DSP was developed from existing Linux drivers but did not implement most of the 
'ioctls' (I/O controls) provided by the interface. The driver was written to provide just 
enough functionality to 'pipe' audio data from the application, through the driver to 
the FPGA. A ring buffer was implemented in the driver to prevent buffer under-run 
problems and the FPGA FIFO state was polled at a constant rate and new data pro­
vided when necessary. Interrupt operation was tried but was unsuccessful and the time 
taken to fix this could not be justified. 
The virtual networking driver was implemented to provide a TCPIIP communica­
tions link between the host PC and the node. Two versions of the driver were imple­
mented as described in the section on the Linux host PC driver. The specifications 
for the virtual Point-to-Point networking link, which was eventually sucessfully im­
plemented were: 
• 	 Two 'mail box' shared memory windows would be used for transmit and receive 
buffers 
• 	 Each 'mailbox' will have a field indicating the size of the message packet and 
an area of memory for the packet. 
• 	 The signalling between the two processors will be done using interrupts. All 
required information will be present in the 'mailbox' for the receiving side to 
use. 
• 	 Presently, no acknowledge of receipt is implemented. 
• 	 The interface will emulate a TCPIIP Point-to-Point link, no other low-level pro­
tocols will be usable. 
• 	 The virtual interrupt number 33 in the Linux MIPS kernel was allocated for the 
transmit interrupt and 34 used for the receive interrupt. See Table 8.1. 
Page: 84 of 125 Radar Remote Sensing Group, Electrical Engineering, VeT 
link enabled use of Network System (NFS) between the host 
and NFS come as part of the source code 
were was used to mount the host's 
8. SOFIWARE DEVELOPMENT 
SDRAM memory reserved ..,,""'''' memory communications was •.aU• 

at address OxOOCOOOOO. 
exported filesystems to 
8.2.4 Linux Programs 
for Linux MIPS is as simple changing the com­
piler to compile for Linux on a Pc. Linux kernel was 
operational, software development could on system 
What follows is a short description of some of the various nrr,,,,r"rn used for __U',A..,", 
benchmarking the MIPS platform. 
the first intensive nrnOT<l 
as the output 
on the processor was an 
audio decoding program called audio driver to 
the sigma-delta converter was MPEG 1 3 
files were demonstrated on the system. 
FFfW is a collection of fast C routines to compute the Fourier 
one or more dimensions. Provided with the library is a test programs that 
one or multi-dimensional transforms given a set of command line options. 
whetstone 
A converted version of the Whetstone Double Benchmark. is a 
simple benchmark to provide a performance measure both floating point and integer 
performance a 
fastdct 
Fastdct is a program to compute the Discrete Cosine Transform algorithm 
from IEEE proc, 1992 Yugoslavian authors. This " ....,.......... was used to 
benchmark processors DCT performance to with the performance. 
Radar Remote Sensing Group, Engineering, ucr 
8.2. MIPS SOFTWARE 

tcpserver test 
A simple TCP server running on port 37 (time) was written to test the networking 
code in the kernel. This program used simple Unix sockets to listen on the port. The 
program was first tested on a standard Intel PC running Linux before being compiled 
for the MIPS processor. 
PVM 
From the PVM manual: PVM is a software system that enabled a collection of het­
erogeneous computers to be used as a coherent and flexible concurrent computational 
resource, or a "parallel virtual machine". 
PVM is a program that runs on various computer systems and provides transparent 
communications between computing machines. A master program is responsible for 
managing the application and requests processing clients to be initiated and provided 
them with data. The PVM software has the responsiblility of executing the programs 
on the remote machines as well as performing communications with data conversion 
where machines have different data representations. Because each machine can be 
different, versions of the client programs must be compiled for each of the target ar­
chitectures. 
The first problem experienced with PVM was trying to make the application to 
compile and execute. After sorting out a lot of problems with the PVM source code, 
a compiled version of PVM was obtained. After a lot of work, PVM executed on the 
system and opend network connections but failed to communicate with the host Pc. 
This problem could not be resolved in a reasonable time frame. 
miscellaneous 
A simple http server program was compiled and tested on the platform. 
Various network client programs such as 'ping' and 'ftp' were used to verify the 
operation of the networking interface. Network sniffing programs such as 'tcpdump' 
were used on the host system to monitor network communications. 
Unix and Linux programs such as 'mount', 'cat', 'Is' and 'ifconfig' were used 
extensively on the system. 
Page: 86 of 125 Radar Remote Sensing Group, Electrical Engineering, VeT 
Chapter 9 
Testing and Verification 
Testing and verification was important in order to demonstrate that the hardware and 
software performed to the requirements. chapter the testing 
and verification procedures performed to test the firmware and software. 
n<>r'rlUJ'<lrp verification process was described chapter 6. 
9.1 Firmware Verification 
9.1.1 Config CPLD 
The Config CPLD firmware design in VHDL was tested incrementally during the 
phase though the use of simulations. Each sub-section of the was 
after implementation to verify correct logic behaviour. design consisted 
..,>..,...""...", that produced real-time delays excess of lOOms from a 66MHz clock. 
These were extremely large simulations and the delays were reduced consid­
erably for the simulated tests. A full simulation at lOOps resolution would require 
10,000,000,000/66,666,666.667*8,500,000 = 1 billion steps. 
Once the had been fully simulated, corrected verified, the task of testing 
the in physical devices was initiated. Initially, design was modified to 
the processor-reset asserted while enabling the functionality of rest of 
system. This allowed the testing of the reset delay functionality without affecting the 
processor. 
Once it was verified that reset counters operated correctly, the 
initialisation section of the design was enabled. An oscilloscope was used to verify 
the from CPLD as as checking timing conformance 
of the various CPU interface signals. Various unforeseen problems such as incorrect 
bit-ordering inverted from dip-switches were and ,.."•.,.",,,'t,,,£1 
Finally, the initialisation sequences were verified the final pf{)cessc)r 
reset was An oscilloscope was used to monitor the processor's SysAD 
bus for activity which would indicate that device had initialised, This con­
87 

9.1. FIRMWARE VERIFICATION 

firmed the correct functioning of the Config CPLD design. 
9.1.2 Bus + Control CPLD 
The Bus CPLD design was extensively simulated to verify the correct operation of the 
local-bus interface before the design was implemented in the CPLD. Any problems 
in the design could cause bus contention, which could interfere with the operation of 
other devices on the bus. 
A 500MHz oscilloscope with a PC interface was used to assemble a timing diagram 
of the Bus CPLD's interaction with an emulated memory controller in the FPGA. See 
7.4.4. The FPGA simulated four byte burst transfers across the Bus CPLD bridge to 
the SRAM and Flash Devices. 
The only major problem encountered during the testing of the Bus CPLD with 
the memory controller. The device did not respond to any requests on the Local Bus. 
During an investigation of the problem, every control signal on the Local Bus was 
captured to make sure the memory controller was producing the correct signals. Two 
problems were discovered, the memory controller was not generating a chip-select 
signal and the Bus CPLD design specified an incorrect pin location for a Local Bus 
signal. The memory controller problem was tracked down to a problem in the memory 
controllers register configuration, which was modified by the PC hosts PCI BIOS. 
Once memory behind the Bus CPLD on the Peripheral Bus Bridge was accessible 
and working correctly, the Bus CPLD design was accepted as meeting the require­
ments. 
9.1.3 FPGA Testing 
Many designs were implemented in the FPGA during this project. Many of them re­
used components from previous implementations such as the UART module, Local 
Bus interface and Propane system interface. Each module was tested individually and 
proven before use in other designs. 
The initial tests of the FPGA were required to verify that the hardware design was 
correct and that the FPGA was functional. These tests output signals to the LEDs to 
provide a visual verification of the design. 
The Local Bus emulation testing and verification was performed mostly in the 
simulator, with functional and timing simulations being performed. The Local Bus 
interface was required to be correct in order to verify the operation of the Bus CPLD. 
The design was worked on until it satisfied all the specifications of the Local Bus 
interface in the simulations. A captured trace of the physical bus signals was also 
acquired with the digital oscilloscope. 
The next FPGA design to be tested was the 16550 UART core. The initial tests 
were performed first on the transmitter unit. A continuous stream of characters was 
output from the FPGA. Once the baud rate generations and transmitter unit were func­
tioning as specified, the characters were received and verified using a serial port on the 
Page: 88 of 125 Radar Remote Sensing Group, Electrical Engineering, VeT 
the 
the last task was 
CHAPTER TESTING AND 

host receiver was the next to be tested. transmitter was connected to 
the receiver unit so that characters were retransmitted or 'echoed' back to the 
The correct reception of transmitted verified UART and 
final test of the UART was the generation of a state machine that 
bus setting up the registers initially, then polling 
of a the UART, incremented by 
one and sent to the transmitter back to the operation of the 16550 
UART design. 
The Local and 16550 UART designs were linked with control that im­
plemented a protocol communicating over serial port. design was 
In The first was to reading and writing to the SRAM in 
verify the operation. A digital oscilloscope was to verify the assertion of 
the correct on the and Peripheral busses. Once the read write oper­
ations were working, the design was shown to be working 
to that the devices the system be 
to 
Successful reading and writing the FLASH device proved that design was working 
correctly. During the programming, over 700,000 characters were correctly 
transfered over the UART without a error occurrence. 
Local device interface was tested with the Galileo memory controller the 
master of the Local Transactions on the Bus were from host 
PC across the PCI During the interface was refined to resolve 
such as write cycle synchronisation. final design was verified when the 
interface was functioning as specified and physical on the were verified 
with oscilloscope. 
sigma-delta modulator was only verified by the quality of the output 
signaL first tests produced the which vaguely contained of 
the presence of a signal. was found to a problem with the modulator design and 
corrected. next problem was that digital audio format was interpreted 
incorrectly and the signal was 'clipped' and inverted which very poor 
quality sound with a heavy noise component. Once was identified and 
fixed, the audio quality output equated to AM radio. Various problems the 
FIFO buffer to under-run were fixed eventually, high quality audio output was 
achieved with only very modulation noise at very low output volume 
final testing on the system was concerned with the 
system interface and its associated modules. 
read/write access to interface and then the operation 
The first 
rlP,,"r'Pof each 
tasks, a on the PC was to test Propane interface across 
The verifying of the Propane interface found a problem with the Local Bus 
rpt"tQf"'P that had been adapted this design. lot of work was required to find and 
solve problem. This difficulty was mainly due to the that certain problems 
only with some the Propane modules. interface was finally 
and verified. 
VCT Page: of 125 
9.2. SOFTWARE TESTING 

Each of the modules implemented for the Propane interface including the Propane 
UART, Real Time Clock interface, Sigma-delta processor and Discrete Cosine Trans­
fonn processor were all individually tested. The test program on the PC host identified 
each of the modules present in the Propane interface and called functions to test them. 
Each module was tested and fixed to meet the specifications before the system was 
declared to be working correctly. 
9.2 Software Testing 
Through most of this project, software was used to test hardware and finnware de­
signs. The only software component that was tested with other software was the Linux 
operating system. 
Testing Linux was perfonned primarily to identify problems with the port in order 
to fix them. The first program tested on the system was BusyBox, a program containing 
various common Unix utility functions built into a single file for embedded systems. 
The 'init' part of the program emulates the Linux Init program which is used for system 
start-up. During the initial stages of getting Linux to execute user programs, the 'init' 
program was modified to do various system calls to try and find the causes of problems 
in the port. One major problem that was identified and solved with this program was 
that the 'sys_info' system call was not 32-bit safe and a 32-bit version sys32_info was 
created to address the problem. 
Various networking and audio programs were used to verify the virtual networking 
link and sigma-delta sound card interface. The networking testing was a considerable 
effort caused mainly by the problems in implementing the Ethernet network emulation. 
The TCP/IP Point-to-Point link caused a lot less problems. The proving test for the 
networking link was to mount a filesystem from the host PC on the Linux MIPS kernel 
via the NFS protocol. NFS operation was stable and reliable. A small TCP server 
program was written and run on the Linux MIPS kernel to test connections initiated by 
the host to a server on the Linux MIPS system. This proved successful and meant that it 
would be possible to implement programs to communicate and exchange infonnation 
over TCP/IP for parallel processing. 
9.3 Benchmarks 
Various benchmarks were performed on the processor and FPGA to analyse the perfor­
mance of the prototype system and compare it to other reference systems. Most of the 
benchmarks on the processor were run under the Linux operating system. Due to lim­
itations in the tools available, only 32-bit programs could be created. This limited the 
instruction set available to the compiler for optimisation. The pre-compiled libraries 
used were created for stability and only used the MIPS-I instruction set which reduced 
the perfonnance of the system. 
Page: 90 of 125 Radar Remote Sensing Group, Electrical Engineering, VCT 









results of various benchmarks are shown in Table 1. 
9.1 the normalised performance of the MIPS processor at 200MHz 
compared to the AMD Athion (normalised to 200MHz). Past 1024 the 
smaller on the processor makes the performance drop 
The final benchmarks performed on platform were to test the 
capabilities compared to MIPS processor and a standard 'TWo possible 
algorithms were identified for in the The function was the first 
common function identified and is used in a wide variety of and 
ing applications. The Cosine Transform (DCT) is a common function 
TheIn Image audio processing, most notably in JPEG MPEG 
core provided by the manufacture was not compatible with the particular 
that was chosen. left the function for testing. A module was developed 
using manufacture provided core. This was integrated into the Propane interface 
and a to access it was written Linux. 
The results of the 16-point DCT transform benchmark for MIPS processor, 
and 1 Athion were: MIPS CPU X DCTs second, 
450,000 DCTs second, Athlon - 700,000 DCTs per second. 
Note, the speed the FPGA was greatly limited by the 
requires a rC-(JCSl 
The for the network speeds were: 1.2MB second speed over the 
link. This was mainly limited by CPU overhead involved. ICMP 
reacne:a 2.374MB/s with the program. Raw coping of data across the PCI bus 
reached 32.768MB/s which is maximum of the Bus. 
."'.....""A which 
Radar Remote Sensing Group, Electrical Engineering, 
9.3. BENCHMARKS 
























Figure 9.1: N onnalised FFfW perfonnance 
Page: 92 of 125 Radar Remote Sensing Group, Electrical Engineering, VeT 
Chapter 10 
Conclusions and Future Work 
This chapter presents the conclusions of the project as taken from the results of the 
hardware, firmware and softwares design, development and implementation. The var­
ious requirements that were specified and the manner in which these were achieved is 
described. After the conclusions, is a section describing possible future work that can 
be performed on the hardware system. 
10.1 Conclusions 
A hardware prototype platform with accompanying system software was developed 
for the application of re-configurable hardware parallel processing research. 
The hardware platform provides all that is needed for a stand-alone microproces­
sor system with configurable logic. A 64-bit MIPS processor is provided for general 
purpose processing and control, a memory controller and SDRAM memory provides 
high-speed high capacity memory required for high speed computing with large data 
sets. Sufficient FLASH memory is provided for processor booting and FPGA config­
uration and an SRAM device is provided as an FPGA configuration cache. The FPGA 
provides system functions and the ability to implement application specific hardware 
processing routines and a simple interface for the processor to access it is provided. 
The system provides Real Time Clock, processor configuration and Reset manage­
ment devices that provide support to the hardware and software environment. LVDS 
channels are provided by the FPGA for high-speed inter-node communications via a 
dedicated interface. 
The software developed for the system provides a base for future research in par­
allel processing with the system. A boot loader and system test application was de­
veloped to execute the Linux operating system as well as providing a simple base 
for real-time application development. The Linux operating system was ported the to 
MIPS processor and provides a software environment that is familiar to people with 
Unix and Linux experience. The benefits of the Linux operating system were that 
standard Linux programs simply require a recompile and binary files from other Linux 
93 

10.2. FUTURE WORK 

MIPS systems are directly supported. 
The software for the hardware platform also required the support of programs and 
drivers on the host PC running the Linux operating system. These programs provide 
debugging, testing and essential functions for loading and executing code on the pro­
cessor. 
The PCI Bus connection to a host PC provides a very fast and simple interface 
to the project hardware and allows firmware designs to be tested directly from the 
host without having to interface them via the MIPS processor. The PCI Bus interface 
also provides for real-time access to the platform memory while the processor is in 
operation, which simplifies debugging and allows for shared memory communications. 
The MIPS processor was demonstrated to perform well despite the restrictions 
under which the benchmarks were performed, namely running in 32-bit mode under 
Linux without using only the MIPS-I and MIPS-II instructions. The processor per­
formed 75-100% of the normalised speed of an AMD Athlon 1GHz processor with 
a 266MHz Front Side Bus. Given that the AMD processor has 256kb cache and ad­
vanced features such as branch prediction, the MIPS processor performed well and 
could possibly outperform the AMD processor if its full set of features is deployed. 
The FPGAs DCT algorithm is pipelined and can theoretically perform over 4 mil­
lion transforms per second at 66MHz. The external Local Bus interface however lim­
ited the performance as data could not be provided and read back fast enough. The 
FPGA DCT benchmark still produced reasonable results at around 450,000 transforms 
per second. This was still in the order of 5.8 times faster than the most optimised al­
gorithm running on the MIPS processor. The FPGA benchmarks have shown that 
even a simple un-optimised algorithm can have huge performance advantages when 
performed by the FPGA. 
10.2 Future Work 
The focus of the project was the development of a hardware platform with supporting 
software for re-configurable parallel processing research. In order for the platform to 
be of use for parallel processing, applications need to be developed to make use the 
hardware. A developer skilled in Linux and FPGA programming could develop an ap­
plication to use the platform as a base for experimenting with various algorithms. The 
PCI bus can be used for point-to-point communications and with more work, a shared 
infrastructure could be developed. A communications link can also be implemented 
over the LVDS channels to provided dedicated high-speed communications between 
nodes. Experiments with the PVM software package can be performed and work can 
be focused porting it to use the high-speed custom communications channels. 
Further work can also take the form of optimising the hardware platform. On 
the provided hardware, better bus interface logic can be experimented with including 
DMA and VUMA support. Future work can also analyse the strong and weak points 
of the current design and improve upon it. Given that the design developed during 
Page: 94 of 125 Radar Remote Sensing Group, Electrical Engineering, VCT 
CHAPTER 10. CONCLUSIONS AND FUTURE WORK 

this project was a prototype aimed mainly for implementing a demonstration platform, 
future work can be focused on optimising the performance. Faster processors, faster 
and wider buses and deeper memory are features that should possibly be implemented. 
Future work can also look at partial reconfiguration of the FPGA. Although the 
current platform was design to permit this, none of the software tools used supported 
this feature of the FPGA. 




[1] 	 Minty, Davey, A. Simpson, Henty. Decomposing the Potentially 
Parallel. Edinburgh Parallel Computing Centre. The University of Edinburgh 
http://www.epcc.ed.ac.uklepcc-tec/documents/decomp-coursel 
[2] 	 Sweetman. MIPS Run. Morgan Kaufmann Publishers, San Francisco, 
California, USA, 1999 
[3] 	 IDT. 79RC64574 79RC64575 User "'1""",r"'n/~p. Manual. Integrated Devices Tech­
nology. 2975 Stender Way, Santa Clara, 95054. March, 2000 
[4] 	 Xilinx. Virtex-E 1.8V Field Programmable Arrays. DS022-2 (v2.1). April, 
2001 
NetBSD: http://www.netbsd.org/ 
[6] 	 Linux: http://www.linux.org 
[7] 	 RedHat Software: operating system. http://www.redhat.com/products/ecos 
[8] 	 L4: on MIPS R4xOO. http://www.cse.unsw.edu.aul-disyIL4IMIPS/index.html 
[9] 	 Galileo Technology: GT-64115 System for RC4640, RM523X, and 
VR4300 CPUs. Revision 1.11. April, 2000 
[10] 	 144XL High IJpf"ltnrrn<>n CPLD. Version 1.2. November, 1998 
[11] 	 H. Johnson, M Graham. High Speed Digital Design: A Handbook of Black 
Magic. Olympic Technology Group, Redmond, WA 
[12] 	T. Wallis. EMC For Product Newnes, Linacre House, Jordan Hill, 
Oxford OX2 8DP 
[l 	 G. AID Converters - Audio and Medium Bandwidths. 
RMS Instruments, 6877 Goreway Mississauga, Ontario, Febru­
1996 




[15] Xilinx: Configuring Virtex FPGAs from Parallel EPROMs with a CPLD. XAPP 
137 March 1, 1999 (Version 1.0) 
[16] 	 Xilinx: Virtex FPGA Series Configuration and Readback. XAPP138 (v2.3) Oc­
tober 4, 2000 
Appendix A 




1 1 1 1 I I I 
(.. ~.,lf 
§~:~. 






1_~1OM1I1II1( '" ~ .,""""",-" Ft,,'- ~ IIJI ,\(; 
a:::~"""'. -II - r 'l:rtI~1 i\1kIrrft I 
1J... , lfx o; f-f- r.... II.lIr,.c~\dt 
Ht . ;n.... 
\ lc. \ U 
l"VIIIfDlt r ." I)It..-.M .\Jr.:l,
C.."I.l!n$C''1 





~" '-'L~mA LE 
'"Kcat)-. 1',:lI rIio Palli 
4::t.UNI 
1',("l." 





 -,-,- 1""";;\.("'.(:"1)!tAM Adilr ~ I

































LVI>-~RX ~ LIIOS f :\ 
fOjOCl Sch"ok Project Toplevel 
'"~ .••• 
,'~ 
s;,. ~ -I'''''''" .xxxxx .... 3 ~ O. 
~ 0.0:' : 2!--hn·Z002 "",,'0(-"z ,.or': ""'f." 
1 II I 1 1 1 
c 
.11:7 17 U '":'t'''1 
\'~. ~ 
v~ ~ ~ ~ t •• i1'I.('( 






l.£a,-clBC'll: k,)nj( (,orn IX'''''C'T 
~
o
) (blc tO q:lOl . 
I ,.-Iml 1.1 
s,..aoc. \':\ 





'11O · CT-6ofI'~S~krn 
COIIrTOU~, 
jl 
,,,,, . FPGA 
•• ON".' EtnI"a··· 11112 .. f'C' dirm '"ie 










~ .'t.~1Il ... 






1\ ~ 'I 
ATPLVc,.A "' CPU Dmluj)lin. 













JU lR70 .r ....I·IOJj~ 	 \11.r--- "'" > 
Jl\II'ER r~ 	 '"T 
1'1(09 Li'l"' fl"Klo:.o.n 
1'170· IIi, ".-li~n -.-l'0 _L"...E"...£:' E...E' 2':or' :or' 
,""~,.,.,.,.,..".".".",; /'""",,u: T<I" 
I) -60 _bl 62 J ~61 . , -()(l]" :ii'" li" ]S" ]S" ]" ]" :ii'" Y" 
7;: :; ;:;: ;;;;;; 3 	 JIU ]JJu Jiju """'[iu yu yu $IU ylll c">"I~I " I ' I'I ~I"I~ I; ~1~~~r~I~I~ I@ I~t~I~ I~I~I~ I~ I~I;13t~I~ I~I~I§I~13 §1~1~1~lc;l; D, J:: t~ .. " . ,t:". " ,', ,_L, '" " -::: -- "":: j " , -; ..... "";. 
MO'~LIXic,oC"rLO 

:uj.Ql1)C~J,lj e~~~~ a:;:.".~ , 
 ~1~1~1~ISI$Iii'I~I~ ~~~'~~I~I~I~I;!lel§I§I§lrI8EfEF~rl~rl'er[Frrel~ 
I' ~ ~ "'I lJ 	 !(,l.l J .':0 ~\ '~.r I 
.' ' \'lh,II~H 115 ~~ .. ~S882i8SCi8 VRU' 200~ I " V;;hdOu, ~ ~ 	 PCI~:jaC3~6QdO ~~~;ij!~~~~~~~~~~~~~g~~ii~i~t~~~~ fI ~ 	 I"'- ,\ :.loJ'" 116 Validln ..!:i"'I)~ '\I f l~}~~f~!!~lk~~!}!~~~l}!~~~~~~~~~~ 	 BSS~ ~£-a-~~~,n-~~", 	 ~~}f.~.~ · · : :~ ~ Rek:l.S( 
I.i 	 !(JS 
CPU Init"rr:.rCC 	 1,\11 ; ~S~ 




~ HI C""'" ,~. f..~ ::: .~ ',' :~ ~~::;-;::~~ 
Q) AS2 
iNIl 
~ r.,, ; 192 DAdrJ 
j!~'" 	 R-­I/.J.'I -Il7 !-!.lA_' 	 . ,I-\ ! 193 OAdI"2lBAdr2 
'.1 1') ill 
,\iJi"~ 	\.1\1 I OAdrJ 
.. I. '~ 190 DAdn 
J1:7 0 M7 





.......-	 \ ' 1\11 DAdr9 
 0 	 ~ >:U 
1-\ 111(; ill0 iiTI' 
.~ <:>:,8",,·~.Jil:;;'V----",,, ~1I ;i;l k "dO 127 
'.1 1 R51 	 t::= ~ :: ~:~ :~ g~::~ 
~ 
'\DI~
8antScIO E 	 ~ 8'";/'II)~ "\.., <d UlTll"'d l I BaIll::...~"J~7 Q) ,"' ) ~ I B29 
'-'J ...... 	 ~ AUm\llQ 168 AD~.l.1~ (J) 	 ~ illTl I I AO-llDevRWS~CS[31.a: 800. C5 PC! BAR (or:addn-u lNII<toinr.nd ~iu 
.... Ii ~ ./ AllmAl)~ 166 AO-2 >­;1'r)~ rnpcnH 
n,,\I' A03 ill.!..:i JU!R«·~R'.l·D.,.,,""-11:7 (f) ,,"1)26 A2J 
'U R55 
r...... ))...1 163 AD_" 
~ mS... aoppedSCSp.lJ Pet B.... R (or JodGo., nul<to;n, ~nd ,ire '","p:II"I'( m\lJ:-; 162 AO S 
.~ R-'6·E~R·U·DI:ubk (f) 	 ~m 1)o'>...ML AO-6 ;;:: 	 "LINmA/H 160 AD=7 ."iNr)~ PCIE>.plns~ROM [3 0... 	 TIrE! dliin' · OM AD &·n7 R4·E~U1·Di,...... '.\UII 820TtIAI~ 15& AO-9 




RSS "~7 7 '5 III BUIMj)I~ ISS AD 12 l!) il ,CH\:" 833 
..".." 
1.._rqislrn.lIO~pcIBAR.ror~nmlll.::hi"fndw~ ",:\I)n IS3 AO-13In 
, II J AO-I" " 	 T""'" 
It.n·En.bk.RJr·Di:ubk m\ln!'i lSI AO-IS ~ T""'";,..n~ tot 	 .-13I:..~ a AO-16 ......0&0.:1 ENbIe on.\lllI 1"9 AD-17 "<tRJ.&-EftIbk.RJJ-[)i"\.alll( ~ :~~;~~\,'E A3-'J-.: R1Allllj 14& AO-18 <0iND~ 
») ,.-17 I-11:7 	 CSI2:1)] PCI B.... R (or~ ....dli... lIId,"u.-a~ t.::: ,1119 AO-19 
I ­ ).1,.1, ,. A ! , "5 AO-2I '1l\1'~O 146 AO-2OJtj6-EftIbk.RJ5·o;,...... 1 11)<; ·1" A26 
SCSIJ:21PC!BAIt(0f~........:t.;... a.:Idz,,~
u~'" 	 J-.: '''2In.\1)'2 AOJ2R62.a.t.7 ~"dO (!J JO >f>FII'R:1t~-EftIbk.RJ7·Di1ollllc
,Nil 	 1\ ",'-\1)2"\ 1"1 ADJJ ___ ...... 	 p.. ,.... 1 ~ -11:7 	 I\: rnt,I'~J AD 241DMA AO IkoC"S· uod CSIJr dr>ia boo ..WIIt! ZOO ,11..1 · 1"\ 	 m\l)~.'i 1J7 AD)5IDMA AO
R6o'R62·Bbils 0 	 p.. ''\1.'. I "mA)' AD2~.'~R63-dArr RMItM·It.b'lII Q)')1 -It7 29 11)(..., .-1118 
.... nl ..")~~ lli AD_2~
... 	 1\ IT, \1)'1 13S AO)71DM"-.AC'" . Rt..llttil-J2bou n 91 
I 
,'''.r -A11 
,NO -It7 J Il~C) AO_~ P 'n "k"I' il 
I 
lcomt,I'-;o 132 AO_~ 
J~ 	 'I~' >m \1 .11 AO )IJ'CSJ." (!J·J·..I~o • 	 I.> I B 81VOO 
J 	 Ir-T(· A7~""R"'S'12"~ voo;NI)~ -I 	 1r-1I)~voo~b• .\M i ,s . 	 SE.A:S .,. .VDD 
...(1('",·0 173 [lI. [,10.: 816VOO
'lWS··! IZ2J.'~'"1- R68 "t7 I'll VDOProp-ammillJ C"OIIdilion;al PCI Rnry 	 SCSI,,)("S"2 171~ ~ ;~n R66-EttJibIe..R6S· Diul)lc '" 	 PCI Po-a- MIN.~I 








voo-',-.-- sc'" \ .1 1 I/:,l() 849voo 
VOO 
vooSCS;I ~~ 
HHHH liJH ~g ~~~~~~~~~~~~~~~~~~~~~~ §gSggmD




~II "1_1_'1_1'1'1'~~ ~~ %~ ~ ~I~~ 
















PC1 AOI PCI TOO Ei 
~:=~g; ~II~~~ ~j.. ' 
PCI-AN PCI TRST A I 
~:=~~ PCI_P~SNTI ~" 
PCI AD7 PCt P'kS,''T2 c Ii! 1I 0---, !r;"1 
~:=~g: -PCUf!Q k· 17 
PCI_ADIO PC1_S00\'E Op. 15W 






















PCI ciiiEO PCU..JV 
PCI-ciBEl PCU·3VPCI=~ PCU.3V 
PCL08EJ 
PCI~ ~:-~:g f---'*--"'-= 
~~E ~:=~:g 




PCI !"ERR PCLGNO 
PCI--.sEi:R PCI GND - PCLGND 




PCI INT8 PCI GND 
PCIlj::ijC PCLGND 
PCIINTD PCI GND- PCI-GND 
pc'.....w PCLGND 




'n"!!',,", C"hclcM only ont PCI IntnTUpI Line 
m\.1I'i'.'i JI·J4(Od.lultlnIA) 
J6-lnI8opciondio=\'oCf'U 
'. "\11 11l0iii-" 
"", System Controller ~ 
51](: Number 	 R(vi~ion 





!ooDIC " '1 Add<" 
w ·<\,O 1,."'AI ,.101.\.1 ..... 
<..,,, 
,,'"
~d·\ij.. 	 " 
,"'All 
,."'R.mL~1O 
..t11.1·' ~ ' II 11 
-.dC· ,Jo S · 17 
l'C lf'$'1 
sdwR' " " 
mrtYl12'O I 
nINyn 'c' 1 




SIlII ... MC.. • 
-dW · 
Il l 'Tt. ;~ 
Vl.~ 
'" AD 










































~ I ), 
.,. .\ /)1 
1I1 .,'~J 
,< 





In ... I\' I 
50 	 III"JI.\ 
nlAIHJ 
1 • IH~ 















)<1"'1(' " " 
nllJ'rTI ~ Jj 
ru in" ,!:", JI) 
,.wXV,,1 


















































nl~ l) l ~ 
0;. \1)1 , .. ' 
1>o,\'Q9 
.nAI) ~ 1I 
>1 
nl \JI~~,_-
-n "' \ 




























1_ .. .• . __8 
--ru;;-r.iiJU!jJU;pj;])"j;;-nl LI N"mbo:r Ih" 'iSiOOl 
]~'~Jf]~H]: I'"'' 
" bn·XI02 ~ 








RI I'WI . TCK
GiI3!:J:!::=>-...!!!=.L---1rt "i:,~ 
ilL! TOO 
IO'GTSI- ' ­ :~IOIGTS2 
10&+­ IQ.'GTS)
J :gi""""" ~ """SO :gi
1!l.lL....-..l> rOI lOS 
CPlD5VTolin-en[ 
'<JAJ 17 rOl loj , r\ J 19 
IQj 
101d , \~ '20 :~ 
10J '<t :gi
oslO? n 101 '<t 108~ 101 ,.-­
10J loi,0(/ ->. 1') lS r-.....101 ~ Kii -.J 
I 
101 
IQJ 107~~\~: < H 
101 X 107 
IQi '<t 107,I/{l-,.C:l:.l..-ll 
~ 101 107'<t 
,.-- 101 __ 142 
102 IOJ.L()1" ""1.1 101 107 
- .~:; . , 9 OlKii IOi 
IC!! X 107I. IOJ 107 
Kii E 101 
'Thl " RI O 
~z... IC!! !giI'llj')'i""'jIT . IOJ X 107..1l 102 q 
~~~~~; :H ~ 
IOiA:'I 40 ~ 106 
~ "'~ " :gj 106 
," 10 ~ IQJ 

~J ~ A6 ~J 
 ~ 
., Kl} lot 
IOio(IAI;> :gi 106 
I 101 ~'Q~'Q~'Q'Q~'Q'Q'Q'Q~'Q'Q 'Q'Q'£?§"2'QtQ'Q'Q'Q'§'2'Q'Q 
'-'. '$ 








6J 1 .. ,, 11\1 
12:! 1",~'II J! I 
r07 lI' "'T,;!:Hn' " 
I~ ,.".\1.1' 
:O~ ~~~ . ':~J 
:~ ::!\~,'~ 





.. .. 11I"[).1,,,,,.1) " nl,\1)6 
"m :\I),Ij 




n ,,'I. ... r'IJ 
7S ", ·\ IJI" 
" ", \ n l1 m. f) :I 
,\0 
I • ",,\1)::'0 
I nl '\')~ I 
II, 
\I 
1n.... I):!\ , " 
m,\ U!S ~ 
II m ,·\1)~6, 
112 IT\.. \()~" 
II m ... I J'1l 
II. m,\II .lll 
106 nI .. I )i 1 
XC95r.LIXL·7n)I~ 







n,,, Peripheral Bus Interlace 















, IIII~" ~93 
~ ~'~'~' ~ ~ 
:: ..::: 





















































































~ ~ i bV-Nm-WI 
~,tl8~ S. ~ "','z;g:~g:
"\~~ ,<,"\<,,\ '1"\"\ <,"\ '\<,"\"\'\0-.."-,,,, 
~~3~ ~3~'~ :~~~ ~~5~53~g "mu-~
~~~~8~~~~8~~~~8~8~~~~~~________ 2 _____________ _ 
~ ~!: Ii N ~- -



















~ ! \ " 1 
. 3 1)" 1 
.!llI' l 
, " 
• 1 ~"IJ :lt.1 
IK 
Y<I" 





';~ Uf t' _ 








Vl ~ Vl " OJ ~g 
~ ,












', ) 1.\1 
' , ~"'I 
-\11"'1 
·tP,'1 





~ ~~ ~£ ~ 
~~ 
~~.~ ; 
. ~ ~ 
o 
- -


































~ ~««':(~«~;~~;~;~s~ l!jl~~~~~ 
-
-; - ~- ". 























'+-' 't' 1­..l.cr:~(~ ~(""I<n ('I~_ ,,~:::t:--+-.~c~,,~:::t:-I-.~C1~",-:::h~.CI9' ... 
L1 Til. TO!' T'....'T looo.'T ImJo'T""""'T IOIJIN' 
V~~~ 















































h it :. 
A'-I"'ll IM · 1 
~ 
I'i/I. I 
1" .\ 2 
1. ,\. 1 
w I> IlL-------< ; 
~~===~L, 
,\M l'lSS I6--&·1LV(JS1:\ 
onpTo.. Peripherals 
RC"'I1~ 
RRSG·XXX)(X 0 _1 
sr-xilfOl 
\"'r~ r1 VAn 
1-----1-----'-11 Ym (I." I ~~t-11.2VIL!'r_I--r----r------, 
ONIOFF ~ III 
~ ~ DtOoeSCHOTJ"l;;Y 
(; , ,"1) 


















" ~ ~ 
--fo~ 
I I I I I I 
~. 



















• • •••• • •• • 
• • 
• •••• 












", ...... .. 
I " " 
I 
 .. , II .
~.,~"" 
': " .. . ••• ~ , Oo .... .. " " ...... 
" " "'.­
" 






.. , , • • ••• • '... . rOO 
......,. " • ..... 
ft· .. :.,.,,' 
; .. 
.. •. I, 
, , 
", 













t i ... 
, ':
" , .. 
"' 
~ 




.. .... . ­
,• . .. ", .... . • 
• 








• • ' .~. " . . I 
• c: Ct • 
. . .. 
• 
..: : 
" :., ' . . ', . 






•• • • • ••• 
C> • 



















, " · 
" , 
" 




, " " .. 
.' " 
. ," 
• •••• 0 ' ••• ' •• 
, '.




.. .' .. 






' .. " .,' ... " 
" ':
:' ,." . ..,, t, " 
.'/I~........ 
 , ,',-. .. . . '" 
" " " :" ....t 
- .:~
, 
" ' :.: ';'. \ ., " 
\ ,.., 
• • ,.. 
". " " ... 









. . : .. " 
. ~. . " 
. .. .... 
. . . 
. . 










Freq(I)=259 . 7MHz Rise ( D=I .350ns 
Figure B.1: 4X Clock Generation 
Figure B.2: 4X Clock - dV/dt (lGV/s per division) 
Figure B.3: LVDS Clock - FFf (266MHz) 
1 





N 0 0 ~ 0a:: Q) "'0 CO 0 (J) a::....... co W ....... « 00 ·c Q) .....I co U« ~ a:: « 0 ~ c « 
I I I I I I 
I 










0.900 -0 t/) 










The Propane interface provides eight 1 MB address windows for function implementa­
tion. Thus Address bits 20,21,22 are used for device or window addressing. 
Window 0 is the Propane control interface 
Windows 1 ..7 are designerlimplementation specific. 
Address OxOO of each window is mapped to a configuration register which 
provides device identification and designer specifiable features. 






































































I - Function ID, V - Function Version, x - Design specific 
Windows 0 ..7 correspond to internal WishBone chip-selects 0 ..7 
The interface specification for the Propane interface on device window 0 is: 
I Window Address I Description I Attributes I 
OxOOOOO Configuration Register read-only 
OxOOO04 Interrupt Status Register (raw) read-only 
OxOOO08 Interrupt Mask Register write-only 
OxOOlOO RTC Status/Control Register read-write 
OxOOl04 RTC Request Address read-write 
OxOO108 RTC Request Data read-write 
Ox00200 Reserved - DMA Controller 
... 
Configuration Register OxOOOOO 
121 
122 
Function ID: 'OxE' or '1110' 

Function Version: Oxl 

Bits 23 .. 8 reserved (read as '0') 

Bits 7 .. 1 Indicate function (7 .. 1) presence in the system. 

Bit 0 Function 0 presence - always' l' 

Interrupt Status Register Ox00004 
Bit 10 NMI Summary (Masked) 

Bit 9 Interrupt Summary (Masked) 

Bit 8 Interrupt Summary (Raw) 

Bits 7 .. 0 Interrupt Status of each functional unit 

Interrupt Mask Register Ox00008 
Bits 23 .. 16 Logic' AND' mask to generate NMI. 

Bits 7 .. 0 Logic' AND' mask to generate Interrupt 

RTC Status/Control Register OxOO 100 
Bit 9 Alarm 2 Status 
Bit 8 Alarm 1 Status 
Bit 7 Alarm 2 Enable 
Bit 6 Alarm 1 Enable 
Bit 5 Enable alarm polling 
Bit 4 Send 'RTC Write' request (write-only) 
Bit 3 Send 'RTC Read' request (write-only) 
Bit 2 I 2 C Data valid 
Bit 1 I 2C Inter face busy 
Bit 0 I 2C User request busy 
Various other Propane functional unit interfaces were specified: 
Propane UART Interface: 
I Window Address I Description Attributes 
OxOOOOO Configuration Register read-write 
OxOOO04 Transmit / Receive Register read(receive )-write(transmit) 
OxOOO08 Status Register read-write 
Configuration Register OxOOOOO 
Function ID: 'OxA' or '1010' 
Function Version: Ox 1 
Bit 21 UART Enable 
1 
Bits 19 .. 8 rate (bits 11 ..0) 

Bit 7 interrupt enable 

Bit 6 Transmitter empty interrupt ,",,,,,.Ul..., 

Bits 5 .. 4 FIFO interrupt trigger 

Bit 3 bits - '0' - 8-bits, '1' 

Bit 2 bits - '0' - one 






Transmit I .,"..",.. Register Ox00004 




Bit 8 Transmitter empty 
Receive FIFO contents count 
3 .. 0 Transmit FIFO contents count 
LED Display Interface: 
Register OxOOOOO 
Function ID: 'OxB' or' 1011' 
Function Version: Oxl 
23..16 LED 7..0 Available status 





Configuration Register OxOOOOO 




Bit 17 FIFO Full 
Bit 16 FIFO Empty 
Bits 15 .. 0 FIFO contents count 
Propane DSP Interface: 
I Window Address I Description Attributes 
OxOOOOO Configuration Register read-write 
Configuration Register OxOOOOO 
Function ID: 'OxF' or ' 1111 ' 
Function Version: OxO - (DCT) 
Appendix D 
Source Code and Datasheets 
Due to the size of the associated source code and datasheets, this information has 
omitted this 
Copies of the full source code, development tool and datasheets are provided on CDROM 
format. on 
125 


