Solving graph coloring and SAT problems using field programmable gate arrays. by Chung, Chu-Keung. & Chinese University of Hong Kong Graduate School. Division of Computer Science and Engineering.
Solving Graph Coloring and SAT Problems 
using Field Programmable Gate Arrays 
Chu-Keung CHUNG 
A Thesis Submitted in Partial Fulfilment 
of the Requirements for the Degree of 
Master of Philosophy 
in 
Computer Science & Engineering 
Supervised by: 
Prof. Philip LEONG 
� The Chinese University of Hong Kong 
July 1999 
The Chinese University of Hong Kong holds the copyright of this thesis. Any 
person(s) intending to use a part or whole of the materials in the thesis in a pro-
posed publication must seek copyright release from the Dean of the Graduate 
School. 
/ ^ ^ ^ v 
f v ^jL$nt_M^X 
丨玉[n rrn "^n i| l 
.:一 2 I � “ “ J g J 
— r ~ J J 
, , . \ 1_.二广。丨；/ / c ^ / 
X^ Jj>4::RnRY sroJmy<^/ 





















Solving Graph Coloring and SAT Problems 
using Field Programmable Gate Arrays 
submitted by 
Chu-Keung CHUNG 
for the degree of Master of Philosophy 
at the Chinese University of Hong Kong 
Abstract 
To solve a Constraint Satisfaction Problem (CSP) means finding appropriate values 
for its set of variables such that all of the specified constraints are satisfied. Tradi-
tionally, CSPs have been solved using general-purpose computers or supercomputers. 
Since the complexity of most CSP's are exponential, solving CSP needs large amount 
of computational power and time to solve. In order to speed up the solution of CSPs, 
particularly for small to medium sized problems, a FPGA hardware based method is 
proposed. 
Four different hardware based solving machines were developed to solve graph col-
oring and boolean satisfiability in this thesis. The first three architectures employed 
a forward checking tree search algorithm and the fourth one used an incomplete algo-
rithm called GSAT. The first approach was a fully parallel design for graph coloring 
problems. Although fast, hardware consumption was large. The second approach, 
also for graph coloring, traded off speed and parallelism for reduced hardware require-
ments. The third architecture was for solving boolean satisfiability (SAT) problems. 
The design could be configured at runtime, avoiding the need for resynthesis. The last 
approach made use of a runtime reconfigurable FPGA based clause evaluator for SAT 
problems in which a customized bitstream was directly generated from the problem 
specification, again avoiding the need for resynthesis. 
All of the designs were tested on FPGA hardware and showed between one and 
i 
two orders of magnitude improvement in execution time over software approaches. 
The runtime configurable version showed 3 orders of magnitude improvement in re-
configuration time over the standard approach which required resynthesis, placement 
and routing for new constraints. It is envisaged that such machines could be used in 
hardware based real time constraint solving systems. 
ii 
Acknowledgments 
I would firstly like to thank my supervisor, Prof. Philip Leong, for the useful weekly 
discussions. His insight gives me many ideas, inspiration, and guidance for my research. 
I would also thank him for reviewing my manuscript which is a time-consuming job. 
I would also like to thank all my colleagues in Room 1026, Ho Sing Hang Engi-
neering Building of CUHK, in particular, Peter, Bobo, Small Keung, Oldfield, Fei, 
Thomas, Philip and Polly for the daily gatherings at lunch, useful discussions, support 
and encouragement. 
I would also thank H.Y. Wong and W.S. Yuen for their help with the implementa-
tion of GSAT. 
I would also thank my best friend, Winnie Chui, who shares all my happiness and 
unhappiness especially in the recent 2 years. 
Finally, I thank my father and mother who supported me and gave me the chance 





1 Introduction 1 
1.1 Motivation and Aims 1 
1.2 Contributions 3 
1.3 Structure of the Thesis 4 
2 Literature Review g 
2.1 Introduction 5 
2.2 Complete Algorithms 7 
2.2.1 Parallel Checking 7 
2.2.2 Mom's g 
2.2.3 Davis-Putnam 9 
2.2.4 Nonchronological Backtracking 9 
2.2.5 Iterative Logic Array (ILA) 10 
iv 
2.3 Incomplete Algorithms 11 
2.3.1 GENET 11 
2.3.2 GSAT 12 
2.4 Summary 13 
3 Algorithms 14 
3.1 Introduction 14 
3.2 Tree Search Techniques 14 
3.2.1 Depth First Search 15 
3.2.2 Forward Checking 16 
3.2.3 Davis-Putnam 17 
3.2.4 GRASP 19 
3.3 Incomplete Algorithms 20 
3.3.1 GENET 20 
3.3.2 GSAT Algorithm 22 
3.4 Summary 23 
4 Field Programmable Gate Arrays 24 
4.1 Introduction 24 
4.2 FPGA 24 
4.2.1 Xilinx 4000 series FPGAs 26 
4.2.2 Bitstream 31 
4.3 Giga Operations Reconfigurable Computing Platform 32 
V 
4.4 Annapolis Wildforce PCI board 33 
4.5 Summary 35 
5 Implementation 36 
5.1 Parallel Graph Coloring Machine 36 
5.1.1 System Architecture 38 
5.1.2 Evaluator 39 
5.1.3 Finite State Machine (FSM) 42 
5.1.4 Memory 43 
5.1.5 Hardware Resources 43 
5.2 Serial Graph Coloring Machine 44 
5.2.1 System Architecture 44 
5.2.2 Input Memory 46 
5.2.3 Solution Store 46 
5.2.4 Constraint Memory 47 
5.2.5 Evaluator 48 
5.2.6 Input Mapper 49 
5.2.7 Output Memory 49 
5.2.8 Backtrack Checker 50 
5.2.9 Word Generator 
5.2.10 State Machine 5i 
5.2.11 Hardware Resources 54 
vi 
5.3 Serial Boolean Satisfiability Solver 56 
5.3.1 System Architecture 58 
5.3.2 Solutions 59 
5.3.3 Solution Generator 59 
5.3.4 Evaluator 60 
5.3.5 AND/OR 62 
5.3.6 State Machine 62 
5.3.7 Hardware Resources 64 
5.4 GSAT Solver 65 
5.4.1 System Architecture 65 
5.4.2 Variable Memory 65 
5.4.3 Flip-Bit Vector 66 
5.4.4 Clause Evaluator 67 
5.4.5 Adder 70 
5.4.6 Random Bit Generator 71 
5.4.7 Comparator 71 
5.4.8 Sum Register 71 
5.5 Summary 71 
6 Results 73 
6.1 Introduction 73 
6.2 Parallel Graph Coloring Machine 73 
vii 
6.3 Serial Graph Coloring Machine 74 
6.4 Serial SAT Solver 74 
6.5 GSAT Solver 75 
6.6 Summary 76 
7 Conclusion 77 
7.1 Future Work 78 
A Software Implementation of Graph Coloring in CHIP 79 
B Density Improvements Using Xilinx R A M 81 




List of Tables 
4.1 XC4000XL and XC4000XV series FPGA data 25 
5.1 Estimated resources for the parallel graph coloring machine 44 
5.2 Estimated resources for the serial graph coloring machine 55 
5.3 Number of CLBs used for several graph coloring problems 55 
5.4 Estimated hardware resources for the serial SAT Solver 64 
5.5 Number of CLBs used for several DIMACS SAT problems 64 
6.1 Summary of results obtained for the four architectures 76 
B.1 Number of CLBs used to store the input assignment in parallel and 
serial machines for several graph coloring problems 81 
ix 
List of Figures 
3.1 An example of depth-first search 15 
3.2 The 5 node {zo to z4) and 3 color ( {0 , l ,2} ) graph coloring problem and 
its corresponding GENET network (with an initial assignment) 20 
4.1 FPGA structure 26 
4.2 Simplified block diagram of a configurable logic block 28 
4.3 A wired-AND function implemented by an open-drain buffers 28 
4.4 A multiplexer implemented by 3-state buffers. . 29 
4.5 The block diagram of Input/Output Block 29 
4.6 The block diagram of Programmable Switch Matrix 30 
4.7 The high-level routing diagram of XC4000 Series CLB (shaded arrows 
indicate XC4000X only) 30 
4.8 The picture of the GigaOps Reconfigurable Card 32 
4.9 The block diagram of the G900 Reconfigurable Interface Card 32 
4.10 The block diagram of the XMOD computing module 33 
4.11 The picture of the XMOD computing module 34 
4.12 The block diagram of the Wildforce board 35 
V 
5.1 The 4-node graph 36 
5.2 The tree representation of the previous graph 37 
5.3 Block diagram of the Parallel Graph Coloring Machine 39 
5.4 Connection table for 4-node graph coloring problem 40 
5.5 Assignment and Constraint Tables for 4 nodes and 3 colors 40 
5.6 Gate level diagram of evaluator 41 
5.7 Symbol table of the evaluator 42 
5.8 Block diagram of the Serial Graph Coloring Machine 45 
5.9 Contents of the Input Memory 46 
5.10 Contents of the Constraint Memory 47 
5.11 Gate level diagram of a 1-bit evaluator 48 
5.12 Block diagram of Input Mapper 49 
5.13 Contents of the Output Memory 50 
5.14 Gate level diagram of the backtrack checker 50 
5.15 State diagram of the constraint writing 52 
5.16 Timing diagram for the hand-shaking 52 
5.17 Timing diagram for consecutive memory read accesses 53 
5.18 State machine for problem solving 54 
5.19 State diagram of the solution write-back 54 
5.20 The tree representation of a 4-variable SAT problem. . 56 
5.21 Block diagram of the search machine 58 
5.22 Block diagram of the Solutions module 59 
xi 
5.23 Gate level diagram of a 1-bit evaluator 61 
5.24 State diagram of the evaluation 63 
5.25 State diagram of the solution write-back 63 
5.26 Block diagram ofthe GSAT Solver 66 
5.27 Block diagram of the Flip-Bit Vector 66 
5.28 Block diagram of the Clause Evaluator 67 
5.29 Layout of the Clause Evaluator template 68 
5.30 Layout of the Clause Evaluator after placement 69 
5.31 Layout of the Clause Evaluator after placement and routing 70 
5.32 F function generator RAM configuration 70 




1.1 Motivation and Aims 
A Constraint Satisfaction Problem (CSP) can be defined as a triple (Z, D, C) , where 
Z is a finite set of variables {cci, x2,…，avJ, D is a function that maps every variable 
in Z to a set of objects D^i of arbitrary type. These objects are the possible values 
of Xi and the set of Dx‘ is the domain of Xi. C is a finite set of constraints. Each 
constraint in C is a restriction of the variables. To solve a CSP is to find appropriate 
values for its variables such that all the constraints are satisfied. As an example, the 
map coloring problem (a special case of the graph coloring problem) with 3 colors and 
N countries involves finding a color for each of the N countries on a map which do not 
violate the constraint that no two adjacent countries can have the same color. The 
variables are the countries and the colors form their domain. 
Many real-life problems such as scheduling, graph coloring and scene labeling can be 
formulated as constraint satisfaction problems. These are mostly NP hard problems 
and algorithms to efficiently solve them have been the field of active research (e.g. 
[14, 16, 21, 20, 37, 41]). 
CSP solving systems require large amounts of computation to find a solution. One 
of the methods to improve the execution speed of a CSP solving system is using 
custom hardware. The two most common methods are through VLSI (Very Large 
1 
Chapter 1 Introduction 2 
Scale Integrated circuit) and FPGA (Field Programmable Gate Array) technologies. 
VLSI has the advantages that the circuit for the solving machine can be tailor-made 
and the available area inside a chip is large. Furthermore, higher clock speed can 
be achieved because the design is fully customized. Unfortunately, the turnaround 
time is long because the fabrication time normally needs at least two months and the 
development cost is high for small quantities. In addition, these limit the availability 
by using VLSI to implement a CSP solving machine. 
FPGAs have several advantages over VLSI. The turnaround time to configure a 
problem (download a new bitstream configuration to the FPGA) is only several sec-
onds. This feature leads to much shorter design times compared with VLSI. The cost 
for a single FPGA is relatively cheap. Moreover, the density of the chips has improved 
dramatically over recent years and hundreds of thousands of gates are available on a 
single FPGA, making it feasible to solve large CSPs using this technology. Another 
advantage of FPGAs is they are reconfigurable and reusable. Different problems can 
be solved on the same FPGA by downloading different bitstream configurations, and 
circuits can be customized to solve a specific instance of a problem. This is normally 
not possible for custom VLSI due to the long turnaround times and high development 
and manufacturing costs involved. 
The aim of this thesis was to develop graph coloring and boolean satisfiability 
solving systems which exploit the benefits of FPGA technology. Due to limitations 
in the size of the FPGAs and limited complexity of algorithms that are amenable to 
hardware design, traditional workstations are better for very large problems. However, 
FPGA implementation offers significant speedups over workstations for a range of small 
t.o medium sized problems which were addressed in this thesis. Applications of this 
technology may be found in real time CSP applications listed below. 
• Axelsson [9] proposed a heuristic method to select a suitable system architecture 
automatically for implementing real-time applications, in particulars, genetic al-
gorithm, simulated annealing and tabu search. 
• Different automatic target recognitions have different number of operations re-
Chapter 1 Introduction 3 
quired to execute it. David et. al. [12] proposed using scalable architecture to 
meet the processing requirements and real-time constraints. 
• Athanas et. al. [27] explored the utility of custom computing machine for ac-
celerating the development, testing, and prototyping of a diverse set of image 
processing applications. 
• Mooney et. aL [28] developed a tool that automatically generated a run-time 
scheduler for a target architecture. 
• Mostert [29] designed dynamic reconfigurable distributed hardware and software 
systems, supporting permanently available computer resources and hard real-
time constraints. 
• Dave et. al. [32] developed the first co-synthesis algorithm which provides si-
multaneous support of periodic and aperiodic task graphs with hard real-time 
constraints. 
1.2 Contributions 
This thesis presents a set of 4 different architectures for solving graph coloring and 
boolean satisfiability problems. The first architecture is the most parallel one which 
was developed for solving graph coloring problems. The main advantage of this archi-
tecture is that the evaluation of constraints requires only one clock cycle. The hardware 
consumption is high because the storing elements are done by D-type flip-flops which 
occupies 1/2 CLBs. Only small problems can be tackled by this design. 
To tackle larger problems, one of the approaches is to reduce the hardware con-
sumption by using Xilinx internal memory for storing instead of D-type flip-flops and 
use more clock cycles per evaluation of constraints. Based on this approach, the second 
architecture was developed for solving larger graph coloring problems. The solving ma-
chine requires more cycles per evaluation but consumes less hardware resources. Since 
the solving machine is rather simple, a much higher clock speed can be achieved such 
Chapter 1 Introduction 4 
that this factor can partially compensate the less parallelism to a certain extent and 
the performance gain is still observable. 
The third architecture is a runtime configurable solver for solving boolean satis-
fiability (SAT) problems using Xilinx internal memory. It is the first such system 
reported. The significance of this architecture is that all the memories are runtime 
configurable such that different constraints of a problem can be written into the mem-
ories within several seconds. It is particular useful if solutions to a fixed size problem 
with different constraints are required to solve. Traditionally, a complete iteration 
of synthesis, placement and routing is required for different constraints (it can take 
several hours for a large design) and limits the benefits of this approach. 
The fourth architecture was developed for solving SAT problems using a heuristic 
search algorithm call GSAT [36]. It is the first reported method of directly generating 
a bitstream configuration from the problem specification for Xilinx 4000 series FPGA 
devices. No resynthesis is required and placement and routing are done manually to the 
predefined locations and styles. The locations of the contents of memories are known 
such that they can be configured by direct modification of the original bitstream. 
This approach can greatly reduce the development time especially for solving large 
problems. 
Most of the designs were tested on FPGA hardware and showed one and two 
orders of magnitude improvement in execution time over software approaches. The 
runtime configurable version showed up to 3 orders of magnitude improvement in 
reconfiguration time over the standard approach which requires resynthesis, placement 
and routing for new constraints. 
1.3 Structure of the Thesis 
In Chapter 2, a review of prior research on solving CSPs using FPGAs is presented in 
two different sections, complete and incomplete algorithms. Chapter 3 describes sev-
eral incomplete and complete algorithms. The complete algorithms include backtrack-
Chapter 1 Introduction 5 
ing, forward checking, the Davis-Putnam algorithm and GRASP. The two incomplete 
methods are called GENET and GSAT. They each have their own advantages and 
disadvantages which will be discussed in detail in the following chapters. 
In Chapter 4, an introduction to field programmable gate arrays (FPGA), and 
in particular the Xilinx 4000 series FPGAs, is presented. The architecture and its 
advantages of a FPGA is presented. Two different development platforms, namely the 
GigaOps Reconfigurable Interface card and Annapolis Micro Systems Wildforce board, 
are discussed in detail. 
The 4 different architectures of graph coloring and boolean satisfiability solving 
systems using FPGAs are presented in Chapter 5. The first one is the parallel solving 
machine which was used to solve graph coloring problems. It is the fastest approach 
and the performance gain is maximum but its hardware consumption is large. The 
second one is the serial solving machine which are intended to solve some larger graph 
coloring problems. The machine sacrifices the parallelism but the consumption of 
hardware resource is greatly reduced. The third one is the boolean satisfiability (SAT) 
solver. The significance of the approach is that Xilinx internal RAM was used and 
therefore, a 16 times reduction in circuit density. The memory can be configured in 
runtime such that no time-consuming of development steps (synthesis, placement and 
routing) are required. The last one is the GSAT solver. Xilinx internal RAM was 
used again and the contents of the memory can be configured by direct modification 
of bitstream. It is the first reported approach that can generate a bitstream directly 
from the problem specification for Xilinx 4000 series FPGAs. 
Results from the different hardware implementations of graph coloring and boolean 
satisfiability solving systems are presented in Chapter 6. Conclusions and future works 




There has been considerable recent interest in the application of field programmable 
gate array devices (FPGAs) as accelerators for solving constraint satisfaction problems 
(CSPs) and, in particular, the boolean satisfiability (SAT) problem. In this chapter, 
a review of prior research on solving CSPs using FPGAs is presented in two different 
sections, complete and incomplete algorithms. 
In particular, the boolean satisfiability (SAT) problem is discussed. A conjunctive 
normal form (CNF) formula on m binary variables a i^, x2,..., Xm is the conjunction 
(AND) of n clauses Ci, . . . , Cn each of which is the disjunction (OR) of one or more 
literals, where a literal is the occurrence of a variable or its complement. Such a for-
mula denotes an m-variable Boolean function /(cci,. . . , Xm). The boolean satisfiability 
problem (SAT) is concerned with finding an assignment of binary values to variables, 
so that f [x i , . . . , Xm) — 1 or proving that no solution exists. The SAT problem appears 
in many fields, such as automatic test pattern generation [17], timing analysis [10] and 
logic verification [26.. 
6 
Chapter 2 Literature Review 7 
2.2 Complete Algorithms 
A complete algorithm can determine whether a solution exists or not and find all 
possible solutions of a CSP. A complete algorithm employs a tree search (as will be 
described in Chapter 3), and the difference between the algorithms is how efficiently 
they can prune the tree being searched. Complete algorithms are typically applied to 
relatively small CSPs due to large search spaces involved. This makes them excellent 
candidates for FPGA implementation and, in fact, most FPGA implementation of 
CSP solving machines have employed complete algorithms. 
2.2.1 Parallel Checking 
M. Yokoo, T. Suyama and H. Sawada [44] were the first to propose using FPGAs 
for solving satisfiability problems. They developed an algorithm called the parallel-
checking algorithm. 
Instead of determining variable values sequentially, all variable values are de-
termined simultaneously, and all constraints are checked concurrently. Moreover, 
backtrack-position of each variable was calculated in parallel to record which previous-
assigned variable was to be evaluated again when backtracking was executed. A unit 
value of each variable was calculated in every evaluation. This unit value was used 
to determine the next lowest index of variable p, in the same clause that was con-
strained with the current variable. That means unless the value of p is changed, the 
current value has only one possible value. Multiple variable values can be changed 
simultaneously when some constraints are not satisfied. 
In order to prune the search space, this algorithm introduces a technique similar 
to forward checking. Simulation results show that the order of the search tree size in 
this algorithm is approximately the same as that in the Davis-Putnam algorithm [13" 
(described in Section 3.2.3). 
Their implementation could solve hard random 3-SAT problems with 300 variables 
at clock rates of about 1 MHz, checking one million states per second. No hardware 
Chapter 2 Literature Review 8 
results were presented. Their implementation included three functional units, a rule 
checker, next state generator and next unit generator. In the rule checker, the con-
straints and the backtrack-position of each node were calculated from the current state 
in parallel. One clock cycle only was required to do the evaluation and calculation. 
The next state generator was used to determine the next variable to be evaluated based 
on the result from the rule checker. The next unit generator calculated the unit values 
in the next state, using the outputs from the rule checker. 
2.2.2 M o m ' s 
T. Suyama, M. Yokoo and H. Sawada [39] developed an algorithm which is equivalent 
to the Davis-Putnam procedure with a powerful dynamic variable ordering heuristic 
called Maximum Occurrences in clauses of Minimum Size (Mom,s) heuristic (selects 
the variable p having the maximum number of occurrences in clauses of two variables. 
It does not have a large memory structure like a stack; thus sequential accesses to the 
memory do not become a bottleneck in its execution. A register for each variable is 
used instead which records the depth of the search tree where the variable value is 
determined. This information is used for backtracking. When a unit clause (clause 
with only one literal) contains a literal of p and another unit clause contains a literal 
p or a clause is not satisfied, contradiction exists because different values are supposed 
to assign to variable p. Backtracking is executed and execution jumps back to the 
previous-assigned variable indicated by the register. 
An FPGA hardware system called ZyCAD RP2000 [11] were used for the hardware 
implementation. The system contains 16 FPGAs where each chip is a Xilinx XC4025 
with 15,000 equivalent gates. A 3-SAT problem instance with 30 variabes and 80 
clauses, which was created by a problem generator contributed to the DIMACS Chal-
lenge by Oliver Dubois [15], was actually implemented. The simulation result shows 
that the maximum clock rate is 9.24 MHz. 
Chapter 2 Literature Review 9 
2.2.3 Davis-Putnam 
Zhong, Martonosi, Ashar and Malik [46] implemented the Davis-Putnam algorithm 
(Refer to Section 3.2.3). Their implmentation consisted of two parts, the implication 
circuit and the state machine to manage the backtracking based exploration of the 
search space. The role of the implication circuit is to determine the implied value of 
each literal and figure out whether a contradiction exists. Each implication circuit is 
controlled by a local state machine. Such kind of architecture regards as a building 
block. Each building block represents a node. To solve a problem with larger size, 
extra building blocks are simply added at the end of the bus. 
A Digital Pamette board [38] and an IKOS VirtualLogic Emulator [40] were used 
to implement the algorithm on configurable hardware. The Pamette board has a 
PCI interface and four Xilinx XC4010E FPGAs. The IKOS Emulator consists of one 
system control board and 1 to 6 FPGA array boards. They used 1 FPGA board only 
and each FPGA board has an array of 64 Xilinx XC4013E FPGAs. Their approach 
offers speedups from 17 times to several hundred times on benchmark problems from 
the DIMACSi SAT suite [4]. A DIMACS benchmark problem "holelO" was used to 
illustrate the impact of compilation time on performance. A software implementation 
GRASP [34] took more than 8 hours of CPU time to solve this problem. The hardware 
implementation required 566 seconds to complete the problem. This represents a 51 
times speedup ratio. Even the development time (synthesis, placement and routing) 
was included, which used 2904 seconds. The speedup ratio is still 8.3X. 
2.2.4 Nonchronological Backtracking 
Zhong, Ashar, Malik and Martonosi [45] made another implementation of SAT which 
was similar to their previous work except the algorithm they used no longer performed 
chronological backtracking where the algorithm backtracks to the most recently as-
signed variable. Instead, nonchronological backtracking was used where an algorithm 
1 Centre for Discrete Mathematics & theoretical Computer Science 
Chapter 2 Literature Review 10 
jumps over several previously-assigned variables to a variable more than one level above 
the current variable. In order to jump directly to a previous level, the algorithm must 
first determine that no combination of values on the skipped variables will result in a 
satisfying assignment. 
The only modification required to implement the nonchronological backtracking 
algorithm was in the state machine. The result shows that the hardware implemen-
tation obtains a median speedup over a software implementation GRASP [34] of 63.5. 
For the full DIMACS [4] suite, they offer 100 times or greater speedups on 63 of the 
problems. The median CLB requirements for the DIMACS benchmarks are 3655 and 
202 of 240 problems require fewer than 20,000 CLBs. 
2.2.5 Iterative Logic Array (ILA) 
Abramovici, Sousa and Saab [8] introduce a new massively-parallel ilne-grain satisfier 
architecture to accelerate a SAT solver implemented on reconfigurable hardware. It 
provides new forms of massive parallelism - parallel backtracing of all objectives along 
all possible paths and concurrent assignments of several variables. 
Modular design techniques using iterative logic array (ILA) structures were devel-
oped to overcome the high computational costs of conventional FPGA physical design 
tools. The aim of ILA is to design several types of basic building blocks (including 
their internal placement and routing), and create a library of modules as ILA cells to 
be used by any satisfier. After the library modules have been created, the complexity 
of the place-and-route procedure for an ILA grows only linearly with the size of the 
ILA. For inter-ILA connections, a conventional router is used. While an unstructured 
chip design ends up with many unrouted nets for many CPU hours, the same circuit 
using ILA-based design techniques takes only a few minutes to successfully compile. 
For SAT problem with 13 variables, 29 clauses, and 69 literals, a hardware imple-
mentation on a Xilinx XC6264 FPGA was developed. For larger problems, a perfor-
mance evaluation was made by comparing simulations with a software implementation 
mn by GRASP [34]. The maximum clock frequency of satisfier was 3.5 MHz and 
Chapter 2 Literature Review 11 
GRASP was run on a workstation with a 248 MHz clock frequency. The result showed 
that for 11 examples out of 20, the satisfier achieved speed-ups between 78 and 7,000 
and for 3 instances the speed-up was in the 1.5 to 2.8 range. For 6 examples, GRASP 
was faster than the satisfier due to its sophisticated search features that do not have 
a match in the satisfier. 
2.3 Incomplete Algorithms 
An incomplete algorithm does not guarantee to find any solution even if there exists 
one or more solutions. It considers part of the region in the search space which is likely 
to have solutions and not search the space that is unlikely to have solutions. 
The main advantage of incomplete algorithms to complete algorithms is that they 
can normally find a solution in a much shorter time. However, they cannot guarantee 
to find a solution even if one exists. 
2.3.1 G E N E T 
An FPGA implementation of the GENET algorithm (described in Section 3.3.1) was 
proposed by Lee et. al. [24]. Their GENET network was implemented by processing 
elements (PE) arranged in a ring structure, each PE representing a cluster. All the 
weights of connection were stored in local memory of each PE. The values of the ON 
nodes in all the clusters were propagated to every other clusters in the network by 
N-1 cycles. The system either computes the inputs to every node and updates every 
cluster or executes in learning phase. Their system was applied to a 125 node 18 color 
graph coloring problem from DIMACS [4] graph suite and in the results, assuming the 
clock speed is 5 MHz, the expected speed-up was up to 127 times. 
Chapter 2 Literature Review 12 
2.3.2 G S A T 
Y. Hamadi and D. Merceron [18] employed an incomplete heuristic search algorithm 
called GSAT [36] (described in Section 3.3.2) to implement a reconfigurable solver for 
solving satisfiability problems. 
No hardware implementation was done at that time. The hardware implemenata-
tion were assumed to run at 60 MHz. The expected performance was compared with 
the results shown in [35]. The comparison showed that 70 to 300 speedups for 6 
different size of problems. 
Wong, Yuen, Lee and Leong [43] employed the same algorithm and architecture 
to implement a runtime reconfigurable solver for satisfiability problems. The input to 
the system is a 3-SAT problem from which a software program directly generates a 
problem-specific configuration which can be directly downloaded to a Xilinx XC6216 
reconfigurable processing unit (RPU), avoiding the need for resynthesis, placement and 
routing for different constraints. 
The clause checkers, which is used to evaluate the constraints, are problem depen-
dent and are customized by a C program. All of the variables are routed in horizontal 
lines and the logic to implement a particular clause are distributed in a vertical direc-
tion. The software customizes the logic equation of each clause checker and writes the 
new configuration into the address mapped configuration of the XC6200 memory. In 
the XC6200 devices, this can be done without affecting the nonconfigurable parts of 
the circuit. 
The result shows that a four orders of magnitude improvement in the reconfigura-
tion time ofthe GSAT algorithm over the conventional approach involving resynthesis, 
placement and routing was demonstrated. The design was tested on hardware and 
achieved approximately the same performance as that of a modern workstation but at 
greatly reduced hardware cost, power consumption and memory requirements. 
Chapter 2 Literature Review 13 
2.4 Summary 
In this chapter, a review of earlier works using field programmable gate array was pre-
sented. Two different approaches, complete and incomplete algorithms, were employed 
to solve constraint satisfaction problems. 
A hardware implementation of a CSP usually generates a custom circuit for a given 
CSP. This circuit is normally expressed in the form of a hardware description language 
which is synthesized, placed and routed to produce a bitstream configuration which 
can be downloaded to a FPGA. This process can often take more times than solving 
the CSP. In the implementations reviewed, Aramovici et. al. (Section 2.2.5) and 
Wong et. al. (Section 2.3.2) proposed methods to create the bitstream configuration 
directly from the CSP specification. This serves to shorten the compilation time since 
synthesis, placement and routing can be avoided. Note, however, both designs used 
Xilinx 6200 series devices. These have much lower logic densities than the 4000 series 
FPGAs that will be described in Chapter 4. In Chapter 5, a technique for runtime 




This chapter provides a description of the algorithms used in this thesis. A descrip-
tion of (complete) tree search techniques is first given including backtracking, forward 
checking, the Davis-Putnam algorithm and GRASP. This is followed by a description 
of two incomplete methods, namely GENET and GSAT. For a detailed description of 
algorithms for solving CSPs, the reader is referred to [19, 41]. 
3.2 Tree Search Techniques 
A tree is a connected graph with no cycles. A graph is tuple (F, U) where V is a set 
of nodes and U (C V X V) is a set of arcs. A node can be an object of any type and 
an arc is a pair of nodes. A tree is constructed by n nodes where each node occupies a 
tree level. A node in each level has several children. They are connected by arcs and 
each arc represents a possible assignment for the parent node. For a CSP, the nodes 
represent the variables and the arcs represent the assignments. The solutions of a CSP 
are represented by each path of the tree from the root to the leaf node. Searching the 
tree actually finds solutions of a problem. 
14 
Chapter 3 Algorithms 15 
3.2.1 Depth First Search 
Depth first search is a tree search that picks one of the children at every node to visit, 
and moves forward from that child until constraint is violated or a leaf is reached. 
When a constraint is violated, the algorithm backtracks to the nearest previous node, 
that has an unexplored alternative. If a leaf can be reached without violating any 
constraints, the corresponding path represents a solution to the CSP. 
,45^¾, © © 0 © 0 0 0 0 
A r D ^ A A A A A A 
X 
Figure 3.1: An example of depth-first search. 
Figure 3.1 shows an example of depth first search. The dotted line shows the search 
path from the root A to the second node D from the left. If the partial assignment of 
(A,B,C) = (0,0,0) violates a constraint, backtracking is executed and goes back to the 
previous node (node C) with the alternative choice. The search path can then reach 
the leaf node marked X. If this does not violate any constraints, the values of nodes 
along by the search path is a solution to the problem. 
Backtracking is normally applied to a depth-first search algorithm. When the 
selection of the most recently made node violates the constraints, backtracking is 
executed to withdraw that node and select an alternative. If all the alternatives at the 
node have been explored, the algorithm then goes further back until an unexplored 
alternative is found. If all the search space has been explored and no solution is found, 
Chapter 3 Algorithms 16 
the problem is not solvable. 
The following psuedo code describes a non-recursive form of the backtracking algo-
rithm. The function select() selects a node in fixed order with unexplored alternative. 
The function domain() selects an assignment to a node n. The node n is then added 
to the set of temporary assignments of nodes, S. The function constraint() determines 
if the new (possibly incomplete) assignment violates constraints or not. If no violation 
exists, those nodes connected to the node n are constrained, otherwise the node n is 
removed from the S. 
backtrack() 
B E G I N 
W H I L E (have space to search) 
node 二 select(); 
n = domain(node); 
S = S U { n } ; 
I F (constra int(5) = TRUE) T H E N 
5 = 5 - { n } ; 
E L S E 
update the constraints introduced by node n; 
I F [S is a solut ion) T H E N 
return S\ 
E N D I F 
E N D I F 
E N D W H I L E 
return NULL; 
E N D 
3 .2 .2 Forward Checking 
The following procedure describes the forward checking algorithm. It is almost the 
same as backtracking algorithm except it contains an extra criterion, solution(), to 
determine whether the new assignment of node will make no solution for another 
node. The explanation can be referred to Section 3.2.1. 
Chapter 3 Algorithms 17 
forward_checking() 
B E G I N 
W H I L E (have space to search) 
node = select(); 
n = domain(node); 
S = S U { n } ; 
I F (constraint(5') 二 T R U E O R solut ion() = FALSE) T H E N 
5" = S " - { n } ; 
E L S E 
update the constraints int roduced by node n; 
I F {S is a solut ion) T H E N 
return S\ 
E N D I F 
E N D I F 
E N D W H I L E 
return NULL; 
E N D 
Forward checking is very similar to backtracking, the difference being that when 
a choice is selected, even if no constraints have yet been violated, backtracking is 
executed if no possible solution exists for another node. This method allows us to 
avoid searching subtrees which cannot have a solution and hence significantly reduces 
the search tree. 
3.2.3 Davis -Putnam 
Davis-Putnam algorithm is a form of backtracking search that is specialized for solving 
boolean satisfiability problems. The algorithm has the following procedures: 
• Elimination of one-literal (unit) clauses. If variable p appears in a one-literal 
clause, it is assigned a value to make the literal true. All clauses made true by 
this assignment can be eliminated, as well as negation of p in other clauses. 
• Elimination of variables with consistent assignment. If the literals of a variable 
Chapter 3 Algorithms 18 
p in all clauses are all positive (p) or all negative (p), then the literals (p or p) in 
all clauses can be deleted. A variant of this rule that can be used during search 
is to instantiate variables that have only one literal in the unit clause. 
• Generation of subproblems where p is given alternative assignment (Splitting 
Rule). The two subproblems are generated by assigning a value '0' and '1' re-
spectively to the variable p. They are then tested in turn again. 
The Davis-Putnam algorithm consists of two main procedures, unit propagation 
and variable splitting. The unit propagation is to assign value to every literal in all 
unit clauses (clause with one literal only) such that all the unit clauses are satisfied. 
After the assignment of values to each literal 1“ the unit clauses are deleted. The 
literal l{ are deleted from all clauses containing that literal l “ The variable splitting 
is to assign a true and false value to a literal individually and continue to search the 
subtree separately. 
The following pseudo code describes the Davis-Putnam algorithm. S is the set of 
the clauses Ci , C2, ..., C „ where C{ is a clause with m literals summed together. A 
literal 1 represents a variable or its negation. A unit clause L means a clause with only 
one literal. 
Satisfiable(clause set S) 
B E G I N 
/ * un i t propagat ion * / 
D O { 
F O R E A C H un i t clause L in S D O 
delete f r om S every clause containing L 
delete L f rom every clause of S in which i t occurs 
E N D D O 
I F S is empty T H E N 
return TRUE 
E L S E I F the nul l clause is in S T H E N 
return FALSE 
E N D I F 
Chapter 3 Algorithms 19 
} W H I L E (further changes result) 
/ * sp l i t t ing V 
choose a l i teral 1 occurr ing in S 
I F Satisfiable {S U { / } ) T H E N 
return T R U E 
E L S E I F Satisfiable (S U { / } ) T H E N 
return TRUE 
E L S E 
return FALSE 
E N D I F 
E N D 
Forward checking is normally applied to the depth-first search. At a particular 
choice point, all the choices ofthat point have to be checked even no useful contribution 
can be made by that choice. The Davis-Putnam algorithm can overcome this problem 
by unit propagation. All the unit clauses can be concurrently solved by assigning values 
to each literal in the clause. The constraints are then updated by the new assignments 
of literals. The procedure can greatly reduce the search space. 
3.2.4 G R A S P 
The GRASP {Generic seaRch Algorithm for the Satisfiability Problem) was proposed 
by Marques and Sakallah in 1996 [34]. This represents the state of the art for soft-
ware based complete SAT solver. A basic backtracking search with a powerful con-
flict analysis procedure was integrated into GRASP. GRASP can be backtracked non-
chronologically to earlier levels in the search tree by analyzing conflicts. It can poten-
tially prune large portions of the search tree. In addition, by "recording" the causes of 
conflicts, GRASP can recognize and preempt the occurrence of similar conflicts later 
on in the search. 
Chapter 3 Algorithms 20 
3.3 Incomplete Algorithms 
3.3.1 G E N E T 
k^ 
Domain 0 ; i F m ^ ^ 5 : ) ; 
1 ; d ^ n ^ ^ g ) i 
2 tfS^a^Rii^ 
Zo Zi Z2 Z3 Z4 Node 
Figure 3.2: The 5 node (2¾ to Z4) and 3 color ( {0, l ,2}) graph coloring problem and 
its corresponding GENET network (with an initial assignment). 
The GENET algorithm was proposed by C.J. Wang and E.P.K. Tsang [22, 23, 33:. 
Figure 3.2 shows a graph coloring problem with 5 nodes and its corresponding GENET 
network (with an initial assignment). There are a group of neurons, called a cluster, 
which represents a node. The connections among neurons are constructed according 
to the constraints, with no connection between compatible nodes. Initially, one of the 
neurons in every cluster are on and the weight of all the connections are initialized to 
-1. Then, the value of all nodes will be sent to its neighbor. Each cluster will choose a 
node which has the largest value or the least negative number and set it to ON. The 
remaining nodes in the same cluster are set to OFF. This step will continue to iterate 
until a solution is found or it is moved to local minima. A "learning" phase is applied 
to get the network out of these minima by adjusting the weights. Any network state in 
Chapter 3 Algorithms 21 
which no two ON neurons are connected represents a solution to the specific problem. 
The following psuedo code describes the procedure of GENET. The input to a node 
is the weighted sum of all its connected nodes' states. 
G E N E T ( ) 
B E G I N 
select one arb i t ra ry node for each cluster (ON)； 
R E P E A T / * network convergence * / 
R E P E A T 
modi f ied = false; 
F O R each cluster D O I N P A R A L L E L 
on_node 二 node which is at present ON; 
node_value = node w i t h calculated weight; 
label_set = the nodes w i t h the m a x i m u m inpu t of node_value; 
I F N O T (on_node in label_set) T H E N 
modi f ied = true; 
on_node = OFF ; 
switch an arb i t rary node in label_set to ON; 
E N D I F 
E N D F O R 
U N T I L ( N O T modif ied); / * the network has converged * / 
/ * learning * / 
I F (sum of input to al l O N nodes < 0) T H E N / * in local m i n i m a * / 
F O R connection c between nodes x and y D O I N P A R A L L E L 
I F (both X and y are ON) T H E N 
decrease the weight of connection c by 1 ； 
E N D I F 
E N D F O R 
E N D I F 
U N T I L ( input to al l O N nodes are 0) 
/ * solut ion found * / 
E N D 
Chapter 3 Algorithms 22 
3.3.2 G S A T Algorithm 
GSAT [36] is a simple greedy local search based algorithm for solving satisfiability 
problems. Although GSAT may fail to find an assignment even if one exists (therefore 
GSAT is incomplete algorithm), the algorithm works surprisingly effective. 
The procedure starts with a random initial assignment to every variable. Then, 
each variable will then be flipped (change from '0' to '1' or '1，to '0') to investigate which 
flipped variable can produce the largest number of satisfied clauses. This procedure 
will be continued MAX — FLIPS times, where MAX — FLIPS is a preset value of 
the maximum number of flips, or until a possible solution is found. If no solution is 
found, another random variable assignment is made and the procedure is continued to 
run. MAX — TRIES is another preset value which determines the number of times 
of iteration the procedure to be run. 
For a boolean constraint equation F, The search algorithm can be described by the 
following pseudo code. 
G S A T ( i n t M A X - F L I P S , M A X - T R I E S ) 
B E G I N 
F O R ( i = l to M A X - T R I E S ) 
S = an in i t i a l random variable assignment; 
F O R ( j = l to M A X - F L I P S ) 
I F (F(S) = T R U E ) T H E N 
return S; 
E N D I F ; 
p = variable whose negation yields largest increase 
in number of satisfied clauses; 
S = S w i t h flipped p; 
E N D F O R 
E N D F O R 
return NULL; 
E N D 
J 
Chapter 3 Algorithms 23 
It has been shown that the GSAT algorithm outperforms the Davis-Putnam pro-
cedure by an order of magnitude on hard random formulas [36]. Since the algorithm 
is very simple, it is suitable to be implemented by hardware. 
3.4 Summary 
A description of tree search techniques including backtracking, forward checking and 
the Davis-Putnam algorithm were presented. Two incomplete methods, GENET and 
GSAT, were also discussed for solving constraint satisfaction problems. 
Chapter 4 
Field Programmable Gate Arrays 
4.1 Introduction 
In this chapter, an introduction to field programmable gate arrays (FPGAs), and 
in particular the Xilinx 4000 series FPGAs, is presented. The architecture and its 
advantages of a FPGA is presented. The two different development platforms used in 
this thesis, namely the GigaOps Reconfigurable Interface card and Annapolis Micro 
Systems Wildforce board, are discussed in detail. 
4.2 FPGA 
A FPGA is a programmable logic device in which the user can configure the logic 
functions and interconnections. FPGA is based on SRAM so it is reusable because the 
logics inside the FPGA can be reconfigured when a different bitstream configuration is 
downloaded to the FPGA. The development time is short because the download process 
requires several seconds only. There are many manufacturers for programmable logic 
devices, such as Xilinx, Actel, Altera, Motorola, etc. In this thesis, Xilinx 4000 series 
FPGAs will only be concerned. The current devices of Xilinx 4000 series have densities 
up to 8,464 CLBs or 250,000 equivalent gates [3] (refer to Table 4.1). 
FPGA implementations have the following advantages over microprocessor systems: 
24 
Chapter 4 Field Programmable Gate Arrays 25 
Number Max. 
Logic CLB Total of Max. Logic 
Device Cells Matrix CLBs Flip-Flops User I /O Gates 
XC4005XL 466 14 x 1 4 ~ ~ 1 % 6l6 H2 5,000 
XC4010XL 950 20 x 20 400 1,120 160 10,000 
XC4013XL 1,368 24 x 24 576 1,536 192 13,000 
XC4020XL 1,862 28 X 28 784 2,016 224 20,000 
XC4028XL 2,432 32 x 32 1,024 2,560 256 28,000 
XC4036XL 3,078 36 x 36 1,296 3,168 288 36,000 
XC4044XL 3,800 40 x 40 1,600 3,840 320 44,000 
XC4052XL 4,598 44 x 44 1,936 4,576 352 52,000 
XC4062XL 5,472 48 X 48 2,304 5,376 384 62,000 
XC4085XL 7,448 56 X 56 3,136 7,168 448 85,000 
XC40110XV 9,728 64 x 64 4,096 9,216 448 110,000 
XC40150XV 12,312 72 x 72 5,184 11,520 448 150,000 
XC40200XV 16,758 84 x 84 7,056 15,456 448 200,000 
XC4Q25QXV 20,102 92 x 92 8,464 18,400 448 250,000 
Table 4.1: XC4000XL and XC4000XV series FPGA data 
• FPGAs may provide large speedups for CSPs since highly parallel problem spe-
cific circuits can be generated with short turnaround time. 
• FPGA implementations normally achieve a higher degree of parallelism over 
microprocessor implementations. 
• FPGAs have flexible word sizes so the architecture can be tailored to the problem. 
• A single FPGA will usually have lower cost and power consumption than an 
equivalent microprocessor based system. 
The disadvantages that FPGA systems face over microprocessor systems are: 
• The clock speed of FPGA is normally lower compared with microprocessor. 
• FPGA systems have relatively limited resources compared with microprocessors 
which have almost unlimited amounts of virtual memory. 
• More sophisticated algorithms can be implemented in software than hardware. 
Chapter 4 Field Programmable Gate Arrays 26 
4.2.1 Xilinx 4000 series F P G A s 
A Xilinx 4000 series FPGA consists of four major components, configurable logic block 
(CLB), three-state buffer, input/output block (IOB) and programmable interconnect. 
Figure 4.1 shows the internal structure of a FPGA. These blocks will be described in 
detail in the following sections. 
Configurable 
10 Block — — — — — — — — y^Log i c Block 
^ • ^ ^ ^ a H 
.^^ Programmable 
Interconnect 
Figure 4.1: FPGA structure. 
4.2.1.1 Configurable Logic Block 
Figure 4.2 shows a simplified block diagram of a configurable logic block. Configurable 
logic blocks implement most of the logic in a FPGA. The following paragraph describes 
each main block of a CLB in detail. 
Four independent inputs are provided to each of two function generators (Fl-F4 and 
Gl-G4). These function generators, with outputs labeled F, and G', are each capable 
of implementing any arbitrarily defined Boolean function of four inputs. Signals from 
the function generators can exit the CLB on two outputs. F, or H, can be connected 
to the X output. G' or H' can be connected to the Y output. 
A third function generator, labeled H', can implement any Boolean function of its 
Chapter 4 Field Programmable Gate Arrays 27 
three inputs, F', G' and H1. Each CLB can implement certain functions of up to nine 
variables by three function generators. 
The function gnerator outpus are optionally connected to two positive edge-triggered 
D-type flip-flops having common clock and clock enable inputs. Thus, combinational 
or sequential logic can be implemented. 
There are total 8 multiplexors inside a CLB. The function of multiplexors is used 
to select various F, G, C l - 4 functions. 
The F and G function generators in any CLB can be configured as RAM arrays 
for two different sizes, two 16 X 1 RAMs with two data inputs and two data outputs 
with identical addressing, or one 32 X 1 RAM with one data input and one data 
output. Therefore, a CLB can store up to 32 bits by RAM compared with 2 bits 
which is implemented by 2 D-type flip-flops. A 16x logic density improvement can be 
achieved. It have two timing modes, edge-triggered (synchronous) and level sensitive 
(asynchronous). The Fl-F4 and Gl-G4 inputs to the function generators act as address 
lines, selecting a particular memory cell in each look-up table. 
4.2.1.2 Three-state Buffer 
A pair of 3-state buffers is associated with each CLB in the array. These 3-state 
buffers connect to their nearest horizontal longlines (see Section 4.2.1.4). Such an 
arrangement can be used to implement wired-AND and bus functions. Note that a 
pull-up resistor should be attached to the longlines if used in this manner. Figure 4.3 
shows the implementation of a wired-AND function. 
The 3-state buffers can be configured in three modes, standard 3-state buffer, wired-
AND and wired OR-AND mode. Figure 4.4 shows the implementation of a multiplexer 
using 3-state buffers having active-low input enable. 
4.2.1.3 Input/Output Block 
Chapter 4 Field Programmable Gate Arrays 28 
C1 C2 C3 C4 
± ± ± r 
H1 DIN S/R EC 
~ s m 
CONTROL n 
G4 —— p ^ 
~ DN^ 
G3 LOGIC F' ~sD~~ 
FUNCTION ~^ • G' 1^ ‘ D Q 一 YQ 
G2 OFG1-G4 H'J ^ {> - ^ > 
| _ ^ Z > 
G1 一 _ • ~ " 乂 “ h 
p \ EC 
L LOGIC 0 „ G：^  厂 乂 L ： ^ 
FUNCTION L^ 
OF F, G' “ ^ 1 
andH1 ^ Y 
F4 ~ S ^ ~ 
r ^ |C0NTR0L k 
F3 ——LOGIC L olKh 
FUNCTION ~i n F' ~~^~" 
F2 一 OFF1-F4 G, f ^ D Q — YQ 
— ^ 令 ^ > 
F1 —一 ^ —^  -^ 
z EC r ^ 5 ^ 
K (CLOCK) P\^ 
L H'| X 
~ ~ ~ [ ^ 
Figure 4.2: Simplified block diagram of a configurable logic block. 
User-configurable input/output blocks provide the interface between external pack-
age pins and the internal logic. Each IOB controls one package pin and can be con-
figured for input, output, or bi-directional signals. Figure 4.5 shows a simplified block 
diagram of the XC4000E IOB. Two paths, labeled I\ and /2, bring input signals into 
the array. The inputs can be globally configured for either TTL or CMOS thresholds. 
Output signals can be optionally inverted within the IOB. Besides, there are additional 
programming options for IOB, such as, pull-up and pull-down resistors, independent 
Z = DA-DB-(Dc + Dp )-(PE + Dp) 尊 
D A ^ D " ^ D B _ _ [ ; ^ ^ : ^ ； ^ ^ = ¾ ^ 
WAND1 WAND1 WOR2AND WOR2AND 
Figure 4.3: A wired-AND function implemented by an open-drain buffers. 
Chapter 4 Field Programmable Gate Arrays 29 
Z = DA.A + DB.B + Dc.C + " . + DN.N 
• J 1 • _ z 
BUFT BUFT BUFT BUFT 
DA " ~ " ^ ^ ^ 3 — ^ ^ D c " " C ^ D , - ^ > ~ ~ 
A B C N 
Figure 4.4: A multiplexer implemented by 3-state buffers. 
Slew Rate Passive 
Control Pu"-Up/Down ~ y ~ 
^ T o 
T ^ ^ ^ = ^ h 3 
out ~ ~ . _ J ~ ^ D Q ~ " L ^ ^Output 
Z Buffer 
l>s^ ~ CE 
p D > ^ ^ Pad 
Output , I > Input 
Clock L ^ Buffer 
, . U ^ n ^ 4 
'. r ^ ^ < 
^ J ~ L ^ — — I Delay p ^ 
a。 :k U - C E < _ i ‘ 
Enable 
K > ^ 
Input 
Clock L ^ 
Figure 4.5: The block diagram of Input/Output Block. 
clocks, global set/reset, etc. Pull-up and pull-down resistors are useful for tying unused 
pins to Vcc or GND to minimize power consumption and reduce noise. The global 
set/reset signal (GSR) is useful during initialization. 
4.2.1.4 Programmable Interconnect 
All internal connections are composed of metal segments with programmable switching 
points and switching matrices to implement the desired routing. A structured, hierar-
chical matrix of routing resources is provided to achieve efficient automated routing. 
There are several types of interconnect: 
• C L B rout ing is associated w i t h each row and column of the C L B array. 
Chapter 4 Field Programmable Gate Arrays 30 
• I O B rou t i ng forms a r i ng around the outside of the C L B array. I t connects the I / O w i t h 
the in te rna l logic blocks. 
• G loba l rou t i ng consists o f dedicated networks p r i m a r i l y designed to d is t r ibu te clocks 
th roughou t the device w i t h m i n i m u m delay and skew. 
Single 
Double ~~• Double 
Double ~ i < P^"^-“~~~ 
「 ! ^ = = = 二 ： 二 亡 \ 
I ~" ~~ ~^  f~ /K 
Single - " * " = = = = t = = = Z l Z ^ y ^ 
_ I — — 1_ 
_ I — ( 1_ 
L ~ r ~ — • ‘ ~ Six Pass Transistors 
Double - ^ I 1— Per Switch Matrix 
_ _ - - - - - - - - - Interconnect Point 
Figure 4.6: The block diagram of Programmable Switch Matrix. 
.iT^ i^p^ ^^^ .^r^ .=:pl=. - ^ ^ =^^ ^ • � 
:；；:：:：>:  ；:；«?>；：;： 5:：^:；^；: :¾¾:::¾ 
！勞！ 
� ， ‘ ‘ ' ' , ‘ ‘ “ . / ''' ‘“ 'L “ ；： / , ‘ ； ", , ‘ ‘‘ ‘‘ "• : ‘‘ ； 12� Quad \ ；^ : ‘ ‘： ； / 
/:' '、 '；；> _:丨 K <i '5 ':人 J 科：.- K 
� _ > Z Z . | ; Z Z g Z Z Z Z Z Z — ‘ V s _ 
:^：；<> '-,½^:- w;p: 
A >'；' '''' ；^ ‘‘ K 
(^  - ^ " " ' Z Z l Z Z ^ I Z : 二 Z Z Z Z ‘ 4) Double 
. . : : , — ? ; ; ‘ 
( : z : Z Z z z : = Z Z 二 Z Z ——‘ 3 � L o n g 
"':: ,,::" “； I I ^ ^ 丨 
“ ‘^  : V''' 
.....,,„.. ••；；；；；…-�:；'；；.f ..:.:;>;: ；,"；' U'X"'V' ''''7/'''' '''''' ‘ : � ‘ ‘ ^ CLB ::…::；p\ Direct 
(,‘ ‘ ,,',‘,,,‘,y ‘ , ‘ „„ ',, ',' ', -, “ ‘ ‘ ", , , ‘ , 4 ^^  ^― ^ ^ 、 *W / ^^ , 
t ,丨 _ 在 -. ™— _ ^ / Connect 
-i ™_ 
'V j'/ ',/：' T r~^ 
i ; ; , : ‘ ‘ k 
v = i = i = i = = 二 = ===--=^ _ 
- \ 
%l 暴 45> ^ ^ 4J- ^ ^ -
Quad Long Global Long Double Single Global Carry Direct 
Clock Clock Chain Connect 
Figure 4.7: The high-level routing diagram of XC4000 Series CLB (shaded arrows 
indicate XC4000X only). 
For Xilinx XC4000XL series, there are four main types of interconnect, single-
length lines, double-length lines, quad lines and long lines (See Figure 4.7). Those 
names are distinguished by the relative length of their segments. The horizontal and 
Chapter 4 Field Programmable Gate Arrays 31 
vertical single- and double-length lines intersect at a box called a programmable switch 
matrix (PSM). Each switch matrix consists of programmable pass transistors used to 
establish connections between the lines. Figure 4.6 shows the block diagram of PSM. 
Single-length lines provide the greatest interconnect flexibility and offer fast routing 
between adjacent blocks. Double-length lines consist of a grid of metal segments, each 
twice as long as the single-length lines. These lines provide faster signal routing over 
intermediate distances, while retaining routing flexibility. They are connected by way 
of the programmable switch matrices. Quad lines are four times as long as the single-
length lines. They are interconnected via buffered switch matrices. They run past four 
CLBs before entering a buffered switch matrix. 
Longlines form a grid of metal interconnect segments that run the entire length 
or width of the array. They are intended for high fan-out, time-critical signal nets, 
or nets that are distributed over long distances. They are 6 vertical and 6 horizontal 
longlines for a Xilinx XC4000E FPGA. 2 horizontal longlines per CLB can be driven by 
3-state or open-drain drivers (TBUFs). They can therefore implement unidirectional 
or bidirectional buses, wide multiplexers, or wired-AND functions. A pull-up resistor 
is attached to the horizontal longline driven by TBUFs. 
4.2.2 Bitstream 
A bitstream defines configuration of all parts of the FPGA and is downloaded through 
a programmable cable or bus. 
The bitstream configuration begins with a string of eight ' l 's, a preamble code, 
following by a 24-bit length count and a separator field of ones. This header is followed 
by the actual configuration data in frames. The length and number of frames depends 
on the device type. A XC4062 FPGA contains 2,339 frames and each frame consists 
of 613 bits. Each frame begins with a start field and ends with a CRC check. A 
postamble code is required to signal the end of data for a single device. 
A selection of an ASCII configuration of bitstream is allowed. The ASCII bit-
Chapter 4 Field Programmable Gate Arrays 32 
stream is useful for direct modification. Unfortunately, Xilinx does not report the 
documentation of bitstream for XC4000 series. 
4.3 Giga Operations Reconfigurable Computing Platform 
^^^S 
M^P 
( ^ ^ i l < > : b n i : : : : H i l f W ^ ^ ^ S P I i ’��… 
Figure 4.8: The picture of the GigaOps Reconfigurable Card. 
The GigaOps G900 PCI Reconfigurable Interface Card (shown in Figure 4.8) con-
tains two system FPGAs. One is used to interface PCI bus transactions to the XBUS 
and the components connected to it. The other one implements miscellaneous func-
tions such as clock generation, runtime reconfiguration of the FPGAs and power up 
configuration of the FPGAs. n n n n 
XMOD XMOO XMOD XMOD , _ _ , I 
I YCON I L _ ^ _ J L _ ^ _ l L _ _ J 1—.— —......I 
I , 128 bits 
I XCUN I ^ ^ - i i - ^ - i ^ ^ - ^ . ^ - . J L ^ ^ _ ^ _ ^ 
" ~ ~ * ^ — — I MEM I I Mon.tor | 
^ _ ^ _ ^ PPGA CPGA 
PC1 I rciX)CKS I 
Figure 4.9: The block diagram of the G900 Reconfigurable Interface Card. 
Figure 4.9 shows the block diagram of the G900 RIC. This board is connected 
through a PCI bus with maximum bandwidth of 133Mbytes/sec. The fast PCI bus is 
important for reducing the I /O bottleneck to the host computer. To communicate with 
Chapter 4 Field Programmable Gate Arrays 33 
the FPGA from host computer, a HBUS protocol [1] is required. GigaOps provides 
support for the HBUS protocol on the PPGA. This protocol allocates some of the 
XBUS pins and assigns meaning to them. The hardware system allocates 21 of the 
128 XBUS line to the HBUS protocol. 
G900 PCI card is expandable. Each board can expand up to 16 XMOD computing 
modules. A XMOD contains two Xilinx XC4013EPQ208 FPGAs with -3 grade of 
speed, 8M 60ns DRAMs and 256k 20ns SRAMs with 16-bit data width. Each FPGA 
contains 576 CLBs or 13,000 equivalent gates. So, the maximum capacity of the G900 
RIC can be up to 9216 CLBs or 208,000 equivalent gates, 128M DRAMs and 4M 
SRAMs. Figure 4.10 shows the block diagram and Figure 4.11 shows the picture of 
XMOD computing module. For more details about the structure and each component 
of the card, please refer to the reference manual [1]. 
驟_|錢_缀丨凝丨_缀_輯|縫丨缀彳缀^1胃缀纟„缀|織__丨顔_丨_丨缀|丨链|: _丨鬆_丨^丨額鑲缝丨__痛_顙_1丨_翁丨_凝丨1雜_丨耀_$_錄_  
•'_ •''"'*•'*"丨丨丨‘I ‘ ‘ ‘ 11 ‘ ‘ ‘ 't •� 
• L SRAM2 
‘ , , ‘ ‘ ' J 
~~ XFPGA ~~ ^^^ HUCON “ “ “ “ 
(FPGA1) ^ '�DRAM4 
‘；‘'-,-"--' 
: -',DRAM3 � YUCON _ , . , , „ J . \ PQBUS ^ ISO SWITCH - : __ 
\ � ,'�i____|,„ DRAM2 
；:；；5:：«5 ^^^—••••“ I 
f>>>aaa>>>*_ii^__ I K0*>>X>>X(M(W 、 II 
^ ‘ DRAM1 
YFPGA y Z 
__ (FPGAO) __ MONITOR ^ ^ � _ 1 | || n'^ >^ ^ -
CPUs � — — S R A M 1 
I • 
‘','\ 'j ,','r:,:::,'i_ :„':'';:''/',':''':''!:''' 
Figure 4.10: The block diagram of the XMOD computing module. 
4.4 Annapolis Wildforce PCI board 
The Wildforce PCI board is manufactured by the Annapolis Micro Systems, Inc. This 
board is a PCI bus based parallel high speed processing board. The PCI bus runs at 
a clock speed of 33MHz and employs a 32-bit data bus such that the peak bandwidth 
can be up to 133MB/s. The board consists of one Control Processing Element (CPE), 
a Xilinx XC4085XL FPGA and 4 Array Processing Elements (PEs), Xilinx XC4062XL 
Chapter 4 Field Programmable Gate Arrays 34 
_^HI l _ 醒^^^1 
Figure 4.11: The picture of the XMOD computing module. 
FPGAs. The total gate count can be up to 333K equivalent gates. Each PE consists 
of its own 4MB dual ported SRAMs with 32-bit data width. 
The host system may communicate with the board through the FIFOs, interrupt 
signals, and the memory components found on the board. To write data to the memory 
or fetch data back to the host, the host should "block" the dual port memory controller 
(DPMC). When the DPMC is "blocked", the processing element will never receive a 
grant signal after requesting memory, and will be unable to read or write to memory. 
The interrupt request and acknowledge signals provide a means of communication 
between each processing element and the host system. They are often most helpful in 
signaling the start or completion of a process. For example, the processing element 
might interrupt the host once it has finished processing a set of data and might wait 
for the host to finish acknowledging the interrupt before processing more data. The 
last method is by FIFOs. They are both 512 by 36-bits wide in each direction, both 
to and from the board. On the other hand, there are two programmable clocks on 
the board, PCLK and MCLK. PCLK is the "processing clock" and is programmed by 
the user. MCLK, or "memory clock", is always two times the value of PCLK. The 
frequency of the PCLK is ranged from 2.5 to 50 MHz and so the MCLK is ranged from 
Chapter 4 Field Programmable Gate Arrays 35 
5 to lOOMHz. The maximum clock skew of both clocks is 0.1%. Figure 4.12 shows 
the block diagram of the Wildforce board. Please refer to the reference manual [6] for 
further details. 
(~^ \ • X J 
32 ‘ Local Bus 
Host T-,j^  I f^i / \ 广 N 
P C I J M PCI 1 . _ 「 S R A M ] 、 , „ . 、 . Bus ^  CHIP J 3 2 � 1 32kby 32 J �32 3 2 � � 4 
T T f 
4、、^~~: 、 Z L i 1^ 
^ ^ I f F!FO'0' ^ f F I F O ' V 〕 f FIFO ' 4 ' ) 
^ 512by 36 512by 36 512by 36 V ‘ y \ J V ‘ i \ i V i k 
36、、 广 ^ " * ~ ^ ^ 3 6 , 、、36 
( | ^ ~ ^ ^ ~ ~ \ CROSSBAR 
24 PET 36 ~ f ~ 
: — • \- >• 36 ^  36\� 36\� 36\� 
i r^~^ r^~^ r~^~^ f ^ ^ 
MEZZANINE , ^ , ^ _ \ > , \ _ ^ , , 
CARD P E T PE '2' ， 、 ‘ PE'3' “ 、 ‘ PE.4. ； •--， « ， « ’ 
; FIFOO 丨 i i DATA �DPMCT •-： ； * .1 i -"1 FIFO i DATA V _ _ _ y 丨 丨丨 丨丨 丨 RIGHT 
i _ i MEZZANINE MEZZANINE ； MEZZANINE ： MEZZANINE DATA 
{ “ “ ^ CARD CARD CARD ： CARD 
i SWITCH i ： ： i ： i 
i AFO V _ _ ——^ DPMC'1' DPMC.2. DPMC,3_ DPMC '4' 
： LEFT 、 ‘ V • J i V . J ： ； V ‘ J \ \ V " J ； 
1 DATA 、32 - ^ ^ * 1 i ； I I ： 
I 1 r , 、 P M n I 
； f EXTERNAL ^ f ^ ^ �� 2 ^ �� 2 卞； �� 2 卞丨 \ � 卞 2 raCARD CONNECTOR 丨~~^8 iH~^8 . \ J V J • J J ； • I 
i K\ , … i i i i i i 
, , , , Local Bus , , ； , , i_J , , U , , ； 
； .T. XX T.T .T 
Handshake 
Bus 
Figure 4.12: The block diagram of the Wildforce board. 
4.5 Summary 
FPGAs offer advantages of flexibility, reusable logic and fast development times while 
maintaining reasonable logic densities. In this chapter, the architecture and internal 
structure Xilinx 4000 series FPGAs were presented and the architectures of two dif-
ferent hardware development platforms, GigaOps Reconfigurable Interface Card and 
Annapolis Wildforce Board, were discussed in detail. 
Chapter 5 
Implementation 
In this chapter, the implementation of graph coloring and boolean satisfiability solving 
systems using FPGAs is presented. A parallel and serial graph coloring machine for 
solving graph coloring problems are discussed in the first two sections. A boolean 
satisfiability (SAT) solver is then presented followed by a GSAT solver. 
5.1 Parallel Graph Coloring Machine 
G(^ 
Figure 5.1: The 4-node graph. 
The graph coloring problem [42] involves assigning a color to every node in the 
graph such that connected nodes in the graph are of different colors. Figure 5.1 shows 
a simple 4-node graph, the nodes are being {M, N, 0 , P } . Each node can be one of 
three colors, red, green or blue. This problem can be posed as a CSP (see Chapter 1) 
where Z is a set of nodes {M, N, 0 , P} , D is a function that maps all nodes in Z to 
their domains, i^M, ^7V, Do and Dp，where i^M, ^iV, Do and Dp={ved, green, blue} 
36 
Chapter 5 Implementation 37 
and C is the set of constraints. Since M-N are connected, they are constrained to be 
of different colors. Similarly, N - 0 are not connected so they can be of the same color. 
One of the solutions to this problem is {M, N, 0 , P } = {red, green, green, blue}. 
© 
red ^ ^ ^ ^ ^ ^ ^ "~~~~~~~~^  blue 
^ 0 ^ ^ ^ gmen ~^~~~~~~~~^ ~^~~~~^  ^ o ^ 
red / , --Xblue red / \ b l u e red / \ b l u e 
V gr ^n - .V / green V ^ green V 
© @ ® © ® © © 0 © 
^ ¾ ¾ ¾ 
府 i.i...i...........丨 i i . . . - - i i i -.........__. A 
Figure 5.2: The tree representation of the previous graph. 
A forward checking tree search algorithm (described in Section 3.2.2) was applied 
to the graph coloring problem and implemented on a FPGA. In this implementation, 
all constraints were evaluated in parallel. This approach, although very fast, is limited 
to relatively small problems because of its large hardware usage. 
The following pseudo code describes the search algorithm. All variables are assigned 
by a unique number. The variable, node_no, is the index referred to the current 
assignment of node. The function solution() shows the previous assignment of the 
node. The function constraint() shows the previous constraints of the node. The 
function generate() produces a new assignment to the node and save it back to the 
current location of the register. The function evaluate() checks the constraints by 
the current assignments of all nodes (denoted by all^ol). If constraints are violated, 
backtracking is executed. The index to the node will be decremented. One of the 
solutions is found when an assignment to all nodes exists. 
Chapter 5 Implementation 38 
graphjsearch() 
B E G I N 
W H I L E (have space to search) 
B E G I N 
sol — solution(node_no)； 
con — constraint(node_no)； 
solut ion(node_no) = generate(so/, con); 
backtrack = evaluate(a//_5o/); 
I F (backtrack = T R U E ) T H E N 
node_no——； 
E L S E 
I F (last node assigned) T H E N 
save all_sol to memory ; 
E L S E 
node_no++; 
E N D I F 
E N D I F 
E N D W H I L E 
E N D 
Figure 5.2 shows the tree representation for the 4-node graph ofFigure 5.1. Suppose 
that node M is red, N is green and 0 is blue. Although no constraints have yet been 
violated, forward checking detects that there is no possible color for node P so the 
algorithm backtracks. It then discovers that no possible color is available for node 0 
so backtracking is executed again. This causes the next color (blue) to be assigned to 
node N. 
5.1.1 System Architecture 
Figure 5.3 shows the block diagram of the system architecture adopted for the parallel 
graph coloring machine. The function calls are the interactions between the host 
and the FPGA board. The finite state machine is used to keep track of the states and 
Chapter 5 Implementation 39 
Function Calls 
，r 
Finite State ^ ^ 
y ^ ^ ^ Machine ^ ^ N ^ 
(z \、 
Memory Evaluator 
Figure 5.3: Block diagram of the Parallel Graph Coloring Machine. 
respond to the function calls. The memory is used to store the solutions. The evaluator 
is used to update the constraints concurrently for different inputs in each evaluation. 
The details of individual parts will be discussed in the following sub-sections. 
5.1.2 Evaluator 
The function of the evaluator is used to update the constraints when a node is assigned 
to a color. It also determines and reports an illegal assignment to a node. Moreover, 
if all assignments to a node are constrained, a "backtrack" bit is set which informs 
the finite state machine about this case. A 4 nodes (M, N, 0 , P) and 3 colors (red, 
green, blue) (Refer to Figure 5.1) graph coloring problem was used as an example. 
The connections between nodes are recorded into a connection table which is shown 
as Figure 5.4. 
A tick means that there is a connection between two nodes. Another two tables, the 
assignment and constraint tables (see Figure 5.5) are used to record the assignments 
and constraints. Assignments are made by setting an entry in the assignment table 
to '1'. This causes the constraint table to be updated so that a '1，appears for all 
disallowed variable assignments. 
For example, in the first step, node M is assigned to red color. The assignment is 
a 3-bit column vector, “100”，to the latches, which means the entry (M, red) is set to 
Chapter 5 Implementation 40 
M N 0 P 
M ‘ , V V V 
N , : ' : I . 7 
o I J 
P I I I ‘ 
Figure 5.4: Connection table for 4-node graph coloring problem. 
M N 0 P M N 0 P 
red 1 red 1 1 1 
I 
green 1 green 1 1 
blue blue 
assignment constraint 
Figure 5.5: Assignment and Constraint Tables for 4 nodes and 3 colors. 
T , (M, green) is set to '0，and (M, blue) is set to '0', in the assignment table. This 
vector is generated by the finite state machine (described in Section 5.1.3). Since node 
M has been assigned, node N, 0 , and P cannot be assigned with red color (refer to 
the connection table). The evaluator will set 3 l 's to the corresponding entries in the 
constraint table to indicate that they are constrained. The second step is to assign 
color for node N. Node N cannot be red since the constraint table has a '1' in that 
position. Therefore, green is chosen and a '1' is assigned to (N, green) in the assignment 
table. The vector is “010”. Those nodes (i.e. M and P) having constraints with node 
N are marked as '1，in the constraint table. The entry (0 , green) in constraint table 
is assigned to '0' because no connection exists between node N and node 0 . The 
procedure will continue to run until all possibilities have been tried. 
The column vector keeps track of the colors for corresponding nodes. In every cycle, 
a new assignment is written to the evaluator's assignment table and the evaluator will 
update the constraint table. A T entry means that the node cannot be set to that 
color. A '0，entry means that a new assignment of that color can be made. The 
Chapter 5 Implementation 41 
evaluation requires only one cycle to finish. 
If a whole column in the constraint table is '1', it indicates that no possible as-
signment to a node exists and there is no need to search any further down the tree. 
A signal 'back' is activated which informs the finite state machine that backtracking 
should be executed immediately. If the last node is successfully assigned to a color, a 
solution has been found. 
. - � assign<0> iif 
assign<0:n> | ^ > ~ ~ ^ ~ — ^ D Q 
~ > 
c^Q| 
“ \~~"N const<o> z ^ " I const<0:n> 
a — < ? > ^ D ^^ Q V - _ p T7^=^_^ ) n ^ ^ 
””> 1 
7^  1 ti 1 \ const<1>. 
L ^ 斤二：：^_) • 
assign<i> 「 ： I 
‘• - “ _ • • • - ‘ “ “ ^ ' [*•_ ••，• r .、 
I ; j"-- -、. 、、、、 const<i>. 
. .' ,' -• • • • ^  
； . ； .' , - • 
: - : 11 ~^ ^ ^ const<n>^ assign<n> ‘ iiF ,, ‘ ‘ ‘~ ~~^~ ) ^ 
• D Q ― ； __^^ 
• I ~ > i 
Q 丨 _ 
~ ^ ~ ； combinational 
” ‘ i logic < • back 
Reset • > block 
Clk o 
Figure 5.6: Gate level diagram of evaluator. 
The gate level implementation of the evaluator is shown in Figure 5.6. An asyn-
chronous reset D-type flip-flop with positive edge trigger which is used to latch the 
assignment values. In the example, there are 4 x 3 flip-flops where 4 is the number 
of nodes and 3 is the number of colors. The logic gates implement the constraints 
which are simply the OR'ing of all connected nodes in the graph. Figure 5.7 is a table 
representation of the evaluator showing that there are two ports, one is A^ and the 
other is Cy, for a table entry. The Boolean expressions of the combinational logic are 
shown below. For example, Ci is constrained (high logic value '1') if either A4, A j or 
Aio is assigned to the red color, otherwise, it is set to '0'. 
Chapter 5 Implementation 42 
M N 0 P 
red A/Ci A4/C4 A7/C7 Ai。/Ci。 
green A,/C, A^/C, A3/C3 A J C , , 
—blue A3/C3 A:/C6 A9/C9 | A , , / ^ 
Figure 5.7: Symbol table of the evaluator. 
Ci = A 4 U A 7 U A 1 0 , C4 = A i U A 7 U A 1 0 , C7 = Ai U A 4 U A 1 0 , Cio = Ai U A4 U ^ 7 
C2 = A g U A g U A n , C5=A2UA8UAn, C8 = A2UA5UAn, C11 = A 2 U A 5 U A 8 
C3 二 ^ 6 U Ag U Ai2, Ce = A3 U ^ 9 U A12, C9 = A3 U ^ 6 U A12, C12 二 乂3 U ^ 6 U ^ 9 
5.1.3 Finite State Machine ( F S M ) 
The finite state machine has three functions. The first function is to handle and 
respond to the function calls from the host computer. The second function is to store 
the temporary solution of nodes so that it can write the solution to the external static 
memory in proper time. The third function is to generate the "column vector" for the 
evaluator. These three functions will be discussed in detail in the following paragraphs. 
There are three user functions which can be called by the host computer. Function 
1 is the reset function. It resets all the registers to their initial state. It will be called 
once only. Function 2 is to order the FPGA to find solutions. In the implementation, 
immediately after it is called, the FPGA will execute until all the external static 
memories are full of solutions. Then, the FPGA will become idle as it waits for the 
host to read back the solutions. Function 3 is the memory fetch function. It is used to 
fetch solutions back to the host after Function 2 has finished its job. When Function 3 
has finished its job, Function 2 will be called again such that the FPGA can continue 
to find another portions of solutions. 
Chapter 5 Implementation 43 
As mentioned before, the temporary solution of nodes can be found from the as-
signments to the evaluator. So, after the last node has been assigned to a color, the 
whole vector of the assignments is one of the solutions of the problem. The finite state 
machine must fetch this vector at the proper time and change the current state to 
another state for writing to memory. This state should last for 3 cycles. It is because 2 
bits are required to encode 4 different possible solutions of a node. There are 20 nodes 
of the example. So, 40 bits are required to encode one of the solutions. As mentioned 
in Section 4.3, the data width of the memory is 16 bits. So, 3 cycles are required to 
write the 40-bit word to 3 consecutive locations. 
In each evaluation of an assignment, the finite state machine receives two vectors, 
namely constrain and occupy. The first vector, constrain, indicates the constrained 
assignments of that node by other nodes. The second vector, occupy, shows the pre-
vious assignment of that node. With occupy, next possible assignment of that node is 
selected when backtracking should be executed. A new vector, namely assign, is gen-
erated from the two vectors. It indicates the new assignment of that node. This vector 
will be written to the evaluator to update the constraints by the new assignment. 
5.1.4 M e m o r y 
The main purpose of the Memory module is to store the solutions of the problem. As 
discussed in the previous section, each solution consists of40 bits. So, three consecutive 
locations are required to store one solution. As the memory space is 512kb, 174762 
(512k/3) solutions can be generated per run. The host computer will retrieve the 
solutions after the memory is full of solutions. 
5.1.5 Hardware Resources 
To estimate the hardware resources required by the design, the equation in the following 
table can be used. Color and node are the number of colors and nodes respectively for 
a particular problem. Density defines the number of connections between two nodes. 
Chapter 5 Implementation 44 
Modules Number of CLBs used 
Evaluator -olorxnode + color X node x「^gZ|^ ] + 
�colorXnode ]丄�colorXnode ‘ 
8 t M 
FSM state x 4 
Interface � 1 5 0 
Total colorxnode + ^lor X uode X � ^ £ ! ^ 1 + 
�colorXnode l j_ r colorXnode l i 
8 t M ^ 
state X 4 + 150 
Table 5.1: Estimated resources for the parallel graph coloring machine 
5.2 Serial Graph Coloring Machine 
Parallel implementation has the advantage of speed but a serious shortcoming is that 
it uses a lot of hardware resources, and a new circuit must be generated, synthesized, 
placed and routed for each new problem. In this section, a design which trades off 
speed for the ability to solve larger problems is presented. 
A key feature of the Xilinx 4000 devices is its "distributed RAM" function (see 
Section 4.2.1). Using this feature can lead to a 16 times improvement in circuit density. 
The search algorithm used was the same as the parallel solver, namely depth first 
search with forward checking. To minimize hardware requirements, the system was 
designed with less degree of parallelism. It is much more complex than the first design. 
Although it uses more cycles for each evaluation, it can be run at faster clock speed, 
which compensates for less parallelism to a certain extent. 
5.2.1 System Architecture 
Figure 5.8 shows the block diagram of the architecture. The diagram consists of 
eight modules, Input Memory, Solution Store, Constraint Memory, Evaluator, Input 
Mapper, Output Memory, Backtrack Evaluator and Word Generator. The function 
of the Input Memory is to store the assigned vectors of each node. By using a one-
hot encoding method (one bit for a state), each vector can have four different values 
Chapter 5 Implementation 45 
constrain 
back 
‘ r , ‘ 
o Constraint , , 
S ” ^ - addr c 
3 < Memory 一 
r - I I ^ 
o I “ 
output _ 03 
r f 0 ft) 
~ f ^ • ^ S-
• r ^ | s 
assign occupy i ™ g-
I 
• ^ 5" 
Evaluator E ^ 
^ . ^ � 5=5" 
o s ^P^ ， 
~ " ^ i E ~ f ^ “ 
^ — ^ o 
^ 一 I & 
T ~ ~ 1呈 
f CO . ^ i _ ^ . , . s 5- r ^ 
addr_i 二 • 
addr_o 
Figure 5.8: Block diagram of the Serial Graph Coloring Machine. 
which corresponds to different possible assignments to a variable. The Solution Store 
stores the values from the Input Memory when a solution is found. The Constraint 
Memory is used to store the connections between nodes. The Evaluator is used to 
evaluate a variable assignment to see if it violates any constraints. The Input Mapper 
maps the outputs from the Evaluator to the Output Memory. The Output Memory 
is used to store the output of the Evaluator.The Backtrack Evaluator checks whether 
any constraints have been violated by the current partial assignment. Backtracking 
should be executed if no solution can exist. The function of the Word Generator is to 
generate a new variable assignment to test from the present variable assignment. 
In the following sections, a 20 node, 4 color graph coloring example will be used. 
The following sections will describe each module assuming this particular problem, 
however, the approach is expandable to any sized problem assuming enough hardware 
resources. 
Chapter 5 Implementation 46 
< 4 • 
j i I I 
八0’0 八0’1 八0’2 Ao,3 Ai,o Al l 八口 Ai,3 八2,0 2^,1 \2 2^,3 3^,0 ^3, A3,2 3^,3 
八4’0 八4，1 八4’2 八4，3 八5’0 5^,1 ^ 5,2 ^ 5,3 6^,0 ^ 6.1 ^ 6,2 ^ 6,3 7^,0 \ l l^,2 7^,3 
5 八8’0八8,1八8’2八8’3 八9’0八9’1八9，2八9’3 1^0,0^ 10,1^ 10,2^ 10,3 1^1,0^ 11,1^ 11,2^ 11,3 
八12’0八12’1八12’2八12’3 八13’0八13’1八13’2八13’3 ^ 1 4 , 0 ^ 1 4 . 1 ^ 1 4 , 2 ^ 1 4 , 3 ^ 1 5 , 0 ^ 1 5 , 1 ^ 1 5 , 2 ^ 1 5 , 3 
八16’0八16’1八16’2八16’3 八17’0八17，1八17’2八17’3 ^ 1 8 , 0 ^ 1 8 , 1 ^ 1 8 , 2 ^ 1 8 , 3 ^ 1 9 , 0 ^ 1 9 , 1 ^ 1 9 , 2 ^ 1 9 . 3 
_ ] r_ 
mem_iO mem_i1 mem_i2 memJ3 
Figure 5.9: Contents of the Input Memory. 
5.2.2 Input M e m o r y 
The Input Memory module consists of four individual memory blocks. Each block has 5 
addresses with 4-bit data. The assignment pattern of each node is shown in Figure 5.9. 
A whole word is fetched from one of the memory blocks which is depended on the node 
number (from 0 to 19). If the last two bits of the node number is “00”，"01", “10” or 
"11", the word is fetched from the mem_iO, mem_il, memJ2 or memJ3 respectively. 
This word, namely occupy, is one of the components to determine a new word of the 
particular node by the Word Generator. Xilinx RAM (described in Section 4.2.1) is 
used to implement the Input Memory so the total number of CLBs for 4 memory blocks 
is 8 CLBs. It is easily expandable for larger problems by increasing the data width 
and the number of address lines. 
5.2.3 Solution Store 
The Solution Store module is used to store solutions into the external static memory 
if found. This is detected when an assignment is found for the last node. 
In the particular case, each assignment to a node consists of 4 bits which correspond 
to the 4 colors encoded in a one-hot fashion. There are 20 nodes and therefore, the 
total number of bits for a solution should be 80 bits. Since the width of each word 
of the external static memory in the Wildforce board is 32 bits, 3 different words are 
used to store a solution. The lowest 32-bit (bit 0 - 31) is stored at the address 'x'. The 
next 32-bit (bit 32 - 63) is stored at the address ' x + l ' and the next 16-bit (bit 64 - 79) 
Chapter 5 Implementation 47 
is stored at the address 'x+2' . The writing process requires three clock cycles to store 
a solution. According to the specification of the Annapolis Wildforce Reference Menu, 
the size of the memory is 1048576 (2^°) with 32-bit data width. So, the maximum 
number of solutions to be stored in the 4M bytes external memory is 1048576/3 (or 
349525). After the memory has been filled with the solutions, the FPGA will interrupt 
the host. On the other side, the host will poll the interrupt request. If it detects an 
interrupt request from a processing element (PE), it will fetch all solutions back to the 
host and then reset the interrupt request from that PE. Then, the FPGA can continue 
to execute and find more solutions. 
The previous discussion is the case when a solution exists. If there is no solution, 
nothing is written to the external static memory. The host can detect this case when it 
receives an interrupt from the PE and the first location of the memory is not changed. 
5.2.4 Constraint M e m o r y 
^0,0 l^,0 2^,0 X3,0 X^ ,1 Xj j X^ J X31 X019X119 X219 X3,19 
X4,0 ^5,0 ^6,0 X7,0 X 4 1 X 5 , 1 X 6 i X^ J X4,19X519 ^619 X719 
^8 ,0 X90 Xioo X ^ o X g j X91 X i o i X j j j ^ 8 19 X919 Xio,i9 X i i i9 
Y Y Y Y V V V Y Y Y V Y 
^ 12,0 13,0^14,0^15,0^12,1 ^13,1 ^ 14,1 15,1 八 12,19 ^ 13,19 ^14,19 15,19 
V V V V V V V V V V V V 
16,0 17,0 18,0 19,0 16,1 17,1 18,1 19,1 16,19 17,19 18,19 19,19 
Figure 5.10: Contents of the Constraint Memory. 
The Constraint Memory module is used to store the connections between nodes in 
the graph with the ordering as shown in Figure 5.10. The memory block contains 5 
addresses with a word length of 80 bits. Each bit X$y, is set if there is connection 
between node x and node y. The constraint memory totals 400 bits which represents 
all (20^) possible connections between 20 nodes. Each 80-bit word is divided into 20 
pieces with 4-bit each. Each 4-bit word is connected to a 1-bit evaluator. There are 
totally 20 1-bit evaluators which are used to evaluate the constraints. 
The constraint memory is implemented using Xilinx internal RAM (refer to Sec-
tion 4.2.1). This serves to minimize wiring delays; increase logic density and since 
Chapter 5 Implementation 48 
a very wide memory can be implemented, increase parallelism. The contents of the 
memory are initialized by the host at runtime. Each 16x1 synchronous RAM (16 
addresses with 1 bit data), it occupies 1/2 of a CLB. 40 (80/2) CLBs are required to 
implement 5 addresses with 80-bit data. To expand the design to a larger problems, 




_ _ ^ … T , 
assign<3> I " V r ^ OUtpUt 
constraint<2> ^J^ j ^^^ 丨 1 •‘ j ^^ j—^^^^ 
constraint<3:0> [3~^ V pDCE i 乂 
[3——•—— y^ 
0 二> clk ——> 
constraint<}> ‘ 
[ J ^ ~ ~ ~ V ~ " ~ ~ 
[ ^ E3——»• yT reset 
assign<l> * ^^ 
assign<3:0> constraint<0> | 少 
^ ^ = D ^ 
assign<0> 
Figure 5.11: Gate level diagram of a 1-bit evaluator. 
There are 20 outputs Cxy to be updated in a whole evaluation process. The Eval-
uator consists of 20 1-bit evaluators which is shown in Figure 5.11. A 1-bit evaluator 
accepts 8-bit input only because a CLB contains a F function generator and a G func-
tion generator (explained in Section 4.2.1) only and each generator can have maximum 
4 inputs. To achieve good CLB utilization, the 1-bit evaluator has 4-bit inputs from 
the Input Memory and 4-bit constraints from the Constraint Memory. There is a 
flip-flop inside the evaluator which is used to store the partially evaluated constraint. 
Therefore, each 1-bit evaluator occupies 2 CLBs. 
The 20 1-bit evaluators each evaluate 4 of the OR terms per cycle hence it takes 5 
cycles to generate its output. For example, to generate the output Co,o in 5 cycles, a 
1-bit evaluator computes 
Chapter 5 Implementation 49 
1. Co,0 = ^0,0^,0 + ^l,0^1,0 + ^2,0^2,0 + ^3,0^,0 
2. Co,0 = Co,0 + ^4,0^4,0 + ^5,0^5,0 + ^6,0^6,0 + ^7,0^7,0 
3. Co,o = Co,o + ^8,0^,0 + ^9,0^,0 + ^10,0^10,0 + ^11,0^11,0 
4. Co,0 二 0*0’0 + 成2’0叉12,0 + 山3，0叉13，0 + 成4,0而4,0 + 山5,0而5’0 
5. Co,0 = Co,0 + ^16,0^16,0 + ^17,0^17,0 + ^18,0^18,0 + ^19,0^19,0 
5.2.6 Input Mapper 
Input Output 
Mapper Memory 
r c c r c 
^3,0^7,0^11,0 ^15,0 ^19,0 . -
• mem_oj 
n _ o o 
—^ �2’0 ^6,0 Ci0,0 Ci4,0 ^18,0 
o • mem_o2 
n “ 
• r = = ^ 
• r r r c c 
： ^1,0^5,0^9,0^13,0^17,0 . , 
： — • mem_ol 
n 」 
^ c c c c c 
0 0,0 ^ 4,0 ^ 8,0 ^ 12,0 L 16,0 
• mem_oO 
Figure 5.12: Block diagram of Input Mapper. 
Figure 5.12 shows the block diagram of Input Mapper. The Input Mapper is used 
to map the outputs from the Evaluator to the Output Memory. The output of the 
Evaluator is set if a node is constrained to a color where this color has been assigned 
to a previous node and they are connected together. 
5.2.7 Output Memory 
The layout of Output Memory module is the same as the Input Memory module. It 
consists of 4 individual memory blocks and each block consists of 5 addresses with 
4-bit data. The layout of the memory is shown in Figure 5.13. The inputs are taken 
from the outputs of the Evaluator (see Figure 5.8) which are the updated constraints 
Chapter 5 Implementation 50 
< 4 • 
~ ' i ‘ 
Co,oCo,i C0,2C03 Ci,oCi.i C,^Cj_3 C20 C21 C22C2,3 C3,0 C3_, C32 C3_3 
C4.0 ^4.1 C4,2 G43 C5,0C51 C5,2C5,3 Cgo Cg j Cg2 Cg_3 C70 C7,1 C72C7,3 
5 CgQ Cg , Cg_2 Cg3 CgQ Cg , Cg2 Cg3 1^0,0^ 10,1^ 10,2^ 10.3 1^1,0^ 11,1 ^ 11,2^ 11,3 
r r c r r c r c c c c c r r* r r 
^ i 2 , O ^ i 2 , l i 2 , 2 ^ i 2 , 3 ^13 ,0*^13 , l ^ 1 3 , 2 ^ 1 3 , 3 ^ 1 4 , 0 ^ 1 4 , 1 ^ 1 4 , 2 ^ 1 4 , 3 ^ 1 5 , 0 ^ 1 5 , 1 1 5 , 2 ^ 1 5 , 3 
p p r c c r r r c c r f r p c C 1^6,0^ 16,1 1^6,2^ 16,3 1^7,0*"17,1 1^7,2*^ 17,3 1^8,0^ 18,1^ 18,2^ 18,3 1^9,0^ 19,1 1^9,2^ 19,3 _] ^  
mem_oO mem_o1 mem_o2 mem_o3 
Figure 5.13: Contents of the Output Memory. 
for a possible assignment to a node. Constrain indicates the constrained assignments 
for that node and is sent from the output memory to the Word Generator. There are 
4 memory blocks and the output is dependent on the node number. If the last two 
bits of node number is "00", "01", “10” or "11", the word will be fetched from the 
mem_oO, mem_ol, mem_o2 or mem_o3 respectively. 
5.2.8 Backtrack Checker 
inputI<3> 
inputl<3:0> > ^ ^ 
input2<3:0> t l > f - ^ y 
input2<3> | _ _ ^V 
inputl<2> I / 
t H - ~ ~ V J 
[3~"> J 
input2<2> \ 入 、 ^^^_ 
_ J ) ^ ~ < • backl 
input}<}> I i 乂 ^ ‘ 
t H - ~ ~ V n 
[ J - H . ~ _ ) _ ^ 
input2<l> L _ >^ 
inputl<0> r""~ J 
L h = D ^ 
input2<0> 
Figure 5.14: Gate level diagram of the backtrack checker. 
Figure 5.14 shows the gate level diagram of Backtrack Checker. The function of it 
is to check whether backtracking should be executed. The basic mechanism is for each 
node to check whether all possible assignments to a node are constrained, i.e. equal 
to '1,. Since there are 4 possible entries (colors), a basic building block with 2-level 
2-input AND gates is used to check for it. If the output is equal to '1', no solution is 
existed for that node and backtracking should be executed. 
In the implementation, after the 5 cycles of evaluation, another 5 cycles are used 
Chapter 5 Implementation 51 
to store the updated constraints to the Output Memory. To check all 20 nodes in 5 
cycles, each cycle should check 4 nodes. So, 4 basic building blocks are used and "OR" 
the results. However, each CLB can contain 8 inputs, so two blocks of Figure 5.14 (2 
CLBs) are used to evaluate the results concurrently. The results, hackl and back2, are 
simply ORed to achieve the goal. 
5.2.9 Word Generator 
The Word Generator is used to generate a new trial color for a particular node. It 
receives two inputs, occupy and constrain (see Figure 5.8), and produce a new node 
assignment, namely assign. Constrain is fetched from the Output Memory module and 
indicates the color which should be tried for that node. The occupy vector from the 
Input Memory module indicates the previous assignment of that node. The occupy 
vector is used when backtracking is executed. The next possible assignment is gener-
ated from these two inputs. If no possible assignment exists, the constraints by the 
previous assignment are removed and then backtracking is executed. 
5.2.10 State Machine 
There are three main parts of state machine. The first part is to copy constraints from 
the external static memory to the FPGA internal memory. The second part is the 
main process for the problem solving. The third part is the solution writing to the 
external static memory. 
Figure 5.15 shows the state diagram of the constraint copying. Initially, the FPGA 
sends an interrupt request to the host to indicate that it is ready to fetch the words 
from the external static memory (state 0). This interrupt request is important because 
the state of the FPGA is undetermined. This allows the FPGA to wait until the host 
has written the correct contents of constraints to the memory. After the contents have 
written to the external memory, the host will send an interrupt acknowledge to the 
FPGA. Refer to the Figure 5.16 for the 2-way hand-shaking protocol. Now, the FPGA 
requests access to the local memory bus (state 1). If no other FPGA is accessing the 
Chapter 5 Implementation 52 
? 
) r^ r^ <^^  
y J W \ ^ _ y \ . ^ y ^ state 0 : lnterruptFrst 
^ ^ ^ ^ p State 1 : MemGnt 
y ^ State 2 : RdFrstLoc 
” y / ~ ” State 3 : RdSecLoc 
/ ^ ^ \ / / ^ \ State 4 : Read 
i 3 j ( Q j State 5 : RDLstTwo 
\ S _ J V j ^ State 6 : RDLstOne 
Figure 5.15: State diagram of the constraint writing. 
bus, this FPGA will receive the grant after 3 clock cycles. Then, the FPGA fetches the 
contents from the external memory and write to the designated Constraint Memory. 
The content arrives 2 cycles after the addresses are sent to the dual port memory 
controller (refer to the Figure 5.17 for details). State 2 and state 3 are wait states. 
The state machine will remain in state 4 until 15 consecutive addresses (include those 
of state 2 and state 3) have been sent to the controller. The next two states, 5 and 6, 
receive the remaining two data. After that, all the relevant data have been copied to 
Xilinx internal memory. 
pE_pcik _j"un__rLn_rLrLn_n_rLT 
PE_lnterruptReq_n \ | \ / 
PE_lnterruptAck_n \ / \ 厂 
Figure 5.16: Timing diagram for the hand-shaking. 
Figure 5.18 shows the state diagram of the problem solving state machine. State 7 
causes a second interrupt request to be sent to the host. Its purpose is for debugging 
and is used to indicate the memory fetching from the external memory to the Xilinx 
internal memory has been completed. The FPGA waits in state 7 until an interrupt 
acknowledge is received from the host. State 8 is used to generate the solution of 
Chapter 5 Implementation 53 
PE_Pclk 
PE_MemBusReq_n ^ | 
P E_MemBusG rant_n ^ | 
PE_MemStrobe_n \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ / / / / / / / / / / / / / / / / / / / / 
- e S e , _ n ||||||||||||||||||||||||||||||||MI WWWWWWWWWW 
PE_MemAddr_OutReg ^ f f f f f f f f f ! f f | f f f f ) f f f f f f ! f f f f f f f f f f ] ^ f f ^ M \ A 2 ]( A 3 ^¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢( 
PE_MemData_lnReg X M M M X M X X X M X M M M X X X X X X X X X M X X D1 "^^ "^^“口。丽 
Figure 5.17: Timing diagram for consecutive memory read accesses. 
the target node (word) depended on the node number. At the same time, this word 
will be fed into the Input Memory. The system remains in state 9 for 5 cycles during 
which the Evaluator evaluates and updates the constraints. The system remains in 
state 10 also for 5 cycles during which the outputs from the Evaluator are mapped to 
the Output Memory. At the same time, the Backtrack Checker will check if there is 
possible solution for a node. State 11 updates the node number. There are three cases, 
increment, remain unchange or decrement. If a color is successfully assigned to a node, 
the node number should be incremented. The node number should remain unchanged 
when there is no next solution for the same node or a new assignment causes constraint 
violations. In this case, the constraints caused by the previous assignment should be 
removed and the node number decremented. 
When an assignment to the last node is made successfully, the solution is written 
to the external memory. Otherwise, the next state will be Start state (State 8). The 
FPGA continues to run until the entire search is completed. After that, the FPGA 
goes to the Idle state. 
Figure 5.19 shows the state diagram of solution writing back to the external mem-
ory. State 12 is the request to the access of the local memory bus. State 13 is the 
first write cycle to the external memory. State 14 is the second write cycle and the 
State 15 is the third write cycle. In this state, it will check if the memory is full. If 
、 
Chapter 5 Implementation 54 
o 
© 
z \ � 
Q / \ ^ ~ ^ g ^ Counter != 4 
, \ /^ 
Counter = 4 \广^~~x><Counter = 4 
,..!.._ f 10 ) state 7 : lnterruptSec 
, "•'•, i t K ^ L y State 8 : Start 
' , 1 2 ,,; T y State 9 : Evaluate 
" • -— '' L ^ state 10 : Store 
Counter != 4 State 11 : Cluster 
Figure 5.18: State machine for problem solving. 
it is, the FPGA will send an interrupt request to inform the host that the memory is 
full. The host will read all the solutions and send an interrupt acknowledge back to 
the FPGA. The FPGA can then continue to find another solutions, overwriting the 
previous solutions. © 4 
start ^ s ^ 
, ?� w 
、 . ' , \ 厂 
< 65535 \ ^ ^ Z 
solutions Y \ w 
=65535 
solutions State 12 : Me«Gnt2 
jr state 13 : WR1 
X " ^ X State 14 : WR2 
f 16 j State 15 : WR3 
V,__^ State 16 : InterruptThird 
Figure 5.19: State diagram of the solution write-back. 
5.2.11 Hardware Resources 
To estimate the hardware resources required by the design, the equations in the Ta-
ble 5.2 can be used. 
Chapter 5 Implementation 55 
Modules Number of CLBs used 
「nocte-j — 
Input i " ^ i X color x 4 
�noden 
Output 2~^ X color X 4 
r node"j 
Constraint ' ^ ' X node X 4 
Evaluator node x 2 
Mapper � � 1 X 8 
Backtrack [ ^ ] X 4 
W o r d G e n e r a t o r 「 ^ ^ ] X color 
FSM state x 4 
Interface � 1 5 0 
Total [color X 2 + node) [ ^ ] X 2+ 
(4 + color)「宇]+ node X 2 + 
[ ^ ] X 8 + state x 4 + 1 5 0 
Table 5.2: Estimated resources for the serial graph coloring machine 
In the case of the 20 node, 4 color graph coloring problem, the estimated number 
of CLBs is 358 compared with an actual value of 390 (see Chapter 6) • There is a bit 
difference between the estimated value and the actual value because those equations 
are estimated to obtain a rough figure about the usage for different problem of sizes. 
Table 5.3 shows the estimation of CLBs for several problems and a comparison of 
the requirements compared with the parallel design of Section 5.1. A detail comparison 
of the improvement using Xilinx RAM is referred to Appendix B. 
Nodes Colors CLBs used of CLBs used of 
Parallel machine Serial machine 
-~~W 4 S i ^ 
125 18 10624 1124 
250 15 36403 3532 
250 29 62436 3850 
Table 5.3: Number of CLBs used for several graph coloring problems 
Chapter 5 Implementation 56 
5.3 Serial Boolean Satisfiability Solver 
A forward checking tree search algorithm was applied to the boolean satisfiability 
problem (see Chapter 2) and implemented on a FPGA. Figure 5.20 shows the tree 
representation for 4 variables. Each node in the tree represents a variable. There are 
two paths from each node which represents the two binary values of that node. 
���� 
o o o o O © © 0 
[ ^ ^ ^ ¾ ¾ ¾ ¾ ^ ? ^ ) ^ ^ ^ 
Figure 5.20: The tree representation of a 4-variable SAT problem. 
To illustrate the concept of forward checking, consider an example with 4 variables 
and 3 clauses. The following expressions are the representation of 3 different clauses. 
clause 1 : xi + x2 + xs 
clause 2 : ^Y + 工2 + xs 
clause 3 : x2 + xs + ^ 
An array a:[p] is used to keep the current state of each variable. A Global Pointer, 
GPointer, is used to index into array x and a Global Counter, GCounter, iterates 
through the variables for the purpose of evaluation. 
Initially, all variables are free. GPointer is reset to 0. With fixed order of variables, 
the system will fetch the previous assignment of variable 1 (should be free initially) 
Chapter 5 Implementation 57 
and generate a new assignment for it which is '0'. Then, the value o f 4 variables will be 
fetched in 4 consecutive cycles and tested in the Evaluator (See Section 5.3.4) to see if 
constraints are violated. Obviously, the clause 2 is satisfied and the remaining clauses 
are undetermined. The next two steps are to assign value '0，to variable 2 and 3 and 
GPointer will be incremented respectively. At the evaluation period, after the values 
of variables xi — X3 have been fetched, clause 1 is determined but the value of it is '0', 
Therefore, clause 1 is not satisfied. Backtracking should be executed immediately to 
search the next subtree. Further assignment to next variable {x4) is useless because 
no solution can be found for sure. 
The following pseudo code describes the search algorithm which uses forward check-
ing tree search algorithm. 
sat_search() 
B E G I N 
W H I L E (true) 
B E G I N 
I F (GPointer 二 0 A N D backtracking) 
search completed; 
E N D I F 
out_val = previous assignment (GPointer)； 
in_val = generate(out_val)； 
save in_val; 
GCounter = 0; 
D O { 
out_val = fetch assignment(GCounter)； 
backtracking = evaluate(out_val); 
GCounter = GCounter + 1; 
} W H I L E (GCounter < NO_VAR A N D !backtracking); 
/ * NO_VAR : number of variables * / 
I F (!backtrack) T H E N 
GPointer = GPointer + 1; 
E L S E 
GPointer = GPointer - 1; 
E N D I F 
Chapter 5 Implementation 58 
I F (al l clauses are satisfied) T H E N 
Solut ion found; 
exit the loop; 
E N D I F 
E N D W H I L E 
E N D 
5.3.1 System Architecture 
State Machine ^ 
�6l$ai「； rGlo^ """| 
;Counter ： ： Pointer ： ^ 
义> 
； » - ： :c o 
;o "5 i 
「……召 o ^-----1 
丨 ：5 c i ； , 
( ¾ 0 ) ； 
r CD ： ： ^ r • ‘ ……-^AND/ f…… •…•： Evaluator i ^ ； 
|. ―： vJH 
； ： w 丨 丨 丨 丨 丨 i 
； ： C ： ； - • ‘ -
: - - - - - — I 丨------j 
o ： 
i w ； 
F igu re 5.21: Block diagram of the search machine. 
Figure 5.21 shows the block diagram of the architecture. The diagram consists of 
five modules, Solution Generator, Solutions, Evaluator, AND/OR and State Machine. 
The function of the Solution Generator module is to generate a new variable assignment 
to test from the previous assignment. The Solutions module stores the assignment of 
each variable. The Evaluator module is used to evaluate a new assignment to see 
if it violates constraints. The AND/OR module consists of two individual parts of 
logic. It receives the outputs from every evaluator, outputi and backi, and check if 
solution exists and backtracking exists. The function of the State Machine is to keep 
track of various states existed for the system. It also controls two global counters, 
namely Global Counter and Global Pointer. The Global Pointer is used for storing 
the indexes of current variable. The Global Counter counts a fixed number of cycles 
Chapter 5 Implementation 59 
for evaluation. As for the Serial Graph Coloring Machine of Section 5.2, this design 
uses FPGA RAM to achieve a 16x reduction in hardware. Furthermore the design is 
runtime configurable for different SAT problems. 
A DIMACS [4] 3-SAT benchmark problem {aim-50-l-6-yesl-l) with 50 nodes and 
80 clauses will be used as an example in this thesis. The following sub-sections will 
describe each module based on this problem. Expanding the design to any sized 







F igu re 5.22: Block diagram of the Solutions module. 
Figure 5.22 shows the block diagram of the Solutions module. It is built using the 
distributed RAM [2] feature of Xilinx 4000 series FPGAs. The assignment of each 
variable consists of two bits, bibo. The bit b�indicates whether the variable is free or 
assigned. The bit bi stores the assigned value. If the variable is free, the value of bi 
should be '0'. For 50 variables with 2-bit each, 4 Xilinx 32 x 1 RAMs are used. 
5.3.3 Solution Generator 
To determine the current index of the variable, a Global Pointer is used. Based on 
this Global Pointer, the previous assignment, outjual, of that variable is fetched from 
the Solutions module to the Generator. Then, the Generator will produce a new 
assignment of that variable, injual, and save it at the same clock cycle. When the 
variable is free or occupied by '0', the Generator will produce a '0' and '1' respectively 
and the State Machine will jump to next state. If the value of the variable is '1', no 
next possible assignment is available. So, backtracking should be executed and the 
Chapter 5 Implementation 60 
previous assignment should be removed. The State Machine will remain in the same 
state and the Generator will produce a possible assignment to the previous variable. 
If the Global Pointer is zero at this case, the search is completed. 
The following pseudo code describes the mechanism of the Solution Generator. 
generate() 
B E G I N 
/ * free to ,0, * / 
I F out_val = "00" T H E N 
in_val = " O r ; 
j u m p to next state; 
/*，0，to T * / 
E L S E I F out_val = "01" T H E N 
in_val = " l l " ; 
j u m p to next state; 
/ * ,1，to free, backtrack * / 
E L S E I F out_val = "11" T H E N 
in_val = "00 " ; 
I F GPointer = 0 T H E N 
finish searching; 
j u m p to idle state; 
E L S E 
GPointer = GPointer - 1; 
remain in current state; 
E N D I F 
E N D I F 
E N D 
5.3.4 Evaluator 
The implementation uses a 1-bit evaluator for a clause. Therefore, the Evaluator 
consists of 80 1-bit evaluator which is shown in Figure 5.23. The evaluator consists 
of two independent memory modules, Sel and Inv. The Sel stores the index of the 
Chapter 5 Implementation 61 
Enable 
bo 
count<5:0> 丁 Sei ~~Z3~~">Counter = = } ^ back 
sel_val - — RAM _?is 
sel en —— 
一 —— 今 
lnv ^ ~ • ) FDCE ^ output 
inv_val 一 RAM ^ ~ _ v r - x | ~ ~ ^ > 
inv_Gn —— ~ f L ^ ^ 
bi 
clk 
Figure 5.23: Gate level diagram of a l-bit evaluator. 
variables. So, 50 variables require 50 bits located at addresses 1-50. For example, 
X{ + ^ + ^ is a 3-clause (where a n-clause is a clause with n literals) and the values 
of the addresses i, j and k will be set to T . The value of the remaining addresses 
will be set to '0'. It occupies 2 Xilinx 32 x ls RAMs or CLBs. The Inv stores the 
negation of the variables if exists. The size of the Inv is the same as Sel which occupies 
50 addresses with 1 bit wide. In the previous example, the value of address j and k 
will be set to T and the other addresses '0', It occupies 2 Xilinx 32 X l s RAMs. The 
values to the Sel and Inv memories will be configured in run time. 
To calculate the value of each clause, a summation circuit with a D-type flipflop 
is used. The output is '1，if the value of any variable in the clause is '1'. The boolean 
expression of each variable in each cycle is {h ® 6i) A Sk- Therefore, the output from 
the evaluator is 
{h e hi) A Si + (/2 ① &1) A S"2 + . . • + {In © 61)八 5"n 
A 2-bit counter is used to check whether the clause is determined or undetermined. 
If the clause consists of variable that is free, the clause is undetermined. Otherwise 
the clause is determined. A Enable signal is generated to activate the counter. For 
simplicity, a 3-clause is assumed. When the counter output is "11" or 3，all variables 
are assigned. Backtracking (back = '1') will be executed if the output is '0'. A '0' 
output means the partial assignment of variables cannot satisfy to the clause. 
Chapter 5 Implementation 62 
Each evaluator requires 8 CLBs. Therefore, the total number of CLBs required for 
the Evaluator module is 80x8 or 640 CLBs. 
5.3.5 A N D / O R 
There are two outputs from the evaluator, back and output. To see if backtracking 
should be executed, all the outputs, back, should be ORed together. If the result, 
namely tot_back, is '1', at least one clause is not satisfied. Otherwise, the searching 
will continue for the next variable. 
To see if a solution exists, all the outputs, output, should be ANDed together. If the 
result, namely tot_out, is '1', all the individual outputs are '1,. Therefore, all clauses 
are satisfied with the current assignment and such assignment is one of the solutions. 
13 CLBs are used for each part. 
5.3.6 State Machine 
There are three main parts of state machine. The first part is used to copy the indexes 
of variables and the negation information from the external static memory to the 
Xilinx RAMs. The second part implements the main evaluation process. The third 
part writes the solution to an external memory. 
Figure 5.15 shows the state diagram for writing indexes of variables and negation 
information. Refer the Section 5.2.10 for details. To fetch data from the host, the 
FPGA requests access to the local memory bus. After it has received grant from the 
memory controller, it will fetch the data from the external static memory and write 
them to the Xilinx RAMs, Sel and Inv, inside the Evaluator. The state machine will 
remain in state 4 until 250 (^^¾^^°) consecutive addresses have been sent to the 
memory controller. 
Figure 5.24 shows the state diagram of the evaluation process. Initially, all the 
variables are free and reset to “00”. In State 7, a new assignment to the current 
variable is generated. At the same time, the new assignment will be updated in the 
、 
Chapter 5 Implementation 63 
Solutions module. The search is finished if the GPointer is equal to '0, and no possible 
assignment exists for it. It will jump to state 14 which is an idle state. The system 
remains in state 8 for 50 cycles (worst case) during which the variables are tested in the 
Evaluator. During this 50 cycles, the value of tot_back from the output of AND/OR 
module is continually checked in each cycle. If the value is T , backtracking is executed 
and jump to state 9. In state 9, the GPointer will be updated if no backtracking exists. 
The system will also check for the value of tot_out from the AND/OR module. If it is 
equal to '1', one of the solutions is found. 
X j M 5 / ^ ^~-^ 
G r ) Counter != 50 
) J 
\ J r S o l u t i o n state 7 : Update_bit 
\ ^ ^ ^ ^ found State 8 : Evaluation 
^ ^ 9 j • 10 ： state 9 : Update_pointer 
^ _ ^ State 14 : Idle 
Figure 5.24: State diagram of the evaluation. 
Figure 5.25 shows the state diagram of solution writeback to the external memory. 
State 10 is the request to the access of the local memory bus. State 11 will last for 50 
cycles to fetch the assignments, Xi, of every variable. State 12 and 13 are the first and 
second write cycles respectively, to the external memory. Only two cycles are required 
because the data width is 32 bits. 
,、r~\ 
( 7 ‘ { 10 ) 、\ \ ^ 
©/^""^\ / ^ ^ { \ Counter != 50 
• w ^ 
\ / State 10 : MemGnt2 
\ / State 11 : Fetch 
State12 : WR1 
State 13 : WR2 
Figure 5.25: State diagram of the solution write-back. 
Chapter 5 Implementation 64 
5.3 .7 Hardware Resources 
To estimate the hardware resources required by the design, the equations in Table 5.4 
can be used. 
Modules Number of CLBs used 
Solutions ^ m r ^ ^ x 2 
Solutions Generator ^ 10 
Evaluator clause X (2 X「一^严]+ 4) 
AND/OR 〜（[^^] + � ^ ^ ] ) x 2 
State Machine state X 4 
Interface � 2 0 0 
Total {clause + 1) x [^%^f^^] X 2+ 
f r clause n 丄 r clause 1 � • 0_L 
1 ~~8~ 十 " ^ T " ) X 2十 
[clause + state) X 4 + 210 
Table 5.4: Estimated hardware resources for the serial SAT Solver 
In the case of 50 variable, 80 clause 3-SAT problem, the estimated number of CLBs 
is 956 compared with an actual value of 977 (see Chapter 6). 
Table 5.5 shows the estimation of CLBs for several standard DIMACS benchmark 
SAT problems [4 . 
Problem Variables Clauses CLBs used 
a im-50- l_6-yes l - l ^ ^ ^ 
aim-50-2_0-yesl- l 50 100 1084 
aim-100-2_0-yesl- l 100 200 2716 
aim-lOO-6-O-yesl-l 100 600 7628 
a im-200- l_6-yesl - l 200 320 6114 
aim-200-6_0-yesl- l 200 600 22242 
dubois20 60 160 1580 
dubois30 90 240 2724 
hole6 42 133 1358 
ii8a2 180 800 9838 
i i 32c l 225 1280 26226 
par8- l -c 64 254 2358 
pret60_25 60 160 1580 
pret l50-25 150 400 5974 
Table 5.5: Number of CLBs used for several DIMACS SAT problems 
Chapter 5 Implementation 65 
5.4 GSAT Solver 
The previous SAT architecture has the main disadvantage that it is slow. This problem 
was addressed by the final implementation that will be described in this section which 
an incomplete algorithm, parallel clause evaluator and runtime reconfigurable to make 
execution speed as fast as possible. 
The serial GSAT solver implements the GSAT algorithm (refer to Section 3.3.2) 
and uses the same architecture as Hamadi et. al. [18]. Changes were made to allow 
for runtime reconfiguration of the bitstream. 
5.4.1 System Architecture 
The inner loop of the algorithm (Refer to Section 3.3.2), i.e. the calculation of p is 
implemented by the reconfigurable hardware. The remaining part is implemented 
in software to decrease the logic complexity and therefore minimize the hardware 
resources. 
Figure 5.26 shows the block diagram of the GSAT Solver. There are 7 major 
components, Variable Memory, Flip-Bit Vector, Clause Evaluator, Adder, Random Bit 
Generator, Comparator and Sum Register in the design. A 50 variable, 80 clause SAT 
problem will be used as an example in the description which follows. 
5.4.2 Variable Memory 
The Variable Memory module is used to store the new variable assignment from the 
host computer. The host computer writes a new variable configuration in every itera-
tion of the outer loop of the algorithm. 
Chapter 5 Implementation 66 
‘ I J ^ " " ^ 
^ I ” ： ： ：： Adder ~ p k^ a < b | 
L Clause Evaluator • h 
~~I——I——I——• •——X——X——X~~ ~ ~ ^ a=b 
T T T T T T T T ~ ^ ^ 
Flip-bit Vector | ^ ' 
T T T T T T T T ^ J L 
Variable Memory random bit 
y ^ generator 
\7 
」 < -Register 
H〇ST . ~Output I (Sum) 
N Registers 
^ ^ F ~ 
Counter 
Figure 5.26: Block diagram of the GSAT Solver. 
5.4.3 Flip-Bit Vector 
Figure 5.27 shows the block diagram of the Flip-Bit Vector module. The Flip-Bit 
Vector module consists of a shift register and a series of exclusive-OR(XOR) gates. 
Initially, the most significant bit (left side) of the shift register is '1' and the remaining 
bits are assigned to '0'. When the shift register receives an enable signal, the '1' will 
be shifted to the less significant bit (right side), When the '1' is reached to the least 
significant bit, the D-flipflop is triggered and FINISHsigml is activated. The chain of 
XOR gates is used to flip the target variable (from '0，to '1，or T to '0'). The target 
variable is determined by the the location of '1' in the shift register. The outputs from 
the chain of XOR gates will be passed to the Clause Evaluator module. 
To Clause Evaluator &&666&&66 ^  一 一 、 一 一 、 一一、 一 一 、 一 一 、 ^ — ^ « - 一 ’ 一 一 ~ 一 一 、 
L D ^ Q FINISH 
……00001 ——•[ i;hiftRegister H > 
\ cu. Q 
From Variable Memory 
Figure 5.27: Block diagram of the Flip-Bit Vector. 
Chapter 5 Implementation 67 
5.4 .4 Clause Evaluator 
50-bit variable assignment 
i X0-X3 X4-X7 之 7 
^ 〜 〜 ^ ^ 
CLB 
44^ ^ 
^ L^………― U ^ 
1 ~ ~ \ 80-bit 
/ output 
^ j ^ ^ ^ ^ ^ 4 ^ -
, L ^ LJ" L j 丨 
Figure 5.28: Block diagram of the Clause Evaluator. 
T h e Clause Evaluator m o d u l e is used t o eva lua te a l l t h e clauses w i t h d i f fe ren t 
va r iab le ass ignments . F i g u r e 5.28 shows a b lock d i a g r a m o f t h e Clause Evaluator. I t 
contains an array of configurable logic blocks (CLBs), the logic primitives of Xilinx 
X C 4 0 0 0 devices [2]. Each C L B is configured as t w o 1 6 x 1 R A M memor i es and p roduces 
t w o o u t p u t s on d i f f e ren t rows as i l l u s t r a t e d in t h e f igu re . T h e i n p u t s t o t h e Clause 
Evaluator are 50 b i t s co r respond ing t o t h e var iab les f r o m t h e Flip-hit Vector and t he 
o u t p u t s are t h e 80 clause eva lua t ions . 
Each row of the array in Figure 5.28 corresponds to two clauses, the outputs ap-
pearing in the two wires immediately above and below the CLB. Each 1/2 CLB in 
the row has its address lines connected to 4 consecutive inputs of the variable to be 
evaluated. The output of the CLB is the evaluation of the sum terms for the input 
variables to which it is connected. The RAM outputs are connected to the row line 
through an open drain buffer, implementing the sum terms as a wired-AND (which is 
equivalent to an active low wired-OR operation). Note also that a pull-up resistor is 
connected to each row. 
Chapter 5 Implementation 68 
As an example, for the clause Co = ^ + x 2 + x^, the 1st column CLB of Figure 5.28 
implements ^ + x2 (as a lookup table) and the 2nd column CLB implements x^. If 
one or more literals evaluates to a logical true (in the example, this corresponds to ccQ 
being false or x2 being true or x5 being true), its CLB will drive the row low, asserting 
the (active low) output. 
All the components and routing were placed into predefined locations and routed 
automatically by the Xilinx Epic Editor from a script created by a C program. Fig-
ure 5.29 shows the template of the Clause Evaluator for problems within 50 variables 
and 80 clauses. It is done by removing the Clause Evaluator from a pre-compiled 
design. Figure 5.30 shows the layout of the Clause Evaluator after placement of an 
array of CLBs, open drain buffers and pull-up resistors. Figure 5.31 shows the Clause 
Evaluator with complete routing. The interconnect for the inputs and outputs of the 
Clause Evaluator are implemented using longlines which are intended for high fan-outs 
that are distributed over long distances. 
¢ = ^ " ^ " 二 
ftiTr I ff^ 
11.........:......-......] 1 f 
_ | | = ^ ^ = j ^ : 
.|M [ ^ . fc:d 1 j | p I, j ~ ~ ^ = p ^ M 
： 丨 ^ ^ ^ 響 
[————]1 
Figure 5.29: Layout of the Clause Evaluator template. 
As the bitstream format for XC4000 series devices is not documented, the mapping 
Chapter 5 Implementation 69 
r r n = n ^ — — ~ ~ ― 丨 二 
: ....:層： 
]」 |"j- b . : 
i�|::::::::::::::: L _fei lr^ i 
顯 ^ s j ft-^ 丄 r--.. t.. 
JB^E. . . .| ... gL^S^^^^ft' 
B - p ^ i ——'^^_^ff^ 
i^^ ^^3 
Figure 5.30: Layout of the Clause Evaluator after placement. 
between RAM contents and the bitstream was determined by using a program to 
produce designs with known patterns in each RAM (0000, 0001, ..., 8000 in hexidecimal 
format), compiling the design to a bitstream using the standard Xilinx tools and then 
finding the offset by comparing the difference between two bitstream configuration 
files. A table of the starting positions of all the RAMs in the FPGA's bitstream was 
thus compiled. (Refer to the Figure 5.32 and 5.33 for the configuration of RAM in 
F and G function generators respectively). Using this table, another C program (see 
Appendix C) can configure the contents of the memories in the bitstream directly from 
a SAT problem specification in the standard DIMACS benchmark format [4]. 
For a particular SAT problem with m variables and n clauses, the hardware resource 
is equal to�^^1 X � f l CLBs. 
Chapter 5 Implementation 70 
亡 .^…-j�jj n rT ^ ^^ ?^ nn n n„n…,^  ：： 
： - - X 3 ^ i i ^ ^ ^ T O M F = 
� � _ _ | ; _ _ ^ B 
| 1 _ _ _ _圍 
； r t : 3 H M i B l i f f i i : t t f c 4 " 
漏漏圓丨_酵 
_ _ _ _ 圍 J 
m fflB I _ ^ ^¾ 霪 g i j 圍 IM ^ ffiu|-^ 
f ^ B i ^ ^ B i B ^ ^ f c ? 
M1 i| i 1 ^ ^ ^ ^ I ^m ^^^^__J 
H ¢11 H m ^ p I Sffl ^ Ew' 
sy 6 t t M ag^^w^p 1 Sffi0 smtjjH K 
• • • 圜 | _ — 一 ^ 
f | “ 3 r "1t'fVi ;^CT^ s^B^3S%Ek SS Z S ！ fSU - * feJ i 4 ^ * T 
_ _ _ _ _ _ 圓 _ 二 i 
¥1 1 ‘ Hy "JJ I m -oi iMy|P I*K 堪 3|Tu ^  A'"tty '1 fT i , t «~~~— 一 f _ • 
膠 _ _ _ ^ ^ 『 
^^M \ 
Figure 5.31: Layout of the Clause Evaluator after placement and routing. 
3 VAL-2458 7 VAL-3070 11 VAL-3686 15 VAL-4298 
2 VAL-2 6 VAL-614 10 VAL-1230 14 VAL-1842 
1 VAL-3072 5 VAL-2456 9 VAL-4300 13 VAL-3684 
0 VAL-616 4 VAL 8 VAL-1844 12 VAL-1228 
Figure 5.32: F function generator RAM configuration. 
5.4.5 Adder 
This Adder is used to calculate the number of unsatisfied clauses. Since each satisfied 
clause has a '0' output and '1，output represents an unsatisfied clause. For a problem 
with n clauses, n 1-bit numbers must be summed to find the number of unsatisfied 
clauses. Since the Adder is in the critical path of the design, a tree adder was used. As 
a result, there will be l0g2 n levels of delay where n is the number of clauses. Moreover, 
each level towards the root of the tree has an additional bit of precision. Thus in the 
kth level, all inputs are k-bits and they are added together in a pairwise fashion to 
generate a k+1 bit result. 
Chapter 5 Implementation 71 
3 VAL-3072 7 VAL-616 11 VAL-2458 15 VAL-2 
2 VAL-2456 6 VAL 10 VAL-3070 14 VAL-614 
1 VAL-4300 5 VAL-1844 9 VAL-3686 13 VAL-1230 
0 VAL-3684 4 VAL-1228 8 VAL-3685 12 VAL-1842 
Figure 5.33: G function generator RAM configuration. 
5.4.6 Random Bit Generator 
In the event that the current number of unsatisfied clauses is equal to the smallest value 
stored in Sum Register, the Random Bit Generator is used to decide which solution to 
keep. This prevents the algorithm from being captured in a local minima. The Random 
Bit Generator is implemented as a linear feedback shift register and the equation used 
is: 
bit(0) = bit(3) xor bit(4) xor bit(5) xor bit(7) 
5.4.7 Comparator 
The Comparator is used to compare the current number of the unsatisfied clauses with 
the value in the Sum Register to see if a smaller result can be found. 
5.4.8 Sum Register 
The Sum Register is used to store the smallest number of unsatisfied clauses. That 
means it stores the largest number of satisfied clauses. 
5.5 Summary 
Four different approaches for solving graph coloring and boolean satisfiability problems 
and their architectures were presented. The first one was the most parallel and the 
performance was the best but the hardware resources required is very large. This 
Chapter 5 Implementation 72 
approach was intended for solving small problems. The second one was a serial graph 
coloring machine which was less parallel but can solve larger problems. The third 
approach was a serial SAT solver which employed the Xilinx RAM to be runtime 
reconfigurable. The last design was a GSAT solver which, in contrast to the other 
designs, implements an incomplete algorithm. It has a parallel clause evaluator which 





In this chapter, the hardware resource for a particular problem of each approach is 
presented in the following sections. The hardware and software performance of each 
approach are discussed. All hardware timings given were best case results at room 
temperature. More conservative commercial temperature range results would be ap-
proximately 50% slower. 
6.2 Parallel Graph Coloring Machine 
A complete parallel prototype for finding all solutions to graph coloring problem with 
20 nodes and 4 colors was successfully constructed on the GigaOps G900 Reconfig-
urable Interface Card. The functional specification of it was written in VHDL and the 
bitstream configuration was obtained successfully by using Synopsys' FPGA compiler 
and Xilinx 1.4 Alliance Series tools in UNIX version. The number of CLBs and IOBs 
used in a Xilinx XC4013E FPGA were 307 and 64 respectively, which corresponded to 
53% and 33% utilization respectively. The system was run at 16.7 MHz in frequency 
taking 61 seconds to find all solutions (approximately 100 million in our example) and 
73 
Chapter 6 Results 74 
an additional 118 seconds to read back the solutions to the host computer^. 
A software implementation of the same algorithm implemented in CHIP [14] was 
also dveloped. The total CPU time on a Sun Microsystems Ultra 1/170 model work-
station was 7000 seconds so a performance gain of 114x was achieved. 
A more efficient software algorithm is the forward checking algorithm with fail-first 
principle [7, 14] in CHIP which obtains speedups through dynamic variable ordering. 
This took 4100 seconds to solve the same graph coloring problem. The hardware 
system had a performance gain of 65x over this algorithm. 
6.3 Serial Graph Coloring Machine 
A complete serial graph coloring prototype for the same problem was successfully 
constructed on the Annapolis Micro Systems Wildforce board. The bitstream configu-
ration was obtained using Synopsys FPGA Express 2.0 and Xilinx 1.4 Alliance Series 
tools in PC version. The number ofCLBs and IOBs used in a Xilinx XC4062XL FPGA 
were 390 and 77 respectively, which corresponded to 17% and 39% utilization respec-
tively. More CLBs were used than the parallel machine because the serial machine is 
intended to solve larger problems (refer to Section 5.2.11 and Table 5.3). The system 
was successfully tested at 28 MHz at room temperature. It took 362 seconds to find 
all the solutions and the other 95 seconds to read back the solutions. The speedup 
was 19x compared with the same software implementation and l l x over the fail-first 
software implementation described in the previous section. 
6.4 Serial SAT Solver 
A complete prototype for solving boolean satisfiability problem with 50 variables and 
80 clauses (aim-50-l_6-yesl-l from DIMACS) was implemented using a single Xilinx 
XC4062XL FPGA on the Annapolis Micro Systems Wildforce board. In the system, 
^A Pentium II 233 MHz personal computer with 128M RAMs running under Windows NT 4.0 
Chapter 6 Results 75 
the contents of the two RAMs, Sel and Inv, of the evaluator (see Section 5.3.4) were 
configured in runtime. In fact the prototype could solve any 3-SAT problem with < 50 
variables and < 80 clauses using only runtime configuration. The number of CLBs and 
IOBs used were 977 and 77 respectively. The prototype was able to run at 30 MHz and 
could check 600,000 variables per second. After running for 3 days, neither hardware 
nor software versions of this machine found a solution. Correctness was checked by 
initializing the search close to a solution, but the design was abandoned in favor of the 
incomplete algorithm presented in the next section. 
6.5 GSAT Solver 
The GSAT solver was tested on the same DIMACS 3-SAT benchmark problem (aim-
50-l_6-yesl-l) with 50 variables and 80 clauses. On a Sun Ultra 5/10 UPA/PCI 
(UltraSPARC-IIi 270MHz), the time required to generate the bitstream for this prob-
lem was 0.7 seconds. Using the same UltraSPARC-IIi 270MHz machine, a VHDL 
description of the clause evaluator was written and performed synthesis (407 seconds) 
and place and route (660 seconds), giving a total implementation time of 1067 seconds. 
Thus the runtime reconfigurable version is a three orders magnitude improvement over 
the resynthesis approach. 
The resulting runtime configurable implementation (shown in Figure 5.28) required 
520 CLBs, approximately 1/4 of the resources of a Xilinx XC4062XL device. This 
implementation was successfully tested at 12 MHz on a single XC4062XL chip of 
an Annapolis Micro Systems Wildforce board. A software implementation of GSAT 
(version 41) by Selman and Kautz [34], used for comparison, took 10 ms to find a 
solution^. A hardware implementation of GSAT took 1 ms to find a solution. So, 
it was 10 times faster than the software implementation. The relatively low clock 
frequency was due to the large fanouts of the clause evaluator outputs which drive 
� 8 0 inputs. Buffering techniques could help to improve the performance of the design. 
^Note that in this program, variable flips are done in an inteUigent fashion, only the clauses affected 
by a variable fiip being recomputed. 
Chapter 6 Results 76 
6.6 Summary 
Table 6.1 presents a summary of the results obtained for the four architectures. The 
parallel and serial graph coloring machines have two results in the"Speedup over Soft-
ware" column corresponding to the forward checking and fail-first algorithms respec-
tively. The figure compares the execution time for software implementation over hard-
ware implementation. The hardware implementation was executed under a Pentium 
II 233 MHz personal computer with 128M RAMs running under Windows NT 4.0 and 
the state-of-the-art software implementation was executed using a Ultra 1/170 model 
workstation under Solaris 2.6. As can be seen, the fully parallel approach leads to the 
highest speedups. The serial approaches require more clock cycles but was partially 
compensated by higher clock frequencies. The GSAT implementation's speedup was 
limited by large fanout but avoids a costly resynthesis step. 
Frequency Speedup 
System Device (MHz) over Software 
.Parallel Graph Coloring~~~XC4013E 1^7 114/65 
Serial Graph Coloring XC4062XL 28 19^1 
Serial SAT XC4062XL 30 -
GSAT XC4062XL 12 10 
Table 6.1: Summary of results obtained for the four architectures. 
Chapter 7 
Conclusion 
The aim of this thesis was to explore the suitability of configurable computing for 
solving constraint satisfaction problems. Four different architectures were developed 
and tested on hardware. 
A machine for solving graph coloring problems was implemented so that all con-
straints and the backtrack signal are evaluated in parallel. The hardware require-
ments for this architecture was large so only small problems could be tackled. This 
problem was addressed by trading off parallelism for reduced hardware in the serial 
graph coloring machine. A boolean satisfiability (SAT) machine was introduced which 
used Xilinx internal RAM so that runtime configuration could be achieved. This was 
the first runtime reconfigurable SAT machine reported for Xilinx 4000 series FPGAs. 
Different constraints could be configured within several seconds compared with tradi-
tional approaches which required resynthesis, re-placement and re-routing for different 
problems. The final architecture was a runtime reconfigurable clause evaluator which 
generates a customized circuit for a particular problem instance was developed. Dis-
tributed RAM devices in a FPGA were utilized to customize the circuit by directly 
changing the bitstream of the FPGA. This approach showed a three orders of magni-
tude speedup over resynthesis from a hardware description of a problem and is the first 
runtime reconfigurable system reported for Xilinx 4000 series devices which directly 
modifies the bitstream. All four architectures were implemented and tested on hard-
ware. In most of the cases, at least an order of magnitude improvement in execution 
77 
Chapter 7 Conclusion 78 
speed was observed. 
Constraint Satisfaction Problems (CSPs) are computationally expensive. This work 
has shown that it is possible to implement small to medium sized CSPs on configurable 
hardware systems. It was shown that the increased parallelism of hardware over soft-
ware implementations leads to significant speedups. All the systems were implemented 
on a single FPGA which have cost and power consumption advantages over workstation 
based systems. 
7.1 Future Work 
The forward checking tree search algorithm was used by three solving machines. Al-
though the algorithm was simple and easy to implement, the performance could be 
improved by using more sophisticated algorithms. 
The GSAT machine had a clause evaluater module which was implemented us-
ing Xilinx internal memory. The placement and routing can be done automatically 
by a script to predefined locations and styles. A customized bitstream was directly 
generated. This novel technique could be applied to other FPGA designs. 
Recently introduced Virtex devices [5] offers about 10x more logic than those used 
in this work. Larger CSPs, previously intractable, can be tackled with these devices. 
More logic gates can be used to increase the parallelism of the implementations. In 
addition, Xilinx have recently documented the format of the bitstream for Virtex 
devices [25], aiding techniques which directly modify the bitstream. 
Appendix A 
Software Implementation of 
Graph Coloring in CHIP 
The following program is a CHIP [14] program that implements forward checking with 
the fail-first (FF) principle [7, 14] for a 20 node, 4 color graph coloring problem. The 
fail-first principle assigns a color to a node in the order that the node with the most 
constraints should be assigned first (if more than one node exists, a node is randomly 
chosen). This method is useful for limiting the search space by restricting the branching 
of the search tree at every choice point. 
There are two segments at the end of the source code which follows. The first 
segment is used to implement forward checking with the fail-first principle. The second 














T18,T19,T20] :: 0..3, 
T10#\=T1, Tll#\=Tl, T8#\=T2, T16#V=T2, T18#\=T2， T20#\=T2， T6#\=T3, 
T8#\=T3， T14#\=T3, T15#V=T3, T19#\=T3, T6#\=T4, T7#\=T4, T20#\=T4, 
T7#\=T5, T9#\=T5, T19#\=T5， Tll#V=T7, T16#\=T8， T18#\=T8, Tll#\=T9， 
T20#V=T10, T16#V=T11, T19#V=T11, T20#\=T11, T19#\=T12， T19#V=T13, 
T18#\=T16, T20#\=T17, T19#\=T18. 












Density Improvements Using 
Xilinx R A M 
A CLB can be configured as memory which can store 32 bits. However, when used 
as a logic cell, a CLB can only store 2 bits using its two D-type flip flops. Therefore, 
using Xilinx internal RAM has 16 times reduction in circuit density over using D-type 
flip flop. The following table shows the CLBs used for storing the input assignment in 
parallel (refer to Section 5.1) and serial (refer to Section 5.2) graph coloring machines. 
The serial design can have up to a 16 times improvement in circuit density over the 
parallel machine. 
CLBs used to store CLBs used to store Reduction 
Nodes Colors input assignment in input assignment in in CLB 
Parallel machine Serial machine (times) 
" " m l8 U25 72 15.625"" 
250 15 1875 120 15.625 
250 29 3 ^ ^ 15.625 
Table B.1: Number of CLBs used to store the input assignment in parallel and serial 
machines for several graph coloring problems 
The following VHDL code demonstrates how to implement a 16x4 synchronous 
memory module. It instantiates four RAM16XlS parts from Xilinx to form the memory 
module. Several applications of Xilinx memory can be found in [30, 31]. 
81 
Appendix B Density Improvements Using Xilinx RAM 82 
- - R A M 4-bit inputs, 4-bit outputs using RAMl6XlS 
l ibrary ieee; 
use ieee.stdJogic_1164.all; 
entity memJnput is 
’ por t ( addr : in stdJogic_vector(3 downto 0); 
din : in stdJogic_vector(3 downto 0); 
we : in stdJogic; 
wclk : in stdJogic; 
dout : out stdJogic_vector(3 downto 0)); 
end memJnput; 
archi tecture synthesis of memJnput is 
c o m p o n e n t RAMl6XlS 
por t ( d : in stdJogic; 
aO : in stdJogic; 
al : in stdJogic; 
a2 : in stdJogic; 
a3 : in stdJogic; 
we : in stdJogic; 
wclk : in stdJogic; 
o : out stdJogic); 
end c o m p o n e n t ; 
begin 
RAMBANK ： for I in 0 to 3 generate 
RAM16 : RAMl6XlS port map ( d =J> din(I), 
aO =^ addr(0), 
al =^ addr(l) , 
a2 =>• addr(2), 
a3 => addr(3), 
we =^ " we, 
wclk =^ wclk, 
o =^ dout(I)); 




Direct modification of the bitstream is beneficial because different constraints of a 
problem can be written into the memory within several seconds (refer to Section 5.4.4). 
The locations of each Xilinx memory content in every CLB are known by comparing 
the difference between two bitstream configuration files. A C program is created to 
configure the contents of the memory in the bitstream directly from a SAT problem 
specification in the standard DIMACS [4] benchmark format. 
The following is the source code of that C program. Three different files are re-
quired to generate a modified bitstream file. The first two files are the binary bitstream 
(pe4.bit) and the ASCII bitstream (pe4.rbt) respectively. The third file (50-80) de-
scribes each clause in the standard DIMACS benchmark format. A new file (new.bit) 
is a new binary bitstream file with modification of the contents of the memory. 




#define NO_CLAUSE 80 
#define NO_VARIABLE 50 
#define NO_COL (NO_VARIABLE/4)+l 




Appendix C Bitstream Configuration 84 
int m,i,j,k,total; 
int start, end, no_bit=0; 
int value, local_max, temp; 
unsigned char c, a[2000000], s[255]; 
int sat[NO_CLAUSE][3], sel[NO_CLAUSE][NO_VARIABLE], inv[NO_CLAUSE][NO_VARIABLE]； 
int config[NO_CLAUSE][NO_COL][16]； 
FILE *fpl , *fp2, *fp3, *fp4; 
if ( ( fpl=fopen("pe4.bit" , "r" ) ) 二= NULL) { 
printf("can't open PE4.bit\n"); return 0; 
} 
if ( ( fp2=fopen("new.bit" ,"w")) = = NULL) { 
printf("can't open new.bit\n"); return 0; 
} 
if ( ( fp3=fopen("pe4.rbt" ,"r") ) = = NULL) { 
printf("can't open PE4.rbt\n"); return 0; 
} 
if ((fp4 二 fopen("50-80" , "r" ) ) = = NULL) { 
printf("can't open 50-80\n"); return 0; 
} 
//Initialization 
for (i=0; i<NO_CLAUSE; i + + ) 
for ( j=0; j<NO_VARIABLE; j + + ) 
sel[i][j] = inv[i][j] 二 0; 
for (i=0; i<NO_CLAUSE; i + + ) 
f o r ( j = 0 ; j < N O _ C O L ; j + + ) 
for (k=0; k<16; k + + ) 
config[i]p][k] = 0; 
/ / g e t the values from the benchmark file 
fgets(s, 100，fp4); 
for (i=0; i<NO_CLAUSE; i + + ) { 
fgets(s, 100，fp4); 
sscanf(s, "%d %d %d", &sat[i][0],&sat[i][1],&sat[i][2])； 
} 
/ /configure the array of Sel and Inv 
for (i=0; i<NO_CLAUSE; i + + ) 
f o r ( j = 0 ; j < 3 ; j + + ) { 
sel[i][abs(sat[i]p])-l] = 1; 
if (sat[i]|j] < 0) 
inv[i][abs(sat[i]p])-l] = 1; 
} ' 
/ /configure the ROM Content into a 3-d array 
for (i=0; i<NO_CLAUSE; i + + ) / / 80 iterations 
for ( j=0; j<NO_VARIABLE; j + + ) / / 50 iterations 
Appendix C Bitstream Configuration 85 
if (sel[i]D] = = 1 ) { 
m = j / 4 ; 
if ( i r w _ = = 0) { 
if ((j % 4) = = 0) { 
config[i][m][l] = 1; config[i][m][3] = 1; config[i][m][5] = 1; 
config[i][m][7] = 1; config[i][m][9] =r 1; config[i][m][ll] = 1; 
config[i][m][13] = 1; config[i][m][l5] = 1; 
} else if ((j % 4) 二= 1) { 
config[i][m][2] = 1； config[i][m][3] = 1; config[i][m][6] = 1; 
config[i][m][7] = 1; config[i][m][10] = 1; config[i][m][ll] = 1; 
config[i][m][14] = 1; config[i][m][15] = 1; 
} else if ((j % 4) = = 2) { 
config[i][m][4] = 1; config[i][m][5] = 1; config[i][m][6] = 1； 
config[i][m][7] = 1; config[i][m][12] = 1; config[i][m][13] = 1; 
config[i][m][14] = 1; config[i][m][15] = 1; 
} else if ((j % 4) = = 3) { 
config[i][m][8] = 1; config[i][m][9] = 1; config[i][m][10] = 1; 
config[i][m][ll] = 1; config[i][m][l2] = 1; config[i][m][13] = 1; 
config[i][m][14] = 1; config[i][m][l5] = 1; 
} 
} else { 
if ((j % 4) 二 = 0) { 
config[i][m][0] = 1; config[i][m][2] = 1; config[i][m][4] = 1; 
config[i][m][6] = 1; config[i][m][8] = 1; config[i][m][10] = 1; 
config[i][m][12] = 1; config[i][m][l4] = 1; 
} else if ((j % 4) = = 1) { 
config[i] [m][0] = 1; config[i] [m][l] = 1; config[i] [m][4] = 1; 
config[i][m][5] = 1; config[i][m][8] = 1; config[i][m][9] = 1; 
config[i][m][12] = 1; config[i][m][13] = 1; 
} else if ((j % 4) = = 2) { 
config[i][m][0] = 1; config[i][m][l] 二 1; config[i][m][2] = 1; 
config[i] [m][3] = 1; config[i] [m][8] = 1; config[i] [m][9] = 1; 
config[i][m][10] 二 1; config[i][m][ll] = 1; 
} else if ((j % 4) =二 3) { 
config[i][m][0] 二 1; config[i][m][l] = 1; config[i][m][2] = 1; 
config[i] [m] [3] = 1; config[i] [m][4] = 1; config[i] [m][5] = 1; 




/ / c o p y the header from .bit 
for ( i = l ; i < 6 9 ; i + + ) { 
c 二 fgetc(fpl) ; 
fputc((int)c, fp2); 
Appendix C Bitstream Configuration 86 
} 
/ / c u t the header from .rbt 
for (i=0; i<7; i + + ) { 
fgets(s,255,fp3); 
no_bit = no_bit + strlen(s); 
} 
start = no_bit; 
/ / p u t all the .rbt content into a array 
while(feof(fp3) = = 0) { 
a[no_bit]=fgetc(fp3); 




end = no_bit; 
/ /configure the ROM content 
for (i=0; i<NO_CLAUSE; i + + ) { / / 80 iterations 
if ((i % 2) = = 0) { 
temp 二 i /2; 
value = MAX - 12*temp; 
if ((temp> = 12) && (temp<24)) 
value = value - 2; 
else if ( (temp>=24) && (temp<36)) 
value = value - 6; 
else if (temp>=36) 
value = value - 8; 
local_max = value; 
} else 
value = local_max - 7982; 
for ( j=0; j<NO_COL; j + + ) { / / 1 3 iterations 
if ((i % 2) 二 = 0) { 
if (config[i]|j][0] = = 1) a[value-616] = '1'; if (config[i][j][l] = = 1) a[value-3072] 二 '1'; 
if (config[i][j][2] 二= 1) a[value-2] = '1'; if (config[i]|j][3] = = 1) a[value-2458] = '1'; 
if (config[i]|j][4] = = 1) a[value] = '1'; if (config[i][j][5] = = 1) a[value-2456] = ’1 ； 
if (config[i]|j][6] =二 1) a[value-614] = '1'; if (config[i][j][7] = = 1) a[value-3070] = '1'; 
if (config[i]y][8] = = 1) a[value-1844] = '1'; if (config[i]p][9] = = 1) a[value-4300] = '1'; 
if (config[i]|j][lO] = = 1) a[value-1230] = '1'； if (config[i][j][ll] = = 1) a[value-3686] = '1'; 
if (config[i]|j][12] = = 1) a[value-1228] = '1'; if (config[i][j][13] = = 1) a[value-3684] =，1’； 
if (config[i][j][14] = = 1) a[value-1842] = '1'; if (config[i][j][15] = = 1) a[value-4298] = '1'; 
} 
else { 
if (config[i][i][0] = = 1) a[value-3684] = '1'; if (config[i]|j][l] = = 1) a[value-4300] = '1'; 
if (config[i]|j][2] = = 1) a[value-2456] = '1'; if (config[i]|j][3] = = 1) a[value-3072] = '1'; 
if (config[i]p][4] = = 1) a[value-1228] 二 ’1’； if (config[i]|j][5] = = 1) a[value-1844] 二，1’； 
Appendix C Bitstream Configuration 87 
if (config[i]|j][6] = = 1) a[value] = '1'; if (config[i]|j][7] = = 1) a[value-616] = '1'; 
if (config[i][j][8] = = 1) a[value-3685] = '1'; if (config[i][j][9] = = 1) a[value-3686] = '1'; 
if (config[i]0][lO] = = 1) a[value-3070] = '1'; if (config[i][j][ll] =二 1) a[value-2458] = '1'; 
if (config[i]|j][12] = = 1) a[value-1842] = '1'; if (config[i]D][13] = = 1) a[value-1230] 二 '1'; 
if (config[i][j][l4] = = 1) a[value-614] = '1'; if (config[i]^][l5] = = 1) a[value-2] = '1'; 
} 
if (j = = 4) 
value 二 value - 29472; 
else 




total = 0; 
for (j=start; j<end; j + + ) { 
if (a[j]=='\n') continue; 
if (a|j]=='l ' ) total = total + ( l < < i ) ; 
i——； 
if (i < 0) { 
i=7; 
fputc(total,fp2); 







'1] Hardware Reference Documents for G900 PCI System, Release 4-2. Giga Opera-
tions Corporation, 1996. 
'2] The Programmable Logic Data Book. Xilinx, Inc., 1999. 
3] XC4OOOXLA and XC4OOOXV FPGA Series - Description vl.2. Xilinx, Inc., May 
1999. 
.4] DIMACS challenge benchmarks, ftp://dimacs.rutgers.edu/pub/diallenge. 
'5] Virtex Series FPGAs. Xilinx, Inc., http://www.xilinx.com/products/virtex.htm. 
'6] WILDFORCE™ Reference Manual Annapolis Micro Systems, Inc., Revision 
3.4, 1999. 
'7] C. A. Brown and P. W. Purdom. How to search efficiently. In Proceedings 7th 
International Joint Conference on AI^ pages 588-594, 1981. 
"8] M. Abramovici, J. T. Sousa, and D. Saab. A massively-parallel easily-scalable 
satisfiability solver using reconfigurable hardware. In Proc. ACM/IEEE Design 
Automation Conference, pages 684-690, 1999. 
9] J. Axelsson. Architecture synthesis and partitioning of real-time systems: a com-
parison of three heuristic search strategies. In Proceedings of the Fifth Interna-
tional Workshop on Hardware/Software Codesign, pages 161-165, 1997. 
88 
Bibliography 89 
10] P. C. McGeer and R. K. Brayton. Timing analysis and delay-fault test generation 
using path recursive functions. In Proceedings of the International Conference on 
Computer Aided Design, pages 180-183, November 1991. 
11] Zycad Corp. Paradigm RP Concept Silicon User's Guide, Hardware Reference 
Manual, Software Reference Manual 1994. 
12] P. David, P. Emmerman, and S. Ho. A scalable architecture system for automatic 
target recognition. In Proceedings of AIAA/IEEE on Digital Avionics Systems 
Conference, pages 414-420, 1994. 
13] M. Davis and H. Putnam. A computing procedure for quantification theory. In 
Journal of the Association for Computing Machinery, pages 201-215, July 1960. 
14] M. Dincbas, P. V. Hentenryck, H. Simonis, A. Aggoun, T. Graf, and F. Berthier. 
The constraint logic programming language CHIP. In Proceedings of the Interna-
tional Conference on Fifth Generation Computer Systems, pages 693-702, Japan, 
December 1988. 
15] 0 . Dubois, P. Andre, Y. Boufkhad, and J. Carlier. Can a very simple algorithm be 
efficient for solving the SAT problem? In Proceedings of the DIMACS Challenge 
II Workshop, 1993a. 
16] R. Durbin and D. Willshaw. An analogue approach to the traveling salesman 
problem using an elastic net method. In Nature, volume 326, pages 689-691, 
1987. 
.17] M. H. Schulz and E. Auth. Improved deterministic test pattern generation with 
applications to redundancy identification. In IEEE Transactions on Computer-
Aided Design, volume 8，pages 811—816, July 1989. 
18] Y. Hamadi and D. Merceron. Reconfigurable architectures: A new vision for 
optimization problems. In Principles and Practice of Constraint Programming 
CP97, pages 209-215, Austria, 1997. 
19] Patrick Henry Winston. Artificial Intelligence. Addison Wesley, third edition, 
1992. 
Bibliography 90 
20] J. H.M. Lee, H. F. Leung, and H. W. Won. Extending GENET for non-binary 
CSPs. In Proceedings of the Seventh IEEE International Conference on Tools with 
Artificial Intelligence, pages 338-343, November 1995. 
.21] J. J. Hopfield and D. W. Tank. Neural computation of decisions in optimization 
problems. In Biological Cybernetics, pages 52:141-152, 1985. 
22] C. J. Wang and E. P. K. Tsang. Solving constraint satisfaction problems us-
ing neural-networks. In IEE Second International Conference on Artifical Neural 
Networks, pages 295-299, 1991. 
23] C. J. Wang and E. P. K. Tsang. A cascadable VLSI design for GENET. In 
International Workshop on VLSIfor Neural Networks and Artificial Intelligence, 
Oxford, 1992. 
24] T. K. Lee, P. H.W. Leong, K. H. Lee, K. T. Chan, S. K. Hui, H. K. Yeung, 
M. F. Lo, and J. H.M. Lee. An FPGA implementation of GENET for solving 
graph coloring problems. In IEEE Symposium on Field-Programmable Custom 
Computing Machines, pages 284-285, 1998. 
.25] S. Kelem. Xilinx Virtex Configuration Architecture Advanced User,s Guide 
(XAPP151). 1999. 
.26] H. K.T. Ma, S. Devadas, Ruey-Sing Wei, and A. Sangiovanni-Vincentelli. Logic 
verification algorithms and their parallel implementation. In IEEE Transactions 
on Computer-Aided Design of Integrated Circuits and Systems, volume 8:2, pages 
181-189, Feb 1989. 
'27] P. M. Athanas and A. L. Abbott. Real-time image processing on a custom com-
puting platform. In IEEE Computer, pages 16-25, 1995. 
'28] V. Mooney, T. Sakamoto, and G. De Micheli. Run-time scheduler synthesis for 
hardware-software systems and application to robot control design. In Proceedings 
of the Fifth International Workshop on Hardware/Software Codesign, pages 95— 
99, 1997. 
Bibliography 91 
29] S. Mostert. Towards hard real-time system engineering. In Proceedings of the 
IEEE Workshop on Real-Time Applications, pages 207-210, 1993. 
30] R. Murgai, M. Fujita, and F. Hirose. Logic synthesis for a single large look-up ta-
ble. In IEEE International Conference on Computer Design: VLSI in Computers 
and Processors, pages 415-424, Oct 1995. 
"31] T. Ngai, J. Rose, and S. J.E. Wilton. An SRAM-programmable field-configurable 
memory. In Proceedings ofthe IEEE Custom Integrated Circuits Conference, pages 
499-502, May 1995. 
32] B. P. Dave and N. K. Jha. Casper: Concurrent hardware-software co-synthesis of 
hard real-time aperiodic and periodic specifications of embedded system architec-
tures. In Proceedings of Design, Automation and Test in Europe, pages 118-124, 
1998. 
33] E. P. K. Tsang and C. J. Wang. A generic neural network approach for constraint 
satisfaction problems. In Taylor, J.G. (ed.), Neural network applications, pages 
12-22, Springer-Verlag, 1992. 
34] J. P. Marques Silva and K. A. Sakallah. GRASP-a new search algorithm for 
satisfiability. In IEEE/ACM Inter. Conf. on Computer-Aided Design, pages 220-
227, 1996. 
35] B. Selman and H. Kautz. Domain-independent extensions to GSAT : Solving large 
structured satisfiability problems. In International Joint Conference on Artificial 
Intelligence, pages 290-295, 1993. 
.36] B. Selman, H. Levesque, and D. Mitchell. A new method for solving hard satis-
fiability problems. In Proceedings of the Tenth National Conference on Artificial 
Intelligence (AAAI-92), pages 440-446, San Jose CA, 1992. 
:37] 0 . Shagrir. A neural net with self-inhibiting units for the n-queens problem. In 
International Journal of Neural Systems, pages 8(3):249-252, 1993. 
•38] M. Shand. PCI Pamette VI. http://www.research.digitial.com/SRC/pamette. 
Bibliography 92 
:39] T. Suyama, M. Yokoo, and H. Sawada. Solving satisfiability problems using logic 
synthesis and reconfigurable hardware. In Proceedings of the Thirty-First Hawaii 
International Conference on System Sciences, pages 179-186, 1998. 
40] IKOS Systems. Virtual Logic SLI Emulator, http://www.ikos.com. 
.41] E. Tsang. Foundations of Constraint Satisfaction. Academic Press, 1993. 
42] D. W. Matula, G. Marble, and J. D. Isaacson. Graph coloring algorithms - Graph 
theory and computing. Academic Press Inc., 1972. 
43] H. Y. Wong, W. S. Yuen, K. H. Lee, and P. H.W. Leong. A runtime reconfigurable 
implementation of the GSAT algorithm. In to appear in Proc. Field Programmable 
Logic and Applications Workshop (FPL,99), Scotland, 1999. 
.44] M. Yokoo, T. Suyama, and H. Sawada. Solving satisfiability problems using field 
programmable gate arrays: First results. In Proceedings of the 2nd Inter. Conf. 
on Principles and Practice of Constraint programming^ pages 497-509, 1996. 
45] P. Zhong, P. Ashar, S. Malik, and M. Martonosi. Using reconfigurable computing 
techniques to accelerate problems in the CAD domain: a case study with boolean 
satisfiability. In Proceedings on Design Automation Conference, pages 194-199, 
1998. 
46] P. Zhong, M. Martonosi, P. Ashar, and S. Malik. Accelerating boolean satisfia-
bility with configurable hardware. In IEEE Symposium on Field-Programmable 
Custom Computing Machines, pages 186-195, 1998. 
Publications 
• C.K. Chung, "Solving Constraint Satisfaction Problems using Field Programmable 
Gate Arrays", in Proceedings of The First ACM Hong Kong Postgraduate Re-
search Day, pages 76-79, Oct. 1998. 
• C.K. Chung and P.H.W. Leong, "An Architecture for solving boolean satisfia-
bility using runtime configurable hardware", accepted for publication at the In-
ternational Workshop on Parallel Execution on Reconfigurable Hardware, Japan, 
Sept. 1999. 
• P.H.W. Leong and C.K. Chung, "A FPGA based Runtime Configurable Clause 
Evaluator for SAT problems", accepted for publication at the Electronic Letters. 
93 
: , , . : . "
 .




 s , l . . ^ ^ , -
 , ' , . . ^ y ^ v w ^ 
： ： -
 . : : :





 , . 1 
• " . . ‘ "
 ’























































 A ^ 
、.‘
 .•
 . ， ：
 .
 •
















_ . : : : : : 、 」 ： . ： 義 i i a i l M i i i M ^ ^ ^ ^ ^ M M ^ a ^ ^ ^ f l M i a i A m ^ 1 A f . t t l i b p
 I
 . j . 、 . - • 丄 ！
 I
 • J ^ . l , f s t ^ ^ ^ ^ ^ ^ ^ ^ t 
CUHK L i b r a r i e s 
• 0 3 7 2 3 4 3 2 
