In this work we have developed a completely new and novel SAT solver architecture to address three fundamental hurdles blocking the way to a wider application of reconfigurablehardware-based acceleration of SAT, namely, (1) 
Introduction
Boolean Satisfiability Checking (SAT) is at the core of many applications in VLSI CAD [5, 6, 9, 10, 11, 12] . We and other groups have proposed a number of techniques for accelerating SAT using FPGA-based instance-specific reconfigurable hardware in the recent past [2, 13, 16] . While significant speedups have been reported over software implementations of SAT for a number of examples, a few fundamental hurdles remain before this technology can be applied widely.
The first hurdle is the time overhead for compiling the hardware implementation of the algorithm on to FPGAs. This includes the times for mapping the logic netlist onto the Combinational Logic Blocks (CLBs) of the FPGA and the time for place-and-route. In most cases, this time can actually be comparable to or greater than the time to actually solve the formula even in software.
The second drawback is that the level of sophistication in the hardware algorithm is not quite up to the level of the best software algorithm. The hardware algorithm basically relies on raw parallelism to achieve the speedup. As a result, it is possible to find a number of examples for which the software implementation is faster or comparable because the software heuristics work very well. In previous hardware acceleration efforts for SAT, we have been able to incorporate the non-chronological backtracking feature into the hardware SAT algorithm [16] . Another major feature that often speeds up the SAT solver considerably is the addition of clauses corresponding to solution subspaces which don't need to be explored. This feature is practically impossible to implement in our previous architectures.
Another fundamental drawback is that previous architectures have led to much slower clock speeds than what is potentially achievable in FPGAs. The slow clock speeds are a result of the hardwired implementation of connections between literals leading to irregular layout, long wires and wires crossing FPGA boundaries.
Our Contribution
The primary contribution of our work has been to propose a completely new and novel SAT-solver architecture in which the interconnection between literals, and between literals and clauses is implemented by a physically shared but time multiplexed pipelined-bus-based scheme. Each time slot is associated with a small set of literals. This assignment of time slots to literals is done once statically. Using this scheme, the architecture basically implements a version of the Davis-Putnam algorithm [7] for SAT. The implementation consists of a subcircuit for each clause in the CNF formula. Each of these subcircuits computes two things: (1) the implications of a variable assignment during the forward phase of the algorithm, and (2) the cause of implications on its literals during the backtrack phase. This architecture enables us to address the drawbacks of previous approaches in the following manner:
1. Each clause subcircuit, including its input and output connections, is identical for all practical purposes except for some reference bits that indicate which time slots it needs to look at. The identical subcircuits imply that the layout of each FPGA can be made very regular consisting of a tiling of identical blocks. This layout, including all the routing, is independent of the specific instance of the SAT problem. Therefore, configuring the FPGA board for solving a specific SAT problem only involves downloading the predetermined configuration bits for each FPGA and the predetermined reference bits for each clause subcircuit. This time is negligible compared to the time required even for an easy SAT problem. 2. The fact that the clause subcircuits are identical and that the FPGAs are pre-laid out implies that the addition of new clauses dynamically to implement some form of learning is feasible. 3. The fact that all clause subcircuits connect directly to the bus and that there are no long wires running endto-end on an FPGA and crossing FPGA boundaries means that the FPGAs can be clocked at much higher speeds than in previous architectures.
Organization of the Paper
In the rest of the paper, Section 2 provides the background for our work in terms of an introduction to the SAT problem and to recent previous work on hardware acceleration of SAT using FPGA-based reconfigurable hardware. Sections 3 and 4 contain the details of our architecture with experimental results. Section 5 concludes the paper.
Background and Related Work

Boolean Satisfiability Checking (SAT)
The Boolean satisfiability (SAT) problem is a well-known constraint satisfaction problem with many applications in computer-aided design, such as test generation, logic verification and timing analysis. Given a Boolean formula, the objective is either to find an assignment of 0-1 values to the variables so that the formula evaluates to true, or to establish that such an assignment does not exist.
The Boolean formula is typically expressed in conjunctive normal form (CNF), also called product-of-sums form. Each sum term (clause) in the CNF is a sum of single literals, where a literal is a variable or its negation. An nclause is a clause with n literals. For example, (v i +v' j +v k ) is a 3-clause. In order for the entire formula to evaluate to 1, each clause must be satisfied, i.e., evaluate to 1.
Most current SAT solvers are based on the Davis-Putnam algorithm [7] . The basic algorithm begins from an empty partial assignment. It proceeds by assigning a 0 or 1 value to one free variable at a time. After each assignment, the algorithm determines the direct and transitive implications of that assignment on other variables. If no contradiction is detected during the implication procedure, the algorithm picks the next free variable, and repeats the procedure. Otherwise, the algorithm attempts a new partial assignment by complementing the most recently assigned variable for which only one value has been tried so far. This step is called backtracking. The algorithm terminates when no free variables are available and no contradictions have been encountered (implying that all the clauses have been satisfied and a solution has been found), or when all possible assignments have been exhausted. The algorithm is complete in that it will find a solution if it exists.
Determining implications is crucial to pruning the search space since (1) it allows the algorithm to skip entire regions of the search space corresponding to contradictory partial assignments, and (2) every implied variable corresponds to one less free variable on which search must be performed. Unfortunately, detecting implications in software is very slow since each clause containing the newly assigned or implied variable is scanned and updated sequentially, with the process repeated until no new implications are detected.
Our intuition for the hardware speedup potential in the SAT algorithm stems from recognizing that the implication procedure central to the algorithm is both highly parallelizable and easily mapped to basic logic gates. Our entire hardware architecture is designed to take advantage of this parallelism. Pseudo code for the basic Davis-Putnam algorithm is shown below. [6, 10, 11, 12] while maintaining the same basic flow. The contribution of the GRASP work [11] is notable since it applies nonchronological backtracking (conflict analysis) and dynamic clause addition to prune the search space further.
Conflict analysis and clause addition are relative easy to implement in software. When a new value is implied, it is added to a data structure recording the implication graph. When a conflict occurs, traversing the graph backwards identifies predecessors of the conflict. It is also easy to add new clauses to the clause database. Such techniques may greatly improve the performance on many problems. Recognizing these two as important features, we show how they can be mapped to reconfigurable hardware in our architecture. Conflict analysis and clause addition are relative easy to implement in software. When a new value is implied, it is added to a data structure recording the implication graph. When a conflict occurs, traversing the graph backwards identifies predecessors of the conflict. It is also easy to add new clauses to the clause database. Such techniques may greatly improve the performance on many problems.
Previous Work in Hardware Acceleration of SAT
Prior work includes several proposals for solving SAT using reconfigurable hardware [13, 16] . Suyama et al [13] have proposed their own SAT algorithm distinct from the Davis-Putnam approach. Their approach is characterized by the fact that at any point, a full (not partial) variable assignment is evaluated. While the authors propose heuristics to prune the search space, they acknowledge that the number of states visited in their approach can be 8X the number of states visited in the basic Davis-Putnam approach.
The work by Abramovici and Saab also proposed a configurable hardware SAT solver [2] . Their approach basically amounts to an implementation of a PODEMbased [8] algorithm in reconfigurable hardware. PODEM is typically used to solve test generation problems. Unlike PODEM, which relies on controlling and observing the primary inputs and outputs of a circuit, the Davis-Putnam algorithm also captures relationships between the internal variables in the circuit; this significantly reduces the state space visited and the run time [6.10,11] .
In prior work, we designed a SAT solver based on the basic Davis-Putnam algorithm and implemented on an IKOS Virtualogic Emulator. This work was the first to publish results of an actual implementation of SAT in programmable logic [16] . We also designed and implemented an improved algorithm which uses a modified version of non-chronological backtracking to prune the search space [16] . This method indirectly identifies predecessors for conflict but is not as efficient as the direct conflict analysis we perform in this paper. Finally, none of the previous hardware SAT solvers have employed dynamic clause addition.
Importantly, all the previous approaches generate formula-specific solver circuits which differ significantly from formula to formula. As a result, the entire circuit for each formula must be completely synthesized, placed and routed and downloaded to the multiple FPGAs from scratch -a significant run time overhead. The approach we are presenting in this paper, on the other hand, overcomes this hurdle.
Proposed SAT Solver Architecture
A Modular Design Approach
As discussed in earlier sections, one of our major goals in developing the new architecture is to greatly reduce the compilation overhead. In the design flows for previous architectures, all the four steps of synthesis, mapping, place-and-route and bit stream generation are required for each SAT formula. Through the new architecture, we wish to maximize the amount of regularity and number of repetitive structures in the design. We can take advantage of this regularity by means of our own customized compilation flow that is orders of magnitude faster. A recent development that allows us to do that is the availability of tools from Xilinx for direct modification of their FPGA configurations [15] .
The modularity in our design arises from the fact that our SAT solver circuit is clause based. The subcircuit for each clause encapsulates the variable implications arising out of the clause. The subcircuits for any two n-literal clauses are identical except for the input and output connections. In particular, the implication subcircuit for a general 3-literal clause v i + v j + v k would be:
Here v' i , v' j and v' k are the inputs and v i , v j and v k are implied outputs. Only one module needs to be designed for clauses with the same number of literals. These modules can be compiled once and be placed at different locations in the FPGA. The final subcircuit for each clause is then determined by the identity of the input and output wires. 
Pipelined Bus
Improved Algorithm
There are several algorithm improvements that lead to higher software efficiency, and that we can use to improve the hardware performance as well. We identify that conflict-analysis-based techniques have the most impact on performance for many problems [11] . The major characteristics of our algorithm implementation are the following:
Conflict analysis:
This analysis identifies the branching assignments that lead to the conflict. This is the reversal of the implication process. From the conflict variable, whose two literals are both either implied or assigned to 1, we need to find the transitive predecessors that have implied the values that led to the conflict. Adding this function requires additional hardware. When a clause generates an implication, this event should be remembered. In the analysis mode, this stored implication information is used to identify the predecessors. If the true literals are assigned due to implication, their predecessors will be posted on the bus while they are reset to false. If a literal is assigned due to branching, it has no predecessors. It will keep circulating the bus. This analysis process ends when no value change occurs to the literals on the bus. At this moment, only the predecessors assigned by branching are circulating with value one.
Collecting these values creates the set of assignments that leads to the conflict.
Non-chronological backtrack:
With the list of the predecessors, the backtrack target is the most recent assignment within this list. It often backtracks more levels than it would with the basic backtrack algorithm, which backtracks to the most recent assignment.
Dynamic clause addition:
The conflict analysis can also lead to the addition of redundant clauses which do not affect the final result, but potentially speed up the algorithm. Based on the conflict analysis, non-chronological backtrack and dynamic clause addition renders much higher performance on many problems. We decided to implement this technique in the new design. As a result, we will need to add some memory in the clause module to store whether it has made a certain implication. We also need additional circuits to perform the conflict analysis.
Centralized Control
Since we have a pipelined bus to handle the data communication, the value changes can be monitored on any point of the bus. It is more beneficial to use centralized control. The task of the centralized control is to manage the conflict analysis based backtrack search algorithm. The functionality is shown below:
State control:
The SAT solver changes between several major states: branch, implication, analysis and backtrack. The control unit should decide the state transition depending on the signals of conflict and value change. It should send the control signals to the clauses so they can perform the appropriate operation.
Value monitoring:
The value of the variables should be monitored to identify conflict. It also should monitor whether a new value has been generated. During the conflict analysis, it should also record the variables identified as the predecessors for the conflict. This information is used to determine the backtrack destination and to construct the new addition clause.
Value decision:
During the backtrack and branching operations, the control block determines value changes on the variables.
Interface with the host:
If clause addition is implemented, it is necessary to send the new clause information to the host to create new circuits for the new clause module. The control unit should also notify the host of termination and send solution to host.
Based on these high level requirements for the new design, we nest describe the detailed design of the modular SAT solver.
Configurable Hardware Mapping
This section describes a hardware organization for the new SAT solver. This design is intended to be mapped to an array of FPGA chips, because as in previous architectures, one FPGA does not provide the capacity necessary for interesting (i.e. large) SAT problems.
Global Topology
The circuit topology is based on a regular ring structure as shown in Figure 3 . There is a main control unit and a series of processing elements (PEs) on the ring. Each PE is one FPGA containing multiple clause modules. A pipelined bus goes across the ring. Each PE represents one stage on the pipeline, so a wire will cross no more than one FPGA boundary. The size of the ring is determined by the size of the SAT formula. More clauses require more clause modules, hence more FPGAs. Within each FPGA, the clause modules have access to the values on the bus. They can also update the value on the bus. For example, if a new implication is made, it will set the corresponding bit to 1. The newly updated literal will pass around the ring, so every clause will be able to see the update in a few cycles.
This structure bears the similarity to the SPLASH and SPLASH-2 [3] reconfigurable computer. Physically, the implementation system does not need to be a ring. For example, we can map it to an FPGA board like the one used in the IKOS emulator. The ring structure also has the flexibility of extending the capacity by allowing the connection of a number of boards into one system. We can also divide a large system into several smaller SAT solvers. This flexibility of expanding capacity or dividing hardware into multiple problems improves the utilization and efficiency of the hardware.
Main Control Unit
The main control maintains the global state and monitors the ring for value changes and conflicts.
State control:
The main control manages the state transitions between the four main states -branching, implication, analysis and backtrack. It sets the control signals on the bus to insure proper operations for the clause modules.
Conflict check:
If both literals of a variable are set to 1, a conflict arises and causes conflict analysis and backtrack. Since we need to perform conflict analysis, it is necessary to identify which variable has the conflict. Each pair of literals for one variable feed an AND gate to create a variable conflict signal. These conflict signals feed a wide OR gate for global conflict signal. The conflict signals also feed a priority encoder so the identity of the conflict variable can be stored. This variable will be posted on the bus as the starting point for conflict analysis.
Branch decision:
Branch decision is selecting a new value for an undetermined variable. The control circuit should decide on the next available free variable. During implication, the circuit checks if both literals of a variable are 0. Then the signal feeds a priority encoder, just like the one for the conflict check. Then the variable selected will be assigned 1 through a decoder during the branching. After a branch decision is made, the system changes into implication mode.
Analysis and backtrack decision:
During the analysis, the main control waits until all values settle. The variable assignments leading to the conflict are then collected. The most recent assignment is the backtrack destination. This set of variables should be forwarded to the host machine if clause addition is implemented.
Host interface:
The main control unit should be connected to the host to send the termination signal and to shift the solution out. If clause addition is implemented, it should send out the information about the additional clause. It should also coordinate the reconfiguration for additional clause modules.
Compilation:
Since the function of the main control unit is fixed, it need only be compiled once. It occupies one FPGA and is connected to both the pipelined bus and a communication interface to the host computer.
Clause Module
The clause module is the unit that performs the computing task of a clause in the SAT formula. It computes implications and performs conflict analysis. Since the clause module is connected to the pipelined bus, it also includes the interface to the bus. Figure 4 shows one clause module in the context of one pipeline stage. It sets or resets the values on the pipelined bus during implication or analysis. In an actual implementation, since the number is compared to a constant, there is no need for a full comparator. It just needs to be a combinational gate that outputs one only for a given input vector. In the Xilinx XC4000 FPGAs, a CLB can take up to nine inputs. With the identification performed by the counter and checker, the bus interface should take care of updating values both from the bus to the clause and from the clause to the bus. This task is handled by a set of registers and some control circuits. Each clause module has a set of local registers to store the values of the relevant variables. They are updated from the bus once the variable is available. It also has a set of registers to store the values it wants to put on the bus. When the variable is accessible, the value is updated. It also sets the update signal, telling the control unit that new values have been asserted.
Implication:
The implication is similar to the original design. It checks the local variable registers and generates new implications accordingly. The new implications will subsequently update the value on the bus. The unit also remembers that such an implication has been made. This information is used in the analysis process.
Analysis:
Analysis is the reversal of the implication process. It determines the predecessors of a conflict. It starts by setting both literals of the conflict variable to 1. If a clause has generated an implication, it checks whether this implied literal is set to one on the bus. If so, it means the analysis process is looking for the predecessors of this literal. Then this clause module sets the complement of the other literals in the clause to 1 because these values have led to the implication. The implied literal is reset to 0 on the bus because this literal has already been accounted for. This process ends when all the true values on the bus are determined by branch decisions. At this moment, the main control should generate the predecessor set and decide the backtrack destination.
Running SAT Algorithm on Hardware
The primary source of parallelism in the hardware implementation of SAT is that each clause is implemented as a separate cell in hardware. As a result, many clauses can be evaluated in parallel. The basic flow of the SAT solver is shown in Figure 5 . It includes four major states: Branch, Imply, Analysis and Backtrack.
Branch:
After initialization (and whenever a partial assignment does not lead to a conflict), the circuit is in the branch state. The decision of which variable to assign (i.e., branch on) next, is determined by the main control. In the simple approach we simulate here, the variable ordering is statically determined, but more elaborate dynamic techniques are the subject of future research. It then proceeds to the Implication state. If no more free variables exist, a solution has been found.
Implication:
This state is used to compute the logical implications of the assignment. The clause cells check the values and generate implications when necessary. The implied values propagate to other clauses through the pipelined bus. The main control monitors the value changes. When implications have settled, it will proceed to branch on a new variable. It also monitors conflicts in variables, and initiates conflict analysis when needed.
Analysis:
To indicate conflict analysis mode, both literals of the conflicted variable are set to 1, and all other data bits are set to 0. In the analysis mode, clause modules generate predecessors for the implied literals. When this process settles, only the assigned variables responsible for the conflict will have a true value on the bus. This list may be used by the host computing to generate a new clause module. In our work, a new clause is generated only when both assignments for a variable end with a conflict. The backtrack destination is chosen to be the most recently assigned variable of the ones responsible for the conflict. If there is no variable to backtrack to, there is no solution to the formula because the observed conflict cannot be resolved.
Backtrack:
After conflict analysis, the backtrack destination is determined. All the variables that have obtained their value after the backtrack variable should be reset to free. A new branch of searching begins and the state machine goes into the implication state.
Performance
In order to evaluate the performance of this design, we built a C++ simulator and used the DIMACS SAT benchmark suite as input [1] . Table 1 shows a comparison of the number of hardware cycles needed to solve the problems. The first column of data shows the cycle counts for the basic configurable approach described in the previous chapter. The next column shows performance data for the newer design discussed here. This column includes conflict analysis and non-chronological backtracking, but does not include dynamic clause addition. The following column of data is the new design with the dynamic clause addition.
Although the run-time is expressed in number of cycles in each case, the actual cycle time is very different between the original design and the current one. In the old design, long wires between implication units and VirtualWire pin multiplexing make the user clock slow. Typical user clock rates are several hundred KHz to 2 MHz. In the newer design, the communication is pipelined and the routing is shorter and more regular. Our compilation to FPGAs show that the clock rates in the range of 30 MHz can be quite easily achieved. Therefore, speedups occur whenever the new design requires 15X or fewer cycles compared to the old design.
From the results, we can see the new design without clause addition has a speedup of about 1x to 10x. Speedups in this case are due to (1) the improved clock rate and (2) the improved conflict analysis, which leads to more direct backtracking. Implementing dynamic clause addition offers benefits that vary with the characteristics of the problems. For some problems, there is marginal or no benefit. For the aim problems, we can achieve up to several thousand times additional speedup. The performance gain is especially significant in the problems with no solution. In these cases, the dynamically added clauses significantly prune the search space allowing the circuit to rule the problem no solution much earlier. The result shows the performance new design is similar to or faster than the original configurable SAT solver. With clause addition, the performance for some problems may be improved by several hundred times.
Fast Configuration Generation
One of the major objectives for the new modular design is to achieve faster hardware compilation. In this section, we will elaborate on the fast configuration generation of the SAT-solver using JBits tools [15] .
JBits Tools
Typically, the physical design tools are often provided by the FPGA vendor and the user cannot manipulate the configuration without the vendor tools. However, these tools are in general quite slow.
JBits is a tool set that provides the means to directly program the Xilinx FPGAs. JBits is an Application Program Interface (API) to the Xilinx configuration bitstream. This API permits Java applications to dynamically modify Xilinx XC4000EX/XL bitstream configurations. The advantage of this approach is primarily one of speed.
Typical FPGA tools can automatically compile the circuit design to the FPGA configuration. This process is quite slow. The JBits tool is not yet another automatic tool. Instead, it allows direct control of each single connection and look-up table in the FPGA. For example, we can write a Java program that creates the configuration of an FPGA by directly assigning the routing and logic functions. Since the configuration is directly generated without automatic optimization, the process can be very fast.
Fast SAT-solver Generator
With the capability of direct programming provided by JBits, we use a two-step approach for a fast configuration generator. The first step is creating a general template and the second step is customizing the template according to the SAT formula.
We will limit our solver to 3-SAT problems first. Each clause has no more than three variables. Each SAT problem can be transformed to 3-SAT in polynomial time by introducing new variables. Therefore limiting to 3-SAT does not lose the generality of the approach. We need only create one 3-SAT module instead of modules of many different sizes. We can create a basic template with the 3-SAT clause modules. The following steps are used to create such template:
Designing single clause module:
We have designed one 3-SAT clause module by schematic method. Schematic design gives simple correspondence between the design and implementation. The Xilinx Foundation implementation tool is used to place and route the design into FPGA. This design is used as an origin.
Reshaping clause module:
The compiled clause module is placed on the whole FPGA. However, we want it to be regular shape so many modules can be tiled into one FPGA. The EPIC tool provides a graphical user interface to edit the actual layout of the FPGA. We rearrange the CLBs to create a rectangular module for a clause.
Creating SAT template:
The clause module is duplicated in the FPGA and global routing is added to the circuit. This provides a generic template for SAT solver.
The template should be further customized according to the SAT formula. Two entities must be customized within each clause module: (1) the comparator to identify the variable on the bus, and (2) the connection to the global bus.
The customization is performed by a Java program we developed, called SAT-Solver Generator. It reads in the SAT formula and the bitstream file of the generic template. It performs the customization by calling the JBits API. The FPGA bitstream files are the products of this generator and they can be directly downloaded to the FPGAs.
This approach minimizes the work needed to customize the circuit. All the duplicated parts are built into the template. The instance-specific customization only involves a small number of simple operations. The experimental result of dramatic reduction in compilation time is shown in the next section.
Experimental Results
We use Xilinx XC4036EX FPGA as our target device. Each FPGA contains 36 x 36 CLBs. Larger FPGAs can also be used with a small number of modifications to our program. The clause module is a 4 x 16 rectangle. Sixteen modules are placed in each FPGA. The global signals propagate horizontally. The modules connect to the global signal through vertical wires. Their connections are determined according to the SAT formula. The comparators are also configured according to the formula. All other elements are identical for different problems.
The customized configuration files are generated by SATSolver Generator. It reads it the template file and SAT formula, and generates a number of bit stream files for the FPGAs. Sun Java 1.1.7 tool is used to compile and run the Java program. The computer is an Intel Pentium Pro running Microsoft NT 4.0. The CPU clock rate is 200 MHz and the main memory is 128 MB. Table 2 shows the compilation time comparison. The old compilation time is the IKOS compilation time for the initial design. It is the order of hours while the new compilation time is only a few seconds. The new approach is several orders of magnitude faster. In the old design, there is no useful acceleration for problems solvable in several hundred seconds because of the compilation overhead. In the new design, we can achieve useful speedups of up to 91 times.
This design clearly demonstrates that with proper tools and careful design, the overhead of instance-specific configurable computing can be reduced to a few seconds. Our approach successfully exploits the repetitive structure in configurable computing. We will elaborate the general strategies of expediting the implementation of configurable computing in the discussion chapter.
Conclusions
We have proposed a new FPGA-based reconfigurable architecture for solving the satisfiability problem. The goal in proposing this new architecture has been primarily to achieve speed up over software implementations of SAT and, at the same time, remove the synthesis and place-and-route overheads which have been associated with previous proposals for accelerating SAT with reconfigurable hardware. Our proposal represents a novel architecture which not only achieves that goal but also enables dynamic clause addition for incorporating learning into the algorithm, and allows for much faster clock speeds. We present experimental results indicating that significant speed up is possible using our architecture over software implementations of satisfiability, and that the compilation overhead of the architecture is negligible.
Program Connections
Write 
