We apply our object-oriented design environment PAM-Blox to dynamic generation of circuits for recongurable computing. Our approach combines the structural hardware design environment with commercial synthesis of nite state machines (FSMs). The PAM-Blox environment features a well dened hardware object interface and the ability to control the placement of hand-optimized circuits. We integrate the advantages of an object-oriented design environment with full control over placement a t e v ery level of abstraction, with commercial FSM synthesis and optimization.
Introduction
Reconguration models used in congurable computing can be classied into compile-time reconguration (CTR) and run-time reconguration (RTR) [6] . In CTR, the hardware compilation and the reconguration are done at compile-time. At run-time, the circuit loaded onto a recongurable resource is executed for many sets of input data. In RTR, the conguration of the recongurable resource is changed while the application is running. Here, the reconguration time becomes part of the application's run-time and has to be minimized.
Recently, another model of recongurable computing emerged: instance-specic reconguration. In this model, new hardware is generated for every problem instance, i.e., every set of input data, of a particular algorithm. In such a case, not only the reconguration time but also the hardware compilation time becomes part of the overall run-time. This is denoted as dynamic circuit generation.
Applications that can make use of the instancespecic reconguration model must show two characteristics, lots of ne-grained parallelism dependent o n the actual input data and long run-times in software. The rst characteristic ensures that ne-grained recongurable resources, such as Field Programmable Gate Arrays (FPGAs), achieve high speed-ups compared to microprocessors. The second characteristic is required to hide the hardware compilation overhead. The driving application for this reconguration model is the acceleration of Boolean satisability problems [13] [9] [2] [10] [12] .
For this paper, we required a design environment that supports dynamic reconguration. Our approach combines PAM-Blox, a structural object-oriented design tool that gives control over placement, with commercial FSM synthesis and optimization. The concept of a hardware object demonstrated with PAM-Blox w as introduced in [8] and has proven itself to be highly competitive with commercial tools such as Synopsys FPGA Express II. Circuits for solving Boolean satisability problems consist of some datapath blocks and many cooperating nite state machines. In terms of hardware area, the FSMs are dominating. A design environment that supports dynamic reconguration must therefore not only provide a method to re-use highly optimized datapath components, but also a method of specifying complex FSMs, both with control over placement.
In the remaining part of this section, we dene the Boolean satisability problem, discuss the general design tool ow for recongurable satisability solvers, and introduce the PAM-Blox design environment. In Section 2, we describe our new environment that combines FSMs with PAM-Blox. Section 3 presents an example of a hardware architecture to solve satisability problems in recongurable hardware. First experimental results achieved are discussed in Section 4. Section 5 concludes this paper.
The Boolean Satisability Problem
The Boolean satisability problem (SAT) is a fundamental problem in mathematical logic and computing theory with many practical applications in areas such as computer-aided design of digital systems, automated reasoning, and machine vision. In computer-aided design, tools for synthesis, optimization, verication, timing analysis and test pattern generation use variants of SAT solvers as core algorithms. The SAT problem is commonly dened as follows [5] :
Denition 1 Given i) a set of n Boolean variables x 1 ; x 2 ; : : : ; x n , ii) a set of literals, where a literal is a variable x i or the negation of a variable x i , and iii) a set of m distinct clauses C 1 ; C 2 ; : : : ; C m , where e ach clause consists of literals combined by the logical or connective _, determine, whether there exists an assignment of truth values to the variables that makes the Conjunctive Normal Form (CNF) C 1^C2^: : : C m true, where^denotes the logical and connective.
An example for a SAT problem with 4 variables and 3 clauses is (x 1 _ x 2 )^(x 1 _ x 3 _x 4 )^(x 2 _ x 4 ). The vector (x 1 ; x 2 ; x 3 ; x 4 ) = ( 1 ; 1 ; 0 ; 0) is one possible solution to this SAT problem.
Since the general SAT problem is NP-complete, exact methods to solve S A T show an exponential worstcase run-time complexity. This limits the applicability of exact SAT solvers in many areas.
The SAT problem is a discrete, constrained d e cision problem [5] . A straightforward but inecient procedure to solve it exactly is to enumerate all possible truth value assignments and check if one satises the CNF. Many of the improved techniques that have been proposed to solve S A T problems eliminate one variable from the CNF at a time. There are two basic methods: splitting and resolution. Resolution was implemented in the original Davis-Putnam (DP) algorithm [4] . Splitting was used rst in Loveland's modication to DP, the DPL algorithm [3] . In splitting, a variable is selected from the CNF and two sub-CNFs are generated by setting the variable to 0 and 1, respectively. The iterative application of splitting generates a search tree; a leaf of the tree denotes a full assignment o f v alues to variables. Most practical SAT solvers use the splitting technique and combine it with backtracking. Backtracking searches the search tree in a depth-rst order and thus avoids excessive memory requirements.
Existing software SAT solvers use a wide variety o f backtracking methods and strategies for decision, deduction, and diagnosis. A v ery sophisticated example is the GRASP algorithm [11] , which is also used as reference in our work. The powerful strategies that are implemented by sophisticated SAT solvers reduce the number of variable assignments required to nd a solution or to prove that there is no solution. However, these strategies can be computationally very expensive.
Recently, recongurable hardware architectures have been proposed to solve hard instances of the SAT problem [13] [12] . For each instance of a SAT problem, i.e., for each CNF, new hardware is generated on-the-y reecting the particular structure of the CNF. These instance-specic architectures rely on ne-grained computing structures as provided by FPGA technology.
Design ToolFlow for Recongurable SAT solvers
A design tool ow for instance-specic computation of SAT problems includes basically three steps, as shown in Figure 1 . The rst step is a generator program that takes a SAT problem as input and generates the instance-specic logic description of this problem. The next step, the FPGA compilation, maps, places, and routes this description for a specic target FPGA family. The result of this step is a conguration bitstream. The third step, the backend, congures the recongurable resource, starts the computation, waits for completion, and extracts the results. The two major issues in the design tool ow for recongurable SAT solvers are: fast circuit generation and the use of predesigned and optimized FSMs. Depending on the complexity of the SAT problem, circuit generation can take b y order of magnitude longer than the execution of the hardware algorithm itself. FSM optimization is crucial because as simulations have shown, for most SAT problems the FSMs are the limiting factor in terms of hardware complexity. Zhong et al. [14] presented a tool ow for a recongurable SAT solver where the instance-specic logic is described in VHDL. This description is then partitioned and mapped onto an array o f Xilinx XC4000 series FPGAs by a n I K OS logic emulation system. The advantage of this approach is that large FPGA systems can be targeted. The drawback is that the utilization of the FPGAs and the achieved clock frequencies are usually rather low.
A dierent approach is proposed by Rashid et al. [10] . SAT-specic CAD tools for synthesis, partitioning, placement, and routing are being developed for an open FPGA architecture, namely the Xilinx XC6200 family. The proposed design tool can generate either a logic description in VHDL, requiring commercial tools for FPGA compilation, or directly a conguration bitstream for the Xilinx XC6216.
The design tool ow from Suyama et al. [12] generates a logic description in the hardware description language SFL. This description is synthesized by the PARTHENON CAD tools and mapped onto a ZyCAD system, which consists of Xilinx XC4000.
In our approach, we generate a logic description in form of a Xilinx FPGA netlist and use the commercial Xilinx design implementation tools for mapping, placement, and routing. We address the issue of fast circuit generation by controlling the placement o f t h e FSMs with the object-oriented hardware design environment P AM-Blox/PamDC. FSMs are optimized with a commercial synthesis and optimization tool such a s Synopsys FPGA Express II.
PAM-Blox: Object-Oriented Structural Design
PAM stands for Programmable Active Memories described in [7] . PamDC was developed for the PAM project to oer circuit generation on the registertransfer-level (RTL) in C++. For datapaths, hand de- signs are typically more ecient than compiled behavioral descriptions. In order to exploit the eciency of hand design while simplifying the design process, PAMBlox, introduced in [8] , oer a bottom up approach t o compilation for custom computing machines. By using a powerful and highly optimized parameterizable library of hardware object generators, PAM-Blox, we add levels of abstraction that preserve optimal area and performance while simplifying the design process. The rst level, PamBlox, consists of parameterizable simple elements such as counters and adders. Automatic placement of carry chains and exible shapes are supported. PaModules are more complex elements possibly instantiating PamBlox. PaModules have xed shapes and are usually optimized for a specic datawidth. Examples for PaModules are multipliers, Coordinate Rotations (CORDICs), and special arithmetic units for encryption. PAM-Blox simplies the design of datapaths for FPGAs by implementing an object-oriented hierarchy i n P amDC/C++. With PAM-Blox, hardware designers can benet from all the advantages of object-oriented system design that the software industry has learned to cherish during the last decade. Ecient use of function overloading, virtual functions, and templates makes PAM-Blox a v ery powerful and yet simple to use design environment.
By implementing PAM-Blox together with the actual design within a C++ hierarchy, we simplify the task of adapting library modules to the specic needs of the application. Therefore PAM-Blox circuit generators are easily scalable and allow FPGA designers to share and reuse pieces of designs by writing new PamBlox and PaModules.
2 Combining PAM-Blox with FSM Synthesis PAM-Blox has proven itself to be very useful for designing high performance data intensive applications.
FSM *myFSM = new FSM ("cruise") ; ... myFSM->set output("offOut"); myFSM->set output("readyOut"); myFSM->set output("setOut"); myFSM->set output("waitOut"); myFSM->set input("cruiseOn"); myFSM->set input("cruiseOff"); myFSM->set input ("cruiseSet") ; ... myFSM->set state("OFF", "0001"); myFSM->set state("READY", "0010"); myFSM->set state("SET", "0100"); myFSM->set state("WAIT", "1000"); myFSM->set init state("OFF"); myFSM->set trans("OFF", "READY", "cruiseOn"); myFSM->set trans("READY", "SET", "cruiseSet"); myFSM->set trans("SET", "WAIT", "cruiseStop"); myFSM->set trans("WAIT", "SET", "cruiseResume"); ... myFSM->set output loc("offOut", dx, dy0, FFX); myFSM->set output loc("readyOut", dx, dy0, FFY); myFSM->set output loc("setOut", dx, dy1, FFX); myFSM->set output loc("waitOut", dx, dy1, FFY); myFSM->generate instance("cruise1", x, y); Figure 4 : Code fragment showing the specication of the state machine CRUISE1 in C++. The state machine will be implemented as Moore automaton. Thus, the state encoding corresponds to the specied order of outputs.
However, the tediousness of creating control units remained a major drawback of the object-interface.
Applications such as Boolean satisability require optimized state machines. In order to keep a unied specication of the circuit in C++ and still get maximal optimization of the state machine, we integrate the PAM-Blox design ow with Synopsys FPGA Express II. The tool ow is shown in Figure 2 . The application circuit is described in C++, using the libraries PamBlox, PaModules, and PamFSM for specifying state machines. Running the design executable creates behavioral Verilog for the state machines. Synopsys FPGA Express II is called for synthesis, optimization, and technology mapping. The structural elements of the FSMs and the PAM-Blox design are merged on the Xilinx netlist level, possibly augmented with placement directives. Figure 3 shows a simple state machine controlling the cruise control of a car. The PamFSM specication is presented in Figure 4 . State machines can be instantiated multiple times and placed anywhere on the FPGA. Hand placement or clever automatic placement can signicantly improve the performance of FPGA designs. In addition, placing the state machines is a simple and convenient w a y to determine the FPGA read-back p ositions of the state variables. Placement o f state machines is a key feature in our environment, as it is not supported by conventional CAD tools such as Synopsys FPGA Express.
Recongurable Architectures for SAT
The block diagram of the basic architecture for solving SAT in hardware is shown in Figure 5 . The circuit consists of three parts: i) an array of FSMs, ii) a datapath, and ii) a global controller. Each v ariable of the CNF corresponds to one FSM. The FSMs are connected in a one-dimensional array; each FSM can activate its two neighboring FSMs at the top and at the bottom. The architecture of the FSM is algorithm-specic; i.e. for a specic SAT algorithm, all the FSMs are identical. The datapath is a combinational circuit that takes the variables as input and computes outputs that are fed back to the FSMs. Figure 6 . An activated FSM assigns 0 to its variable and checks the resulting CNF value. If the CNF value is 1, the partial assignment already satised the CNF and the computation stops. If the CNF is 0, the partial assignment made the CNF unsatisable. In this case, the FSM assigns the complementary value to its variable. If the CNF value is X, the partial assignment did neither satisfy the CNF nor did it make the CNF unsatisable. In this case, the FSM activates the next FSM at the bottom. If both value assignments have been tried, the FSM relaxes its variable by assigning X to it, and activates the previous FSM at the top.
When the rst FSM relaxes its variable and activates the global controller, the SAT problem is proven to be unsatisable. By this procedure, the array o f i n terconnected FSMs implements chronological backtracking. Most recongurable architectures that have been proposed for solving SAT in CNF form share the basic block diagram shown in Figure 5 . They dier in the modeling of the variables (2-valued, 3-valued, or 4-valued logic) and in the used deduction strategy, which is reected in the actual implementation of the datapath and the FSM. For all architectures, the datapath Table 1 : hole benchmarks from the DIMACS benchmark suite. The software SAT solver GRASP was executed with parameters +bD +dDLIS o n a P entium-II/300MHz/128MB RAM PC platform running Linux.
and the number of FSMs is instance-specic. However, the global controller and the single FSM do not change with the CNF.
Architectures that implement more powerful deductions strategies also have more complex datapaths and FSMs. The architecture presented here oers the least powerful deduction strategy and has the smallest hardware requirements. As discussed in [9] , this basic architecture can be a viable option for smaller SAT problems or in cases with resource limitations. In this paper, we restrict our discussion to this basic SAT architecture, as the more complex alternatives lead to the same issues.
Experimental Results
In this section, we report experimental results for the class of benchmarks hole taken from the DIMACS satisability benchmarks suite [1] . The hole benchmarks are instantiations of the pigeon hole problem, formulated as a SAT problem in CNF. This benchmark class is well-suited for evaluation, as all the examples are unsatisable and hard to solve, i.e., software SAT solvers have long run-times. Table 1 lists the benchmark examples with their problem size, given by the numbers of variables and clauses, and the run-time of the software SAT solver GRASP [11] on a PC platform.
We compare the performance of the software implementation with the performance on our recongurable computing system. Our hardware prototype is implemented on the PC platform, this time running Windows NT 4.0. As recongurable resource we use a Digital PCI Pamette board, equipped with 4 FPGAs of the type Xilinx XC4020.
We dene the raw speed-up S raw of the recongurable SAT solver as t sw =t hw , the run-time ratio of software and hardware SAT solvers. The overall runtime for computing a SAT problem in recongurable hardware consists of the hardware compilation time, t comp , the time for conguring the FPGA, t config , the actual hardware execution time, t hw , and the time for reading back and extracting the result, t read . t overall = t comp + t config + t hw + t read (1) The overall speed-up S overall is then given by t sw =t overall . Table 2 presents the experimental results for the hole benchmarks. With our design tool ow, the time for FPGA conguration and read-back can be neglected compared to the hardware compilation time, which itself is strongly dominated by the Xilinx design implementation tools.
The examples hole6 to hole9 were mapped onto one Xilinx XC4020. For hole10, an FPGA of type XC4025/XC4028 is necessary. As we know the number of clock cycles for hole 10 from a simulation of the SAT solver and the maximum clock frequency from running the FPGA compilation tools, we w ere able to determine exactly the speed-ups for this benchmark. The hardware cost in Table 2 suggests that hole10 can be mapped in one XC4020. However, our placement strategy tries to minimize the distances between the FSMs and the datapath logic blocks. This prevents us from placing too many FSMs in an FPGA. With this strategy, we never ran into routing problems for the datapath logic and we were able to achieve a rather high performance for these irregular designs.
The hardware and software execution times are shown in Figure 7 . The raw execution times of hardware and software SAT solvers increase more rapidly with the problem size than the hardware compilation time. This leads to a cross-over point in the overall speed-up around hole9. For this benchmark, SAT solvers in instance-specic hardware and software have similar overall run-times. For hole10 we achieve a speed-up of 7.408, which reduces the run-time from more than 2 hours in software to about 17 minutes in hardware. Table 2 shows that the raw speed-up S raw is decreasing with the problem size. This is for two reasons. First, as simulations [9] have shown, a slightly decreasing speed-up seems to be an artifact from applying the presented deduction strategy to this particular SAT problem class. Second, larger problems result in more complex circuits which lead to longer clock cycles times.
Conclusion and Future Work
Although hardware compilation for current FPGAs takes minutes to hours, instance-specic SAT solvers are promising for hard SAT problems, i.e., problems Table 2 : Results for running the hole benchmarks. For each benchmark, the software run-time t sw , the hardware cost in congurable logic blocks (CLBs), the minimum clock period min , the hardware execution time t hw , the hardware compilation time t comp , the overall run-time for the recongurable system t overall , and the raw and overall speed-up are shown.
where software solvers show v ery long run-times. The sources of the potential speed-ups are the deduction steps that show large amounts of ne-grain parallelism. This makes FPGAs with their ne-grained logic blocks an optimal target for instance-specic hardware accelerators for SAT problems. Our design environment that combines PAMBlox/PamDC with commercial FSM synthesis and optimization has proven itself to be very convenient for dynamic hardware generation. The entire application, including the code generation for the instance-specic circuits as well as the run-time functions for downloading bitstreams and reading back results, can be handled in a single representation in C++. This greatly simplies the construction and usage of instance-specic systems.
The toolow is still very clumsy as it requires us to run Xilinx place-and-route tools. A specialized placeand-route tool together with the availability o f larger FPGAs would make the recongurable solution even more competitive.
In the future, we will extend the SAT architectures by assigning cost values to the variables. This will allow to solve the important class of minimum-cost problems. We plan to apply these recongurable SAT solvers to real-world problems in CAD.
In the current implementation, we use only one FPGA of the Pamette board. To employ arrays of FPGAs, we are investigating dierent partitioning techniques. This includes mapping large SAT circuits onto several FPGAs and splitting large SAT problems into subproblems that can be run independently.
In addition, we consider to extend PamFSM to more general models, including Mealy machines and derivatives. 
