WalkSAT (WSAT) is one of the best performing stochastic local search algorithms for the Boolean Satisfiability (SAT) and the Maximum Boolean Satisfiability (MaxSAT). WSAT is very suitable for hardware acceleration because of its high inherent parallelism. Formal verification of digital circuits is one of the most important applications of SAT and MaxSAT. Structural knowledge such as logic gates and their dependencies can be derived from SAT/MaxSAT instances generated from formal verification of digital circuits. Such that knowledge is useful to solve these instances efficiently. In this paper, we first discuss a heuristic to utilize the structural knowledge for solving these problems by using WSAT. Then, we show its implementation on FPGA. The problem size of the formal verification is typically very large, and most data have to be placed in offchip DRAMs. In this situation, the acceleration by FPGA is limited by the throughput and access latency of the DRAMs. In our implementation, data are carefully mapped on the on-chip memory banks and off-chip DRAMs so that most data in the off-chip DRAMs can be continuously accessed using burst-read. Furthermore, a variable-way cache memory comprised of the on-chip memory banks is used in order to hide the DRAM access latency by caching the head portion of the continuous read from the DRAMs and giving them to the circuit till the rest portion is started to be given by the burst-read. We evaluate the performance of our proposed method by changing configuration of the variable-way cache and the processing parallelism, and discuss how much acceleration can be achieved.
Introduction
Given a set of variables and a set of clauses that are disjunctions of the variables and their negations, the goal of the Boolean satisfiability (SAT) problem is to find a truth assignment to the variables in order to satisfy all clauses. The Maximum satisfiability (MaxSAT) problem is a variant of the SAT problem, and its goal is to find a truth assignment that satisfies as many clauses as possible. Many real-world applications can be encoded as SAT or MaxSAT problems. Formal verification, which is one of the most important applications of the SAT problem, is a mathematical method to verify the correctness of digital circuit systems.
WalkSAT (WSAT) and its variants [1] , [2] are one of the best performing Stochastic Local Search (SLS) algo- rithms for SAT and MaxSAT problems. These algorithms begin by considering a random truth assignment and a set of unsatisfied clauses with the assignment. Then, an unsatisfied clause is chosen, and one of its literals is flipped from false to true (a literal is a variable or its negation, and the variable becomes false by this flipping when the literal is a negation of the variable) to satisfy the clause. Clauses that include the negation of the flipped literal are re-evaluated to update the set of unsatisfied clauses. This procedure is repeated until all clauses are satisfied. The heuristic to choose an unsatisfied clause and a literal to be flipped in it is the most critical part of the algorithm. By choosing proper literals, larger problems can be solved efficiently in shorter time. A problem of SAT and MaxSAT is typically given as a propositional conjunctive normal form (CNF), and it is known that the functional dependencies among the clauses (namely, gates in the circuit) can be reconstructed from the given CNF [7] , [8] , and it helps to solve those problems efficiently [18] .
In [22] , [23] , we proposed a new heuristic for WSAT algorithm that utilizes the structural knowledge in a given CNF, and evaluated its performance using SAT-encoded formal verification problems of digital circuits. The size of formal verification problems is very large, and most of the data have to be placed in off-chip DRAMs. Under this situation, the performance of the implementation is limited by the throughput and access latency of the DRAMs. In [22] , we showed an implementation method of our heuristic leveraging the high data transfer rate of DRAMs, and studied how much speedup was possible using the memory throughput and memory access latency as parameters. In [23] , we evaluated its performance on SoC in order to get rid of the hardware resource limitation of FPGA and verify the maximum performance of the proposed implementation.
In [24] , we introduced a variable-way cache (V-cache) memory on FPGA in order to hide the DRAM access latency by using the limited amount of on-chip memories efficiently. The size of data blocks that are frequently fetched from the DRAMs considerably varies in the WSAT algorithm. When the block size is small enough, all data in the block are cached in V-cache using several entries on the same line according to the block size. When the block size exceeds the total amount of the entries on one line, only a head portion of the block is cached using all the cache entries on the line. In the former case, data in the cache memory are simply sent to the circuit. In the latter case, on the other hand, DRAM access to the rest of the block data is immediately started, Copyright c 2017 The Institute of Electronics, Information and Communication Engineers and until the first data from the DRAM arrives, the cached data is sent to the circuit.
In this paper, we first show that our heuristic is effective for MaxSAT problems as well as for SAT problems. Then, we describe its FPGA implementation in detail (mainly data mapping on on-chip and off-chip memory banks, and Vcache), and evaluate the performance gain by our implementation using more benchmark problems. We also discuss how much parallelism can be exploited under the resource limitation of current largest FPGAs available and how much acceleration can be achieved under this situation.
This paper is organized as follows. In Sect. 2, we define the SAT and MaxSAT problems, and we introduce the related work in Sect. 3. In Sect. 4, our heuristic is described and its performance is evaluated. Its FPGA implementation is described in Sect. 5 in detail. The performance of the FPGA implementation is shown in Sect. 6. Section 7 gives the conclusions and future work.
Satisfiability and Maximum Satisfiability Problems
The SAT problem is a well-known combinatorial problem. An instance of the problem can be defined by a given Boolean formula F(x 1 , x 2 , . . . , x n ), and the question is to determine if there exists an assignment of binary values to the variables (x 1 , x 2 , . . . , x n ) that makes the formula true. Typically, F is presented in conjunctive normal form (CNF), which is a conjunction of a number of clauses, where a clause is a disjunction of a number of literals. Each literal represents either a Boolean variable or its negation. For example, in the following formula in CNF that consists of four clauses, {x 1 , x 2 , x 3 } = {1, 1, 0} satisfies all clauses:
The MaxSAT problem, on the other hand, is an optimization variant of the SAT problem, whose goal is to find an assignment to the variables that maximizes the number of satisfied clauses (i.e. minimizes the number of unsatisfied clauses). For instance, in the following CNF formula, there exists no solution, but {x 1 , x 2 , x 3 } = {1, 1, 0} minimizes the number of unsatisfied clauses:
Related Work
Many algorithms and hardware solvers have been proposed to date. Algorithms for solving the SAT problem can be divided into two major groups: complete and incomplete. The complete algorithms can always find a solution, or conclude that the problem is unsatisfiable. When no solutions can be found, the incomplete algorithms do not guarantee to find a solution. When a solution can not be found by those algorithms, it is impossible to determine whether the problem is unsatisfiable, or the algorithms could not find the solution. Nevertheless, these algorithms are of particular interest, because they are very effective on many large problems, and can be used to solve the MaxSAT problem. WSAT [1] , [2] is one of the best performing incomplete algorithms. Figure 1 shows the basic procedure of WSAT algorithms. The procedure begins by considering a random truth assignment to the variables. It searches for a solution by repeatedly selecting an unsatisfied clause at random, and then employing some heuristics to select a variable in that clause to flip (change its truth value from true to false or vice-versa). The heuristic to choose a literal to be flipped is the key of the search, and determines the performance of the algorithm. Many heuristics have been proposed for this reason [1] - [4] , [19] , [20] . Among them, WSAT/SKC [1] is the basis of our proposed method. Here, we introduce the variable selection heuristic in WSAT/SKC:
For each variable in the randomly selected unsatisfied clause, count the number of clauses that are true in the current truth assignment, but that would become false if the flip were made (called a break-value). If variables with break-value of 0 exist, pick any of them. If not, with probability p, pick any variable, otherwise pick a variable that gives the minimum break-value.
Several approaches to accelerate simple WSAT algorithms by hardware systems have been proposed [9] - [11] . However, the size of the problems which can be solved by those hardware solvers are very limited. This is because it is not easy to solve large real-world problems using only simple WSAT algorithms. Sophisticated SLS algorithms such as [3] , [4] , [19] have been proposed, but their control structures and the required memory access sequences are complex, and it is not easy to implement them in hardware. In [21] , an FPGA solver for probSAT [20] , one of the latest SLS algorithms, is proposed. In probSAT, selecting next flip variable is based only on probability distribution calculated by an elementary function. Using Xilinx XC7V690T, this FPGA solver achieves up to 99 times speedup over software running on Intel Core-i5 4670K with 32GB main memory. However, the performance of this solver to formal verification is not clear because benchmark problems used in the evaluation are not based on real-world applications, and their size are very small (up to 250 variables and 1065 clauses).
Several approaches based on complete algorithms have been proposed [12] , [13] . The performance of recent complete algorithms [14] - [16] has been significantly improved by introducing several techniques to prune the search space [14] - [17] . However, these techniques also require complicated control structure, which makes it difficult to handle large real-world problems in hardware solvers.
Outline of Our Algorithm
In this section, we describe our heuristic for the WSAT algorithm that is designed for formal verification problems of digital circuits.
Gates and Dependencies
Formal verification is a mathematical method of verifying hardware or software systems. In SAT/MaxSAT-encoded formal verification of hardware systems, a design of the hardware design (typically in gate-level) with its verification specification is translated to a CNF instance. Here, we describe several logic gates mainly used in CNF.
In the following discussion, we follow the terminology used in [7] and [18] . First, we consider the following CNF formula:
(
This formula becomes true, when y is true and at least one of x 1 , . . . , x n is true, or y is false and x 1 , . . . , x n are all false. Namely, this formula shows an OR gate y = ∨(x 1 , . . . , x n ). In the same way, the following formula: (x 1 ∨ ¬y) ∧ . . . ∧ (x n ∨ ¬y) ∧ (¬x 1 ∨ . . . ∨ ¬x n ∨ y) means an AND gate y = ∧(x 1 , . . . , x n ). In the formulas above, y is an output variable of the gate, and x 1 , . . . , x n are input variables. Another gate which is commonly used is XOR y = ⊕(x 1 , x 2 ), and can be described as follows:
An XNOR gate y = ⇔(x 1 , x 2 ) can be represented as follows:
Here, note that it is not possible to determine the output variable from the clauses for XOR and XNOR.
Basically, by finding the sets of clauses corresponding to any of above ones, we can find gates in the CNF (more sophisticated approaches are necessary to detect more gates as described in [7] , [8] ). An output variable of a gate becomes an input variable of the next gate. In order to make the data dependencies among the gates clear, we define "internal gate", "external gate" and "independent variable". An internal gate is any gate that can be recognized as ∨, ∧, ⇔, ⊕, namely a set of clauses which corresponds to any of that described above. An external gate is a clause which is not the part of the internal gates. Any literals included in them is not an input signal to other gates. It is considered that these literals correspond to the output signals of the circuit [18] . Independent variable is a variable that is never found in internal and external gates. Namely, the independent variables can be considered to the input signals to the circuit. Figure 2 shows an example of gates, and their data de- pendencies. External gates have no output variables to other parts in the given CNF, and the status of each external gate becomes an output of the circuit. In our implementation, AND/OR-type gates are detected first, and the input/output of XOR and XNOR gates are inferred from the input/output of AND and OR gates, so that no input/output conflicts happen among the gates using a simple back-tracking method (the input/output variables of XOR and XNOR gates cannot be known from only the clauses for the gate, though those of AND and OR can be known). Then, a variable which is not the output variable of any gate becomes an "independent" variable. Figure 3 shows the procedure of our algorithm [22] - [24] . In this procedure, the parameters MAX-TRIES (the number of new search sequences) and MAX-FLIPS (the maximum number of flips per try) are used to control the maximum run-time of the algorithm. Given a random truth assignment to the variables, several clauses become unsatisfied. This unsatisfiability of the clauses moves to the output side of the circuit (namely, to the external gates) by flipping the output variables (forward search). On the other hand, by flipping the input variables of the gates, the unsatisfiability moves to the input side (to the independent variables) (backward search). The scenario of our search is as follows:
Heuristic for Formal Verification Problems of Digital Circuits
1. Flipping the output variables of the gates preferentially (forward search). 2. All gates except for some external gates are satisfied. 3. Starting from the unsatisfied external gates, continue to flip one of the input variables of the gates until some of the independent variables are flipped (a series of clauses to one of the independent variables which make the external gate false are tracked) (backward search). 4. Then, repeat Steps 1 to 3.
According to our observations, by flipping the literals that correspond to the output signals of the gates preferentially, the search will converge very quickly to a local minimum. When the search is stuck in the local minimum, it is possible to get out of the minimum by flipping the input signals of the gates preferentially. Thus, by repeating these two phases, better local minima can be found efficiently. In Fig. 3 , p decides the probability to choose the output signals to be flipped, which is automatically adjusted by a noise parameter tuning mechanism [3] considering the period of be- ing stuck in a local minimum.
Performance of the Proposed Heuristic
We compare the performance of our heuristic with four algorithms: (1) Sparrow [19] , one of the latest SLS algorithms, (3) RSAPS [4] , one of the best performing WSAT variants for real-world problems, and (4) Table 1 shows the number of instances in each benchmark suite for which correct solutions could be found. For example, '1' in '20/20' column in LIVENESS-SAT-1.0 means that the correct solutions could be found for one instance in all the 20 tries, and '4' in '19/20-15/20' means that the correct solution could be found for 4 instances in 15 to 19 tries out of the 20. For the two unsatisfiable suites, solutions which can not satisfy only one clause are considered as their correct solutions. As shown in this table, our heuristic is comparable to RSAPS in the MaxSAT problems and superior to all other heuristics in the SAT problems.
Hardware Architecture
In our system, first, (1) data arrays are generated from the given CNF, (2) data dependencies in it are analyzed, and (3) a random truth assignment for the variables and the initial set of unsatisfied clauses are generated on the host computer. Then, the data arrays, the random truth assignment and the set of unsatisfied clauses are downloaded to the FPGA, and the solution of the problems is searched on FPGA. 
Data Arrays

Processing Sequence
On the host computer, otherwise, with probability p, l f = any of literals in c; with probability 1 − p, l f = the literal that represents the output signal of c; 
Parallelism in the Circuit
The parallelism in the algorithm is very high. However, throughput of off-chip DRAMs, namely, the number of words (L) that are given in parallel by the off-chip DRAM interface limits the circuit parallelism because most parts of the tables have to be placed in the off-chip DRAMs (in the following discussion, we call the L words given by the DRAM interface in parallel "L-words").
Suppose that the circuit on FPGA runs at f c MHz. The databus of the DDR3-SDRAM operates at 4 × f c MHz and transfers the data with double data rate operation. Therefore, one DDR3-SDRAM bank of 32b word provides the data of 8 × 32b to the circuit in parallel. Then, up to 32 words can be given to the circuit in parallel when it has four DRAM banks of 32b width (many FPGA boards e.g. Xilinx VC709 have the DRAM interface of the same configuration). In our circuit, most of the entries of the tables can be represented by up to 26b, and it is reasonable to use a 32b word for each entry of the tables except for some special cases (64b for Figure 5 shows a block diagram of our circuit when we have four off-chip DDR3-SDRAM banks of 32b word. The circuit runs at 1/8 of the DRAM data transfer rate. By deciding how to map the tables into the DRAMs, the architecture of the circuit is almost fixed. In the mapping in Fig. 5 , four DRAM banks work as one large bank, and 32(= 4×8) words are given from the DRAM interface to the circuit in parallel. As shown in Fig. 5 , The width of c sat tbl[] is changed from 4b to 18b (4b, 6b and 18b for clauses with up to 15, 63 and more literals respectively) in order to use the block RAMs efficiently. In our tested benchmarks, the maximum number of k-arg clauses is 408055 (for iq54 a). By this variable width approach, entries for up to 544K k-arg clauses can be supported with 192 18Kb block RAMs, which is enough for our target FPGA.
On this architecture, up to 32 unsatisfied clauses may be generated at the same time (though much fewer on average). These unsatisfied clauses are first put in the FIFOs 
Parallel Evaluation of the Clauses
In our circuit, off-chip DRAMs are accessed as one larger memory bank for simplifying the memory access control, and up to L words are given to the circuit in parallel. As described above, the processing of a clause list is repeated for #args + 2 times, and occupies the most of the computation time. Therefore, the parallelism in our circuit mainly exists in the processing of the clause lists given from c list tbl[].
Each clause list consists of three parts: (1) 2-arg clauses, (2) 3-arg clauses, and (3) k-arg clauses. As described in Sect. 5.4, v tbl[] is used for evaluating 2-and 3-arg clauses, while c sat tbl[] is used for evaluating k-arg clauses. These three types of clauses are arranged in each Lwords in a clause list so that the maximum parallelism can In order to evaluate argument literals in 2-or 3-arg clauses, we need to refer v tbl[] to obtain the truth values of the variables used as the argument literals. Figure 6 shows the details of v tbl[] (L = 8 for simplicity). v tbl[] consists of 64 banks, each of which includes a block RAM, an address encoder and two decoders. The truth values of the variables are stored in one of the 64 block RAMs. To read the truth value of a variable, the bank number (to choose one of the block RAMs) and the address of the block RAM are required. In our implementation, the least-significant six bits of the variable number are used as the bank number (called bank-index), and the rest bits are used as the address of the block RAM (called bank-address).
Argument literals of 2-or 3-arg clauses in L-words given from c list tbl[] are divided into two groups according to their positions in L-words (even and odd).
We call these groups literal-groups. The two argument literals of each 3-arg clause are always assigned to the different literal-groups, and the values of the two argument literals can be accessed at the same time utilizing the dual port access of the block RAMs. Clauses with argument literals for which the accesses to the same block RAM are required can not be assigned to the same L-words, and they are placed in different L-words to avoid the memory access conflict.
The variable numbers of the argument literals in the two literal-group are broadcasted to all of the banks along with their positions in the literal-groups (called source-positions) as shown in Fig. 6 . Each bank has its own bank number (for example, 0b000000 is shown in Fig. 6 ), and in each bank, the bank-indexes of the broadcasted variables are compared with its bank number in the the address encoder. Through this comparison, the bankindexes and source positions of the variables that do not match the bank number are masked to zero, and those in the same literal-group are ORed. Then, the bank-address of the ORed is used to access the block RAM. At the same time, the source position of the ORed is decoded. The truth value from the block RAM is masked by the decoder outputs. Its results (4 results for each literal-group in Fig. 5 The scheduling of these three kinds of clauses for avoiding bank conflict is executed on the host computer in advance. There exists no data dependency in this scheduling, and most of the L-words can be filled.
Variable-Way Cache Memory
As mentioned in the previous section, parallel processing of the clause lists is the main source of performance gain. It can be considered that L = 32 gives a good balance point of the performance and the hardware resource usage because the average length of the clause list is 19 to 36. However, for each access to the clause list, the idle time by DRAM access latency becomes the main factor that limits the system performance.
Let f be the operational frequency of the FPGA, the number of the cycles of row address to column address latency RCD (the cycles required between activation of a row and reading the first data from that row), its time T RCD , the number of the cycles of CAS latency CL, its time T CL , and the latency by DRAM interface T IF . The total access latency is given by
Here, note that T RCD equals T CL . The number of clock cycles in T d is given by in parallel. In Fig. 7 , data width of one word is 21b, which is wide enough to represent literal numbers and clause numbers for our target benchmarks. By holding C d lines in each block (k = C d ), the access latency of DRAMs can be completely hidden as shown at the lower right corner in Fig. 7 .
To realize this data access sequence, first, the F-cache is looked up. If it hits, the clause list in the DRAMs are read from (L × C d )th words to read out the uncached part. Otherwise, it is read from the first to read out the whole clause list. The F-cache is looked up using log 2 D−1 bits of the variable number and its negation bit as the cache index, and the rest bits as the tag. The reason for using the negation bit as a part of the address is to avoid the cache conflict caused by x and ¬x. This F-cache can be easily extended to a set-associative cache. This simple approach, however, does not work well. Table 3 shows the ratio of FPGA clock cycles to read the clause list excluding the access latency in four benchmarks (see Table 4 for their problem size). In Table 3 , for the benchmark 'bug1', 92.2% of the clause lists can be given to the FPGA in only one FPGA clock cycle, and 99.5% of the clause lists can be given within 16 FPGA clock cycles. This means that most of the clause list can be read within the DRAM access latency. Furthermore, 80 to 90% of them can be stored in one line of the F-cache in Fig. 7 , which means that the rest k − 1 lines in the block are wasted. Figure 8 shows our approach (called V-cache) against this problem. In Fig. 8 , the cache memory is constructed using 8 banks (8-way set associative). Each bank consists of N blocks, and each block has k lines of L-word width. Like F-cache, log 2 N − 1 bits of the variable number are used as the cache index (block address), and rest bits as the tag. As shown in Fig. 8 , if the clause list 'A' and 'B' are short enough, only one block of the V-cache is assigned to cache whole of them. In this case, the set of the blocks that store 'A' and 'B' and other blocks on the same block address works as 8-way set associative. If a clause list is very long, on the other hand, up to first L × C d words of that have to be cached. When the clause list is longer than the size of a block (like 'C' and 'D'), all blocks of on the same block address are used to cache it, and this set of the blocks works as direct map.
In V-cache, first, the target clause list is looked up, and if cache hits, a flag in the cache block (DRAM access flag shown in Fig. 9 ) is checked to decide whether to start the DRAM access or not. When the whole clause list is cached, the DRAM access is not started, and otherwise, the DRAM access is started to read the uncached part. Figure 9 shows the detail structure of V-cache (the maximum associativity = 8). In Fig. 9 , 'data' indicates the data caching field. 'cache mode' represents whether the cache blocks are used as direct map or set associative. 'lru' contains the usage history of the corresponding data. All of the lrus are initialized by zero. 'valid bit' represents if the cached data on the corresponding line is valid or not.
Cache Block Replacement
The behavior of the cache replacement depends on the cache mode. In direct map, blocks for the same index are simply overwritten by the new cached data (before caching the new data, all of the valid bits for the same index are set to zero in order to invalidate the old data. Whenever changing the cache mode, all of the valid bits for the same index are set to zero before caching new data). In set associative, on the other hand, the cache replacement is executed in LRU policy. When a read or write access to a block is occurred, lrus whose values are smaller than that for the read/written block are incremented, and that for the read/written one is set to zero. Hence, it follows that the block with the biggest value of lru is the least-recent-used among those for the same index. The maximum value of lru is at most N a − 1, where N a is the maximum associativity. Therefore, data width of lru is log 2 N a . When the replacement of the block becomes necessary, the block with the biggest value of lru is selected. Then, the lru for the replaced block is set to zero, and these for other blocks are incremented.
Performance Evaluation
The circuit size for L = 32 without F-or V-cache is 90K LUTs, 42K flipflops (FFs) and 303 18Kb block RAMs. These block RAMs are used mainly for v tbl[], c sat tbl[], and usc buf[]. The size of the control logic for the cache is 17K LUTs, 2.6K FFs and 54 18Kb block RAMs (mainly used for the tags and the lrus) when its maximum associativity, k and L are 16, 2 and 32, respectively. In total, 106.5K LUTs, 45K FFs and 387 block RAMs are required, which is feasible for all of the devices in Xilinx Virtex-7 series FPGA. The size for the data field of F-or V-cache are limited by the amount of the block RAMs on FPGA. XC7V1140T, the largest FPGA in Virtex-7 series, has 3760 18Kb block RAMs. Therefore, more than 3000 18Kb block RAMs can be still utilized. In the following evaluation, 3072 18Kb block RAMs are used for the data field in F-or Vcache.
A Hardware simulator is used for the following performance evaluation in order to facilitate changing the configuration of F-/V-cache and the circuit parallelism. This simulator simulates the hardware in logic circuit level and counts the number of the FPGA clock cycles. It also counts the number of the access to the off-chip DRAMs, calculates the total amount of the DRAM access latency and converts that to the number of the FPGA clock cycles. In the simulation, four DRAM banks of 32b word are used as one bank as described in Sect. 5. We consider the DRAM throughput to be that of DDR3-2133 (11-11-11) which is the fastest speed grade in JEDEC's standard. Each DRAM has internal memory banks in it, and data with continuous addresses can be read out by burst-read. The operational frequency of the FPGA is 1/8 of the memory data transfer rate of the DRAMs. DRAM interface latency is assumed to be 100 nsec.
Performance Comparison Over Software
First, we evaluate the system performance over software using the 20 benchmarks used in Sect. 4.3. In this experiment, L, k and the maximum associativity in V-cache are fixed to 32, 2 and 16, respectively based on the experiment described in the next subsection. Table 4 shows the results of this evaluation (the averages of 20 runs). N v and N c are the number of variables and clauses respectively. 'ratio' shows the ratio that the correct solutions could be found in 20 runs with 2 27 flips at a maximum. Two ratios on each line should be same because the same algorithm is used, but they are different. Because they depend on random numbers. #flips is the number of flips required to find the solutions. t ave is the average of the total execution time (the processing time on the host computer and the time for data downloading are included), and tf ave is the average execution time per flip. X f shows the speedup over our algorithm on CPU (Intel Core i7-3770 3.4 GHz with 8GB main memory) per flip, and X t shows the speedup of total execution time (X t does not necessarily give the true performance because it depends on random numbers). The execution time of our algorithm on CPU is used as the base of this comparison, because it finds the solutions faster and more frequently than the other WSAT variants as described in Sect. 4.3. X VC is the speedup by V-cache (the ratio of X f with no cache and with V-cache). Hdn is the ratio that the idle time could be hidden by the V-cache. In these values, those of the runs that failed to find the solutions are not included.
As shown in Table 4 , our system with V-cache achieves 1.16 to 8.07 times of speedup over CPU in the execution time per flip, and the enhancement by V-cache is up to 26% (X VC ). The two lowest speedup, 1.16 and 2.08 are given by the two smallest benchmarks, and the speedup for other problems is 3 to 8 approximately, which is fast enough considering that the performance is limited by DRAM throughput. The idle time hidden by V-cache is about 50%. This is not so high, but it is reasonable considering the size ratio of V-cache and c list tbl[].
For bug9 and bug10, 'ratio's seem to be inferior in our FPGA solver than those in the software. We are now trying to make the reason clear, but it probably comes from the selection method of an unsatisfied clause and the accuracy of the random number in our solver, and the problem size. When selecting an unsatisfied clauses, our FPGA solver selects from usc buf[] on the block RAMs, and gradually moves unsatisfied clauses from the DRAMs when usc buf[] on the block RAMs becomes empty. This may cause unfair selection of an unsatisfied clause. In addition, the lower accuracy of the random number generation may cause the degradation (our solver uses a linear congruential generator whereas the software solver uses Mersenne twister). When the problem size becomes larger, the influence of these factors may not be able to negligible.
Performance by Changing V-Cache Configuration
For deciding the optimal configuration of V-cache (k and the associativity), we have evaluated the performance of Fcache and V-cache using four benchmarks used in Table 2 . Table 5 shows the cache hit ratio (hit), ratio of the idle time that could be hidden (hdn), and the performance gain (X f , which is same as X f in Table 4 ). The values in Table 5 are the average of 20 runs with 2 27 flips at a maximum. The left half of Table 5 shows the results of the F-cache. The hit ratio decreases as k is increased (this is because the number of entries in the cache decreases as k is increased). The performance, however, becomes better as k is increased. The peak (shown by bold font) is given when k = 32 in most cases, and it does not depend on the associativity. In the Fcache, for example, when k = 1, only one line (up to 32 words) of a clause list can be cached. This works well for short clause lists that can be stored in one line, but does not work well for long clause lists, because their access latency can be hidden by only one clock cycle. When k = 32, the access latency is hidden by 32 clock cycles, and it showed the best balance of the cache hit ratio and access latency hiding.
The right half of Table 5 shows the results of V-cache. Larger k shows lower hit ratio as with F-cache. Unlike Fcache, higher associativity also brings lower hit ratio. In the V-cache, several entries on the same horizontal line in the cache memory are used to cache a long clause list as shown in Fig. 8 . With higher associativity, more entries are used to cache a longer clause list, and all short clause lists that have already been cached in these entries are wiped out. This is the reason of lower hit ratio for higher associativity. However, the lower hit ratio does not mean the lower performance gain as shown in the table (for the same value of k). The best performance is given by V-cache when k = 2 and the associativity is 16. In this case, 32 lines (32 × L words) can be cached for each long clause list, and it is enough to hide the DRAM access latency.
The interesting point here is that in both F-and Vcache, the maximum number of lines to be cached to achieve the highest performance is the same (32 in this experiment). In this situation, hit ratio and hdn of V-cache are always 5 to 10% higher than those of F-cache, and then the performance of V-cache always outperforms that of the F-cache. This indicates that V-cache can utilize the same space of caching more efficiently than F-cache.
Performance with More Parallelism
The latest FPGAs support more hardware resources than Virtex-7 series FPGAs. Therefore, we can increase the performance by the following two approaches: executing different problems in parallel by copying the circuit or accelerating the same problem by enlarging L. Here, we consider the second approach. Xilinx XCVU13P, the largest FPGA in Virtex Ultrascale+ series, has about three times of logic cells and six times of on-chip memory banks than XC7V1140T. Its number of High Performance I/Os (HP I/Os), which are required for high speed transfer between DDR3-SDRAMs and FPGA, are a bit less than that of XC7V1140T, but still large enough to support L = 96. Figure 10 shows the speedup of the four benchmarks by changing L from 32 to 96. The improvement by enlarging L is not so high, because the average length of the clause list is at most 36 in the tested benchmarks as shown in Table 2. However, the performance continues to be improved by enlarging L except for iq3 C1. This is probably because the length of the frequently fetched clause lists is longer than the average. To make it clear, we need to analyze the length of the clause lists which are actually fetched in the search in detail. The performance improvement by using larger FPGA is not so high, but it can be effective when it is allowed to use larger FPGA.
Conclusions and Future Work
In this paper, we have shown an FPGA solver for large SAT/MaxSAT-encoded formal verification problems of digital circuits. To solve large real-world problems efficiently on FPGA, first, we have proposed a new heuristic for WSAT algorithms that is simple enough to realize it on a hardware circuit, and we have showed its implementation on the FPGA that uses off-chip DRAM banks to hold the main data. The performance gain by our system is approximately 3 to 8 times for large problems. This speedup is not so drastic, but it is fast enough when we consider that it is limited by the throughput and access latency of the off-chip DRAMs and that the search time on CPU that is sometimes longer than one hour can be reduced to 1/3 to 1/8. In this system, all the on-chip memory banks that are not used to store data arrays are used to configure the specialized cache memory for the search, and the performance can be improved up to 26%. We have also evaluated the system performance when the parallelism is enlarged in order to clarify how much speedup could be possible by using largest FPGA available.
It will be possible to improve the performance of our heuristics by analyzing the relation between the unsatisfied external gates and the values of the independent variables when the search is stuck, so that the search can get out from the local minima more easily. The search can be also improved by managing the history of the flipped variables, and by choosing the next variable to be flipped using the history. These improvements requires run-time analysis and management of the relation of the variables. To implement these improvements on hardware is our future work.
