In this paper 
Introduction
Using and taking advantage of hardware accelerators have a long tradition in data processing [1] essential for vast variety of computational systems. Among numerous tasks that need to be solved, combinatorial problems are considered to be very time-consuming and therefore a speed-up is greatly required for many practical applications. A number of recent research works are targeted to the potential of advanced hardware accelerators, which are analyzed in detail in [2, 3] . Notable results have been achieved through applying parallelism, pipelining, non-sequential circuits, and other techniques and building specialized blocks in hardware. A special attention has been paid to such competitive implementation platforms as: Field-Programmable Gate Arrays (FPGA), Graphics Processing Units (GPU) and multi-core CPU (Central Processing Unit). The use of FPGAs, permitting design constraints of CPU and GPU with predefined architectures to be eliminated, is studied in a number of publications [4] [5] [6] [7] [8] . The most important features of FPGA-based circuits are their inherent configurability and relatively cheap development cost. FPGA are especially good for prototyping, experiments and comparisons.
In this paper we further study the application of FPGA to solving computationally intensive problems on the example of Sudoku game [9] . The remaining part is organized in seven sections. A brief description of FPGA evolution and a comparison to other technologies is done in section 2. Section 3 makes an introduction to Sudoku game. An overview of related work is done in section 4. The implemented algorithms are described in section 5. The proposed architectures are presented in section 6. Section 7 discusses implementation details and the results of experiments. Conclusions are given in section 8.
arrays to multi-and extensible platform FPGAs combining traditional gate arrays with advanced multicore processors. Fig. 1a contains a table showing milestones of the most known companies: Altera and Xilinx which dominate the market. The table shows the year where the indicated FPGAs were introduced and the characteristics of the most advanced product of the respective family.
From the chart in Fig. 1b you can see that the progress in FPGA capacity from 1985 to 2012 looks very impressive and exciting. Note that we do not intend to compare companies and the main objective is to demonstrate the continuous progress in FPGA area. Besides, logic elements (LE) from Altera and logic cells (LC) from Xilinx cannot be easily scaled to each other. Fig. 1c summarizes advancements in the process technology and shows the maximum number of transistors in FPGAs from their invention to present time. In accordance with [13] Let us contrast FPGAs with the most widely used alternatives with the aid of particular examples. In [7] FPGA is compared with single/multi core CPU-based accelerators for linear algebra kernel/solvers and the authors report increase in performance which ranges from 20 to 240 times. Publications [1, 3] demonstrate advantages of FPGA-based circuits over CPU and GPU for different types of data processing in very large data bases. In [4] a performance study of three diverse applications: Gaussian Elimination, DES, and Needleman-Wunsch were presented and three different platforms (FPGA, GPU, and CPU systems) were compared. It was demonstrated that, in general, FPGAs provide the best expectation of performance, flexibility and low overhead, while GPUs tend to be easier to program and require less hardware resources. In [8] the Lightfield descriptor method for 3D computer graphics was implemented on a Virtex-5 FPGA and on a GeForce GPU. The FPGA-based circuit demonstrated an average speed up of 50x over software. The corresponding GPU results were better than software solutions and worse than FPGA-based implementations. In [6] a systematic approach to the comparison is applied to five case study algorithms, characterized by their complexity, memory requirements, and data dependency. The algorithms were implemented and tested in nVidia GeForce 7900 GTX GPU and Xilinx Virtex-4 FPGA. Two orders of magnitude speedup, over a general-purpose processor, is achieved for each device for arithmetic intensive applications. In the presence of data dependency, the implementation of a customized data path in the FPGA exceeded GPU performance by up to eight times. Other examples can be found in [14] [15] [16] .
Combinatorial problems and the Sudoku game
Many combinatorial problems that are frequently encountered in practical applications are NPcomplete and therefore have high computational complexity. These problems may benefit from FPGAbased solutions because the required basic operations can be tailored to the characteristics of a particular problem, and these operations can be easily executed in parallel. In this paper we report our experience from implementation in FPGA of a Sudoku game solver which is a good representative of high computational complexity problems.
Sudoku is a well-known combinatorial number-placement puzzle which achieved international widespread popularity in 2005 [17] . The most common version of the game consists of 81 cells organized in a 9×9 grid (which in turn is divided in nine 3×3 regions) where some cells contain numbers from 1 to 9. The objective is to fill in the remaining empty cells with numbers 1..9 so that:
1. every number (1..9) appears in each row only once; 2. every number (1..9) appears in each column only once; 3. every number (1..9) appears in each 3×3 region only once.
The most common version of Sudoku (9×9 grid with 3×3 regions), can be generalized to other sizes. A Sudoku of order N is an N 2 ×N 2 grid divided in N 2 regions (each of size N×N) which have to be filled with numbers from 1 to N 2 . The generalized Sudoku problem is NPcomplete [18] .
Very recently (and especially after the Field-Programmable Technology Conference launched the design competition) an interest appeared to explore the possibility and effectiveness of solving Sudoku with FPGA. In this paper, we also tackle the problem but only considering puzzles of order 3. In particular, we exploit parallelism to increase performance.
Related work
A number of research works explore the FPGA-based implementation of computationally intensive problems such as fast Fourier transform [19] , data sort [20] , Boolean Satisfiability problem (SAT) [21] , Eternity II [22] , and Sudoku [23] [24] [25] [26] .
In [23] a probabilistic Sudoku solver is presented based on simulated annealing methods. The solver was implemented on a Xilinx Virtex-II FPGA, ran at 53 MHz and is capable of handling puzzles up to 12 th order. The obtained results were compared to the implementation of exactly the same algorithm in software and the authors reports that both systems behaved similarly in terms of performance (FPGA was faster for lower order puzzles, software performed better for higher order puzzles). The suggested technique is not applicable to hard instances (none of the tested "hard" puzzles was solved in [23] ).
Hardware implementation of an exhaustive search algorithm for Sudoku (which does not use any search space pruning) is presented in [24] . The solver was implemented in a Xilinx Virtex-II FPGA, supported 50 MHz clock frequency and was able to tackle problems up to order 8. Once again the technique is not applicable to hard instances because the search is not directed.
Another FPGA-based Sudoku solver is proposed in [25] . In this case, the backtracking search algorithm is augmented with several heuristics that lead to search space reduction. The solver can process puzzles up to order 11.
In [26] depth-first search algorithm is implemented and a new memory scheme is proposed which allowed to reduce memory requirements of [23] [24] [25] and to solve puzzles up to order 14 in a Spartan-3E FPGA. Parallelization was introduced as well. The results were compared to a state-of-the-art SAT solver (by modeling Sudoku as the SAT problem as in [27] ) and the authors found that their FPGA-based implementation can only speed up easy puzzles while the hard ones are processed slower than in software.
In this work we further explore the problem with particular emphasis on implementing: a) a backtracking search algorithm with search space pruning; b) a number of parallel Sudoku solvers (all the overviewed research works point that state-of-the-art software solving methods are more competitive and therefore parallelism must be the main target of hardware implementations).
Implemented algorithms
For every empty cell in a puzzle a list of possibilities must be generated. This list includes all the numbers that theoretically could be placed in each cell (taking into account the current puzzle constraints).
The following search space pruning techniques have been implemented:  Singles: when a cell has only one candidate number to be used, this number is assigned to that cell and removed from the lists of candidates in all other cells in the same row, column, and region;  Hidden Singles: if a candidate number appears only once in a given row, column or region, it is assigned to the respective cell. When further pruning is not possible, we recur to a backtracking algorithm which: 1. Finds a cell with the smallest number of candidates; 2. Tries those candidates one by one (performing problem simplification as described above); 3. If a cell without candidates appears it means that this trial was wrong and the algorithm backtracks to explore the remaining possibilities; otherwise go to point 1 above; 4. The algorithm stops when all the cells become filled with a valid number. Basically, we implemented two algorithms: A simple , which just performs pruning of the search space and therefore can only solve very simple puzzles and A search , which applies breadth-first search strategy and thus has virtually no limitation on the type of puzzles which are solvable.
Proposed architectures
Three different Sudoku solver architectures have been implemented. The first solver S simple implements the algorithm A simple and is therefore only able to solve simple puzzles with reasoning, i.e. without search. The second solver S search applies breadth-first search algorithm A search and can process any puzzle (provided memory is available). And, finally, the third solver S parallel implements the algorithm A search and explores the possibility of parallel processing of nodes in the search tree. In particular, all the sub-puzzles which are instantiated at the same level in the search tree are solved in parallel. Moreover, if further branching occurs, the new lower level sub-puzzles are also executed in parallel with those already running.
Organization of Data in Memory
The original puzzle as well as auxiliary puzzles are stored in memory as 81 (9×9) 4-bit words, where every cell value is represented by the respective binary code (code "0000" is reserved for empty cells).
Each cell might have up to 9 candidate numbers (in the worst case). Therefore we store the list of candidates as 81 36(9×4)-bit words, where every word is organized as illustrated in Fig. 2 . For example, the cell A1 from Fig. 2 ,a already has a value. Therefore this cell does not have any candidate and all 36 bits are assigned to zero. The cell A2 has two possible candidate numbers: 2 and 6. So, these numbers are coded in binary and occupy 4 bits each as illustrated in Fig. 2,b ; the remaining bits are assigned to zero. The cell A4 has three possible candidate numbers: 2, 3, and 5, which are coded in binary with 4 bits each as shown in Fig. 2,b .
The maps of columns and regions are used to find easily which column or region does a particular cell belong to. These memories are organized as 81 (9×9) 4-bit words. The maps of columns and regions are implemented as ROMs whose contents is defined statically and does not change during the solution of a problem (as well as when we switch between different problem instances). These maps are not required in software but we found that they facilitate memory addressing in hardware.
The map of minimals keeps ordered (by the number of candidates) addresses of cells and is used in branching (in order to split the search on the cell that has the smallest number of candidates). This memory is organized as 81 7-bit words. A2 0000 0000 0000 0000 0000 0000 0000 0000 0000 A1 0000 0000 0000 0000 0000 0000 0000 0000 0000 I9 0011 0110 0000 0000 0000 0000 0000 0000 0000 A3 0010 0011 0101 0000 0000 0000 0000 0000 0000 A4 Finally, the map of indices points out which memories are free and can therefore be used for storing auxiliary puzzles in the solvers S search and S parallel . This memory is composed of 32 1-bit flags and can be used for managing at most 32 auxiliary puzzles (but is easily scalable to deal with more puzzles). At the beginning of execution, the map is filled in with zeros.
Solvers Architectures
The block diagrams of the solvers S simple and S search are illustrated in Fig. 3 . The first solver S simple (see Fig. 3,a) is able to process very simple puzzles that can be solved without branching by applying two reduction techniques: Singles and Hidden Singles. The Control Unit manages access to memory blocks (which hold the original puzzle, the list of candidate numbers, the map of columns, and the map of regions) with the aid of a priority encoder and multiplexers for addresses, data and write enable signals. The access is arbitrated between external interface (for example, USB interface to a host computer) and three VHDL processes, which are responsible for application of the two reduction techniques (singles and hidden singles processes) and updating the list of candidate numbers (list of candidates process) according to the algorithm A simple . The processes are implemented as communicating finite state machines [28] .
The second solver S search (see Fig. 3,b) applies the breadth-first search algorithm as soon as it concludes that further reduction is not possible. The process find minimal is responsible for finding an empty cell C which has the smallest number of candidates n. Then n copies of the puzzle are done and each of these has the cell C filled with one of the candidates. After that the copied auxiliary puzzles are processed by the solver one by one sequentially according to the algorithm A search eventually creating more copies of the sub-puzzles which are added to the end of the queue and processed in the order of their instantiation. The process check solution is responsible for verifying if the current puzzle is completely filled in and a solution was found, a case in which the Control Unit has to stop running the remaining puzzles.
With this architecture any Sudoku puzzle can be solved provided there is memory available for keeping auxiliary copies as described above. Access to memories (which store the list of the candidates and different maps) is arbitrated by the Control Unit in a way similar to the solver S simple . Access to K puzzles is controlled with the aid of a demultiplexer (not shown in Fig. 3 for the sake of simplicity) activated by index supplied by the Control Unit. As soon as a solution is found, data from the respective puzzle are sent to the external interface via a multiplexer (also not shown in Fig. 3 ) controlled by index from the Control Unit.
Finally, the block diagram of the third solver S parallel is illustrated in Fig. 4 . The solver is started by activating the first engine Solver 1. Each solving engine (see Fig. 4,a) is only responsible for applying Singles and Hidden Singles reduction techniques and for finding a cell C with the minimum number of candidate numbers n. Information about the cell C is then sent to the Parallel Control Unit (see Fig. 4,b) which makes n copies of the puzzle and activates n next available solving engines which all execute in parallel. As soon as one of these engines needs to branch the search space the process repeats by activating n new more solving engines. As soon as one of them reports that a solution is found the Parallel Control Unit aborts all the remaining solving engines and notifies the respective external entity and data from the relevant solving engine are sent to external interface. As in the previous case, with this architecture any Sudoku puzzle can be solved provided resources are available for implementing the required number of solving engines. Access to different solving engines is controlled with the aid of demultiplexers activated by the respective index supplied by the Parallel Control Unit. 
The results of experiments
All three proposed architectures were described in VHDL, synthesized with Xilinx ISE, and tested on a Nexys-2 prototyping board containing the XC3S1200E FPGA. The FPGA has 28 18-Kbit embedded memory blocks (BlockRAMs). We configured each BlockRAM as a single-port memory and made so that every entity described in Section 6.1 occupies one BlockRAM. This configuration permitted to implement easily the solver S simple , to keep up to 21 auxiliary puzzles in the solver S search , and to instantiate at most 5 parallel solving units in the solver S parallel . The occupied FPGA resources are summarized in Table 1 . Please note that although it is possible to keep up to 21 auxiliary puzzles in the solver S search , only 12 were instantiated because these were sufficient for the experiments. That is why this solver needs just 19 BlockRAMs.
Experiments were done with benchmarks available at [29] [30] [31] . The results are summarized in Tables 2-3 . We have designed a software program (in C) which implements the algorithms A simple and A search . The respective software execution times t sw (measured on a HP EliteBook 2730p with Inter Core Duo CPU 1.87 GHz) are presented in the rightmost columns of Tables 2-3. The column "# copied puzzles" gives information on how many search tree branches it was necessary to explore (how many puzzle copies were executed, or how many solving engines were activated -in case of the solver S parallel ). The solver S parallel supports at most 5 parallel solving units and therefore is not able to process the last 3 instances from the Table 3 . From the Tables you can see that the solvers S simple and S search are less efficient than the respective software implementations. This is an expected result because these two solvers implement very control-oriented algorithms and only explore low-level parallelism (at the level of operations). On the other hand, the solver S parallel , which explores parallel traversal of the search tree branches, leads to execution times more closely related to software solution. The solver S parallel is on average 2.4 times faster than the solver S search .
We would like to underline that the experiments were done on a low-cost FPGA which supports very limited clock frequency (about 37 times slower that the PC used). If we migrate the same designs to a more advanced FPGA chip, obviously better results will be received. Besides, we used a number of auxiliary test circuits that check the validity of puzzles frequently. These circuits slow down the solvers and will be removed in future implementations.
We have also compared our results with those of [23] [24] [25] [26] using a benchmark 3a from the Field-Programmable Technology Conference [32] . The solver S simple was 5 times faster than [24] , 2 times faster than [25] but slower than [26] (the simulated annealing-based solver [23] was not able to process 3a instance).
Conclusions
In this paper three different architectures were presented that allow a Sudoku puzzle to be solved on an FPGA. We proved that despite the serial nature of the implemented algorithms, parallelism can be applied efficiently allowing performance to be increased in 2.4 times. Nevertheless, even the parallel FPGA-based solver is slower than software solution. This leads to the conclusion that parallelism should be explored more aggressively. There is room for such exploration even with the used low-cost FPGA through more efficient data structures and BlockRAM management.
