Abstract
Introduction
Practical applications of reconfigurable computing systems already cover quite a wide and diversified field of applications [I] , such as image processing, video processing, pattern recognition, neural networks, applications in high-energy physics, search of genetic databases, cryptography, etc. No definite conclusions on how their performance compares with general-purpose processors and DSPs can be drawn. The data available however indicates that reconfigurable computation achieves the best results when dealing with problems exhibiting significant levels of intrinsic parallelism and requiring the processing of data represented in formats not supported by microprocessor's or DSP's ISAs.
In this paper we investigate the capabilities of FPGAs to be used for computations that are required for NP-hard combinatorial problems. Note that such problems cannot be efficiently solved on general-purpose computers.
Various problems of combinatorial optimization appear in different application areas, such as synthesis and optimization of digital circuits; mapping, placement and routing of microchips, topology and cartography (mapmaking) [ 2 ] , artificial intelligence [3] , etc. Examples of such problems are: finding the shortest and longest path, graph coloring, Boolean function optimization, covering and planarity problems, encoding problems, etc. Due to the universality and wide scope of applications the problems of combinatorial optimization are well studied and specified through such mathematical models that can be applied to different optimization tasks. Mainly the problems of combinatorial optimization can be formulated in terms of graphs, matrices [3] , sets [4] , Boolean functions and equations [5] , etc. Given the heterogeneity of combinatorial problems it does not make sense to construct some kind of universal device, which on the one hand would be rather complex and expensive and on the other would not be used in its full power. A particular co-processor might be constructed just for those combinatorial problems that need to be solved. An excellent platform for such kind of devices is based on reconfigurable circuits whose architecture can be customized to the specific problem through reconfiguration. The complexity of recent FPGAs allows to implement a complete combinatorial co-processor with the desired architecture. There are many approaches that make use of FPGA facilities in order to gain advantage of specific circuits [6] , such as datapath [7] and memory [8].
This paper is organized in six sections. Section 1 is this introductory section. A reconfigurable processor for solving combinatorial tasks is considered in section 2. Section 3 suggests the structure of a dynamically modifiable core for the combinatorial processor. Section 4 presents a design example. Section 5 describes the developed design tools for problems of reconfigurable computations. The conclusion is in section 6.
A configurable combinatorial processor
It is known that the majority of combinatorial problems of logic design and artificial intelligence can be formulated on logic (Boolean, ternary or some other) matrices [3] . All of them are discrete in the sense that the number of values for the elements of these matrices is limited. In practice mainly three values (such as 0, 1 and don't care) are used. Below we will show that considering a fourth value allows to simplify the descriptions of many combinatorial tasks. The primary problem to be solved was presented in [3]. Let A and B denote some finite sets and # be a binary relationship between the sets: #LAXB. Let [A#B] be a logic matrix of this relationship, which has rows corresponding to elements from A, and columns corresponding to elements from B. For example, B={bi,b2,b3,b~l, A={ Bi3B2,B3J9 Bi={bi7b4J9 B2=(bYb3J, B3={bl,b3,b4). Here A and B are connected by the relationship of E that reflects belonging of elements from B to elements from A. This can be presented in matrix form on the one hand, and disjunction (v) or exclusive disjunction (e), on the other hand, serve respectively as inside and outside operators when multiplying matrices [3]. Some of these matrices are given, and others have to be found. This problem can be decomposed for complex circuits into a set of well-specified and rather simple problems over separate matrices [9] . Independently of a clear specification of such problem most computations over logic matrices are NP-hard, i.e. their complexity depends exponentially on the quantity of input data. Besides, the majority of basic operations implemented on general-purpose processors are not so well suited for such kinds of computations and it is very desirable to accelerate the process of computations using specialized coprocessors. Let us consider an example of some operations over logic matrices. For many practical tasks we need more than 2 values for computations. For instance, in the case of minimization of Boolean functions at least three values are needed for variables, 0, 1 and don't care (denoted -) and three values for functions, 0, 1 and any value (either 0 or 1). The latter might be also specified by -.
Consider some typical operations that must be applied to Boolean matrices in order to solve different combinatorial problems. For such purposes we will distinguish binary and unary operations that are applied to two and one matrix respectively. Suppose we have two given matrices X and Y. For certain applications they have equal numbers q of rows X I ,..., X,, Y1 ,..., Y, (we will use subscript in order to represent rows) and n of columns XI ,..., X", Y1 ,..., Y" (we will use superscript in order to represent columns) and in the general case n#q.
For such matrices we can use traditional binary logic operations, such as OR, AND, XOR, NOR, NAND, etc.
which are applied to either two rows Xi and Yi with the same index i or two columns XJ and YJ with the same index j .
For many practical applications, especially in the scope of VLSI design, for a given matrix X we need to construct a new matrix Z, which describes .some new relationship extracted from X. Suppose X = [xi'], i=l, ..., q, j=1, ..., n and it describes a non-directed graph. Examples of such relationships might be the following:
1.
2.
All these operations can be either unary or binary. For many practical tasks, such as the covering problem [3], we have to provide search for some specific properties of rows and columns; for example we might want to find a row that contains the maximum number of ones or a column with the minimal number of ones, etc.
An important class of applications includes logic matrices, which describe systems of Boolean functions presented in disjunctive or conjunctive normal forms. In such case matrices X and Y in matrix logic equations Z = X x Y will have different numbers of rows and columns [IO] .
The problem becomes more complicated when dealing with logic matrices X = [x;], i=l, ..., q, j=l, ..., n whose components x, ' have more than just two values. Typically we need three values, 0, 1 and -. Let's introduce one extra value (denoted +). For example, we can mark with + any unused (or unnecessary) area of a logic matrix. This value can also be interpreted from the point of view of logic. Indeed, if x is input then:
The value x might be inverted (x) denoting 0;
The value x might be taken without any change (x) denoting 1;
The value x might be ignored (i.e. disconnected) denoting -;
1. 2.
3.
4. The value x might be logically calculated with itself (x # x, where # is some Boolean operation) denoting +.
We have provided encoding of these four operations with a 2-bit code. So our matrices have been represented as a 2-dimensional array X = [x,'], i=l, ..., q, j=1, ..., n where each individual element x,' has 2-bit size. The computational unit for the problem, which we are going to solve, can be specified in the following way:
The most general task for this unit is calculations on a single or a pair of logic matrices with 2-bit primary elements. 2. Most computations are homogeneous, i.e. they invoke the same operation(s) for regular data. 3. We have to be able to perform the required operations either on rows or on columns of logic matrices. The volume of data is usually very large.
4.
Taking into consideration all these requirements we have proposed the first architecture of a combinatorial processor depicted in fig. 1 that is going to be used for -e analysis and estimation.
Figure 1. Primary architecture of a combinatorial processor
The processor contains 3 very fast blocks that are based on RAM implemented in CLBs of an FPGA. The size of RAM-based blocks is statically modifiable. They are used in order to store 3 matrices Z, X and Y, where the matrices X and Y are considered to be matrix operands and Z keeps the result of combinatorial computations. The matrices X and Y are loadable from outside and the matrix Z is readable from outside. This can be implemented with a dual-port RAM, as available in the FPGAs of the XC4OOOXL family. On the other hand the value of Z might be utilized as an operand (either X or Y) in future computations (see fig. 1 ).
There are two blocks that can be re-programmed during run-time. They are a Re-programmable Function Unit, RFU and a Re-programmable Control Unit, RCU depicted in fig. 1 fig. 1 for different kinds of operations on matrices X, Y and Z.
Structure of dynamically Re-programmable Function Unit (RFU)
We are considering a primary structure of the RFU shown in fig. 2 Finally consider an example where Xi and XJ are exchanged (Xi ++ X'). In this case the RCU provides parallel reading of a row Xi and then sequential reading of a column X' . The latter is performed by sequential reading of a bit j in the column RG (see fig. 1 ) and shifting it with incrementing the address of the respective matrix in RAM (see fig. 1 ).
Note that there exist a large number of different operations that we might want to perform on logic matrices. It is known that the number of Boolean functions of n variables is equal 2' . So even for trivial Boolean computations (over elements of Boolean matrices with two possible values 0 and 1) we have 16 different functions. In case of 4 feasible values for each element of a matrix, such as 0, 1, -, +, we have 65536 possible Boolean functions that might be performed. On the other hand for any real task we need a very limited number of such functions. That is why it is unreasonable to construct a complicated logic block for the respective computations and we have proposed to provide it with dynamically modifiable functionality (see fig. 3 ).
In our case the block is built from dynamically reprogrammable computational primitives P I , ..., Pn (see fig.  3 ). Each Pi has two inputs ai, bi and one output h. The variables 3, bi and are elements of matrices X, Y and Z respectively. Each variable has two bit size, which allows to represent values 0, 1, -, and +. By re-programming of each primitive from { P I , ..., Pn} we can implement any n (from 65536) Boolean function of 4 Boolean variables (in fact in any primitive we can realize two such Boolean functions because each block Pi has two Boolean outputs). The primitives shown in fig. 3 are constructed on the base of 2 RAM-based CLBs of XC401OXL. Run-time reprogrammability is achieved with the aid of dual-port RAM organized as 16x2. The first port is used to provide the desired functionality. The second port enables to carry out run-time re-programming of a primitive in order to change the respective Boolean function. Finally the processor has two dynamically modifiable resources that affect the functionality of hardware. They are RFU and RCU with run-time alterable core.
Figure 3. Dynamically reconfigurable core
In conclusion we will demonstrate how to use 4 values of logic variables in practice. Let's consider the following product of logic variables: XIa x; . . where xo = 2, X I = x, x-= 0, x+ = 1. This is not the same as in previous case. However the following representation can be applied to both logic expressions: xo = x, X I = x, x-= x , x+ = x#x, where # is the respective Boolean operation (AND in the first case and OR in the second case). Finally it denotes that if x-= 0 then x+ = I and vice versa. Since we know how to define don't care for any variable we can find the fourth value for this variable. By definition this value is different from the first three that are 0, 1 and don't care. Depending on the combinatorial problem that is going to be solved, the value + might be interpreted in a different manner. Let us say, for programmable logic arrays or 7 similar matrix-based circuits [4] this value can mark primary re-programmable units that do not need to be programmed at all. It makes possible to modify the functionality of the circuit after programming even for mask-programmable devices. Note that using the values 0, 1, -and + does not restrict a potential scope of the approach. Indeed, we might ignore the fourth value ' I + " or even the value don't care "-". Hence in the last case we will deal with the pure Boolean space [3].
Design example
Large variety of combinatorial tasks required for solving problems of digital circuit optimization, artificial intelligence, etc. was considered in [3, 10] where it was also shown that in case of the use of general-purpose computers the respective procedures involve intensive data exchange between central processor and RAM. As a result they are time consuming. Since for majority of practical applications the size of utilized vectors does not coincide with the size of RAM words we have to use many specific operations that allow to form vectors from RAM words and to divide vectors into RAM words. It invokes many supplementary operations that lead to low effectiveness of the respective combinatorial algorithms implemented in general-purpose computers. On the other hand widely used FPGA architectures are well suited for implementing operations that intensively invoke different data, including feasible pipelining and managing built-in memory. Besides many FPGAs contain RAM-based cells that have a speed comparable with fast gates and can be combined in structures with the desired organization, that in particular provide the proper size of words and the required number of words. Thus we are able to construct what we want. An example of a widely used combinatorial task is presented below.
Let us assume that we have to find the minimal column cover of a Boolean matrix [3, 10] . It includes some predefined steps, such as searching for the row with the minimal number of ones, detecting the column, which covers the row to be found (i.e., which has one in this row) and has the maximum number of ones, removing the column to be found and all the rows that it covers, etc. In this case the RCU provides the predefined sequence of steps, that depend on generated logic conditions. The circuits in fig. 2 enable us to count the number of ones and to check which rows will be covered by the selected columns, etc. In order to set a column in a state "has been selected" we can use the fourth (specific) value (+) of its elements that is "not 0", "not 1" and "not -" (see section 2).
On the other hand we can use special registers keeping an additional information about the matrix vectors (see fig.  2 ).
Consider the following matrix X:
x' x2 x3 x4 1 1 0 0 x 1 1 1 0 0 x 2 0 1 0 1 x 3 0 0 1 1 x 4 Fig. 4 depicts the flow-chart of the algorithm. At the first step row 1 has to be selected (because this is the first among the rows with minimal number of ones). It covers columns 1 and 2. Since column 2 has the maximum number of ones it is chosen in step 2. At the next step it is deleted together with all the rows that are covered by it (i.e. rows 1, 2 and 3). After that the new matrix will look like the following:
XI x3 x4 0 1 1 x 4 Now we have to choose row 4 and column 3. They will be removed from the matrix and the latter becomes empty. As a result we obtain the solution that is represented by the columns 2 and 3. The algorithm (see fig. 4 ) includes one basic operation that is "counting the number of ones in a given vector". This operation will be implemented in the dynamically reprogrammable core shown in fig. 2 . Note that the flowchart, shown in fig. 4 , actually represents the control algorithm, which has to be realized in the RCU (see fig. 1 ). This algorithm can be converted to a standard description of a finite state machine (FSM) [4] . In section 3 we have already mentioned that the values L and N for the FSM are fixed. So we can implement the desired behavior just by altering the parameters A, cp and w.
Design tools for problems of reconfigurable computations
To support the design of FPGA-based computational units a set of tools is being developed with the aid of Visual C++ 6.0 and the MFC library. The user interface of the application is shown in fig. 5 . The two overlapped lefthand windows named Hierarchy and Library are used to manage the hierarchy of a project and its library respectively. The right-hand window is a schematic editor. The user can create a library of basic elements and then use them in a schematic editor. There are two kinds of library elements: basic elements like adders, decoders, etc. and more complex elements, which are hierarchically composed of the first ones. Having a complete circuit specification it is possible to generate its VHDL description. Then the VHDL code can be analyzed and tested with the aid of the Xilinx Foundation software and implemented in an FPGA. All the experiments have been based on the XC4010XL FPGA. Now we are working on the integration of the developed software with the application GraphBuilder [14] , which enables to describe control algorithms in the form of hierarchical graph-schemes, to synthesize the respective control circuits with dynamically modifiable functionality and to implement them in FPGA.
Conclusion
This paper describes the problems of computations in a discrete (logic) space and presents the results of the design of an FPGA-based combinatorial processor intended to be used as a fast co-processor for general-purpose computers. The processor can execute a subset of combinatorial operations on discrete matrices. The implemented subset can be changed during run-time if required. The primary architecture has been chosen in the assumption that it reflects the universal structure of combinatorial computations in the form of the basic logic equation Z = X # Y and its varieties. On the other hand the generality of such equation has been shown in publications, such as [3, 8, 10] . The major components of the combinatorial processor, such as the RFU and the RCU have been implemented and tested in hardware. An integrated environment targeted to the considered problems is being developed. Note that the complexity of new generations of FPGAs is increasing drastically [15] . As a result very complex combinatorial problems can be solved using the proposed approach even with a single FPGA chip.
