This article proposes using symbolic learning methods based on multiple-valued (MV) logic and implemented in reconfigurable hardware. In the part one, we discussed why symbolic learning is useful in some applications, such as robotics. We presented an architecture for a massively parallel reconfigurable processor that enables speeding up logic operations performed in learning hardware.
This article proposes using symbolic learning methods based on multiple-valued (MV) logic and implemented in reconfigurable hardware. In the part one, we discussed why symbolic learning is useful in some applications, such as robotics. We presented an architecture for a massively parallel reconfigurable processor that enables speeding up logic operations performed in learning hardware.
Rather than learning using evolutionary and neural network methods in hardware, our approach uses combinatorial synthesis methods developed in the framework of the logic synthesis approach in digital-circuitdesign automation. In contrast to previous approaches to evolvable hardware that so far have dominated the learning in reconfigurable systems, here the learning takes place on the level of constraint acquisition and quasioptimal logic synthesis rather than on the lower level of programming binary switches based on (close to random) decisions of evolutionary programming methods. Our learning strategy is based on the principle of Occam's Razor, which facilitates generalization, discovery, and strong learning methods. We realize directly in reconfigurable hardware such MV cube-algebra operators as intersection, supercube, sharp, and crosslink, and such algorithms as disjunctive normal form (DNF) minimization, Ashenhurst/Curtis decomposition, satisfiability, and decision tree generation. Part two of our article presents cube calculus in more detail as well as various aspects of realizing cube calculus operations in hardware. We also evaluate two variants of our experimental designs.
Cube calculus
Let's consider discrete variables (attributes) X 1 , X 2 , ... , X n , such that each variable X i can take values from a certain finite discrete set V i (V i can be any finite set of symbols). A literal X i Si of variable X i represents a characteristic function of subset S i of V i , that is, the literal's value is 1 for symbols from this subset. For example,
• for binary logic, X 1 = X, and X 0 = X ′ are two literals; and • for four-valued logic V i = {0, 1, 2, 3}: Y {0, 2} equals 1 if Y ∈ {0, 2} or 0 if Y ∈ {1, 3} is a literal.
A cube on X 1 , X 2 , … , X n is an ordered set of literals on X 1 , X 2 , … , X n , that is, X 1 S1 X 2 S2 … X n Sn . A cube represents a subspace in the ndimensional discrete MV space, as Figure 1 shows. Usually, traditional switching algebra interprets a cube as a product of literals. Here we also interpret it as a sum of literals or an XOR of literals. In general, a cube can represent any ordered set of subsets of certain discrete sets V i , i = 1, ... , n (that is, a Cartesian product of the subsets).
Cube calculus is a system of
• a set of all cubes on a certain ordered set of discrete variables X 1 , X 2 , ... , X n that also contains an empty cube and a full cube (full discrete space defined by X 1 , X 2 , … , X n ); and • a set of operations on sets of cubes.
There are several types of cube operations:
• Cube operators result is a list of 0 to n cubes.
• Cube predicates result is the logic values 0 or 1.
• Counting operations result is a number (for instance, the Hamming distance of two cubes).
Positional notation represents literals in the cube calculus machine (CCM). Positional notation uses a separate bit to represent each possible value of each literal. If the literal is true for a specific value, the corresponding bit is set to 1. For example, assuming that each of the variables X 1 , X 2 , and X 3 is a three-valued variable, and the values are {0, 1, 2}, the cube X 1 {0,2} X 2 {1,2} X 3 {2} will be denoted by [101 011 001]. The base K of a logic machine is the number of bits required to represent a simple symbol in this machine.
For example, K = 2 realizes all logic operations in MV logic with no more than 2 2 = 4 values. Every operator can be represented by its map. The four simple symbols are 0 for a negated variable, 1 for a positive variable, X for don't care, and ε for contradiction. We encode these symbols in positional notation as follows: 0 = 10, 1 = 01, X = 11, and ε = 00. With this encoding, cube bcd′ (X110) on the ordered set of discrete variables (a, b, c, and d) is represented by [11 01 01 10].
When we realize MV logic using binary signals, K binary signals represent each simple symbol from the set of 2 K symbols, as Figure 2a (next page) shows. A simple symbol expressed in base K is a fundamental idea of our machine.
A symbol of base K = 1 is sufficient to realize the binary logic, set theory, and binary arithmetic. A K of 2 is necessary for binary cube calculus and some MV systems (shown in Figure  2b ). Figure 2c shows programmability matrix for three binary arguments and K = 2. For more operations on MV systems, K must be greater than 2. A W-input, base K universal cell is a logic block with W inputs and one output, each input and output being a base-K signal ( Figure  2d ). A big enough K can realize any operator, and this is why we call this the universal cell realizing operations of universal logic.
The iterative cell (IT) processes each simple symbol in a CCM, since the IT is an elementary processor. A K-base symbol requires a Kbase IT cell, shown in Figure 3 • a cube, • a list (array) of cubes (clist), and • a list of lists of cubes (cclist).
Hence, MV cube calculus (MVCC) is a set of cubes of a certain discrete space and a set of operations on the cubes, clists, and cclists.
Computation patterns
There are many (two-argument) operators on cubes, but fortunately they expose certain common computation patterns, which we can subdivide into three classes:
• simple combinational operations (such as intersection),
• complex (conditional) combinational operations (such as prime), and • sequential operations (such as nondisjoint sharp).
Operations of each class have the same computation pattern, and the pattern of a simpler class is a special case of a more complex class; Table 1 shows these patterns. The common computation pattern enables the design of certain dedicated-hardware architectures for cube operations. We implement each particular operation using a particular instantiation of the general architecture, realized by appropriately programming the field-
Binary-realized universal cell of the CCM 
STATE0
Iterative cell 
Simple combinational operation example
The following is an intersection (product) operation on cubes A and B:
Complex combinational operation example
The following is a prime operation: A prime B = where A i ∪ B i are computed only for those variables X i for which A i ∩ B i ≠ ∅ is satisfied (such positions are called active positions).
Sequential operation example
The following is a nondisjoint sharp
It is important to observe that the same general pattern given below can describe every sequential operation: 
Functions REL, BEF, AFT, and ACT can thus specify every operation. Table 1 shows that for simple and complex combinational operations, only those columns that specify this operation are filled.
Patterns of combinational operations are special cases of this general pattern. Thus, simple combinational operations are a special case of the complex combinational operations, and all combinational operations are a special case of the sequential operations. For different operations, we select different functions for REL, BEF, ACT, and AFT.
Compare the previous operation examples and Table 1 . Functions REL, BEF, ACT, and AFT are K-wise functions. If, for example, K = 2, then each 2 bits of resulting cube C that represent a simple symbol depend only on the corresponding 2 bits of argument cubes A and B, that is, C i C i + 1 depend only on A i A i + 1 and B i B i + 1 . A complex symbol that represents a value of variable C that has length R × K(IT) is a composition of R simple symbols and represents the results computed in all ITs representing this variable. The ITs form the R K(IT)-sliced CCM chips. Example applications of these operations include intersection, the most common operation in standard binary logic, rough set, and decomposition calculations; supercube, used in DNF minimization and ESOP synthesis; prime, used in XOR-based synthesis; nondisjoint sharp, used in tautology, satisfiability, covering, and DNF algorithms; disjoint sharp, used in representation transformations; and consensus, used in automatic theorem proving (Prolog's subset implementation) and DNF minimization.
CCM architecture
Software implementation of each cube operation uses a single loop that runs through all cube variables. The following two crucial ideas form the basis for the CCM:
• Execution of the lowest-level loop of the cube operation algorithms-the variable loop-occurs in hardware using a linear, iterative array of cellular automata (finite state machines). Information flows between the FSMs from left to right and from right to left. Every FSM can be in internal states BEF, ACT, and AFT. This state selects the corresponding BEF, ACT, or AFT function programmed to an iterative logic unit (ILU).
• Reconfiguration of logic functions of the cellular automata is achieved by their implementation with FPGAs.
The CCM system architecture involves
• a host processor, since it is a traditional general-purpose computer; and • the massively parallel array of the CCM processors, which forms a coprocessor (application-specific reconfigurable hardware accelerator) of the host processor.
A single CCM processor 1 consists of
• an ILU, which is a horizontal linear array of R K(IT)-sliced CCM processors, each composed of R ITs, as Figure 4 shows; • a control unit that controls ILU operation and executes cube calculus operations; • a register file to store auxiliary and control registers that aid in operations; and • a bus-interface unit (BIU) to control internal and external flow of cube array data among processors, host, and memories.
The lowest-level loop-usually the variable loop-is implemented inside the CCM processors by horizontal communication between the ITs, or, with a few CCM processors, which are connected horizontally. The CCM processors' vertical linear array (pipeline) implements the second lowest level loop-usually the cube loop. This enables 2D-data movement: horizontal (inside or among CCM processors) and vertical (among CCM processors). RAM memories connected to ITs realize the third dimension of data movements. forming the previously described computations for a certain active literal, the CCM activates the literal corresponding to the next to the right potentially active position and creates the corresponding resultant cube. When producing a particular resultant cube, all literals on positions to the left of the active literal are of the after type, and all literals on positions to the right from the active literal are of the before type. The iterative network of small, fast FSMs (ITs) executes all of these operations in hardware-using fast communication between the FSMs. This type of controlled cellular automata used as a data path of a general-purpose processor is a new concept in computer architecture. The number of all possible operations programmed in this way is extremely large. The operations that are possible to program are a multidimensional space created by a Cartesian product of basic programmable features like those shown in Table 1 : REL, BEF, ACT, AFT, composition, and pipelining. Every realizable operation is a point in this space. Selection of the operation occurs without reconfiguring the entire data path or control. We consider CCM a prototype symbol-processing computer with a sort of data path microprogramming, in contrast to the control path microprogramming of traditional arithmetic computers.
Advantages
CCMs can greatly speed up many applications. They efficiently implement (multilevel) logic operations unlike a conventional computer's ALU. For instance, to calculate the consensus of two cubes, an ALU must execute a long series of shifts and ANDs. Also, some resultant cubes are empty and require removal, making the generation of resultant cubes irregular and inefficient. The CCM can execute each MVCC operation in a single clock pulse or a few clock pulses; this execution requires only one CCM instruction per operation. The CCM does not generate empty resultant cubes, so resultant-cube generation is regular. The time needed to generate the cubes depends solely on the number of nonempty resultant cubes of the particular operation.
Designers tune traditional, general-purpose information processors (Turing-machine equivalents) for arithmetic computations. In contrast, the CCM, although also a generalpurpose processor, is tuned to symbolic computations. The result of a single development process, the CCM can efficiently manage a broad range of applications. It is reprogrammable, which enables implementation of only the operations actually required for a certain algorithm's execution during a certain time slot. It allows a customized instruction set, which programmers can optimize for each application. Moreover, the reconfiguring program on a host computer can reconfigure SRAM-based FPGAs even while host program that uses CCMs is in full operation.
Furthermore, in a conventional computer, a program stored in RAM provides the control. This strategy results in considerable control overhead, because the instructions must be fetched from RAM. If an algorithm contains loops, the processor will read the same instructions many times. This repeated work causes bottlenecks in the memory interface of conventional computer architecture, especially when the memory bus is not as fast as the internal processor bus. In the CCM architecture, the CCM data path itself implements most of the control. Once a complex MVCC instruction is loaded into the CCM, the host computer only needs to write data cubes to the CCM and read the resultant cubes from the CCM. The host processor can process the resultant values from the CCM while loading them; meanwhile, the CCM awaits the next clock pulse to send another cube.
Additionally, in most commonly used computer architectures, parallel processing is very limited, even in modern RISC or Pentium processors. Parallel processing has also proven difficult for compilers. In the CCM architecture, a single CCM instruction can replace parts of an existing program for a traditional computer. Hardware specifically designed for this particular instruction can then execute it, allowing microparallelism in the CCM.
Another limitation of conventional computer architectures is the ALU's bandwidth. The CCM suffers from this problem to a much lesser extent because the FPGA implementation flexibly adopts ALU bandwidth for each application. The only limiting factor is the capacity and speed of the FPGAs in the hardware.
Furthermore, the CCM architecture is regular and scalable, and lets designers build massively parallel computers from many CCM processors. Other CCMs-and ultimately, the host computer-control these CCM processors. Thus, it's possible to realize true massively parallel processing. Mapping these architectures onto the FPGAs requires considerable time, but once compiled, these new architectures are instantly loadable into the FPGA board.
The question arises as to whether the speedup of a certain application justifies the development or purchase of a costly FPGA board. However, we can spread the cost of FPGA hardware among various applications, not only those involving symbolic computations. Moreover, the essence of the CCM is not the FPGA board, but the architecture programmed into the FPGAs. Because the essence of the CCM is its architecture, which involves reprogrammable, basic logic operations of the ILU for REL, BEF, ACT, and AFT, it's not necessary to implement the CCM on a classical FPGA board. Limiting reprogrammability to only REL, BEF, ACT, and AFT; and/or implementing most of the CCM processor in classical, hardwired hardware can provide faster execution for one of CCM's variant implementations. In this way, designers can implement a CCM processor as a very fast classical VLSI hardware chip with a few small, reprogrammable lookup tables in its ILU.
First prototype evaluation
We have designed, simulated, and implemented a CCM prototype for a word length of 16 binary, eight 4-valued, or four 8-valued variables; or any combination of binary, 4-, or 8-valued variables for a total of 32 bits. Our prototype implementation used two Xilinx FPGA XC-3090-50 PP175C chips with 175 pins and running at 50 MHz. Eight iterative cells of the prototype consumed approximately 48 percent of the available configurable logic blocks (CLBs). We simulated and tested the prototype implementation on many data examples for each operation. We also performed timing analysis: The greatest delay was 145.8 ns. The sharp operation speedup on six variable terms was approximately 25 times that of the software implementation. The speedup of the algorithm for the satisfiability problem was approximately 14 times that of its software implementation. We achieved these speedups on a single CCM processor having an ILU with a short word composed of eight IT cells.
Since speedups grow with the number of IT cells in a single CCM processor and with the number of CCM processors in various massively parallel architectures, computation speed enhancement in the full-scale massively parallel implementation should be much higher. To develop an idea of the possible speed enhancement, we performed several simulation experiments with a tree of pipelined CCM processors used for computation of the generalized Petrick function (this function, being a product of sums of literals, solves the unate covering problem). Among others, we considered a small tree with three levels and seven CCM processors. We assumed a host processor clock rate of 100 MHz. Because the host processor must fetch all four leaf nodes of the CCM processor tree in every execution cycle, it limits the clock rate of the tree to a slow 2.5 MHz. With these assumptions, the considered (small) parallel processing structure solves a generalized Petrick function with 1,000 sums within 0.7 ms. To solve this problem with the same algorithm implemented in C, a traditional host computer (PC) operating at 100 MHz requires 7.08 seconds. Thus, application of an appropriate parallel structure of the CCM processors, even with a small number of processors, resulted in an approximately 10,000 times speedup.
Second prototype evaluation
For the second evaluation we used the DEC-PERLE-1 board. 2 This has a central computational matrix composed of 16 Xilinx XC3090 logic cell arrays, surrounded by four 1-Mbyte RAM banks. It includes seven other LCAs to implement switching and controlling functions. To compare the DEC-PERLE-1 CCM's performance with that of the software approach, we used a C program that executes the disjoint-sharp operation on two arrays of cubes. We also used this program on the CCM to solve all minterms with three, four, and five binary variables. We compiled the C program on a GNU C compiler v.2.7.2 and ran it on a Sun Ultra5 workstation with a 64-Mbyte RAM memory. For the CCM, we used a 1.33-MHz clock (with a 750-ns clock period). Table 2 (next page) shows the results of this experiment. Table 2 (next page) shows that the software approach takes about one-fourth the time of the DEC-PERLE-1 CCM. But the Sun Ultra5 workstation's CPU clock is 270 MHz-206 times faster than the CCM's clock. Therefore, although the CCM design is slower than the software implementation, we can still state that it is very efficient for cube calculus operations. Table 2 also shows that the more variables the input cubes have, the more efficient the CCM. This is because the software approach must iterate through one loop for each variable present in the input cubes. However, the clock period of 750 ns is too slow. The BIU state diagram shows that delays from an empty carry path and a counter carry path only occur in a few states. Thus, if we can give a little more time to these states-an easily achievable situation-we could speedup the clock of the entire CCM. For example, if state P2 of the BIU needs more time for the delay of counter carry path, we can add two more states in series between states P2 and P3. These two extra states do nothing but give the CCM two more clock periods to evaluate signal prel_res, which means that the CCM has three clock periods to evaluate signal prel_res in state P2 after adding two more delay states. After making similar modifications to all these states, the CCM can run with a 4-MHz clock frequency (clock period of 250 ns). Table 2 shows these results. It is difficult to increase the clock frequency again with this mapping because other paths, like memory paths, have delays greater than 150 ns. Table 2 illustrates a real need for human intelligence combined with the EDA tools to optimize the FPGA architectures.
From this comparison, we can conclude that it is not efficient to map a CCM design with a complex control unit and complex data path to a board such as the DEC-PERLE-1. Because our CCM mapping sends many signals through multiple FPGA chips, the signal delays are large. For instance, if we directly connect the memory banks and registers, the memory path has a delay of only 35 ns. But the DEC-PERLE-1 memory path has a delay of 160 ns.
Another issue is that the XC3090 FPGA is now outdated technology, so it's at a disadvantage when compared with more modern microprocessors. The latest FPGAs from Xilinx, Altera, or other vendors have more powerful CLBs and more routing resources, and greater speed because of deep-submicron processes. This direction in FPGA technology will continue, providing an advantage to their use in massively parallel accelerators.
For instance, if we map the entire CCM onto a single modern FPGA chip or a special connection pattern of modern FPGAs, we can speedup the CCM multiple times because of the following factors:
• The signals do not need to go through multiple chips, reducing routing delay.
• The new FPGA chips provide more powerful CLBs and routing resources, allowing denser CCM mapping. This also reduces the routing delays.
• Implementation of new FPGA chips in deep-submicron technology reduces CLB and routing wire delays. For example, CLB delay on an XC3090A is 4.5 ns, while delay on a Virtex II is 2.5 ns for a much more powerful cell.
We expect that with the new FPGAs and the corresponding FPGA-based boards, new versions of CCMs will become a true competitor to software-in terms of speed-for robotics applications.
T o our knowledge, CCM is the first logic machine for MV logic, universal logic, and cube representation. It generalizes the previous machines of T. Sasao, 3 Ulug and Bowen, Zakrevskij, and others. 4 The results of our experiments indicate that future CCM processors could provide significant speedups for many applications. The 
IEEE MICRO

